# RunPod Discord Help Request - UniRig2 Container Issues

## 🚨 **Problem Summary**
My serverless container keeps dying after 55 seconds to 2+ minutes, even though it should run indefinitely waiting for jobs. GitHub repo builds work fine, but manually pushed Docker images fail.

## 📋 **Container Details**
- **Image**: `nmn28/unirig2:v2-fixed` (Docker Hub)
- **Base**: `nvidia/cuda:12.1.0-runtime-ubuntu22.04`
- **Purpose**: ML model inference (UniRig avatar rigging)
- **Expected Behavior**: Should run indefinitely like `runpod.serverless.start()` does

## 🔄 **Timeline of Issues & Attempts**

### **Original Issue**
- GitHub repo build: ✅ Works perfectly
- Manual Docker push: ❌ Dies after 55 seconds

### **Debugging Steps Taken**

1. **Container Execution Test**
   - Created simple 2-minute test container
   - Result: ✅ Runs exactly 2 minutes then exits (expected)
   - **Conclusion**: Basic container execution works

2. **Entry Point Fix**
   - Changed `CMD ["python3.11", "handler.py"]` → `ENTRYPOINT ["python3.11", "handler.py"]`
   - Result: ❌ Still dies after ~2 minutes

3. **S3 Credentials Fix**
   - Fixed missing AWS credentials in S3 client
   - Added proper environment variables
   - Result: ❌ Still dies

4. **Enhanced Error Handling**
   - Added comprehensive logging around `runpod.serverless.start()`
   - Added try/catch with keep-alive on failure
   - Result: ❌ Still dies

## 🔧 **Current Configuration**

### **Dockerfile (Key Parts)**
```dockerfile
FROM nvidia/cuda:12.1.0-runtime-ubuntu22.04
# ... dependencies ...
WORKDIR /app
COPY . .
ENTRYPOINT ["python3.11", "handler.py"]
```

### **Handler Structure**
```python
import runpod
# ... other imports ...

def handler(job):
    # ML processing logic
    return {"status": "success"}

# Enhanced startup with error handling
try:
    logger.info("=== CALLING runpod.serverless.start ===")
    runpod.serverless.start({
        "handler": handler,
        "return_aggregate_stream": True
    })
except Exception as e:
    logger.error(f"=== FAILED TO START: {e} ===")
    # Keep container alive for debugging
    while True:
        logger.error("Container staying alive for debugging...")
        time.sleep(30)
```

### **Environment Variables Set**
```
AWS_ACCESS_KEY_ID=AKIA3UOWOAZ4KEXQ7E4A
AWS_SECRET_ACCESS_KEY=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
AWS_REGION=us-west-1
S3_BUCKET=endure-media
```

## 🤔 **Key Questions for RunPod Team**

1. **Why does GitHub repo build work but manual Docker push fail?**
   - Same Dockerfile, same code, different behavior

2. **What's the difference in how RunPod handles these two deployment methods?**
   - Does RunPod execute containers differently?
   - Are there different networking/environment setups?

3. **Container dies after 2+ minutes - what could cause this?**
   - Not idle timeout (that's usually 60s+)
   - Container starts successfully (proven by simple test)
   - `runpod.serverless.start()` might be failing silently

4. **How can I debug what's happening inside the RunPod environment?**
   - Logs show container starts but then dies
   - No clear error messages
   - Need visibility into what RunPod is doing

## 📊 **Evidence**

### **What Works**
- ✅ GitHub repo builds and deploys successfully
- ✅ Simple test containers run for expected duration
- ✅ Docker image builds locally without errors
- ✅ All dependencies install correctly

### **What Fails**
- ❌ Manual Docker push containers die after 2+ minutes
- ❌ `runpod.serverless.start()` appears to fail silently
- ❌ No clear error messages in logs

## 🎯 **Specific Help Needed**

1. **Debugging guidance**: How to get more visibility into what's happening when the container dies?

2. **Deployment method differences**: What's different between GitHub builds vs manual pushes?

3. **Container lifecycle**: What could cause a container to die after 2+ minutes that's not idle timeout?

4. **RunPod serverless startup**: Common issues with `runpod.serverless.start()` failing?

## 📁 **Additional Context**
- Using RunPod Python SDK version 1.3.0
- Container has ~6GB of ML models
- Heavy dependencies: PyTorch, Blender, etc.
- Works perfectly when deployed via GitHub repo

**Please help! This has been blocking production deployment for days.**