# RunPod Discord Help Request - UniRig2 Container Issues ## 🚨 **Problem Summary** My serverless container keeps dying after 55 seconds to 2+ minutes, even though it should run indefinitely waiting for jobs. GitHub repo builds work fine, but manually pushed Docker images fail. ## 📋 **Container Details** - **Image**: `nmn28/unirig2:v2-fixed` (Docker Hub) - **Base**: `nvidia/cuda:12.1.0-runtime-ubuntu22.04` - **Purpose**: ML model inference (UniRig avatar rigging) - **Expected Behavior**: Should run indefinitely like `runpod.serverless.start()` does ## 🔄 **Timeline of Issues & Attempts** ### **Original Issue** - GitHub repo build: ✅ Works perfectly - Manual Docker push: ❌ Dies after 55 seconds ### **Debugging Steps Taken** 1. **Container Execution Test** - Created simple 2-minute test container - Result: ✅ Runs exactly 2 minutes then exits (expected) - **Conclusion**: Basic container execution works 2. **Entry Point Fix** - Changed `CMD ["python3.11", "handler.py"]` → `ENTRYPOINT ["python3.11", "handler.py"]` - Result: ❌ Still dies after ~2 minutes 3. **S3 Credentials Fix** - Fixed missing AWS credentials in S3 client - Added proper environment variables - Result: ❌ Still dies 4. **Enhanced Error Handling** - Added comprehensive logging around `runpod.serverless.start()` - Added try/catch with keep-alive on failure - Result: ❌ Still dies ## 🔧 **Current Configuration** ### **Dockerfile (Key Parts)** ```dockerfile FROM nvidia/cuda:12.1.0-runtime-ubuntu22.04 # ... dependencies ... WORKDIR /app COPY . . ENTRYPOINT ["python3.11", "handler.py"] ``` ### **Handler Structure** ```python import runpod # ... other imports ... def handler(job): # ML processing logic return {"status": "success"} # Enhanced startup with error handling try: logger.info("=== CALLING runpod.serverless.start ===") runpod.serverless.start({ "handler": handler, "return_aggregate_stream": True }) except Exception as e: logger.error(f"=== FAILED TO START: {e} ===") # Keep container alive for debugging while True: logger.error("Container staying alive for debugging...") time.sleep(30) ``` ### **Environment Variables Set** ``` AWS_ACCESS_KEY_ID=AKIA3UOWOAZ4KEXQ7E4A AWS_SECRET_ACCESS_KEY=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 AWS_REGION=us-west-1 S3_BUCKET=endure-media ``` ## 🤔 **Key Questions for RunPod Team** 1. **Why does GitHub repo build work but manual Docker push fail?** - Same Dockerfile, same code, different behavior 2. **What's the difference in how RunPod handles these two deployment methods?** - Does RunPod execute containers differently? - Are there different networking/environment setups? 3. **Container dies after 2+ minutes - what could cause this?** - Not idle timeout (that's usually 60s+) - Container starts successfully (proven by simple test) - `runpod.serverless.start()` might be failing silently 4. **How can I debug what's happening inside the RunPod environment?** - Logs show container starts but then dies - No clear error messages - Need visibility into what RunPod is doing ## 📊 **Evidence** ### **What Works** - ✅ GitHub repo builds and deploys successfully - ✅ Simple test containers run for expected duration - ✅ Docker image builds locally without errors - ✅ All dependencies install correctly ### **What Fails** - ❌ Manual Docker push containers die after 2+ minutes - ❌ `runpod.serverless.start()` appears to fail silently - ❌ No clear error messages in logs ## 🎯 **Specific Help Needed** 1. **Debugging guidance**: How to get more visibility into what's happening when the container dies? 2. **Deployment method differences**: What's different between GitHub builds vs manual pushes? 3. **Container lifecycle**: What could cause a container to die after 2+ minutes that's not idle timeout? 4. **RunPod serverless startup**: Common issues with `runpod.serverless.start()` failing? ## 📁 **Additional Context** - Using RunPod Python SDK version 1.3.0 - Container has ~6GB of ML models - Heavy dependencies: PyTorch, Blender, etc. - Works perfectly when deployed via GitHub repo **Please help! This has been blocking production deployment for days.**