caching latents... 42%|████████████████▎ | 1584/3797 [29:59<41:51, 1.13s/it][E ProcessGroupNCCL.cpp:828] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=3, OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1802833 milliseconds before timing out. [E ProcessGroupNCCL.cpp:455] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [E ProcessGroupNCCL.cpp:460] To avoid data inconsistency, we are taking the entire process down. terminate called after throwing an instance of 'std::runtime_error' what(): [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=3, OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1802833 milliseconds before timing out. 42%|████████████████▎ | 1589/3797 [30:05<41:28, 1.13s/it]Traceback (most recent call last): File "/opt/conda/bin/accelerate", line 8, in sys.exit(main()) File "/opt/conda/lib/python3.10/site-packages/accelerate/commands/accelerate_cli.py", line 47, in main args.func(args) File "/opt/conda/lib/python3.10/site-packages/accelerate/commands/launch.py", line 977, in launch_command multi_gpu_launcher(args) File "/opt/conda/lib/python3.10/site-packages/accelerate/commands/launch.py", line 646, in multi_gpu_launcher distrib_run.run(args) File "/opt/conda/lib/python3.10/site-packages/torch/distributed/run.py", line 785, in run elastic_launch( File "/opt/conda/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 134, in __call__ return launch_agent(self._config, self._entrypoint, list(args)) File "/opt/conda/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 250, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError: ===================================================== ./sdxl_train.py FAILED ----------------------------------------------------- Failures: ----------------------------------------------------- Root Cause (first observed failure): [0]: time : 2023-12-06_12:27:47 host : 8969749b694f rank : 1 (local_rank: 1) exitcode : -6 (pid: 1143) error_file: traceback : Signal 6 (SIGABRT) received by PID 1143 =====================================================