Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Can comfyui-xdit run in multiple servers? #350

Open
VincentXWD opened this issue Nov 15, 2024 · 3 comments
Open

Can comfyui-xdit run in multiple servers? #350

VincentXWD opened this issue Nov 15, 2024 · 3 comments

Comments

@VincentXWD
Copy link

Hello developers,
I'm trying to use xDiT (version 3.3) comfyui-xdit on 2 servers with 4 NVIDIA 3090 GPUs. I use the command below to start the service:


torchrun --nproc_per_node=2 --nnodes=2 --node_rank=0 --master_addr=192.168.10.100 --master_port=6000 comfyui-xdit/host.py --model=/home/mnt/wdxu/models/stable-diffusion-3-medium-diffusers --pipefusion_parallel_degree=4 --ulysses_degree=1 --ring_degree=1 --height=512 --width=512

torchrun --nproc_per_node=2 --nnodes=2 --node_rank=1 --master_addr=192.168.10.100 --master_port=6000 comfyui-xdit/host.py --model=/home/mnt/wdxu/models/stable-diffusion-3-medium-diffusers --pipefusion_parallel_degree=4 --ulysses_degree=1 --ring_degree=1 --height=512 --width=512

I cannot start the service on the worker and get the logs below:

W1115 21:02:04.800000 140165111768896 torch/distributed/run.py:779]
W1115 21:02:04.800000 140165111768896 torch/distributed/run.py:779] *****************************************
W1115 21:02:04.800000 140165111768896 torch/distributed/run.py:779] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
W1115 21:02:04.800000 140165111768896 torch/distributed/run.py:779] *****************************************
/home/wdxu/miniconda3/envs/py311/lib/python3.11/site-packages/xformers/ops/fmha/flash.py:211: FutureWarning: `torch.library.impl_abstract` was renamed to `torch.library.register_fake`. Please use that instead; we will remove `torch.library.impl_abstract` in a future version of PyTorch.
  @torch.library.impl_abstract("xformers_flash::flash_fwd")
/home/wdxu/miniconda3/envs/py311/lib/python3.11/site-packages/xformers/ops/fmha/flash.py:344: FutureWarning: `torch.library.impl_abstract` was renamed to `torch.library.register_fake`. Please use that instead; we will remove `torch.library.impl_abstract` in a future version of PyTorch.
  @torch.library.impl_abstract("xformers_flash::flash_bwd")
/home/wdxu/miniconda3/envs/py311/lib/python3.11/site-packages/xformers/ops/fmha/flash.py:211: FutureWarning: `torch.library.impl_abstract` was renamed to `torch.library.register_fake`. Please use that instead; we will remove `torch.library.impl_abstract` in a future version of PyTorch.
  @torch.library.impl_abstract("xformers_flash::flash_fwd")
/home/wdxu/miniconda3/envs/py311/lib/python3.11/site-packages/xformers/ops/fmha/flash.py:344: FutureWarning: `torch.library.impl_abstract` was renamed to `torch.library.register_fake`. Please use that instead; we will remove `torch.library.impl_abstract` in a future version of PyTorch.
  @torch.library.impl_abstract("xformers_flash::flash_bwd")
WARNING 11-15 21:02:13 [args.py:320] Distributed environment is not initialized. Initializing...
DEBUG 11-15 21:02:13 [parallel_state.py:179] world_size=-1 rank=-1 local_rank=-1 distributed_init_method=env:// backend=nccl
WARNING 11-15 21:02:13 [args.py:320] Distributed environment is not initialized. Initializing...
DEBUG 11-15 21:02:13 [parallel_state.py:179] world_size=-1 rank=-1 local_rank=-1 distributed_init_method=env:// backend=nccl
[W1115 21:02:13.809330666 Utils.hpp:164] Warning: Environment variable NCCL_BLOCKING_WAIT is deprecated; use TORCH_NCCL_BLOCKING_WAIT instead (function operator())
[W1115 21:02:13.809348597 Utils.hpp:135] Warning: Environment variable NCCL_ASYNC_ERROR_HANDLING is deprecated; use TORCH_NCCL_ASYNC_ERROR_HANDLING instead (function operator())
[W1115 21:02:13.809353447 Utils.hpp:164] Warning: Environment variable NCCL_BLOCKING_WAIT is deprecated; use TORCH_NCCL_BLOCKING_WAIT instead (function operator())
[W1115 21:02:13.809372718 Utils.hpp:135] Warning: Environment variable NCCL_ASYNC_ERROR_HANDLING is deprecated; use TORCH_NCCL_ASYNC_ERROR_HANDLING instead (function operator())
INFO 11-15 21:02:18 [config.py:163] Pipeline patch number not set, using default value 4
INFO 11-15 21:02:18 [config.py:163] Pipeline patch number not set, using default value 4
[Rank 2] 2024-11-15 21:02:18 - INFO - Initializing model on GPU: 0
Loading pipeline components...:   0%|                                                           | 0/9 [00:00<?, ?it/s][Rank 3] 2024-11-15 21:02:18 - INFO - Initializing model on GPU: 1
Loading checkpoint shards: 100%|████████████████████████████████████████████████████████| 2/2 [00:02<00:00,  1.02s/it]
Loading pipeline components...:  11%|█████▋                                             | 1/9 [00:06<00:49,  6.16s/it]You set `add_prefix_space`. The tokenizer needs to be converted from the slow tokenizers
Loading pipeline components...:  56%|████████████████████████████▎                      | 5/9 [00:06<00:05,  1.36s/it]You set `add_prefix_space`. The tokenizer needs to be converted from the slow tokenizers
Loading pipeline components...: 100%|███████████████████████████████████████████████████| 9/9 [00:06<00:00,  1.33it/s]
WARNING 11-15 21:02:25 [runtime_state.py:63] Model parallel is not initialized, initializing... | 0/2 [00:00<?, ?it/s]
Loading checkpoint shards: 100%|████████████████████████████████████████████████████████| 2/2 [00:01<00:00,  1.05it/s]
Loading pipeline components...: 100%|███████████████████████████████████████████████████| 9/9 [00:08<00:00,  1.01it/s]
WARNING 11-15 21:02:27 [runtime_state.py:63] Model parallel is not initialized, initializing...
INFO 11-15 21:02:52 [base_pipeline.py:290] Transformer backbone found, paralleling transformer...
INFO 11-15 21:02:52 [base_model.py:83] [RANK 3] Wrapping .pos_embed in model class SD3Transformer2DModel with xFuserPatchEmbedWrapper
INFO 11-15 21:02:52 [base_model.py:83] [RANK 3] Wrapping pos_embed.module.proj in model class SD3Transformer2DModel with xFuserConv2dWrapper
INFO 11-15 21:02:52 [base_model.py:83] [RANK 3] Wrapping transformer_blocks.0.attn in model class SD3Transformer2DModel with xFuserAttentionWrapper
INFO 11-15 21:02:52 [base_model.py:83] [RANK 3] Wrapping transformer_blocks.1.attn in model class SD3Transformer2DModel with xFuserAttentionWrapper
INFO 11-15 21:02:52 [base_model.py:83] [RANK 3] Wrapping transformer_blocks.2.attn in model class SD3Transformer2DModel with xFuserAttentionWrapper
INFO 11-15 21:02:52 [base_model.py:83] [RANK 3] Wrapping transformer_blocks.3.attn in model class SD3Transformer2DModel with xFuserAttentionWrapper
INFO 11-15 21:02:52 [base_model.py:83] [RANK 3] Wrapping transformer_blocks.4.attn in model class SD3Transformer2DModel with xFuserAttentionWrapper
INFO 11-15 21:02:52 [base_model.py:83] [RANK 3] Wrapping transformer_blocks.5.attn in model class SD3Transformer2DModel with xFuserAttentionWrapper
INFO 11-15 21:02:52 [base_pipeline.py:335] Scheduler found, paralleling scheduler...
INFO 11-15 21:02:52 [base_pipeline.py:290] Transformer backbone found, paralleling transformer...
INFO 11-15 21:02:52 [base_model.py:83] [RANK 2] Wrapping .pos_embed in model class SD3Transformer2DModel with xFuserPatchEmbedWrapper
INFO 11-15 21:02:52 [base_model.py:83] [RANK 2] Wrapping pos_embed.module.proj in model class SD3Transformer2DModel with xFuserConv2dWrapper
INFO 11-15 21:02:52 [base_model.py:83] [RANK 2] Wrapping transformer_blocks.0.attn in model class SD3Transformer2DModel with xFuserAttentionWrapper
INFO 11-15 21:02:52 [base_model.py:83] [RANK 2] Wrapping transformer_blocks.1.attn in model class SD3Transformer2DModel with xFuserAttentionWrapper
INFO 11-15 21:02:52 [base_model.py:83] [RANK 2] Wrapping transformer_blocks.2.attn in model class SD3Transformer2DModel with xFuserAttentionWrapper
INFO 11-15 21:02:52 [base_model.py:83] [RANK 2] Wrapping transformer_blocks.3.attn in model class SD3Transformer2DModel with xFuserAttentionWrapper
INFO 11-15 21:02:52 [base_model.py:83] [RANK 2] Wrapping transformer_blocks.4.attn in model class SD3Transformer2DModel with xFuserAttentionWrapper
INFO 11-15 21:02:52 [base_model.py:83] [RANK 2] Wrapping transformer_blocks.5.attn in model class SD3Transformer2DModel with xFuserAttentionWrapper
INFO 11-15 21:02:52 [base_pipeline.py:335] Scheduler found, paralleling scheduler...
  0%|                                                                                           | 0/3 [00:00<?, ?it/s]f039463:3009575:3009575 [1] NCCL INFO cudaDriverVersion 12040
f039463:3009574:3009574 [0] NCCL INFO cudaDriverVersion 12040
f039463:3009575:3009575 [1] NCCL INFO Bootstrap : Using eno1:192.168.10.103<0>
f039463:3009574:3009574 [0] NCCL INFO Bootstrap : Using eno1:192.168.10.103<0>
f039463:3009574:3009574 [0] NCCL INFO NET/Plugin : dlerror=libnccl-net.so: cannot open shared object file: No such file or directory No plugin found (libnccl-net.so), using internal implementation
f039463:3009575:3009575 [1] NCCL INFO NET/Plugin : dlerror=libnccl-net.so: cannot open shared object file: No such file or directory No plugin found (libnccl-net.so), using internal implementation
f039463:3009575:3009897 [1] NCCL INFO NET/IB : No device found.
f039463:3009575:3009897 [1] NCCL INFO NET/Socket : Using [0]eno1:192.168.10.103<0> [1]enxb03af2b6059f:169.254.3.1<0> [2]virbr0:192.168.122.1<0>
f039463:3009575:3009897 [1] NCCL INFO Using non-device net plugin version 0
f039463:3009575:3009897 [1] NCCL INFO Using network Socket
f039463:3009574:3009896 [0] NCCL INFO NET/IB : No device found.
f039463:3009574:3009896 [0] NCCL INFO NET/Socket : Using [0]eno1:192.168.10.103<0> [1]enxb03af2b6059f:169.254.3.1<0> [2]virbr0:192.168.122.1<0>
f039463:3009574:3009896 [0] NCCL INFO Using non-device net plugin version 0
f039463:3009574:3009896 [0] NCCL INFO Using network Socket
f039463:3009575:3009897 [1] NCCL INFO comm 0x3a7df850 rank 3 nranks 4 cudaDev 1 nvmlDev 1 busId c1000 commId 0x7d3bbe1348141982 - Init START
f039463:3009574:3009896 [0] NCCL INFO comm 0x3aadc260 rank 2 nranks 4 cudaDev 0 nvmlDev 0 busId 81000 commId 0x7d3bbe1348141982 - Init START
f039463:3009575:3009897 [1] NCCL INFO P2P is disabled between connected GPUs 1 and 0. You can repress this message with NCCL_IGNORE_DISABLED_P2P=1.
f039463:3009575:3009897 [1] NCCL INFO P2P is disabled between connected GPUs 1 and 0. You can repress this message with NCCL_IGNORE_DISABLED_P2P=1.
f039463:3009575:3009897 [1] NCCL INFO P2P is disabled between connected GPUs 0 and 1. You can repress this message with NCCL_IGNORE_DISABLED_P2P=1.
f039463:3009575:3009897 [1] NCCL INFO P2P is disabled between connected GPUs 0 and 1. You can repress this message with NCCL_IGNORE_DISABLED_P2P=1.
f039463:3009575:3009897 [1] NCCL INFO P2P is disabled between connected GPUs 1 and 0. You can repress this message with NCCL_IGNORE_DISABLED_P2P=1.
f039463:3009575:3009897 [1] NCCL INFO P2P is disabled between connected GPUs 1 and 0. You can repress this message with NCCL_IGNORE_DISABLED_P2P=1.
f039463:3009575:3009897 [1] NCCL INFO P2P is disabled between connected GPUs 0 and 1. You can repress this message with NCCL_IGNORE_DISABLED_P2P=1.
f039463:3009575:3009897 [1] NCCL INFO P2P is disabled between connected GPUs 0 and 1. You can repress this message with NCCL_IGNORE_DISABLED_P2P=1.
f039463:3009574:3009896 [0] NCCL INFO P2P is disabled between connected GPUs 1 and 0. You can repress this message with NCCL_IGNORE_DISABLED_P2P=1.
f039463:3009574:3009896 [0] NCCL INFO P2P is disabled between connected GPUs 1 and 0. You can repress this message with NCCL_IGNORE_DISABLED_P2P=1.
f039463:3009574:3009896 [0] NCCL INFO P2P is disabled between connected GPUs 0 and 1. You can repress this message with NCCL_IGNORE_DISABLED_P2P=1.
f039463:3009574:3009896 [0] NCCL INFO P2P is disabled between connected GPUs 0 and 1. You can repress this message with NCCL_IGNORE_DISABLED_P2P=1.
f039463:3009574:3009896 [0] NCCL INFO P2P is disabled between connected GPUs 1 and 0. You can repress this message with NCCL_IGNORE_DISABLED_P2P=1.
f039463:3009574:3009896 [0] NCCL INFO P2P is disabled between connected GPUs 1 and 0. You can repress this message with NCCL_IGNORE_DISABLED_P2P=1.
f039463:3009574:3009896 [0] NCCL INFO P2P is disabled between connected GPUs 0 and 1. You can repress this message with NCCL_IGNORE_DISABLED_P2P=1.
f039463:3009574:3009896 [0] NCCL INFO P2P is disabled between connected GPUs 0 and 1. You can repress this message with NCCL_IGNORE_DISABLED_P2P=1.
f039463:3009575:3009897 [1] NCCL INFO comm 0x3a7df850 rank 3 nRanks 4 nNodes 2 localRanks 2 localRank 1 MNNVL 0
f039463:3009575:3009897 [1] NCCL INFO Trees [0] -1/-1/-1->3->2 [1] -1/-1/-1->3->2
f039463:3009575:3009897 [1] NCCL INFO P2P Chunksize set to 131072
f039463:3009574:3009896 [0] NCCL INFO comm 0x3aadc260 rank 2 nRanks 4 nNodes 2 localRanks 2 localRank 0 MNNVL 0
f039463:3009574:3009896 [0] NCCL INFO Trees [0] 3/-1/-1->2->0 [1] 3/0/-1->2->-1
f039463:3009574:3009896 [0] NCCL INFO P2P Chunksize set to 131072
f039463:3009575:3009897 [1] NCCL INFO P2P is disabled between connected GPUs 1 and 0. You can repress this message with NCCL_IGNORE_DISABLED_P2P=1.
f039463:3009574:3009896 [0] NCCL INFO Channel 00/0 : 1[1] -> 2[0] [receive] via NET/Socket/1
f039463:3009574:3009896 [0] NCCL INFO Channel 01/0 : 1[1] -> 2[0] [receive] via NET/Socket/1
f039463:3009574:3009896 [0] NCCL INFO P2P is disabled between connected GPUs 0 and 1. You can repress this message with NCCL_IGNORE_DISABLED_P2P=1.
f039463:3009575:3009897 [1] NCCL INFO P2P is disabled between connected GPUs 1 and 0. You can repress this message with NCCL_IGNORE_DISABLED_P2P=1.
f039463:3009574:3009896 [0] NCCL INFO Channel 00 : 2[0] -> 3[1] via SHM/direct/direct
f039463:3009574:3009896 [0] NCCL INFO P2P is disabled between connected GPUs 0 and 1. You can repress this message with NCCL_IGNORE_DISABLED_P2P=1.
f039463:3009574:3009896 [0] NCCL INFO Channel 01 : 2[0] -> 3[1] via SHM/direct/direct
f039463:3009575:3009897 [1] NCCL INFO Channel 00/0 : 3[1] -> 0[0] [send] via NET/Socket/1
f039463:3009575:3009897 [1] NCCL INFO Channel 01/0 : 3[1] -> 0[0] [send] via NET/Socket/1
f039463:3009575:3009898 [1] NCCL INFO misc/socket.cc:505 -> 2 (Operation now in progress)
f039463:3009575:3009898 [1] NCCL INFO misc/socket.cc:570 -> 2
f039463:3009575:3009898 [1] NCCL INFO misc/socket.cc:589 -> 2
f039463:3009575:3009898 [1] NCCL INFO transport/net_socket.cc:339 -> 2
f039463:3009575:3009898 [1] NCCL INFO transport/net.cc:683 -> 2
f039463:3009575:3009897 [1] NCCL INFO transport/net.cc:304 -> 2
f039463:3009575:3009897 [1] NCCL INFO transport.cc:165 -> 2
f039463:3009575:3009897 [1] NCCL INFO init.cc:1222 -> 2
f039463:3009575:3009897 [1] NCCL INFO init.cc:1501 -> 2
f039463:3009575:3009897 [1] NCCL INFO group.cc:64 -> 2 [Async thread]
f039463:3009575:3009575 [1] NCCL INFO group.cc:418 -> 2
f039463:3009575:3009575 [1] NCCL INFO init.cc:1876 -> 2

f039463:3009575:3009898 [1] proxy.cc:1533 NCCL WARN [Service thread] Error encountered progressing operation=Connect, res=3, closing connection

f039463:3009575:3009898 [1] proxy.cc:1567 NCCL WARN [Proxy Service 3] Failed to execute operation Connect from rank 3, retcode 3
f039463:3009575:3009575 [1] NCCL INFO comm 0x3a7df850 rank 3 nranks 4 cudaDev 1 busId c1000 - Abort COMPLETE
  0%|                                                                                           | 0/3 [02:12<?, ?it/s]
[rank3]: Traceback (most recent call last):
[rank3]:   File "/home/mnt/wdxu/github/xDiT/comfyui-xdit/host.py", line 248, in <module>
[rank3]:
[rank3]:   File "/home/mnt/wdxu/github/xDiT/comfyui-xdit/host.py", line 98, in initialize
[rank3]:     pipe.prepare_run(input_config)
[rank3]:   File "/home/mnt/wdxu/github/xDiT/xfuser/model_executor/pipelines/pipeline_stable_diffusion_3.py", line 75, in prepare_run
[rank3]:     self.__call__(
[rank3]:   File "/home/wdxu/miniconda3/envs/py311/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
[rank3]:     return func(*args, **kwargs)
[rank3]:            ^^^^^^^^^^^^^^^^^^^^^
[rank3]:   File "/home/mnt/wdxu/github/xDiT/xfuser/model_executor/pipelines/base_pipeline.py", line 166, in data_parallel_fn
[rank3]:     return func(self, *args, **kwargs)
[rank3]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank3]:   File "/home/mnt/wdxu/github/xDiT/xfuser/model_executor/pipelines/base_pipeline.py", line 186, in check_naive_forward_fn
[rank3]:     return func(self, *args, **kwargs)
[rank3]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank3]:   File "/home/mnt/wdxu/github/xDiT/xfuser/model_executor/pipelines/pipeline_stable_diffusion_3.py", line 348, in __call__
[rank3]:     latents = self._sync_pipeline(
[rank3]:               ^^^^^^^^^^^^^^^^^^^^
[rank3]:   File "/home/mnt/wdxu/github/xDiT/xfuser/model_executor/pipelines/pipeline_stable_diffusion_3.py", line 444, in _sync_pipeline
[rank3]:     latents = get_pp_group().pipeline_recv()
[rank3]:               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank3]:   File "/home/mnt/wdxu/github/xDiT/xfuser/core/distributed/group_coordinator.py", line 925, in pipeline_recv
[rank3]:     self._check_shape_and_buffer(recv_prev=True, name=name, segment_idx=idx)
[rank3]:   File "/home/mnt/wdxu/github/xDiT/xfuser/core/distributed/group_coordinator.py", line 796, in _check_shape_and_buffer
[rank3]:     recv_prev_shape = self._communicate_shapes(
[rank3]:                       ^^^^^^^^^^^^^^^^^^^^^^^^^
[rank3]:   File "/home/mnt/wdxu/github/xDiT/xfuser/core/distributed/group_coordinator.py", line 859, in _communicate_shapes
[rank3]:     reqs = torch.distributed.batch_isend_irecv(ops)
[rank3]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank3]:   File "/home/wdxu/miniconda3/envs/py311/lib/python3.11/site-packages/torch/distributed/distributed_c10d.py", line 2160, in batch_isend_irecv
[rank3]:     p2p_op.op(p2p_op.tensor, p2p_op.peer, p2p_op.group, p2p_op.tag)
[rank3]:   File "/home/wdxu/miniconda3/envs/py311/lib/python3.11/site-packages/torch/distributed/distributed_c10d.py", line 1914, in irecv
[rank3]:     return pg.recv([tensor], group_src_rank, tag)
[rank3]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank3]: torch.distributed.DistBackendError: NCCL error in: ../torch/csrc/distributed/c10d/NCCLUtils.hpp:275, unhandled system error (run with NCCL_DEBUG=INFO for details), NCCL version 2.20.5
[rank3]: ncclSystemError: System call (e.g. socket, malloc) or external library call failed or device error.
[rank3]: Last error:

W1115 21:05:11.214000 140165111768896 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 3009574 closing signal SIGTERM
/home/wdxu/miniconda3/envs/py311/lib/python3.11/multiprocessing/resource_tracker.py:254: UserWarning: resource_tracker: There appear to be 1 leaked semaphore objects to clean up at shutdown
  warnings.warn('resource_tracker: There appear to be %d '
E1115 21:05:11.479000 140165111768896 torch/distributed/elastic/multiprocessing/api.py:833] failed (exitcode: 1) local_rank: 1 (pid: 3009575) of binary: /home/wdxu/miniconda3/envs/py311/bin/python
Traceback (most recent call last):
  File "/home/wdxu/miniconda3/envs/py311/bin/torchrun", line 8, in <module>
    sys.exit(main())
             ^^^^^^
  File "/home/wdxu/miniconda3/envs/py311/lib/python3.11/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 348, in wrapper
    return f(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^
  File "/home/wdxu/miniconda3/envs/py311/lib/python3.11/site-packages/torch/distributed/run.py", line 901, in main
    run(args)
  File "/home/wdxu/miniconda3/envs/py311/lib/python3.11/site-packages/torch/distributed/run.py", line 892, in run
    elastic_launch(
  File "/home/wdxu/miniconda3/envs/py311/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 133, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/wdxu/miniconda3/envs/py311/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
comfyui-xdit/host.py FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2024-11-15_21:05:11
  host      : f039463.local
  rank      : 3 (local_rank: 1)
  exitcode  : 1 (pid: 3009575)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================

Could any tell me if comfyui-xdit runs in mlutiple servers? Thanks.

@Lay2000
Copy link
Collaborator

Lay2000 commented Nov 18, 2024

@VincentXWD Hello,

  1. comfyui-xdit was renamed by http-service now, and the previous functionality may vary, so we suggest you to update your xDiT to the latest version.

  2. Regarding the problem you've meet, we recommend checking the following:

    • Ensure that the master_port is different for each server to avoid port conflicts. You can set different master_port numbers for each server.
    • For the case where each server has two GPUs, you need to set the pipefusion_parallel_degree parameter to 2 to ensure that each GPU is used correctly.

    You can try launching the two services as follows:

    torchrun --nproc_per_node=2 --nnodes=2 --node_rank=0 --master_addr=192.168.10.100 --master_port=6000 comfyui-xdit/host.py --model=/home/mnt/wdxu/models/stable-diffusion-3-medium-diffusers --pipefusion_parallel_degree=2 --ulysses_degree=1 --ring_degree=1 --height=512 --width=512
    
    torchrun --nproc_per_node=2 --nnodes=2 --node_rank=1 --master_addr=192.168.10.100 --master_port=6001 comfyui-xdit/host.py --model=/home/mnt/wdxu/models/stable-diffusion-3-medium-diffusers --pipefusion_parallel_degree=2 --ulysses_degree=1 --ring_degree=1 --height=512 --width=512
    

    Please note that I have set the master_port to 6000 and 6001 respectively to avoid port conflicts.

I hope this information helps you resolve the issue. If you have any other questions or need further assistance, please feel free to contact us.

@VincentXWD
Copy link
Author

VincentXWD commented Nov 18, 2024

@VincentXWD Hello,

  1. comfyui-xdit was renamed by http-service now, and the previous functionality may vary, so we suggest you to update your xDiT to the latest version.

  2. Regarding the problem you've meet, we recommend checking the following:

    • Ensure that the master_port is different for each server to avoid port conflicts. You can set different master_port numbers for each server.
    • For the case where each server has two GPUs, you need to set the pipefusion_parallel_degree parameter to 2 to ensure that each GPU is used correctly.

    You can try launching the two services as follows:

    torchrun --nproc_per_node=2 --nnodes=2 --node_rank=0 --master_addr=192.168.10.100 --master_port=6000 comfyui-xdit/host.py --model=/home/mnt/wdxu/models/stable-diffusion-3-medium-diffusers --pipefusion_parallel_degree=2 --ulysses_degree=1 --ring_degree=1 --height=512 --width=512
    
    torchrun --nproc_per_node=2 --nnodes=2 --node_rank=1 --master_addr=192.168.10.100 --master_port=6001 comfyui-xdit/host.py --model=/home/mnt/wdxu/models/stable-diffusion-3-medium-diffusers --pipefusion_parallel_degree=2 --ulysses_degree=1 --ring_degree=1 --height=512 --width=512
    

    Please note that I have set the master_port to 6000 and 6001 respectively to avoid port conflicts.

I hope this information helps you resolve the issue. If you have any other questions or need further assistance, please feel free to contact us.

@Lay2000 Thanks for your reply. Now I updated xDiT to the latest version and tried your advices. Thanks for your advices again!
You mentioned that the master_port should not be identical but the processes will still wait when I set 6000 for the mater server and 6001 for the worker server.
And when I set pipefusion_parallel_degree=2, I met the assertion error:

[rank1]: AssertionError: parallel_world_size 2 must be equal to world_size 4

Any more suggestions? Thanks!

@VincentXWD
Copy link
Author

Btw when I run with commands below:

torchrun --nproc_per_node=2 --nnodes=2 --node_rank=0 --master_addr=192.168.10.100 --master_port=6000 http-service/host.py --model=/home/mnt/wdxu/models/stable-diffusion-3-medium-diffusers --pipefusion_parallel_degree=4 --ulysses_degree=1 --ring_degree=1 --height=512 --width=512

torchrun --nproc_per_node=2 --nnodes=2 --node_rank=1 --master_addr=192.168.10.100 --master_port=6000 http-service/host.py --model=/home/mnt/wdxu/models/stable-diffusion-3-medium-diffusers --pipefusion_parallel_degree=4 --ulysses_degree=1 --ring_degree=1 --height=512 --width=512

The worker will stuck at here:

INFO 11-18 16:33:40 [base_model.py:83] [RANK 2] Wrapping transformer_blocks.4.attn in model class SD3Transformer2DModel with xFuserAttentionWrapper
INFO 11-18 16:33:40 [base_model.py:83] [RANK 2] Wrapping transformer_blocks.5.attn in model class SD3Transformer2DModel with xFuserAttentionWrapper
INFO 11-18 16:33:40 [base_pipeline.py:335] Scheduler found, paralleling scheduler...
  0%|                                                                               | 0/3 [00:00<?, ?it/s]

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants