Can comfyui-xdit run in multiple servers? #350

VincentXWD · 2024-11-15T13:24:41Z

Hello developers,
I'm trying to use xDiT (version 3.3) comfyui-xdit on 2 servers with 4 NVIDIA 3090 GPUs. I use the command below to start the service:


torchrun --nproc_per_node=2 --nnodes=2 --node_rank=0 --master_addr=192.168.10.100 --master_port=6000 comfyui-xdit/host.py --model=/home/mnt/wdxu/models/stable-diffusion-3-medium-diffusers --pipefusion_parallel_degree=4 --ulysses_degree=1 --ring_degree=1 --height=512 --width=512

torchrun --nproc_per_node=2 --nnodes=2 --node_rank=1 --master_addr=192.168.10.100 --master_port=6000 comfyui-xdit/host.py --model=/home/mnt/wdxu/models/stable-diffusion-3-medium-diffusers --pipefusion_parallel_degree=4 --ulysses_degree=1 --ring_degree=1 --height=512 --width=512

I cannot start the service on the worker and get the logs below:

W1115 21:02:04.800000 140165111768896 torch/distributed/run.py:779]
W1115 21:02:04.800000 140165111768896 torch/distributed/run.py:779] *****************************************
W1115 21:02:04.800000 140165111768896 torch/distributed/run.py:779] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
W1115 21:02:04.800000 140165111768896 torch/distributed/run.py:779] *****************************************
/home/wdxu/miniconda3/envs/py311/lib/python3.11/site-packages/xformers/ops/fmha/flash.py:211: FutureWarning: `torch.library.impl_abstract` was renamed to `torch.library.register_fake`. Please use that instead; we will remove `torch.library.impl_abstract` in a future version of PyTorch.
  @torch.library.impl_abstract("xformers_flash::flash_fwd")
/home/wdxu/miniconda3/envs/py311/lib/python3.11/site-packages/xformers/ops/fmha/flash.py:344: FutureWarning: `torch.library.impl_abstract` was renamed to `torch.library.register_fake`. Please use that instead; we will remove `torch.library.impl_abstract` in a future version of PyTorch.
  @torch.library.impl_abstract("xformers_flash::flash_bwd")
/home/wdxu/miniconda3/envs/py311/lib/python3.11/site-packages/xformers/ops/fmha/flash.py:211: FutureWarning: `torch.library.impl_abstract` was renamed to `torch.library.register_fake`. Please use that instead; we will remove `torch.library.impl_abstract` in a future version of PyTorch.
  @torch.library.impl_abstract("xformers_flash::flash_fwd")
/home/wdxu/miniconda3/envs/py311/lib/python3.11/site-packages/xformers/ops/fmha/flash.py:344: FutureWarning: `torch.library.impl_abstract` was renamed to `torch.library.register_fake`. Please use that instead; we will remove `torch.library.impl_abstract` in a future version of PyTorch.
  @torch.library.impl_abstract("xformers_flash::flash_bwd")
WARNING 11-15 21:02:13 [args.py:320] Distributed environment is not initialized. Initializing...
DEBUG 11-15 21:02:13 [parallel_state.py:179] world_size=-1 rank=-1 local_rank=-1 distributed_init_method=env:// backend=nccl
WARNING 11-15 21:02:13 [args.py:320] Distributed environment is not initialized. Initializing...
DEBUG 11-15 21:02:13 [parallel_state.py:179] world_size=-1 rank=-1 local_rank=-1 distributed_init_method=env:// backend=nccl
[W1115 21:02:13.809330666 Utils.hpp:164] Warning: Environment variable NCCL_BLOCKING_WAIT is deprecated; use TORCH_NCCL_BLOCKING_WAIT instead (function operator())
[W1115 21:02:13.809348597 Utils.hpp:135] Warning: Environment variable NCCL_ASYNC_ERROR_HANDLING is deprecated; use TORCH_NCCL_ASYNC_ERROR_HANDLING instead (function operator())
[W1115 21:02:13.809353447 Utils.hpp:164] Warning: Environment variable NCCL_BLOCKING_WAIT is deprecated; use TORCH_NCCL_BLOCKING_WAIT instead (function operator())
[W1115 21:02:13.809372718 Utils.hpp:135] Warning: Environment variable NCCL_ASYNC_ERROR_HANDLING is deprecated; use TORCH_NCCL_ASYNC_ERROR_HANDLING instead (function operator())
INFO 11-15 21:02:18 [config.py:163] Pipeline patch number not set, using default value 4
INFO 11-15 21:02:18 [config.py:163] Pipeline patch number not set, using default value 4
[Rank 2] 2024-11-15 21:02:18 - INFO - Initializing model on GPU: 0
Loading pipeline components...:   0%|                                                           | 0/9 [00:00<?, ?it/s][Rank 3] 2024-11-15 21:02:18 - INFO - Initializing model on GPU: 1
Loading checkpoint shards: 100%|████████████████████████████████████████████████████████| 2/2 [00:02<00:00,  1.02s/it]
Loading pipeline components...:  11%|█████▋                                             | 1/9 [00:06<00:49,  6.16s/it]You set `add_prefix_space`. The tokenizer needs to be converted from the slow tokenizers
Loading pipeline components...:  56%|████████████████████████████▎                      | 5/9 [00:06<00:05,  1.36s/it]You set `add_prefix_space`. The tokenizer needs to be converted from the slow tokenizers
Loading pipeline components...: 100%|███████████████████████████████████████████████████| 9/9 [00:06<00:00,  1.33it/s]
WARNING 11-15 21:02:25 [runtime_state.py:63] Model parallel is not initialized, initializing... | 0/2 [00:00<?, ?it/s]
Loading checkpoint shards: 100%|████████████████████████████████████████████████████████| 2/2 [00:01<00:00,  1.05it/s]
Loading pipeline components...: 100%|███████████████████████████████████████████████████| 9/9 [00:08<00:00,  1.01it/s]
WARNING 11-15 21:02:27 [runtime_state.py:63] Model parallel is not initialized, initializing...
INFO 11-15 21:02:52 [base_pipeline.py:290] Transformer backbone found, paralleling transformer...
INFO 11-15 21:02:52 [base_model.py:83] [RANK 3] Wrapping .pos_embed in model class SD3Transformer2DModel with xFuserPatchEmbedWrapper
INFO 11-15 21:02:52 [base_model.py:83] [RANK 3] Wrapping pos_embed.module.proj in model class SD3Transformer2DModel with xFuserConv2dWrapper
INFO 11-15 21:02:52 [base_model.py:83] [RANK 3] Wrapping transformer_blocks.0.attn in model class SD3Transformer2DModel with xFuserAttentionWrapper
INFO 11-15 21:02:52 [base_model.py:83] [RANK 3] Wrapping transformer_blocks.1.attn in model class SD3Transformer2DModel with xFuserAttentionWrapper
INFO 11-15 21:02:52 [base_model.py:83] [RANK 3] Wrapping transformer_blocks.2.attn in model class SD3Transformer2DModel with xFuserAttentionWrapper
INFO 11-15 21:02:52 [base_model.py:83] [RANK 3] Wrapping transformer_blocks.3.attn in model class SD3Transformer2DModel with xFuserAttentionWrapper
INFO 11-15 21:02:52 [base_model.py:83] [RANK 3] Wrapping transformer_blocks.4.attn in model class SD3Transformer2DModel with xFuserAttentionWrapper
INFO 11-15 21:02:52 [base_model.py:83] [RANK 3] Wrapping transformer_blocks.5.attn in model class SD3Transformer2DModel with xFuserAttentionWrapper
INFO 11-15 21:02:52 [base_pipeline.py:335] Scheduler found, paralleling scheduler...
INFO 11-15 21:02:52 [base_pipeline.py:290] Transformer backbone found, paralleling transformer...
INFO 11-15 21:02:52 [base_model.py:83] [RANK 2] Wrapping .pos_embed in model class SD3Transformer2DModel with xFuserPatchEmbedWrapper
INFO 11-15 21:02:52 [base_model.py:83] [RANK 2] Wrapping pos_embed.module.proj in model class SD3Transformer2DModel with xFuserConv2dWrapper
INFO 11-15 21:02:52 [base_model.py:83] [RANK 2] Wrapping transformer_blocks.0.attn in model class SD3Transformer2DModel with xFuserAttentionWrapper
INFO 11-15 21:02:52 [base_model.py:83] [RANK 2] Wrapping transformer_blocks.1.attn in model class SD3Transformer2DModel with xFuserAttentionWrapper
INFO 11-15 21:02:52 [base_model.py:83] [RANK 2] Wrapping transformer_blocks.2.attn in model class SD3Transformer2DModel with xFuserAttentionWrapper
INFO 11-15 21:02:52 [base_model.py:83] [RANK 2] Wrapping transformer_blocks.3.attn in model class SD3Transformer2DModel with xFuserAttentionWrapper
INFO 11-15 21:02:52 [base_model.py:83] [RANK 2] Wrapping transformer_blocks.4.attn in model class SD3Transformer2DModel with xFuserAttentionWrapper
INFO 11-15 21:02:52 [base_model.py:83] [RANK 2] Wrapping transformer_blocks.5.attn in model class SD3Transformer2DModel with xFuserAttentionWrapper
INFO 11-15 21:02:52 [base_pipeline.py:335] Scheduler found, paralleling scheduler...
  0%|                                                                                           | 0/3 [00:00<?, ?it/s]f039463:3009575:3009575 [1] NCCL INFO cudaDriverVersion 12040
f039463:3009574:3009574 [0] NCCL INFO cudaDriverVersion 12040
f039463:3009575:3009575 [1] NCCL INFO Bootstrap : Using eno1:192.168.10.103<0>
f039463:3009574:3009574 [0] NCCL INFO Bootstrap : Using eno1:192.168.10.103<0>
f039463:3009574:3009574 [0] NCCL INFO NET/Plugin : dlerror=libnccl-net.so: cannot open shared object file: No such file or directory No plugin found (libnccl-net.so), using internal implementation
f039463:3009575:3009575 [1] NCCL INFO NET/Plugin : dlerror=libnccl-net.so: cannot open shared object file: No such file or directory No plugin found (libnccl-net.so), using internal implementation
f039463:3009575:3009897 [1] NCCL INFO NET/IB : No device found.
f039463:3009575:3009897 [1] NCCL INFO NET/Socket : Using [0]eno1:192.168.10.103<0> [1]enxb03af2b6059f:169.254.3.1<0> [2]virbr0:192.168.122.1<0>
f039463:3009575:3009897 [1] NCCL INFO Using non-device net plugin version 0
f039463:3009575:3009897 [1] NCCL INFO Using network Socket
f039463:3009574:3009896 [0] NCCL INFO NET/IB : No device found.
f039463:3009574:3009896 [0] NCCL INFO NET/Socket : Using [0]eno1:192.168.10.103<0> [1]enxb03af2b6059f:169.254.3.1<0> [2]virbr0:192.168.122.1<0>
f039463:3009574:3009896 [0] NCCL INFO Using non-device net plugin version 0
f039463:3009574:3009896 [0] NCCL INFO Using network Socket
f039463:3009575:3009897 [1] NCCL INFO comm 0x3a7df850 rank 3 nranks 4 cudaDev 1 nvmlDev 1 busId c1000 commId 0x7d3bbe1348141982 - Init START
f039463:3009574:3009896 [0] NCCL INFO comm 0x3aadc260 rank 2 nranks 4 cudaDev 0 nvmlDev 0 busId 81000 commId 0x7d3bbe1348141982 - Init START
f039463:3009575:3009897 [1] NCCL INFO P2P is disabled between connected GPUs 1 and 0. You can repress this message with NCCL_IGNORE_DISABLED_P2P=1.
f039463:3009575:3009897 [1] NCCL INFO P2P is disabled between connected GPUs 1 and 0. You can repress this message with NCCL_IGNORE_DISABLED_P2P=1.
f039463:3009575:3009897 [1] NCCL INFO P2P is disabled between connected GPUs 0 and 1. You can repress this message with NCCL_IGNORE_DISABLED_P2P=1.
f039463:3009575:3009897 [1] NCCL INFO P2P is disabled between connected GPUs 0 and 1. You can repress this message with NCCL_IGNORE_DISABLED_P2P=1.
f039463:3009575:3009897 [1] NCCL INFO P2P is disabled between connected GPUs 1 and 0. You can repress this message with NCCL_IGNORE_DISABLED_P2P=1.
f039463:3009575:3009897 [1] NCCL INFO P2P is disabled between connected GPUs 1 and 0. You can repress this message with NCCL_IGNORE_DISABLED_P2P=1.
f039463:3009575:3009897 [1] NCCL INFO P2P is disabled between connected GPUs 0 and 1. You can repress this message with NCCL_IGNORE_DISABLED_P2P=1.
f039463:3009575:3009897 [1] NCCL INFO P2P is disabled between connected GPUs 0 and 1. You can repress this message with NCCL_IGNORE_DISABLED_P2P=1.
f039463:3009574:3009896 [0] NCCL INFO P2P is disabled between connected GPUs 1 and 0. You can repress this message with NCCL_IGNORE_DISABLED_P2P=1.
f039463:3009574:3009896 [0] NCCL INFO P2P is disabled between connected GPUs 1 and 0. You can repress this message with NCCL_IGNORE_DISABLED_P2P=1.
f039463:3009574:3009896 [0] NCCL INFO P2P is disabled between connected GPUs 0 and 1. You can repress this message with NCCL_IGNORE_DISABLED_P2P=1.
f039463:3009574:3009896 [0] NCCL INFO P2P is disabled between connected GPUs 0 and 1. You can repress this message with NCCL_IGNORE_DISABLED_P2P=1.
f039463:3009574:3009896 [0] NCCL INFO P2P is disabled between connected GPUs 1 and 0. You can repress this message with NCCL_IGNORE_DISABLED_P2P=1.
f039463:3009574:3009896 [0] NCCL INFO P2P is disabled between connected GPUs 1 and 0. You can repress this message with NCCL_IGNORE_DISABLED_P2P=1.
f039463:3009574:3009896 [0] NCCL INFO P2P is disabled between connected GPUs 0 and 1. You can repress this message with NCCL_IGNORE_DISABLED_P2P=1.
f039463:3009574:3009896 [0] NCCL INFO P2P is disabled between connected GPUs 0 and 1. You can repress this message with NCCL_IGNORE_DISABLED_P2P=1.
f039463:3009575:3009897 [1] NCCL INFO comm 0x3a7df850 rank 3 nRanks 4 nNodes 2 localRanks 2 localRank 1 MNNVL 0
f039463:3009575:3009897 [1] NCCL INFO Trees [0] -1/-1/-1->3->2 [1] -1/-1/-1->3->2
f039463:3009575:3009897 [1] NCCL INFO P2P Chunksize set to 131072
f039463:3009574:3009896 [0] NCCL INFO comm 0x3aadc260 rank 2 nRanks 4 nNodes 2 localRanks 2 localRank 0 MNNVL 0
f039463:3009574:3009896 [0] NCCL INFO Trees [0] 3/-1/-1->2->0 [1] 3/0/-1->2->-1
f039463:3009574:3009896 [0] NCCL INFO P2P Chunksize set to 131072
f039463:3009575:3009897 [1] NCCL INFO P2P is disabled between connected GPUs 1 and 0. You can repress this message with NCCL_IGNORE_DISABLED_P2P=1.
f039463:3009574:3009896 [0] NCCL INFO Channel 00/0 : 1[1] -> 2[0] [receive] via NET/Socket/1
f039463:3009574:3009896 [0] NCCL INFO Channel 01/0 : 1[1] -> 2[0] [receive] via NET/Socket/1
f039463:3009574:3009896 [0] NCCL INFO P2P is disabled between connected GPUs 0 and 1. You can repress this message with NCCL_IGNORE_DISABLED_P2P=1.
f039463:3009575:3009897 [1] NCCL INFO P2P is disabled between connected GPUs 1 and 0. You can repress this message with NCCL_IGNORE_DISABLED_P2P=1.
f039463:3009574:3009896 [0] NCCL INFO Channel 00 : 2[0] -> 3[1] via SHM/direct/direct
f039463:3009574:3009896 [0] NCCL INFO P2P is disabled between connected GPUs 0 and 1. You can repress this message with NCCL_IGNORE_DISABLED_P2P=1.
f039463:3009574:3009896 [0] NCCL INFO Channel 01 : 2[0] -> 3[1] via SHM/direct/direct
f039463:3009575:3009897 [1] NCCL INFO Channel 00/0 : 3[1] -> 0[0] [send] via NET/Socket/1
f039463:3009575:3009897 [1] NCCL INFO Channel 01/0 : 3[1] -> 0[0] [send] via NET/Socket/1
f039463:3009575:3009898 [1] NCCL INFO misc/socket.cc:505 -> 2 (Operation now in progress)
f039463:3009575:3009898 [1] NCCL INFO misc/socket.cc:570 -> 2
f039463:3009575:3009898 [1] NCCL INFO misc/socket.cc:589 -> 2
f039463:3009575:3009898 [1] NCCL INFO transport/net_socket.cc:339 -> 2
f039463:3009575:3009898 [1] NCCL INFO transport/net.cc:683 -> 2
f039463:3009575:3009897 [1] NCCL INFO transport/net.cc:304 -> 2
f039463:3009575:3009897 [1] NCCL INFO transport.cc:165 -> 2
f039463:3009575:3009897 [1] NCCL INFO init.cc:1222 -> 2
f039463:3009575:3009897 [1] NCCL INFO init.cc:1501 -> 2
f039463:3009575:3009897 [1] NCCL INFO group.cc:64 -> 2 [Async thread]
f039463:3009575:3009575 [1] NCCL INFO group.cc:418 -> 2
f039463:3009575:3009575 [1] NCCL INFO init.cc:1876 -> 2

f039463:3009575:3009898 [1] proxy.cc:1533 NCCL WARN [Service thread] Error encountered progressing operation=Connect, res=3, closing connection

f039463:3009575:3009898 [1] proxy.cc:1567 NCCL WARN [Proxy Service 3] Failed to execute operation Connect from rank 3, retcode 3
f039463:3009575:3009575 [1] NCCL INFO comm 0x3a7df850 rank 3 nranks 4 cudaDev 1 busId c1000 - Abort COMPLETE
  0%|                                                                                           | 0/3 [02:12<?, ?it/s]
[rank3]: Traceback (most recent call last):
[rank3]:   File "/home/mnt/wdxu/github/xDiT/comfyui-xdit/host.py", line 248, in <module>
[rank3]:
[rank3]:   File "/home/mnt/wdxu/github/xDiT/comfyui-xdit/host.py", line 98, in initialize
[rank3]:     pipe.prepare_run(input_config)
[rank3]:   File "/home/mnt/wdxu/github/xDiT/xfuser/model_executor/pipelines/pipeline_stable_diffusion_3.py", line 75, in prepare_run
[rank3]:     self.__call__(
[rank3]:   File "/home/wdxu/miniconda3/envs/py311/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
[rank3]:     return func(*args, **kwargs)
[rank3]:            ^^^^^^^^^^^^^^^^^^^^^
[rank3]:   File "/home/mnt/wdxu/github/xDiT/xfuser/model_executor/pipelines/base_pipeline.py", line 166, in data_parallel_fn
[rank3]:     return func(self, *args, **kwargs)
[rank3]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank3]:   File "/home/mnt/wdxu/github/xDiT/xfuser/model_executor/pipelines/base_pipeline.py", line 186, in check_naive_forward_fn
[rank3]:     return func(self, *args, **kwargs)
[rank3]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank3]:   File "/home/mnt/wdxu/github/xDiT/xfuser/model_executor/pipelines/pipeline_stable_diffusion_3.py", line 348, in __call__
[rank3]:     latents = self._sync_pipeline(
[rank3]:               ^^^^^^^^^^^^^^^^^^^^
[rank3]:   File "/home/mnt/wdxu/github/xDiT/xfuser/model_executor/pipelines/pipeline_stable_diffusion_3.py", line 444, in _sync_pipeline
[rank3]:     latents = get_pp_group().pipeline_recv()
[rank3]:               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank3]:   File "/home/mnt/wdxu/github/xDiT/xfuser/core/distributed/group_coordinator.py", line 925, in pipeline_recv
[rank3]:     self._check_shape_and_buffer(recv_prev=True, name=name, segment_idx=idx)
[rank3]:   File "/home/mnt/wdxu/github/xDiT/xfuser/core/distributed/group_coordinator.py", line 796, in _check_shape_and_buffer
[rank3]:     recv_prev_shape = self._communicate_shapes(
[rank3]:                       ^^^^^^^^^^^^^^^^^^^^^^^^^
[rank3]:   File "/home/mnt/wdxu/github/xDiT/xfuser/core/distributed/group_coordinator.py", line 859, in _communicate_shapes
[rank3]:     reqs = torch.distributed.batch_isend_irecv(ops)
[rank3]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank3]:   File "/home/wdxu/miniconda3/envs/py311/lib/python3.11/site-packages/torch/distributed/distributed_c10d.py", line 2160, in batch_isend_irecv
[rank3]:     p2p_op.op(p2p_op.tensor, p2p_op.peer, p2p_op.group, p2p_op.tag)
[rank3]:   File "/home/wdxu/miniconda3/envs/py311/lib/python3.11/site-packages/torch/distributed/distributed_c10d.py", line 1914, in irecv
[rank3]:     return pg.recv([tensor], group_src_rank, tag)
[rank3]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank3]: torch.distributed.DistBackendError: NCCL error in: ../torch/csrc/distributed/c10d/NCCLUtils.hpp:275, unhandled system error (run with NCCL_DEBUG=INFO for details), NCCL version 2.20.5
[rank3]: ncclSystemError: System call (e.g. socket, malloc) or external library call failed or device error.
[rank3]: Last error:

W1115 21:05:11.214000 140165111768896 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 3009574 closing signal SIGTERM
/home/wdxu/miniconda3/envs/py311/lib/python3.11/multiprocessing/resource_tracker.py:254: UserWarning: resource_tracker: There appear to be 1 leaked semaphore objects to clean up at shutdown
  warnings.warn('resource_tracker: There appear to be %d '
E1115 21:05:11.479000 140165111768896 torch/distributed/elastic/multiprocessing/api.py:833] failed (exitcode: 1) local_rank: 1 (pid: 3009575) of binary: /home/wdxu/miniconda3/envs/py311/bin/python
Traceback (most recent call last):
  File "/home/wdxu/miniconda3/envs/py311/bin/torchrun", line 8, in <module>
    sys.exit(main())
             ^^^^^^
  File "/home/wdxu/miniconda3/envs/py311/lib/python3.11/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 348, in wrapper
    return f(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^
  File "/home/wdxu/miniconda3/envs/py311/lib/python3.11/site-packages/torch/distributed/run.py", line 901, in main
    run(args)
  File "/home/wdxu/miniconda3/envs/py311/lib/python3.11/site-packages/torch/distributed/run.py", line 892, in run
    elastic_launch(
  File "/home/wdxu/miniconda3/envs/py311/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 133, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/wdxu/miniconda3/envs/py311/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
comfyui-xdit/host.py FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2024-11-15_21:05:11
  host      : f039463.local
  rank      : 3 (local_rank: 1)
  exitcode  : 1 (pid: 3009575)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================

Could any tell me if comfyui-xdit runs in mlutiple servers? Thanks.

The text was updated successfully, but these errors were encountered:

Lay2000 · 2024-11-18T02:30:37Z

@VincentXWD Hello,

comfyui-xdit was renamed by http-service now, and the previous functionality may vary, so we suggest you to update your xDiT to the latest version.

Regarding the problem you've meet, we recommend checking the following:

Ensure that the master_port is different for each server to avoid port conflicts. You can set different master_port numbers for each server.
For the case where each server has two GPUs, you need to set the pipefusion_parallel_degree parameter to 2 to ensure that each GPU is used correctly.

You can try launching the two services as follows:

torchrun --nproc_per_node=2 --nnodes=2 --node_rank=0 --master_addr=192.168.10.100 --master_port=6000 comfyui-xdit/host.py --model=/home/mnt/wdxu/models/stable-diffusion-3-medium-diffusers --pipefusion_parallel_degree=2 --ulysses_degree=1 --ring_degree=1 --height=512 --width=512

torchrun --nproc_per_node=2 --nnodes=2 --node_rank=1 --master_addr=192.168.10.100 --master_port=6001 comfyui-xdit/host.py --model=/home/mnt/wdxu/models/stable-diffusion-3-medium-diffusers --pipefusion_parallel_degree=2 --ulysses_degree=1 --ring_degree=1 --height=512 --width=512

Please note that I have set the master_port to 6000 and 6001 respectively to avoid port conflicts.

I hope this information helps you resolve the issue. If you have any other questions or need further assistance, please feel free to contact us.

VincentXWD · 2024-11-18T08:29:26Z

@VincentXWD Hello,
comfyui-xdit was renamed by http-service now, and the previous functionality may vary, so we suggest you to update your xDiT to the latest version.
Regarding the problem you've meet, we recommend checking the following:

Ensure that the master_port is different for each server to avoid port conflicts. You can set different master_port numbers for each server.

For the case where each server has two GPUs, you need to set the pipefusion_parallel_degree parameter to 2 to ensure that each GPU is used correctly.

You can try launching the two services as follows:
torchrun --nproc_per_node=2 --nnodes=2 --node_rank=0 --master_addr=192.168.10.100 --master_port=6000 comfyui-xdit/host.py --model=/home/mnt/wdxu/models/stable-diffusion-3-medium-diffusers --pipefusion_parallel_degree=2 --ulysses_degree=1 --ring_degree=1 --height=512 --width=512

torchrun --nproc_per_node=2 --nnodes=2 --node_rank=1 --master_addr=192.168.10.100 --master_port=6001 comfyui-xdit/host.py --model=/home/mnt/wdxu/models/stable-diffusion-3-medium-diffusers --pipefusion_parallel_degree=2 --ulysses_degree=1 --ring_degree=1 --height=512 --width=512
Please note that I have set the master_port to 6000 and 6001 respectively to avoid port conflicts.
I hope this information helps you resolve the issue. If you have any other questions or need further assistance, please feel free to contact us.

@Lay2000 Thanks for your reply. Now I updated xDiT to the latest version and tried your advices. Thanks for your advices again!
You mentioned that the master_port should not be identical but the processes will still wait when I set 6000 for the mater server and 6001 for the worker server.
And when I set pipefusion_parallel_degree=2, I met the assertion error:

[rank1]: AssertionError: parallel_world_size 2 must be equal to world_size 4

Any more suggestions? Thanks!

VincentXWD · 2024-11-18T08:35:52Z

Btw when I run with commands below:

torchrun --nproc_per_node=2 --nnodes=2 --node_rank=0 --master_addr=192.168.10.100 --master_port=6000 http-service/host.py --model=/home/mnt/wdxu/models/stable-diffusion-3-medium-diffusers --pipefusion_parallel_degree=4 --ulysses_degree=1 --ring_degree=1 --height=512 --width=512

torchrun --nproc_per_node=2 --nnodes=2 --node_rank=1 --master_addr=192.168.10.100 --master_port=6000 http-service/host.py --model=/home/mnt/wdxu/models/stable-diffusion-3-medium-diffusers --pipefusion_parallel_degree=4 --ulysses_degree=1 --ring_degree=1 --height=512 --width=512

The worker will stuck at here:

INFO 11-18 16:33:40 [base_model.py:83] [RANK 2] Wrapping transformer_blocks.4.attn in model class SD3Transformer2DModel with xFuserAttentionWrapper
INFO 11-18 16:33:40 [base_model.py:83] [RANK 2] Wrapping transformer_blocks.5.attn in model class SD3Transformer2DModel with xFuserAttentionWrapper
INFO 11-18 16:33:40 [base_pipeline.py:335] Scheduler found, paralleling scheduler...
  0%|                                                                               | 0/3 [00:00<?, ?it/s]

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Can comfyui-xdit run in multiple servers? #350

Can comfyui-xdit run in multiple servers? #350

VincentXWD commented Nov 15, 2024

Lay2000 commented Nov 18, 2024

VincentXWD commented Nov 18, 2024 •

edited

Loading

VincentXWD commented Nov 18, 2024

Can comfyui-xdit run in multiple servers? #350

Can comfyui-xdit run in multiple servers? #350

Comments

VincentXWD commented Nov 15, 2024

Lay2000 commented Nov 18, 2024

VincentXWD commented Nov 18, 2024 • edited Loading

VincentXWD commented Nov 18, 2024

VincentXWD commented Nov 18, 2024 •

edited

Loading