Merge pull request #357 from xibosun/cogvideo

Add CogVideoX1.5 support
xdit-project · Nov 20, 2024 · 353bba9 · 353bba9
2 parents bf953f7 + 7c5fa1d
commit 353bba9
Show file tree

Hide file tree

Showing 9 changed files with 300 additions and 293 deletions.
diff --git a/README.md b/README.md
@@ -93,6 +93,7 @@ Furthermore, xDiT incorporates optimization techniques from [DiTFastAttn](https:
 
 <h2 id="updates">📢 Updates</h2>
 
+* 🎉**November 20, 2024**: xDiT supports [CogVideoX-1.5](https://huggingface.co/THUDM/CogVideoX1.5-5B) and achieved 6.12x speedup compare to the implementation in diffusers!
 * 🎉**November 11, 2024**: xDiT has been applied to [mochi-1](https://github.com/xdit-project/mochi-xdit) and achieved 3.54x speedup compare to the official open source implementation!
 * 🎉**October 10, 2024**: xDiT applied DiTFastAttn to accelerate single GPU inference for Pixart Models!
 * 🎉**September 26, 2024**: xDiT has been officially used by [THUDM/CogVideo](https://github.com/THUDM/CogVideo)! The inference scripts are placed in [parallel_inference/](https://github.com/THUDM/CogVideo/blob/main/tools/parallel_inference) at their repository.
@@ -113,6 +114,7 @@ Furthermore, xDiT incorporates optimization techniques from [DiTFastAttn](https:
 
 | Model Name | CFG | SP | PipeFusion |
 | --- | --- | --- | --- |
+| [🎬 CogVideoX1.5](https://huggingface.co/THUDM/CogVideoX1.5-5B) | ✔️ | ✔️ | ❎ | 
 | [🎬 Mochi-1](https://github.com/xdit-project/mochi-xdit) | ✔️ | ✔️ | ❎ | 
 | [🎬 CogVideoX](https://huggingface.co/THUDM/CogVideoX-2b) | ✔️ | ✔️ | ❎ | 
 | [🎬 Latte](https://huggingface.co/maxin-cn/Latte-1) | ❎ | ✔️ | ❎ | 

diff --git a/docs/performance/cogvideo.md b/docs/performance/cogvideo.md
@@ -1,7 +1,11 @@
-## CogVideo Performance
+## CogVideoX Performance
 [Chinese Version](./cogvideo_zh.md)
 
-CogVideo is a model that converts text to video. xDiT currently integrates USP technology (including Ulysses Attention and Ring Attention) and CFG parallel processing to improve inference speed, while work on PipeFusion is ongoing. We conducted a thorough analysis of the performance differences between a single GPU CogVideoX inference based on the diffusers library and our proposed parallel version when generating a 49-frame (6-second) 720x480 resolution video. We can combine different parallel methods arbitrarily to achieve varying performance. In this paper, we systematically tested the acceleration performance of xDiT on 1-12 L40 (PCIe) GPUs.
+CogVideoX/CogVideoX1.5 is a model that converts text/image to video. xDiT currently integrates USP technology (including Ulysses Attention and Ring Attention) and CFG parallel processing to improve inference speed, while work on PipeFusion is ongoing. 
+
+### CogVideoX-2b/5b
+
+We conducted a thorough analysis of the performance differences between a single GPU CogVideoX inference based on the diffusers library and our proposed parallel version when generating a 49-frame (6-second) 720x480 resolution video. We can combine different parallel methods arbitrarily to achieve varying performance. In this paper, we systematically tested the acceleration performance of xDiT on 1-12 L40 (PCIe) GPUs.
 
 As shown in the figures, for the base model CogVideoX-2b, significant reductions in inference latency were observed whether using Ulysses Attention, Ring Attention, or Classifier-Free Guidance (CFG) parallel processing. It is noteworthy that due to its lower communication overhead, the CFG parallel method outperforms the other two technologies in terms of performance. By combining sequential parallelism and CFG parallelism, we successfully increased inference efficiency. With increasing parallelism, the inference latency continues to decrease. In the optimal configuration, xDiT achieves a 4.29x acceleration relative to single GPU inference, reducing the time for each iteration to just 0.49 seconds. Given the default 50 iterations of CogVideoX, the end-to-end generation of a 24.5-second video can be completed in a total of 30 seconds.
 
@@ -27,4 +31,15 @@ On systems equipped with A100 GPUs, xDiT demonstrates similar acceleration effec
 <div align="center">
     <img src="https://raw.githubusercontent.com/xdit-project/xdit_assets/main/performance/cogvideo/cogvideo-a100-5b.png" 
     alt="latency-cogvideo-a100-5b">
-</div>
+</div>
+
+### CogVideoX1.5-5B
+
+Similarly, we used CogVideoX1.5-5B to generate 161 frames of 1360x768 resolution video on a system equipped with an L40 (PCIe) GPU. We compared the inference latency of the single-card inference implementation in the diffusers library and the parallel version of xDiT. difference.
+
+As shown in the figure, regardless of Ulysses Attention, Ring Attention or CFG parallelism, xDiT's inference latency can be reduced. Among them, when two GPU cards are given, CFG parallel shows higher performance than Ulysses Attention and Ring Attention due to smaller communication volume. By combining sequence parallelism and CFG parallelism, we further improve the inference efficiency. As parallelism increases, inference latency continues to decrease. In an 8-card environment, xDiT can achieve the best performance when mixing Ulysses-2, Ring-2, and CFG-2. Compared with the single-card inference method, it can achieve 6.12 times acceleration, and it takes less than 10 minutes to generate a video.
+
+<div align="center">
+    <img src="https://raw.githubusercontent.com/xdit-project/xdit_assets/main/performance/cogvideo/latency-cogvideo1.5-5b-l40.png" 
+    alt="latency-cogvideo1.5-5b-l40">
+</div>
diff --git a/docs/performance/cogvideo_zh.md b/docs/performance/cogvideo_zh.md
@@ -1,6 +1,8 @@
-## CogVideo 性能表现
+## CogVideoX 性能表现
 
-CogVideo 是一个文本到视频的模型。xDiT 目前整合了 USP 技术（包括 Ulysses 注意力和 Ring 注意力）和 CFG 并行来提高推理速度，同时 PipeFusion 的工作正在进行中。我们对基于 `diffusers` 库的单 GPU CogVideoX 推理与我们提出的并行化版本在生成 49帧（6秒）720x480 分辨率视频时的性能差异进行了深入分析。由于我们可以任意组合不同的并行方式以获得不同的性能。在本文中，我们对xDiT在1-12张L40（PCIe）GPU上的加速性能进行了系统测试。
+CogVideoX/CogVideoX1.5 是有文本/图像生成视频的模型。xDiT 目前整合了 USP 技术（包括 Ulysses 注意力和 Ring 注意力）和 CFG 并行来提高推理速度，同时 PipeFusion 的工作正在进行中。我们对基于 `diffusers` 库的单 GPU CogVideoX 推理与我们提出的并行化版本在生成 49帧（6秒）720x480 分辨率视频时的性能差异进行了深入分析。由于我们可以任意组合不同的并行方式以获得不同的性能。在本文中，我们对xDiT在1-12张L40（PCIe）GPU上的加速性能进行了系统测试。
+
+### CogVideoX-2b/5b
 
 如图所示，对于基础模型 CogVideoX-2b，无论是采用 Ulysses Attention、Ring Attention 还是 Classifier-Free Guidance（CFG）并行，均观察到推理延迟的显著降低。值得注意的是，由于其较低的通信开销，CFG 并行方法在性能上优于其他两种技术。通过结合序列并行和 CFG 并行，我们成功提升了推理效率。随着并行度的增加，推理延迟持续下降。在最优配置下，xDiT 相对于单GPU推理实现了 4.29 倍的加速，使得每次迭代仅需 0.49 秒。鉴于 CogVideoX 默认的 50 次迭代，总计 30 秒即可完成 24.5 秒视频的端到端生成。
 
@@ -27,4 +29,15 @@ CogVideo 是一个文本到视频的模型。xDiT 目前整合了 USP 技术（
 <div align="center">
     <img src="https://raw.githubusercontent.com/xdit-project/xdit_assets/main/performance/cogvideo/cogvideo-a100-5b.png" 
     alt="latency-cogvideo-a100-5b">
-</div>
+</div>
+
+### CogVideoX1.5-5B
+
+同样，我们在配备了L40（PCIe）GPU的系统上用CogVideoX1.5-5B生成161帧1360x768分辨率的视频，我们对比了diffusers库中单卡的推理实现与xDiT的并行版本在推理延迟上的差异。
+如图所示，无论Ulysses Attention、Ring Attention还是CFG并行，均可以降低xDiT的推理延迟。其中，给定2张GPU卡时，CFG并行由于通信量较小，表现出比Ulysses Attention、Ring Attention更高的性能。通过结合序列并行和CFG并行，我们进一步提高了推理效率。随着并行度的增加，推理延迟持续降低。在8卡环境下，混合Ulysses-2，Ring-2，CFG-2时xDiT可以获得最佳性能，相比于单卡推理方法可以实现6.12倍的加速，生成一个视频只需不到10分钟。
+
+<div align="center">
+    <img src="https://raw.githubusercontent.com/xdit-project/xdit_assets/main/performance/cogvideo/latency-cogvideo1.5-5b-l40.png" 
+    alt="latency-cogvideo1.5-5b-l40">
+</div>
+
diff --git a/examples/cogvideox_example.py b/examples/cogvideox_example.py
@@ -20,13 +20,6 @@ def main():
     args = xFuserArgs.add_cli_args(parser).parse_args()
     engine_args = xFuserArgs.from_cli_args(args)
 
-    # Check if ulysses_degree is valid
-    num_heads = 30
-    if engine_args.ulysses_degree > 0 and num_heads % engine_args.ulysses_degree != 0:
-        raise ValueError(
-            f"ulysses_degree ({engine_args.ulysses_degree}) must be a divisor of the number of heads ({num_heads})"
-        )
-
     engine_config, input_config = engine_args.create_config()
     local_rank = get_world_group().local_rank
 
@@ -75,7 +68,6 @@ def main():
         f"pp{engine_args.pipefusion_parallel_degree}_patch{engine_args.num_pipeline_patch}"
     )
     if is_dp_last_group():
-        world_size = get_data_parallel_world_size()
         resolution = f"{input_config.width}x{input_config.height}"
         output_filename = f"results/cogvideox_{parallel_info}_{resolution}.mp4"
         export_to_video(output, output_filename, fps=8)

diff --git a/examples/run_cogvideo.sh b/examples/run_cogvideo.sh
@@ -5,18 +5,18 @@ export PYTHONPATH=$PWD:$PYTHONPATH
 
 # CogVideoX configuration
 SCRIPT="cogvideox_example.py"
-MODEL_ID="/cfs/dit/CogVideoX-2b"
-INFERENCE_STEP=20
+MODEL_ID="/cfs/dit/CogVideoX1.5-5B"
+INFERENCE_STEP=50
 
 mkdir -p ./results
 
 # CogVideoX specific task args
-TASK_ARGS="--height 480 --width 720 --num_frames 9"
+TASK_ARGS="--height 768 --width 1360 --num_frames 17"
 
 # CogVideoX parallel configuration
-N_GPUS=6
-PARALLEL_ARGS="--ulysses_degree 2 --ring_degree 3"
-#CFG_ARGS="--use_cfg_parallel"
+N_GPUS=8
+PARALLEL_ARGS="--ulysses_degree 2 --ring_degree 2"
+CFG_ARGS="--use_cfg_parallel"
 
 # Uncomment and modify these as needed
 # PIPEFUSION_ARGS="--num_pipeline_patch 8"
@@ -33,7 +33,7 @@ $PIPEFUSION_ARGS \
 $OUTPUT_ARGS \
 --num_inference_steps $INFERENCE_STEP \
 --warmup_steps 0 \
---prompt "A small dog" \
+--prompt "A little girl is riding a bicycle at high speed. Focused, detailed, realistic." \
 $CFG_ARGS \
 $PARALLLEL_VAE \
 $ENABLE_TILING \