Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dev nodes nexfort booster #911

Merged
merged 25 commits into from
Jun 12, 2024
Merged

Dev nodes nexfort booster #911

merged 25 commits into from
Jun 12, 2024

Conversation

ccssu
Copy link
Contributor

@ccssu ccssu commented May 25, 2024

Nexfort

cd ComfyUI

# For CUDA Graph
export NEXFORT_FX_CUDAGRAPHS=1

# For best performance
export TORCHINDUCTOR_MAX_AUTOTUNE=1
# Enable CUDNN benchmark
export NEXFORT_FX_CONV_BENCHMARK=1
# Faster float32 matmul
export NEXFORT_FX_MATMUL_ALLOW_TF32=1

# For graph cache to speedup compilation
export TORCHINDUCTOR_FX_GRAPH_CACHE=1

# For persistent cache dir
export TORCHINDUCTOR_CACHE_DIR=~/.torchinductor



# debug
# export  TORCH_LOGS="+dynamo" 
# export  TORCHDYNAMO_VERBOSE=1
# export NEXFORT_DEBUG=1 NEXFORT_FX_DUMP_GRAPH=1 TORCH_COMPILE_DEBUG=1

python main.py --gpu-only --disable-cuda-malloc --port 8188 --cuda-device 6

How to use Nexfort

Case 1

# Compile arbitrary models (torch.nn.Module)
import torch
import onediff.infer_compiler as infer_compiler

class MyModule(torch.nn.Module):
    def __init__(self):
        super().__init__()
        self.lin = torch.nn.Linear(100, 10)

    def forward(self, x):
        return torch.nn.functional.relu(self.lin(x))

mod = MyModule().to("cuda").half()
with torch.inference_mode():
    compiled_mod = infer_compiler.compile(mod,
        backend="nexfort",
        options={"mode": "max-autotune:cudagraphs", "dynamic": True, "fullgraph": True},
    )
    print(compiled_mod(torch.randn(10, 100, device="cuda").half()))

Case 2

import torch
import onediff.infer_compiler as infer_compiler
@infer_compiler.compile(
    backend="nexfort",
    options={"mode": "max-autotune:cudagraphs", "dynamic": True, "fullgraph": True},
)
def foo(x):
    return torch.sin(x) + torch.cos(x)

print(foo(torch.randn(10, 10, device="cuda").half()))

Vae

ComfyUI Workflow

speedup_vae

Result

{ model: sdxl, batch_size: 1 , image: 1024x1024 , speedup: vae}

Accelerator Baseline (non-optimized) OneDiff (Nexfort) Percentage improvement
NVIDIA GeForce RTX 4090 3.02 s 2.95 s 2.31%

First compilation time: 321.92 seconds

image

Lora

ComfyUI Workflow

speedup_vae_unet

Result

{ model: sdxl, batch_size: 1 , image: 1024x1024 , speedup: vae + unet}

Accelerator Baseline (non-optimized) OneDiff (Nexfort) Percentage improvement
NVIDIA GeForce RTX 4090 3.02 s 1.85 s 38.07 %

First compilation time: 878.19 seconds
image

Controlnet

ComfyUI Workflow

cnet_speedup

Result

{ model: sdxl, batch_size: 1 , image: 1024x1024 , speedup: controlnet}

Accelerator Baseline (non-optimized) OneDiff (Nexfort) Percentage improvement
NVIDIA GeForce RTX 4090 4.93 s 4.07 s 17.44 %

First compilation time: 437.84 seconds
image

IPAdapter

@ccssu ccssu marked this pull request as draft May 25, 2024 13:00
@strint strint marked this pull request as ready for review June 12, 2024 14:28
@strint strint merged commit 323897c into main Jun 12, 2024
7 checks passed
@strint strint deleted the dev_nodes_nexfort_booster branch June 12, 2024 14:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants