Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add llama-cpp-python wheels with tensor cores support #5003

Merged
merged 2 commits into from
Dec 19, 2023
Merged

Conversation

oobabooga
Copy link
Owner

@oobabooga oobabooga commented Dec 19, 2023

It is recommended to use this option if you have a newer NVIDIA GPU like RTX series.

Hopefully this will become a runtime flag in the future. For now, I had to compile a new set of llama_cpp_cuda_tensorcore wheels using GitHub Actions without the -DLLAMA_CUDA_FORCE_MMQ=ON flag to support this feature.

llama-2-13b.Q3_K_M.gguf

fully offloaded to the GPU

without --tensorcores:

llama_print_timings:        load time =     508.69 ms
llama_print_timings:      sample time =      91.02 ms /   512 runs   (    0.18 ms per token,  5625.08 tokens per second)
llama_print_timings: prompt eval time =    3545.43 ms /  3200 tokens (    1.11 ms per token,   902.57 tokens per second)
llama_print_timings:        eval time =   14075.88 ms /   511 runs   (   27.55 ms per token,    36.30 tokens per second)
llama_print_timings:       total time =   18381.46 ms
Output generated in 18.61 seconds (27.52 tokens/s, 512 tokens, context 3200, seed 107801409)

with --tensorcores:

llama_print_timings:        load time =     272.25 ms
llama_print_timings:      sample time =      88.52 ms /   512 runs   (    0.17 ms per token,  5783.87 tokens per second)
llama_print_timings: prompt eval time =    1869.05 ms /  3200 tokens (    0.58 ms per token,  1712.10 tokens per second)
llama_print_timings:        eval time =    9889.70 ms /   511 runs   (   19.35 ms per token,    51.67 tokens per second)
llama_print_timings:       total time =   12484.12 ms
Output generated in 12.70 seconds (40.30 tokens/s, 512 tokens, context 3200, seed 1845973748)

mixtral-8x7b-instruct-v0.1.Q4_K_M.gguf

With 20/33 layers offloaded to the GPU.

without --tensorcores

llama_print_timings:        load time =   14624.69 ms
llama_print_timings:      sample time =      92.02 ms /   512 runs   (    0.18 ms per token,  5563.83 tokens per second)
llama_print_timings: prompt eval time =   96262.41 ms /  3167 tokens (   30.40 ms per token,    32.90 tokens per second)
llama_print_timings:        eval time =   45573.66 ms /   511 runs   (   89.19 ms per token,    11.21 tokens per second)
llama_print_timings:       total time =  143372.46 ms
Output generated in 143.60 seconds (3.57 tokens/s, 512 tokens, context 3167, seed 870331982)

with --tensorcores:

llama_print_timings:        load time =   16319.61 ms
llama_print_timings:      sample time =      99.37 ms /   512 runs   (    0.19 ms per token,  5152.67 tokens per second)
llama_print_timings: prompt eval time =   95983.94 ms /  3167 tokens (   30.31 ms per token,    33.00 tokens per second)
llama_print_timings:        eval time =   44855.77 ms /   511 runs   (   87.78 ms per token,    11.39 tokens per second)
llama_print_timings:       total time =  142425.11 ms
Output generated in 142.65 seconds (3.59 tokens/s, 512 tokens, context 3167, seed 1493079911)

@oobabooga oobabooga merged commit de138b8 into dev Dec 19, 2023
@oobabooga oobabooga deleted the tensorcores branch December 20, 2023 17:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant