Add llama-cpp-python wheels with tensor cores support #5003

oobabooga · 2023-12-19T20:14:10Z

It is recommended to use this option if you have a newer NVIDIA GPU like RTX series.

Hopefully this will become a runtime flag in the future. For now, I had to compile a new set of llama_cpp_cuda_tensorcore wheels using GitHub Actions without the -DLLAMA_CUDA_FORCE_MMQ=ON flag to support this feature.

llama-2-13b.Q3_K_M.gguf

fully offloaded to the GPU

without --tensorcores:

llama_print_timings:        load time =     508.69 ms
llama_print_timings:      sample time =      91.02 ms /   512 runs   (    0.18 ms per token,  5625.08 tokens per second)
llama_print_timings: prompt eval time =    3545.43 ms /  3200 tokens (    1.11 ms per token,   902.57 tokens per second)
llama_print_timings:        eval time =   14075.88 ms /   511 runs   (   27.55 ms per token,    36.30 tokens per second)
llama_print_timings:       total time =   18381.46 ms
Output generated in 18.61 seconds (27.52 tokens/s, 512 tokens, context 3200, seed 107801409)

with --tensorcores:

llama_print_timings:        load time =     272.25 ms
llama_print_timings:      sample time =      88.52 ms /   512 runs   (    0.17 ms per token,  5783.87 tokens per second)
llama_print_timings: prompt eval time =    1869.05 ms /  3200 tokens (    0.58 ms per token,  1712.10 tokens per second)
llama_print_timings:        eval time =    9889.70 ms /   511 runs   (   19.35 ms per token,    51.67 tokens per second)
llama_print_timings:       total time =   12484.12 ms
Output generated in 12.70 seconds (40.30 tokens/s, 512 tokens, context 3200, seed 1845973748)

mixtral-8x7b-instruct-v0.1.Q4_K_M.gguf

With 20/33 layers offloaded to the GPU.

without --tensorcores

llama_print_timings:        load time =   14624.69 ms
llama_print_timings:      sample time =      92.02 ms /   512 runs   (    0.18 ms per token,  5563.83 tokens per second)
llama_print_timings: prompt eval time =   96262.41 ms /  3167 tokens (   30.40 ms per token,    32.90 tokens per second)
llama_print_timings:        eval time =   45573.66 ms /   511 runs   (   89.19 ms per token,    11.21 tokens per second)
llama_print_timings:       total time =  143372.46 ms
Output generated in 143.60 seconds (3.57 tokens/s, 512 tokens, context 3167, seed 870331982)

with --tensorcores:

llama_print_timings:        load time =   16319.61 ms
llama_print_timings:      sample time =      99.37 ms /   512 runs   (    0.19 ms per token,  5152.67 tokens per second)
llama_print_timings: prompt eval time =   95983.94 ms /  3167 tokens (   30.31 ms per token,    33.00 tokens per second)
llama_print_timings:        eval time =   44855.77 ms /   511 runs   (   87.78 ms per token,    11.39 tokens per second)
llama_print_timings:       total time =  142425.11 ms
Output generated in 142.65 seconds (3.59 tokens/s, 512 tokens, context 3167, seed 1493079911)

oobabooga added 2 commits December 19, 2023 12:13

Add llama-cpp-python wheels with tensor cores support

477ceed

Update README

0d9bd0b

oobabooga merged commit de138b8 into dev Dec 19, 2023

oobabooga deleted the tensorcores branch December 20, 2023 17:50

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add llama-cpp-python wheels with tensor cores support #5003

Add llama-cpp-python wheels with tensor cores support #5003

oobabooga commented Dec 19, 2023 •

edited

Loading

Add llama-cpp-python wheels with tensor cores support #5003

Add llama-cpp-python wheels with tensor cores support #5003

Conversation

oobabooga commented Dec 19, 2023 • edited Loading

llama-2-13b.Q3_K_M.gguf

without --tensorcores:

with --tensorcores:

mixtral-8x7b-instruct-v0.1.Q4_K_M.gguf

without --tensorcores

with --tensorcores:

oobabooga commented Dec 19, 2023 •

edited

Loading