[perf] improve next token latency when (#threads >= 2 * #heads) by sharding the head into multiple splits #70

pujiang2018 · 2023-11-22T07:03:21Z

Command: XFT_FAKE_MODEL=1 OMP_NUM_THREADS=32 mpirun -n 1 numactl -N 0 -m 0 ./build/example -m examples/model_config/llama-2-7b/ -t examples/model_config/llama-2-7b/tokenizer.model -d fp16 -l 1024 --output_len 10 --loop 10 : -n 1 numactl -N 1 -m 1 ./build/example -m examples/model_config/llama-2-7b/ -t examples/model_config/llama-2-7b/tokenizer.model -d fp16 -l 1024 --output_len 10 --loop 10
Server: a develop machine (perf may be not good)
Perf: ~40->~39ms

pujiang2018 added 2 commits November 22, 2023 01:38

support head split for cross attention

e566227

format the code

5da56ac

pujiang2018 requested a review from changqi1 November 23, 2023 12:53

changqi1 approved these changes Nov 24, 2023

View reviewed changes

changqi1 merged commit 7430fe5 into main Nov 24, 2023
1 check passed

pujiang2018 deleted the pujiang/perf/attn branch January 8, 2024 05:11

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[perf] improve next token latency when (#threads >= 2 * #heads) by sharding the head into multiple splits #70

[perf] improve next token latency when (#threads >= 2 * #heads) by sharding the head into multiple splits #70

pujiang2018 commented Nov 22, 2023

[perf] improve next token latency when (#threads >= 2 * #heads) by sharding the head into multiple splits #70

[perf] improve next token latency when (#threads >= 2 * #heads) by sharding the head into multiple splits #70

Conversation

pujiang2018 commented Nov 22, 2023