[perf] improve next token latency when (#threads >= 2 * #heads) by sharding the head into multiple splits #70
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Command: XFT_FAKE_MODEL=1 OMP_NUM_THREADS=32 mpirun -n 1 numactl -N 0 -m 0 ./build/example -m examples/model_config/llama-2-7b/ -t examples/model_config/llama-2-7b/tokenizer.model -d fp16 -l 1024 --output_len 10 --loop 10 : -n 1 numactl -N 1 -m 1 ./build/example -m examples/model_config/llama-2-7b/ -t examples/model_config/llama-2-7b/tokenizer.model -d fp16 -l 1024 --output_len 10 --loop 10
Server: a develop machine (perf may be not good)
Perf: ~40->~39ms