Skip to content

Performance Comparison

hoshi-hiyouga edited this page Apr 15, 2024 · 5 revisions

Short-sequence training

NVIDIA A100 * 1

Method Bits TGS VRAM Speed
HF 16 2,392 18GB 100%
HF+FA2 16 2,954 17GB 123%
Unsloth+FA2 16 4,007 16GB 168%
HF 4 2,415 9GB 101%
Unsloth+FA2 4 3,726 7GB 160%

NVIDIA A100 * 2

Method Bits TGS VRAM Speed
HF 16 2,155 29GB 100%
HF+FA2 16 2,556 28GB 119%
Unsloth+FA2 16 3,400 27GB 158%
  • TGS: tokens per GPU per second
  • Model: LLaMA2-7B
  • Batch size: 4
  • Gradient accumulation: 2
  • LoRA rank: 8
  • LoRA modules: all
  • Max length: 1024

Long-sequence training

VRAM 1,024 2,048 4,096 8,192 16,384 32,768 65,536 100,000
FlashAttention2 6GB 7GB 9GB 12GB 19GB 32GB OOM OOM
Unsloth 5GB 6GB 7GB 8GB 10GB 16GB 25GB 37GB
TGS 1,024 2,048 4,096 8,192 16,384 32,768 65,536 100,000
FlashAttention2 2,295 2,741 2,926 3,128 3,542 2,216 OOM OOM
Unsloth 2,556 3,178 3,413 3,632 4,050 2,456 1,820 1,202
Improvement 111% 116% 117% 116% 114% 111%
  • TGS: tokens per GPU per second
  • GPU: NVIDIA A100 40GB * 1
  • Model: LLaMA2-7B
  • Batch size: 1
  • Gradient accumulation: 4
  • LoRA rank: 8
  • LoRA modules: all
  • Quantization bit: 4
Clone this wiki locally