Performance Comparison

Short-sequence training

Method	Bits	TGS	VRAM	Speed
HF	16	2,392	18GB	100%
HF+FA2	16	2,954	17GB	123%
Unsloth+FA2	16	4,007	16GB	168%
HF	4	2,415	9GB	101%
Unsloth+FA2	4	3,726	7GB	160%

Method	Bits	TGS	VRAM	Speed
HF	16	2,155	29GB	100%
HF+FA2	16	2,556	28GB	119%
Unsloth+FA2	16	3,400	27GB	158%

VRAM	1,024	2,048	4,096	8,192	16,384	32,768	65,536	100,000
FlashAttention2	6GB	7GB	9GB	12GB	19GB	32GB	OOM	OOM
Unsloth	5GB	6GB	7GB	8GB	10GB	16GB	25GB	37GB

TGS	1,024	2,048	4,096	8,192	16,384	32,768	65,536	100,000
FlashAttention2	2,295	2,741	2,926	3,128	3,542	2,216	OOM	OOM
Unsloth	2,556	3,178	3,413	3,632	4,050	2,456	1,820	1,202
Improvement	111%	116%	117%	116%	114%	111%