-
Notifications
You must be signed in to change notification settings - Fork 9.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Q2 and Q3 quantization #1004
Q2 and Q3 quantization #1004
Conversation
The build failures on macOS show that I've messed up with using AVX2 in the AVX part, so this probably won't work on an AVX-only machine without modification. Edit: removed the AVX parts, I tested these with the wrong compiler flags. |
I'm playing with the smallest GPT-2 models and trying to make them work with 4-bit quantization. They keep breaking down completely even after #951. I guess the small number of parameters requires very high precision in the quantization. However, I just noticed that if I keep just the last tensor in F16, they suddenly become coherent. Maybe it could further improve your Q2 and Q3 if you keep the last tensor in high precision |
Probably nothing new, but here is a quick test of the perplexity performance with Q2_0:
|
Here is a AVX2 implementation of
static inline __m256i bytesFromi2(uint32_t packed1, uint32_t packed2) {
__m128i bx1 = _mm_set1_epi32(packed1);
__m128i bx2 = _mm_set1_epi32(packed2);
__m256i bx = _mm256_set_m128i(bx1, bx2);
// shift counts to get all bit pairs in lowest position of each byte
const __m256i shift256 = _mm256_set_epi32(6, 4, 2, 0,
6, 4, 2, 0);
bx = _mm256_srlv_epi32(bx, shift256);
const __m256i shufmask = _mm256_set_epi8(15,11,7,3,
14,10,6,2,
13,9,5,1,
12,8,4,0,
15,11,7,3,
14,10,6,2,
13,9,5,1,
12,8,4,0);
bx = _mm256_shuffle_epi8(bx, shufmask);
const __m256i mask = _mm256_set1_epi8(3);
bx = _mm256_and_si256(mask, bx);
return bx;
}
static void ggml_vec_dot_q2_0_q8_0(const int n, float * restrict s, const void * restrict vx, const void * restrict vy) {
const int nb = n / QK2_0;
assert(n % QK2_0 == 0);
assert(nb % 2 == 0);
const block_q2_0 * restrict x = vx;
const block_q8_0 * restrict y = vy;
float sumf = 0.0f;
#if defined(__AVX2__)
// Initialize accumulator with zeros
__m256 acc = _mm256_setzero_ps();
for (int i = 0; i < nb; i += 2) {
__m256i bx = bytesFromi2(x[i+1].qs, x[i].qs);
// Compute combined scale for the block
const __m128 scale1 = _mm_set1_ps(GGML_FP16_TO_FP32(x[i].d) * y[i/2].d);
const __m128 scale2 = _mm_set1_ps(GGML_FP16_TO_FP32(x[i+1].d) * y[i/2].d);
const __m256 scale = _mm256_set_m128(scale2, scale1);
const __m256i off = _mm256_set1_epi8(2);
bx = _mm256_sub_epi8(bx, off);
// Load y vector
const __m256i by = _mm256_loadu_si256((const __m256i *)y[i/2].qs);
// Get absolute values of x vectors
const __m256i ax = _mm256_sign_epi8(bx, bx);
// Sign the values of the y vectors
const __m256i sy = _mm256_sign_epi8(by, bx);
// Perform multiplication and create 16-bit values
const __m256i dot = _mm256_maddubs_epi16(ax, sy);
// Convert int16_t to int32_t by adding pairwise
const __m256i ones = _mm256_set1_epi16(1);
__m256i i32 = _mm256_madd_epi16(ones, dot);
// Convert int32_t to float
__m256 p = _mm256_cvtepi32_ps( i32 );
// Apply the scale, and accumulate
acc = _mm256_fmadd_ps( scale, p, acc );
}
// Return horizontal sum of the acc vector
__m128 res = _mm256_extractf128_ps( acc, 1 );
res = _mm_add_ps( res, _mm256_castps256_ps128( acc ) );
res = _mm_add_ps( res, _mm_movehl_ps( res, res ) );
res = _mm_add_ss( res, _mm_movehdup_ps( res ) );
sumf = _mm_cvtss_f32( res );
#else
... |
Thanks @slaren, I just added this. Apparently 2 bits are called a crumb, so I went with that. I originally wrote it for 128 bits because I thought it would be AVX-compatible, but it turns out just avoiding |
That does help, but with a block size of 6 bytes, Q2 is already 3-bit in a sense, and this again makes the file bigger. Q3 might be changed to use QK=24, but that doesn't really match the Q8 block size.
Don't know who would want to use the smallest file, except for entertainment. Sometimes it actually talks about building websites, but sometimes it goes off the rails:
Changing the sampling parameters (temperature etc.) may help, but I haven't played with those. |
In case this is useful for future comparisons, currently 7B q2_0 AVX2 has a perplexity of 12.6438. |
I wrote about this at #456 (comment), but I think there's a pretty obvious win by using a different q3_0 representation. Downside being, it's not the GPTQ format. Code and pull request at sw#1 |
Thanks to @pubby, the Q3 code is now faster on AVX2 and should be more amenable to other SIMD optimizations. You'll have to re-quantize the model, though. |
Here's an attempt at porting ARM NEON:
WASM:
(The |
At this point I'm wondering if we should target a specific model size. Is there any environment (wasm for example) where the 4Gbyte 7B Q4_0 is too large? Q2 probably shouldn't be merged as it's not really useable. |
Final perplexity for LLaMA 30B
|
Rebased onto master, but I kept the tensor/ftype numbering, because @TheBloke has published Alpaca/LoRA model files for Q2. These should still work now but I haven't tested that. On the other hand, Q4_2 and Q4_3 will not work on this branch. If and when this gets merged, you will have to re-quantize your Q2/Q3 models. As for perplexity, thanks to everyone providing numbers, my machine is too slow for that... But it looks like Q2 isn't really worth it, unless you have some extreme file/RAM size restrictions: |
Regarding wasm, there is indeed a 32-bit memory model for now, so sizeof(size_t) == 4 and large models cannot be allocated. In practice on my trial wasm platform (ios a-shell where the whole system is wasm), malloc() calls for a little over 1GiB start returning 0 (and mmap is just a wrapper around malloc() and pread() here so doesn’t resolve it). 2bit quantization of llama 7b wouldn’t be sufficient compression for the particular wasm runtime I’ve been trying without some additional structured pruning and/or ahead-of-time model compilation. But data > 4GB in size can’t be simultaneously referenced in memory (or “mmap”’d from a file) because the pointers are 32 bit for now. |
Obsolete thanks to #1684 |
This adds support for 2-bit and 3-bit quantization with an FP16 shared scale and 16 quants per block.
I don't consider it ready to merge, as we might come up with a different block format. The
struct
definitions are not portable: they use a#pragma
for one, and are unlikely to work on big-endian systems.#951 really is a game-changer for 2-bit quantization. I have updated my code to use the Q8 intermediate quantization.
Q2 sample output (we may need to implement a profanity filter, but swearing is appropriate if you have to use PHP):
Q3 is more sensible, but I haven't played with it a lot.
As mentioned, both new types use FP16 and QK=16. This was easiest to implement for two reasons: I can just use half of a Q8 block, and 16 3-bit values can be mangled in an AVX2 register.
So I've essentially invented my own new formats, but I'm aware there's a 3-bit GPTQ format. I'd love to re-use what's already been done, but haven't been able to find a clear definition (or model files) of that.
Looking forward to anyone finding better SIMD optimizations, especially for Q3, which is a pain in the butt...
Model file sizes:
Perplexity for 7B: (not going to let this run for over a day, can someone with a faster machine help out?)