-
Notifications
You must be signed in to change notification settings - Fork 9.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Q8_0 quantization for intermediate results #951
Conversation
sum01 += y1->m*x1->d*(vaddvq_u8(v0_1l) + vaddvq_u8(v0_1h)); | ||
sum10 += x1->m*y1->d*(vaddvq_u8(v1_1l) + vaddvq_u8(v1_1h)); | ||
sum01 += y1->m*x1->d*((uint16_t)vaddvq_u8(v0_1l) + (uint16_t)vaddvq_u8(v0_1h)); | ||
sum10 += x1->m*y1->d*((uint16_t)vaddvq_u8(v1_1l) + (uint16_t)vaddvq_u8(v1_1h)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These casts are to avoid the edge case of 128 + 128
overflowing uint8_t
I'm still waiting on the final perplexity results for this approach on my machine, but if we confirm that this is the right way to go, I am wondering if we should just remove all SIMD implementations of It's possible that the same holds true for |
Add AVX support
|
@ggerganov The SIMD quantize functions could still be useful to allow using LoRA with quantized models, either by modifying the quantized model, in which case, |
Intel's guide tells me this is AVX512? Anyway I've already started doing AVX and AVX2 on a branch mentioned here #909 (comment). I'll try to get a PR against this PR ready. |
Rebased on latest master Will run perplexity "without BLAS" once more to confirm we didn't break something with the rounding. The Question: should we add Thinking that after we merge this, we should focus on implementing the |
You are right. but I noticed that it works on MSVC so I didn't change. |
Ok, so I've decided on the following plan:
If this idea is viable in terms of speed and accuracy, we can think about changing the Will merge this in about an hour if no objections. |
For an efficient AVX-512 implementation, would it be acceptable for This might be potentially confusing, since the AVX-512 implementation will use a different format to store the intermediate 8-bit results (as opposed to |
@dfyz |
Thanks! I was just thinking out loud, definitely wasn't proposing an AVX-512 implementation to integrate into this PR (#933 with the Continuing the "thinking out loud" part: on second thought, changing the format doesn't really buy us anything. Instead of loading the block data and the scales with two separate load instructions, we can just load as many bytes of a Still, it is nice that changing format is possible in theory. I will try implementing a |
Other SIMD optimizations may benefit as well, because the quant blocks would be aligned to 32 bytes. On AVX2, you could use Or how about this? typedef struct {
float d0; // delta
float d1; // delta
int8_t qs0[QK]; // quants
int8_t qs1[QK]; // quants
} block_q8_0; Depending on your register width, you could choose to do just one at a time. |
Great! If we change the layout of q8_0, I would look at also changing q4_0 to "all the nibbles followed by all the scales", Need to verify for other architectures, but I think this would make the SIMD implementations more straight forward in most cases, and shouldn't hurt anything. The AVX2 implementation of |
A few of thoughts on the data formats:
|
This could make a non-negligible difference, but if you change the data ordering row-wise (as @unbounded suggested), you have to somehow ensure that the length of each row in bytes is also divisible by 32? E.g., two I personally will try to avoid changing formats in the AVX-512 implementation unless absolutely necessary (for the reasons @ggerganov provided above). |
I agree with the hypothesis that reordering the q4_0 format would be beneficial. On an AVX2 CPU, the current format (f32, 32xq4) makes it hard for any implementation to load the f32's as a vector. A scheme that would enable vector based loading would provide more flexibility. Caveat: I experimented with a "reordered" byte sequence based on a "superblock" of 8 q4_0 blocks about 2/3 weeks ago, and there was slight a speedup. It was, however, as far as I remember, not a major step change, especially since you need to divide the single thread improvements by the number of threads you are going to use. |
👍 I will probably do some experiments with continuous layouts - I understand that we would need to see a significant benefit to change the format. It will be hard to speed things up much if we're already close to memory bandwidth, of course. |
ref #909
This is an implementation of mode (E) from the referenced issue.
Basically, we quantize the intermediate results to 8-bits, instead of 4-bits to gain accuracy without any performance degradation.
As a positive side-effect, we will also have full 8-bit quantization support, although I don't think it will be significantly better than the proposed 4-bit quantization with 8-bit intermediate results.
Currently:
PRs are welcome into this PR to implement the missing SIMD routines
Perplexity results
Q4_0 M1 Pro (with BLAS) [655]6.2838 (i.e. reference)
Q4_0 M1 Pro (without BLAS) [655]6.2897 (impl 1))
Q4_0 M1 Pro (without BLAS) [655]6.2895 (impl 2)