Support BitNet b1.58 ternary models #5761

igorbarshteyn · 2024-02-28T09:41:38Z

New paper just dropped on Arxiv describing a way to train models in 1.58 bits (with ternary values: 1,0,-1). Paper shows performance increases from equivalently-sized fp16 models, and perplexity nearly equal to fp16 models. Authors state that their test model is built on LLaMA architecture and can be easily adapted to llama.cpp.

[Edited to add: Further reading into it by fellow Redditors shows that we can't use this to quantize existing models trained to fp16. They'd have to be trained in this ternary mode from the start. But I think it would still be something that we should implement, because models of that flavor will be coming soon.]

This is all over Reddit /LocalLLaMA right now:

https://www.reddit.com/r/LocalLLaMA/comments/1b21bbx/this_is_pretty_revolutionary_for_the_local_llm/

I think, if my napkin math is right, it would let us run something like 120B models in 24 GB VRAM, or 30B in... 8 GB?

Please implement @ggerganov and friends!

https://arxiv.org/abs/2402.17764

Dampfinchen · 2024-02-28T11:02:50Z

Wow, that is indeed very promising. And quite different from the current quant approaches too. Seems like instead of quantizing models post-training it quants them during training. I am sure though if this approach proves to be sucessful, model trainers like Jon Durbin, Teknium and Eric Hartford will jump in quickly.

Aside from obivious benefits during inference, In theory that could also allow much higher quality Lora training at less memory costs? You could theoretically train on GGUF models but that is generally not recommended as quality suffers too much from it compared to a fp16 model, so it seems this approach would help in that regard as well.

@ikawrakow What do you think about this paper?

ikawrakow · 2024-02-28T12:15:29Z

Well, I have been wondering for a while why nobody is training quantized models directly. Given how close we can come to the performance of the fp16 model with relatively simple means, it is kind of obvious that one should be able to get the same performance as fp16 if one trained a quantized model directly.

Having said that, I stopped reading this particular paper at Table 1. In what sense is a 2.22 GB model a 1-bit version of 3B parameters? 2.22 GB is larger than a 2-bit quantized 7B LLaMA, not to mention the much higher perplexity. They say that a 70B model will be 4.1X smaller than fp16, so the dream of running 120B models on 24GB GPU's is not quite there yet. Forgive me if I'm missing something, but I have become allergic to LLM revolutions and new eras proclaimed every other day on arxiv (or HF), so find it very hard to make myself read these revolutionary papers more carefully.

igorbarshteyn · 2024-02-28T12:44:13Z

Let's wait till they post the actual code up... then maybe it will be more clear :)

ggerganov · 2024-02-28T12:50:15Z

Please implement @ggerganov and friends!

Is there something to implement on the inference side? Seems like it's just the training method that is different. The produced model (be it 1-bit, 2-bit or N-bit) should be possible to infer as usual, correct?

But I share @ikawrakow's sentiment - let's wait and see first

igorbarshteyn · 2024-02-28T12:51:41Z

:) Yes. Let's wait till the authors' code is up. Really hoping this is going to be the way of the future :)

Dampfinchen · 2024-02-28T13:06:14Z

Having said that, I stopped reading this particular paper at Table 1. In what sense is a 2.22 GB model a 1-bit version of 3B parameters? 2.22 GB is larger than a 2-bit quantized 7B LLaMA, not to mention the much higher perplexity. They say that a 70B model will be 4.1X smaller than fp16, so the dream of running 120B models on 24GB GPU's is not quite there yet. Forgive me if I'm missing something, but I have become allergic to LLM revolutions and new eras proclaimed every other day on arxiv (or HF), so find it very hard to make myself read these revolutionary papers more carefully.

As I've understand it, the figures in that table are not meant to represent the model size, but the actual GPU memory usage during inference. So those 2.22 GB include the KV-Cache. Given it's llama without GQA I would imagine it being quite big.

ikawrakow · 2024-02-28T13:45:13Z

The IQ1_S quantization uses exactly that: tertiary values -1, 0, 1, so yes, it shouldn't be hard to adapt the existing code or implement new, if necessary.

If we do get meaningful trained quantized models, I would be finally able to retire from contributing quantization methods to llama.cpp :-)

ggerganov · 2024-02-28T13:47:35Z

Let's pray they used 256 block size 😄

ikawrakow · 2024-02-28T14:03:52Z

No, I'm actually hoping the hidden dimension is from the Fibonacci sequence. So we finally get a ggml that does not use blocks 😄

jetro30087 · 2024-02-28T16:54:08Z

Well, I have been wondering for a while why nobody is training quantized models directly. Given how close we can come to the performance of the fp16 model with relatively simple means, it is kind of obvious that one should be able to get the same performance as fp16 if one trained a quantized model directly.

Having said that, I stopped reading this particular paper at Table 1. In what sense is a 2.22 GB model a 1-bit version of 3B parameters? 2.22 GB is larger than a 2-bit quantized 7B LLaMA, not to mention the much higher perplexity. They say that a 70B model will be 4.1X smaller than fp16, so the dream of running 120B models on 24GB GPU's is not quite there yet. Forgive me if I'm missing something, but I have become allergic to LLM revolutions and new eras proclaimed every other day on arxiv (or HF), so find it very hard to make myself read these revolutionary papers more carefully.

The models presented in these papers are not quantized. They are using ternary parameters (-1, 0, 1) not quantization, so it's a full-sized model. So, I don't think expectations for the size of quantized models would apply in this case. Either way, we'll know when they release code.

netrunnereve · 2024-02-28T17:35:46Z

Well, I have been wondering for a while why nobody is training quantized models directly. Given how close we can come to the performance of the fp16 model with relatively simple means, it is kind of obvious that one should be able to get the same performance as fp16 if one trained a quantized model directly.

I think the reason for that is because the Nvidia GPUs that all those companies are using are designed and intended for native fp16 operations. I mean it was fp32 before, then it was discovered that fp16 has neglegible performance loss so they started using that. Now they're working on fp8 as well.

Also our quants do some sort of scaling operation to convert the compressed integer weights back into their original floats, with the same scaling factor applied across a group of weights to save space. It's easy to compute that going from a fp16 model into a q4_0, but I'm not really sure how to optimally do this backwards in the training phase. This paper seems to be not quantizing/dequantizing the model during training, but rather the model itself is literally built with ternary weights and 8 bit activations. And nothing's stopping companies from making a special AI processor that does the calculations using this approach if it works well.

cztomsik · 2024-02-28T18:37:54Z

Well, I have been wondering for a while why nobody is training quantized models directly

Same, and I was also hoping that given 3-4bit weights, it might reduce the solution surface so dramatically that we might even drop the backprop nonsense entirely and use something else... (for the pretraining, because for fine-tuning, it kinda makes sense because you want just little nudge, not dramatical change)

If 1-2 bit is feasible, than this might again change the problem space and maybe we could go straight to random evolutionary algorithms or something like that. I wonder why nobody tried that (and I hope it's not because I'm idiot)

errorsandwarnings · 2024-02-28T18:38:34Z

Well, I have been wondering for a while why nobody is training quantized models directly. Given how close we can come to the performance of the fp16 model with relatively simple means, it is kind of obvious that one should be able to get the same performance as fp16 if one trained a quantized model directly.

I think the reason for that is because the Nvidia GPUs that all those companies are using are designed and intended for native fp16 operations. I mean it was fp32 before, then it was discovered that fp16 has neglegible performance loss so they started using that. Now they're working on fp8 as well.

Also our quants do some sort of scaling operation to convert the compressed integer weights back into their original floats, with the same scaling factor applied across a group of weights to save space. It's easy to compute that going from a fp16 model into a q4_0, but I'm not really sure how to optimally do this backwards in the training phase. This paper seems to be not quantizing/dequantizing the model during training, but rather the model itself is literally built with 3 bit weights and 8 bit activations. And nothing's stopping companies from making a special AI processor that does the calculations using this approach if it works well.

If this pans out, we should see everyone switching to it and throwing 10 times more parameters in the model. Plus NVIDIA should take notice of this.

Gobz · 2024-02-28T18:55:44Z

Designing hardware around pure adders seems so damn juicy, god damn that would be so insanely fast.

Gobz · 2024-02-28T20:44:14Z

Did some simple linear regression from the data in the paper, I hope their data is legit

EwoutH · 2024-02-28T21:57:06Z

Those addition-only matrix operations are brilliant. This could be so fast in the future with dedicated ASICs.

@igorbarshteyn could you clean up the title of this issue a bit though? Maybe just something like:

Support BitNet b1.58 ternary models

igorbarshteyn · 2024-02-28T21:58:25Z

Done @EwoutH

Dampfinchen · 2024-02-28T22:09:09Z

Did some simple linear regression from the data in the paper, I hope their data is legit

Nice table, thank you for the demonstration. The cool thing is that these figures are during inference of outdated non-GQA models. So with modern GQA models, the VRAM usage would be even smaller than what's listed here.

igorbarshteyn · 2024-02-29T03:28:32Z

Code will be populated here when they are ready:

https://github.com/microsoft/unilm/tree/master/bitnet

kinchahoy · 2024-02-29T04:35:50Z

The IQ1_S quantization uses exactly that: tertiary values -1, 0, 1, so yes, it shouldn't be hard to adapt the existing code or implement new, if necessary.

If we do get meaningful trained quantized models, I would be finally able to retire from contributing quantization methods to llama.cpp :-)

Have you run any benchmarks? Obviously after the fact 1-2bit quantization will be terrible, but I'm curious. I'm also interested in any methods folks have to "improve" the quantized model after generation. Some sort of student-teacher distillation should be possible right?

sorasoras · 2024-02-29T05:43:33Z

The IQ1_S quantization uses exactly that: tertiary values -1, 0, 1, so yes, it shouldn't be hard to adapt the existing code or implement new, if necessary.
If we do get meaningful trained quantized models, I would be finally able to retire from contributing quantization methods to llama.cpp :-)

Have you run any benchmarks? Obviously after the fact 1-2bit quantization will be terrible, but I'm curious. I'm also interested in any methods folks have to "improve" the quantized model after generation. Some sort of student-teacher distillation should be possible right?

What are you talking about? This is not a quants. It's a model with 1.58bit instead of FP16.

kinchahoy · 2024-02-29T05:58:51Z

The IQ1_S quantization uses exactly that: tertiary values -1, 0, 1, so yes, it shouldn't be hard to adapt the existing code or implement new, if necessary.
If we do get meaningful trained quantized models, I would be finally able to retire from contributing quantization methods to llama.cpp :-)

Have you run any benchmarks? Obviously after the fact 1-2bit quantization will be terrible, but I'm curious. I'm also interested in any methods folks have to "improve" the quantized model after generation. Some sort of student-teacher distillation should be possible right?

What are you talking about? This is not a quants. It's a model with 1.58bit instead of FP16.

If you actually read the bit I quoted, you'd realize that the amazing ikawrakow notes that we have a ternary quantization implementation (IQ1_S) and I was asking him what the results look like for (as I put it) "after the fact" quantization (which is obviously different from this paper). I was also asking if there are more sophisticated quantization methods available that might help low bit quantizations work better.

Obviously, this particular model is trained as a ternary model, but if it's possible for a ternary model to succeed from scratch, then it's not unreasonable to think that there should be better 1.6 bit quanitzations possible for existing models via distillation techniques.

For what it's worth, I did find some benchmarks, and they are shockingly bad at 1.6 bit ... so, yeah I'm very interested in helping explore distillation methods to improve quantization of existing models, while we all eagerly await this new from scratch model.

sorasoras · 2024-02-29T06:26:30Z

The IQ1_S quantization uses exactly that: tertiary values -1, 0, 1, so yes, it shouldn't be hard to adapt the existing code or implement new, if necessary.
If we do get meaningful trained quantized models, I would be finally able to retire from contributing quantization methods to llama.cpp :-)

Have you run any benchmarks? Obviously after the fact 1-2bit quantization will be terrible, but I'm curious. I'm also interested in any methods folks have to "improve" the quantized model after generation. Some sort of student-teacher distillation should be possible right?

What are you talking about? This is not a quants. It's a model with 1.58bit instead of FP16.

If you actually read the bit I quoted, you'd realize that the amazing ikawrakow notes that we have a ternary quantization implementation (IQ1_S) and I was asking him what the results look like for (as I put it) "after the fact" quantization (which is obviously different from this paper). I was also asking if there are more sophisticated quantization methods available that might help low bit quantizations work better.

Obviously, this particular model is trained as a ternary model, but if it's possible for a ternary model to succeed from scratch, then it's not unreasonable to think that there should be better 1.6 bit quanitzations possible for existing models via distillation techniques.

For what it's worth, I did find some benchmarks, and they are shockingly bad at 1.6 bit ... so, yeah I'm very interested in helping explore distillation methods to improve quantization of existing models, while we all eagerly await this new from scratch model.

it's probably easier to train model with data output from better F16 model.
SPIN
I imagine it would be difficult to do distillation via quant because Imatrix is already sort like a distillation
it's hard to get better for now.

kinchahoy · 2024-02-29T06:46:09Z

The IQ1_S quantization uses exactly that: tertiary values -1, 0, 1, so yes, it shouldn't be hard to adapt the existing code or implement new, if necessary.
If we do get meaningful trained quantized models, I would be finally able to retire from contributing quantization methods to llama.cpp :-)

Have you run any benchmarks? Obviously after the fact 1-2bit quantization will be terrible, but I'm curious. I'm also interested in any methods folks have to "improve" the quantized model after generation. Some sort of student-teacher distillation should be possible right?

What are you talking about? This is not a quants. It's a model with 1.58bit instead of FP16.

If you actually read the bit I quoted, you'd realize that the amazing ikawrakow notes that we have a ternary quantization implementation (IQ1_S) and I was asking him what the results look like for (as I put it) "after the fact" quantization (which is obviously different from this paper). I was also asking if there are more sophisticated quantization methods available that might help low bit quantizations work better.
Obviously, this particular model is trained as a ternary model, but if it's possible for a ternary model to succeed from scratch, then it's not unreasonable to think that there should be better 1.6 bit quanitzations possible for existing models via distillation techniques.
For what it's worth, I did find some benchmarks, and they are shockingly bad at 1.6 bit ... so, yeah I'm very interested in helping explore distillation methods to improve quantization of existing models, while we all eagerly await this new from scratch model.

it's probably easier to train model with data output from better F16 model. SPIN I imagine it would be difficult to do distillation via quant because Imatrix is already sort like a distillation it's hard to get better for now.

Yeah I agree! Though broadly I think of SPIN as one of the class of teacher student distillation techniques. Either way - this should be possible, and has incredible potential. I really don't see the community investing in training cutting edge 60B+ parameter <2 bit models, so we really need to find clever ways to extract the right weights starting from successful fp16 models.

WebsiteInc · 2024-02-29T06:52:46Z

These papers might be a practical approach for existing model conversion:
[https://openreview.net/forum?id=FUnEkOkodU](Token-Scaled Logit Distillation for Ternary Weight Generative Language Models)
[https://huggingface.co/papers/2306.01841](Binary and Ternary Natural Language Generation)

tuyen-huynh · 2024-02-29T11:55:48Z

The 1-bit idea in the Bitnet paper (https://arxiv.org/abs/2310.11453) has been adopted in this recent 1-bit quantization paper (https://arxiv.org/abs/2402.11295).

sorasoras · 2024-04-02T14:04:10Z

I was just curious how fast it could run on CPU. The answer is not very fast compared to GPUs sadly, even for 3B models. The best I could do was upper-bound about 40 tokens/second on Xeon workstation processors. I mean it's like 5x faster than Gemma 3B but nothing stellar in the grand scheme of things.

BitNet speedup on AVX2: lithium0003 on Github suggested using the _mm256_sign_epi8() intrinsic to greatly speed up the AVX2 version. It's now running at 28 tokens/second using AVX2 on Intel 12th gen CPU and 50 tokens/second using AMD Ryzen 9 7950X with AVX-512. Code is checked in

It may not be GPU fast, but it's decently fast. Inference speeds scale up pretty linearly so you'll get something like this:
Model Size Speed (t/s)
3B 50
7B 25
13B 12.5
30B 6
70B 3.5

Keep in mind that Bitnet is supposed to have F16 equivalent performance, so it may might more sense to compare those numbers against the F16 or perhaps Q8_0 equivalent. People who want something faster can always run on GPU and I'm sure someone will create good kernels for Bitnet.

what cpu is that?

netrunnereve · 2024-04-02T21:33:46Z

I literally just got the numbers from @catid's AMD 7950X 3B example and extrapolated them up. From my own experience the inference speed for non Bitnet models is inversely proportional to its size provided you have enough memory to hold everything.

EwoutH · 2024-04-03T11:48:51Z

A new implementation, BitMat, just got open-sourced.

BitMat: Improving Ternary Matrix Multiplication with Triton

BitMat is a Python package designed to optimize matrix multiplication operations by utilizing custom kernels written in Triton. Our package leverages the principles outlined in the "1bit-LLM Era" paper, specifically utilizing packed int8 data to enhance computational efficiency and performance in deep learning and numerical computing tasks.

Some more context on Reddit:

During the training phase, we implement a custom forward and backward propagation mechanism that calculates the gradient as outlined in the paper. This is done in FP16 precision using the pre-quantized weights (W) and inputs (X). For inference, we've optimized storage by packing the model weights into int8 format, achieving a 4x reduction in size. Our kernels then unpack these values during computation, perform the necessary multiplications, and accumulate the results. This approach significantly reduces the memory footprint required for storing and operating the model, aligning with the paper's findings.

ExeVirus · 2024-04-19T14:28:23Z

Thank you all for the interest in our BitNet b1.58 work! I believe that it will be a great benefit for the community to implement it in the awesome llama.cpp.

I noticed that someone had open-sourced some models (https://huggingface.co/1bitLLM) that reproduced the results in our paper. The implementation and the results look good to me. I think we should be able to get started from it.

Please let me know if I can help with the implementation, thank you!

@shumingma

Sorry to reach out again, but looking over the repository you point to from the quote above, those model weights are all fp32.

Now, I don't have the expertise to go in and look at those weights, perhaps they are all just 1.000000, 0.00000, -1.000000, but have you actually examined those models to confirm that is really what bitnet is using for weights (i.e. fp32 for each ternary weight)? I would have expected some other container than fp32 weights for storing these ternary values (some sort of int8 construct).

I don't think llama.cpp can start working on an implementation for supporting that model type as it is, because those model weights aren't in a format we'd expect anyone to actually load. If you could share possibly a method to convert those weights to the correct storage format, we could start working from there.

Thanks in advance, doesn't seem to be much open progress yet on ternary bitnet(s) despite the obvious benefits.

paperdev-code · 2024-04-21T11:14:19Z

those model weights are all fp32.

@ExeVirus I agree that it's likely because of a lack of support for loading ternary models. I think there's nothing more to add to this discussion until @shumingma finishes training their models. Perhaps this isn't just a llama.cpp issue, but more of a ggml/gguf file format support issue?

ExeVirus · 2024-04-21T11:53:58Z

Sounds like it, but if that's true, we already have an open ternary weight model available, it's just in fp32 per weight. That sounds relatively straightforward to at least get a CPU only version working and doing some bit packing, since it's just llama with bit linear. At the very least, isn't Q2-Q3 post-quantization enough to represent -1, 0, 1?

EwoutH · 2024-04-22T10:08:59Z

Two more pre-trained models:

It now seems we have enough models to test on, and inference implementations (like bitnet_cpu and BitMat) to be inspired by.

What would be a good approach to implement support for ternary models in llama.cpp, and how can we move that forward?

netrunnereve · 2024-04-23T01:59:27Z

Sounds like it, but if that's true, we already have an open ternary weight model available, it's just in fp32 per weight. That sounds relatively straightforward to at least get a CPU only version working and doing some bit packing, since it's just llama with bit linear. At the very least, isn't Q2-Q3 post-quantization enough to represent -1, 0, 1?

What would be a good approach to implement support for ternary models in llama.cpp, and how can we move that forward?

My guess is that we'll need a new quantization type for this, say QB with one sign bit and one data bit for a total of two bits per weight. Activations will be 8 bits as specified in the paper. We shouldn't use an existing quant for this as those group the weights together with a scaling factor and we don't need that for Bitnet. This quant will be lossless!

Assuming that the models come in f32 as -1.00, 0.00, 1.00 we would need to create the new QB quant and have some check in quantize to make sure that all the f32 weights are in the correct tenary format before quantizing. Looking into the K-quants PRs should give some guidance on how to create a new quant, which also requires you to write the math kernels for the actual calculations. Now this may be sound straightforward for someone who's done this before but it'll probably be really tricky in practice 😉.

nickovs · 2024-04-23T03:11:52Z

My guess is that we'll need a new quantization type for this, say QB with one sign bit and one data bit for a total of two bits per weight.

Personally I'd vote for a two bit ones' complement format, which would mean that you'd have one bit set for the value +1, the other bit set for the value -1 and neither bit set for the value 0. I believe that this will lead to the most efficient kernels using SIMD mask-and-add on both Intel and ARM CPUs.

danpovey · 2024-05-23T06:57:33Z

Well, I have been wondering for a while why nobody is training quantized models directly. Given how close we can come to the performance of the fp16 model with relatively simple means, it is kind of obvious that one should be able to get the same performance as fp16 if one trained a quantized model directly.

Having said that, I stopped reading this particular paper at Table 1. In what sense is a 2.22 GB model a 1-bit version of 3B parameters? 2.22 GB is larger than a 2-bit quantized 7B LLaMA, not to mention the much higher perplexity. They say that a 70B model will be 4.1X smaller than fp16, so the dream of running 120B models on 24GB GPU's is not quite there yet. Forgive me if I'm missing something, but I have become allergic to LLM revolutions and new eras proclaimed every other day on arxiv (or HF), so find it very hard to make myself read these revolutionary papers more carefully.

The 2.22GB is "memory use" which probably means for training. That would presumably include the un-quantized fp16 version of the parameters, needed during training. (And possibly the optimizer state, which for Adam would be two fp32's per parameter, although I doubt they included that).

ExeVirus · 2024-06-09T12:06:22Z

Update

Microsoft quietly released BitBLAS: github.com/microsoft/BitBLAS
This is a release of their specialized kernel creator for quantized bit inference.

What's more important, is they used this BitBLAS with the open-source-reproduced 3B ternary bitnet model.

They released all that code for testing and inference here: github.com/microsoft/BitBLAS/tree/main/integration/BitNet.

From my quick reading - that's more than enough information to write the kernels for inference and probably make GGUFs for the 3B/1.3B/700M models.

[Everything is MIT licensed]

netrunnereve · 2024-06-10T00:07:12Z

What's more important, is they used this BitBLAS with the open-source-reproduced 3B ternary bitnet model.

If their own researchers are using the 1bitLLM reproduced model then from the looks of it Microsoft is never going to release the original BitNet paper models 😞.

From my quick reading - that's more than enough information to write the kernels for inference and probably make GGUFs for the 3B/1.3B/700M models.

The issue with that is that those are toy models solely for testing BitNet and they aren't really suitable for actual chatting. What will get a dev onboard (maybe even me if I have the time!) is for some company to release a fully trained BitNet model of Llama quality for us to play with. Otherwise the implementation will remain as an experiment and it'll probably have few users and little support.

Now of course this is also a chicken and egg problem since if we can demonstrate how effective BitNet inference is with llama.cpp then some company may be incentivized to train some proper models.

ExeVirus · 2024-06-10T01:20:47Z

Exactly. I'm actually having one of the researchers in the RWVK community run some tests where they replace their weights with teranry weights.

Seeing some initial success with that test, albeit it trains more slowly than they are used to for their architecture.

If they do start that work, I'll be sure to report.

Update on RWVK: they weren't satisfied with such a low learning rate compared to their normal training speeds (too expensive too give a full run). If someone has the experience with hyper parameter tunning and the hardware to test, reach out to me and I can get you set up with what they would need to prove to themselves that ternary is worth it.

netrunnereve · 2024-06-14T16:57:12Z

Haven't looked into it in detail yet but someone just submitted a BitNet PR! The new format uses 2 bits per weight.

#7931

netrunnereve · 2024-06-25T17:46:22Z

Another paper has been released that builds on BitNet with ternary weights. What's interesting here is that they made a FPGA implementation designed for ternary math.

https://arxiv.org/pdf/2406.02528

RonanKMcGovern · 2024-06-29T10:11:11Z

I suppose the issue is that FPGAs just don't have the raw FLOPS that GPUs have - so even if you can program them to run more efficiently, they'll be much slower?

…

On Tue, Jun 25, 2024 at 6:47 PM Eve ***@***.***> wrote: Another paper has been released that builds on BitNet with ternary weights. What's interesting here is that they made a FPGA implementation designed for ternary math. image.png (view on web) <https://github.com/ggerganov/llama.cpp/assets/139727413/6ea1ab3d-70d2-4001-af4c-101192fb471b> https://arxiv.org/pdf/2406.02528 — Reply to this email directly, view it on GitHub <#5761 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ASVG6CTCMGUX6JJPQ72PSXLZJGUJVAVCNFSM6AAAAABD5VLW76VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCOBZGYYDGMZXGM> . You are receiving this because you are subscribed to this thread.Message ID: ***@***.***>

ExeVirus · 2024-06-29T11:47:56Z

Kinda, FPGAs let you lay out the gates exactly to match the algorithm you want to operate. With something like bitnet, that means you only need to design matmul matrices for f8, and ternary 'multipliers' can be laid out in more optimal ways so that more operations are done per 'FLOP' than in a GPU. With a ton of optimization, FPGAs can get much more out of their FLOPs than GPUs. The issue is if you have that much time to optimize, then an ASIC might have made sense in the first place. If only there were machine learning FPGA algorithm optimizers.... There was that one with alpha zero, but it was very manual: https://deepmind.google/discover/blog/alphadev-discovers-faster-sorting-algorithms/ On Sat, Jun 29, 2024, 6:11 AM Ronan McGovern ***@***.***> wrote:

…

I suppose the issue is that FPGAs just don't have the raw FLOPS that GPUs have - so even if you can program them to run more efficiently, they'll be much slower? On Tue, Jun 25, 2024 at 6:47 PM Eve ***@***.***> wrote: > Another paper has been released that builds on BitNet with ternary > weights. What's interesting here is that they made a FPGA implementation > designed for ternary math. > > image.png (view on web) > < https://github.com/ggerganov/llama.cpp/assets/139727413/6ea1ab3d-70d2-4001-af4c-101192fb471b> > > https://arxiv.org/pdf/2406.02528 > > — > Reply to this email directly, view it on GitHub > < #5761 (comment)>, > or unsubscribe > < https://github.com/notifications/unsubscribe-auth/ASVG6CTCMGUX6JJPQ72PSXLZJGUJVAVCNFSM6AAAAABD5VLW76VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCOBZGYYDGMZXGM> > . > You are receiving this because you are subscribed to this thread.Message > ID: ***@***.***> > — Reply to this email directly, view it on GitHub <#5761 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AKT7N2XR6SL5Z3B3SXCX773ZJ2B6DAVCNFSM6AAAAABD5VLW76VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCOJYGA4DQOBYHE> . You are receiving this because you were mentioned.Message ID: ***@***.***>

RonanKMcGovern · 2024-06-29T14:31:26Z

yeah, to be more concrete. With FPGA, the total number of logic gate operations required can be ~8x smaller (if 2bit vs 16 bit) or ~5x (if ternary vs 16 bit). However, FPGAs will just have far lower operations per second versus GPUs right? (at least for current FPGAs and GPUs OR, said differently, the cost per FLOP of GPUs is just very low and you can't make up for that just by being more efficient on more primitive operations with FPGAs, or probably ASICs...unless mass manufactured)

…

On Sat, Jun 29, 2024 at 12:48 PM ExeVirus ***@***.***> wrote: Kinda, FPGAs let you lay out the gates exactly to match the algorithm you want to operate. With something like bitnet, that means you only need to design matmul matrices for f8, and ternary 'multipliers' can be laid out in more optimal ways so that more operations are done per 'FLOP' than in a GPU. With a ton of optimization, FPGAs can get much more out of their FLOPs than GPUs. The issue is if you have that much time to optimize, then an ASIC might have made sense in the first place. If only there were machine learning FPGA algorithm optimizers.... There was that one with alpha zero, but it was very manual: https://deepmind.google/discover/blog/alphadev-discovers-faster-sorting-algorithms/ On Sat, Jun 29, 2024, 6:11 AM Ronan McGovern ***@***.***> wrote: > I suppose the issue is that FPGAs just don't have the raw FLOPS that GPUs > have - so even if you can program them to run more efficiently, they'll be > much slower? > > On Tue, Jun 25, 2024 at 6:47 PM Eve ***@***.***> wrote: > > > Another paper has been released that builds on BitNet with ternary > > weights. What's interesting here is that they made a FPGA implementation > > designed for ternary math. > > > > image.png (view on web) > > < > https://github.com/ggerganov/llama.cpp/assets/139727413/6ea1ab3d-70d2-4001-af4c-101192fb471b> > > > > > https://arxiv.org/pdf/2406.02528 > > > > — > > Reply to this email directly, view it on GitHub > > < > #5761 (comment)>, > > > or unsubscribe > > < > https://github.com/notifications/unsubscribe-auth/ASVG6CTCMGUX6JJPQ72PSXLZJGUJVAVCNFSM6AAAAABD5VLW76VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCOBZGYYDGMZXGM> > > > . > > You are receiving this because you are subscribed to this thread.Message > > ID: ***@***.***> > > > > — > Reply to this email directly, view it on GitHub > < #5761 (comment)>, > or unsubscribe > < https://github.com/notifications/unsubscribe-auth/AKT7N2XR6SL5Z3B3SXCX773ZJ2B6DAVCNFSM6AAAAABD5VLW76VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCOJYGA4DQOBYHE> > . > You are receiving this because you were mentioned.Message ID: > ***@***.***> > — Reply to this email directly, view it on GitHub <#5761 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ASVG6CTDV3QGCE6O3XN2PNTZJ2NJBAVCNFSM6AAAAABD5VLW76VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCOJYGEZDGMRVHA> . You are receiving this because you are subscribed to this thread.Message ID: ***@***.***>

ExeVirus · 2024-06-29T14:33:47Z

The less power used, the faster you can run the clock, for the most part. On Sat, Jun 29, 2024, 10:32 AM Ronan McGovern ***@***.***> wrote:

…

yeah, to be more concrete. With FPGA, the total number of logic gate operations required can be ~8x smaller (if 2bit vs 16 bit) or ~5x (if ternary vs 16 bit). However, FPGAs will just have far lower operations per second versus GPUs right? (at least for current FPGAs and GPUs OR, said differently, the cost per FLOP of GPUs is just very low and you can't make up for that just by being more efficient on more primitive operations with FPGAs, or probably ASICs...unless mass manufactured) On Sat, Jun 29, 2024 at 12:48 PM ExeVirus ***@***.***> wrote: > Kinda, FPGAs let you lay out the gates exactly to match the algorithm you > want to operate. > > With something like bitnet, that means you only need to design matmul > matrices for f8, and ternary 'multipliers' can be laid out in more optimal > ways so that more operations are done per 'FLOP' than in a GPU. > > With a ton of optimization, FPGAs can get much more out of their FLOPs > than > GPUs. The issue is if you have that much time to optimize, then an ASIC > might have made sense in the first place. > > If only there were machine learning FPGA algorithm optimizers.... There > was > that one with alpha zero, but it was very manual: > > https://deepmind.google/discover/blog/alphadev-discovers-faster-sorting-algorithms/ > > On Sat, Jun 29, 2024, 6:11 AM Ronan McGovern ***@***.***> > wrote: > > > I suppose the issue is that FPGAs just don't have the raw FLOPS that > GPUs > > have - so even if you can program them to run more efficiently, they'll > be > > much slower? > > > > On Tue, Jun 25, 2024 at 6:47 PM Eve ***@***.***> wrote: > > > > > Another paper has been released that builds on BitNet with ternary > > > weights. What's interesting here is that they made a FPGA > implementation > > > designed for ternary math. > > > > > > image.png (view on web) > > > < > > > https://github.com/ggerganov/llama.cpp/assets/139727413/6ea1ab3d-70d2-4001-af4c-101192fb471b> > > > > > > > > > https://arxiv.org/pdf/2406.02528 > > > > > > — > > > Reply to this email directly, view it on GitHub > > > < > > > #5761 (comment)>, > > > > > > or unsubscribe > > > < > > > https://github.com/notifications/unsubscribe-auth/ASVG6CTCMGUX6JJPQ72PSXLZJGUJVAVCNFSM6AAAAABD5VLW76VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCOBZGYYDGMZXGM> > > > > > > . > > > You are receiving this because you are subscribed to this > thread.Message > > > ID: ***@***.***> > > > > > > > — > > Reply to this email directly, view it on GitHub > > < > #5761 (comment)>, > > > or unsubscribe > > < > https://github.com/notifications/unsubscribe-auth/AKT7N2XR6SL5Z3B3SXCX773ZJ2B6DAVCNFSM6AAAAABD5VLW76VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCOJYGA4DQOBYHE> > > > . > > You are receiving this because you were mentioned.Message ID: > > ***@***.***> > > > > — > Reply to this email directly, view it on GitHub > < #5761 (comment)>, > or unsubscribe > < https://github.com/notifications/unsubscribe-auth/ASVG6CTDV3QGCE6O3XN2PNTZJ2NJBAVCNFSM6AAAAABD5VLW76VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCOJYGEZDGMRVHA> > . > You are receiving this because you are subscribed to this thread.Message > ID: ***@***.***> > — Reply to this email directly, view it on GitHub <#5761 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AKT7N2VIVOS6OY362AK6DC3ZJ3AN7AVCNFSM6AAAAABD5VLW76VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCOJYGIYTINZRGU> . You are receiving this because you were mentioned.Message ID: ***@***.***>

netrunnereve · 2024-06-29T17:02:47Z

However, FPGAs
will just have far lower operations per second versus GPUs right? (at least
for current FPGAs and GPUs OR, said differently, the cost per FLOP of GPUs
is just very low and you can't make up for that just by being more
efficient on more primitive operations with FPGAs, or probably
ASICs...unless mass manufactured)

I think of the FPGA implementation as more of a way to show companies that, hey, it's possible to run BitNet very efficiently on custom hardware and if it all pans out then it might be worth having special ternary units inside future CUDA cores or maybe even have a special chip just for BitNet. I don't think we'll see people installing FPGA cards in their computers for running LLMs.

paperdev-code · 2024-06-29T22:25:56Z

maybe even have a special chip just for BitNet.

That would be nice, however bitnet is being held back by the undertrained models currently available. If the original authors got to release their 7B, 13B, and up, it would go a long way in convincing people this model architecture really does scale as promised.

nonetrix · 2024-06-30T00:09:23Z

I feel like gap would only get smaller compared to full fat FP16 as it scales, but we will have to see in reality. I wonder who will be first to train a bigger one, I think someone could do better than we got right now by a bit but we need a company willing to take a risk for even bigger ones. Maybe Jamba people would be interested, they took a risk with trying Mamba

Green-Sky · 2024-08-04T07:39:09Z

The development was continued in #8151

github-actions · 2024-09-18T01:07:17Z

This issue was closed because it has been inactive for 14 days since being marked as stale.

igorbarshteyn added the enhancement New feature or request label Feb 28, 2024

igorbarshteyn changed the title ~~This new model training method (BitNet b1.58) is revolutionary - and according to this new paper, support can be easily built into llama.cpp~~ Support BitNet b1.58 ternary models Feb 28, 2024

mofosyne added the Tensor Encoding Scheme https://github.com/ggerganov/llama.cpp/wiki/Tensor-Encoding-Schemes label Jun 15, 2024

compilade mentioned this issue Jun 29, 2024

ggml-quants : ternary packing for TriLMs and BitNet b1.58 #8151

Merged

15 tasks

github-actions bot added the stale label Jul 30, 2024

Green-Sky removed the stale label Aug 4, 2024

github-actions bot added the stale label Sep 4, 2024

github-actions bot closed this as completed Sep 18, 2024

Support BitNet b1.58 ternary models #5761

Support BitNet b1.58 ternary models #5761

Comments

igorbarshteyn commented Feb 28, 2024 • edited Loading

Dampfinchen commented Feb 28, 2024

ikawrakow commented Feb 28, 2024

igorbarshteyn commented Feb 28, 2024

ggerganov commented Feb 28, 2024

igorbarshteyn commented Feb 28, 2024

Dampfinchen commented Feb 28, 2024 • edited Loading

ikawrakow commented Feb 28, 2024

ggerganov commented Feb 28, 2024

ikawrakow commented Feb 28, 2024

jetro30087 commented Feb 28, 2024

netrunnereve commented Feb 28, 2024 • edited Loading

cztomsik commented Feb 28, 2024

errorsandwarnings commented Feb 28, 2024

Gobz commented Feb 28, 2024

Gobz commented Feb 28, 2024

EwoutH commented Feb 28, 2024 • edited Loading

igorbarshteyn commented Feb 28, 2024

Dampfinchen commented Feb 28, 2024

igorbarshteyn commented Feb 29, 2024

kinchahoy commented Feb 29, 2024

sorasoras commented Feb 29, 2024

kinchahoy commented Feb 29, 2024

sorasoras commented Feb 29, 2024

kinchahoy commented Feb 29, 2024

WebsiteInc commented Feb 29, 2024 • edited Loading

tuyen-huynh commented Feb 29, 2024

sorasoras commented Apr 2, 2024

netrunnereve commented Apr 2, 2024 • edited Loading

EwoutH commented Apr 3, 2024 • edited Loading

BitMat: Improving Ternary Matrix Multiplication with Triton

ExeVirus commented Apr 19, 2024

paperdev-code commented Apr 21, 2024 • edited Loading

ExeVirus commented Apr 21, 2024

EwoutH commented Apr 22, 2024

netrunnereve commented Apr 23, 2024

nickovs commented Apr 23, 2024

danpovey commented May 23, 2024

ExeVirus commented Jun 9, 2024 • edited Loading

netrunnereve commented Jun 10, 2024

ExeVirus commented Jun 10, 2024 • edited Loading

netrunnereve commented Jun 14, 2024

netrunnereve commented Jun 25, 2024

RonanKMcGovern commented Jun 29, 2024 via email

ExeVirus commented Jun 29, 2024 via email

RonanKMcGovern commented Jun 29, 2024 via email

ExeVirus commented Jun 29, 2024 via email

netrunnereve commented Jun 29, 2024

paperdev-code commented Jun 29, 2024 • edited Loading

nonetrix commented Jun 30, 2024

Green-Sky commented Aug 4, 2024

github-actions bot commented Sep 18, 2024

igorbarshteyn commented Feb 28, 2024 •

edited

Loading

Dampfinchen commented Feb 28, 2024 •

edited

Loading

netrunnereve commented Feb 28, 2024 •

edited

Loading

EwoutH commented Feb 28, 2024 •

edited

Loading

WebsiteInc commented Feb 29, 2024 •

edited

Loading

netrunnereve commented Apr 2, 2024 •

edited

Loading

EwoutH commented Apr 3, 2024 •

edited

Loading

paperdev-code commented Apr 21, 2024 •

edited

Loading

ExeVirus commented Jun 9, 2024 •

edited

Loading

ExeVirus commented Jun 10, 2024 •

edited

Loading

paperdev-code commented Jun 29, 2024 •

edited

Loading