-
Notifications
You must be signed in to change notification settings - Fork 10.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Batch size affects model's output #249
Comments
There is most definitely something wrong in the way prompt is fed into the program. Changes in the batch size can affect the output. Examples: $ ./main -s 1000 -m models/7B/ggml-model-q4_0.bin --top_k 1 -b 1 -p "This is a sample prompt that expects continuation:" I was experimenting with changing the word selection by forcing the most likely output every time using --top_k 1 to eliminate next word sampling. My guess is that batch size = 1 will give the more "correct" behaviour of the model. |
Hm, yes. I agree. However, I have an interactive assistant with a prompt working and if I put batch size 100 or whatever that eats the entire prompt at once, and all my conversation turns, the model is continuously rather confused and keeps making mistakes in reading what I write. I think it is pretty obvious that higher batch sizes do not work correctly presently. I also think the defaults are not too good, which is somewhat an issue for this less scientific or quantified approach. To be honest, I can't get the regular --top_p 0.9 to stay on topic, and the repeat_last_n must be considerably lowered for chat mode, or it makes the AI generate the end of chat token instead of replying a lot of the time. My guess is that the "Bob:" text token generation likelihood is lowered too much, and it tends to choose to end the discussion instead. I am currently using the following parameters with the Bob-like assistant string: ./main -m ./models/7B/ggml-model-q4_0.bin -b 1 --ctx_size 2048 --temp 1.0 --top_k 100 --top_p 0.7 --repeat_last_n 20 --repeat_penalty 1.2 -n 2048 --color -i -r "User:" -p "Transcript of dialog between blah blah blah" At least in this way, with lower top_p parameter value, the model becomes normally quite coherent, and I can have long chats with it. After enjoying the fairly coherent chat afforded by this model, it is very obvious now that increasing batch size makes the model barely understand what I am saying to it. |
I'm not sure I understand enough of the code to take this conclusion, but I think, the whole deal with batching is sacrificing quality to get speed. This mainly applies to optimizing by crossing the cpu<->gpu boundary less times. But in CPU inference, I'm not even sure batching is significantly faster. I don't feel it has been in my somewhat unscientific tests. OTOH, to compute the attention mechanism for a token, you need the data for previous tokens to be there, because every token in the input must attend to every previous token. So if you batch them in groups of N, I can imagine there being a downgrade in quality, unless someone has made sure that the matrix multiplications are made in such a way that token data for tokens 0..n-1 is available when computing data for token N. |
That's incorrect and it shouldn't sacrifice anything. It also should be faster on CPU. All Pytorch transformers I had to run on CPU were significantly faster at reading prompts than generating text. Transformer's architecture allows to compute activation of a single layer for a whole batch in one go. Under the hood actually there's three steps:
This is done for each layer, one by one, batch goes in, batch goes out. A huge part of why transformers overtook RNNs is this property that allows training on whole data chunks in one pass. |
@jarcen Ok, yeah, this makes sense in general terms and you seem to know more about it than me. Sorry I added noise to the discussion. One question though, because I'd like to make sure I understood your point correctly:
When batching inputs, the "tokens that are placed before it" are part of the batch, and are being computed at the same time, isn't that correct? Or do you mean, the data dependency only goes to previous tokens in previous layers? And leaving that aside, me (and others) have clearly observed output quality differences when varying the batch size. So if this is not an issue, theoretically speaking, then it may be a bug in the implementation? |
They are not being computed at the same time. Computations in one layer are separated in three steps I listed above. Step 2 operates on Query-Key-Value matrices which were already created on step 1. Key-Value matrices are not a part of the batch anymore but part of hidden state. Each Query vector with associated position N looks for Key vectors at positions N, N-1, N-2, N-3, etc... That includes already existed K vectors and the ones that just got added from batch at step 1. If there's four Q vectors then four threads can do that in parallel, there's no data dependency between these threads. (Note that I'm seemingly using Vector and Matrix interchangeably but matrices are essentially how batching is implemented: each row is a vector. So, individual per-token operations are explained with vectors.) Example code expressing idea of self-attention with string operations:
This code can be run in parallel in multiple threads. One thread might start at position 5, another at 6, 7, 8 and so on. They do not conflict in any way. That is what essentially happens at step 2, except characters are Key vectors and Now for the quality. Yes, it must be a bug somewhere. I read |
Can you guys give a test with latest |
Latest |
I also have this problem. Is this a limitation of llama.cpp? Why is this thread closed? |
I was tinkering with the code and made the following change in
line 977, main.cpp
(as it seemed wrong to me):from
to
The model's (13B) outputs suddenly changed. Reverted changes and tried to play with the
batch_size
parameter, it really does affect the output.Not sure if it's expected behaviour. As far as I understand it shouldn't be the case. A bug? Different batch sizes have different evaluation results (rounding error)?
The text was updated successfully, but these errors were encountered: