-
Notifications
You must be signed in to change notification settings - Fork 10.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
llama : default sampling changes + greedy update #9897
Conversation
I've proposed the following change of the temperature parameter and sampler: 78118c4. This should avoid hijacking the sampler chain at In the meantime will continue to refactor |
I think this behavior for temperature 0 is reasonable, and should be fairly intuitive. The behavior for negative values will still be hard (or impossible) to predict without reading the documentation, and while it may serve as a convenient way to enable greedy sampling, I am not convinced that it is worth keeping. There is a simple and clear way to enable greedy sampling by using top-k 1 that does not require having any special values. Regardless, at least for the OAI endpoints we should try to replicate the behavior of the OAI API. Unfortunately their API documentation is not very clear about what happens with temperature 0, although it lists it as a valid value:
|
60d8af4
to
e5150b1
Compare
Yes, seems like there isn't much benefit for this extra "greedy" logic at |
For t <= 0.0f, keep the max logit intact and set the rest to -inf
top-k == 1 is the same
baf3d37
to
cd97850
Compare
Do I understand it correctly that with these changes "greedy sampling" would require either temperature or top-k in the queue? Is it intended? The name of this PR doesn't reflect that, and it seems to be a significant change. UPD: wouldn't other samplers, included by default, affect the results of new "greedy" sampling? Seems like a user would have to deliberately switch all other samplers off in order to get the same type of deterministic results (so, either |
After these changes you would need to explicitly set a greedy chain that picks the token with the highest logit. For example with:
If you use the default sampling chain, then simply Note that if you set I think the new way makes more sense than what we have on |
I agree, but maybe it is worth renaming this PR to mention that greedy sampling logics was changed too (at least to avoid questions about it later). |
Yes, will update it. I also spotted some additional problems that will fix later today. |
@MaggotHATE I've updated the description. PTAL and any testing and feedback is appreciated. |
@ggerganov Do you want to keep |
|
Tested with Mistral-Nemo-Instruct-2407.q5_k_l, seed 404. temp = 0
top_k = 1
Looks good! |
Setting all tokens except the first one to UPD. Tokens are not sorted, that's the problem. Am I missing anything? Results are reproduced with the same seed, so it should work. static void llama_sampler_temp_impl(llama_token_data_array * cur_p, float temp) {
if (temp <= 0.0f) {
// find the position of a token with the highest logit and cut candidates to have only that token
int max_pos = 0;
float max_logit = cur_p->data[0].logit;
for (size_t i = 1; i < cur_p->size; ++i) {
if (cur_p->data[i].logit > max_logit) {
max_pos = i;
max_logit = cur_p->data[i].logit;
}
}
cur_p->data += max_pos;
cur_p->size = 1;
return;
}
for (size_t i = 0; i < cur_p->size; ++i) {
cur_p->data[i].logit /= temp;
}
} |
Yes, I didn't want to change this behavior because normally the temperature sampler does not resize the candidates. Is the overhead from the extra loop measurable in your tests? |
At least we can remove the second loop: static void llama_sampler_temp_impl(llama_token_data_array * cur_p, float temp) {
if (temp <= 0.0f) {
// find the token with the highest logit and set the rest to -inf
int max_pos = 0;
float max_logit = cur_p->data[0].logit;
for (size_t i = 1; i < cur_p->size; ++i) {
if (cur_p->data[i].logit > max_logit) {
cur_p->data[max_pos].logit = -INFINITY;
max_pos = i;
max_logit = cur_p->data[i].logit;
} else cur_p->data[i].logit = -INFINITY;
}
return;
}
for (size_t i = 0; i < cur_p->size; ++i) {
cur_p->data[i].logit /= temp;
}
} However, it will still go through The original greedy sampling has 1 loop over all candidates and that's all. It is sorted only if I will test for performance more later (with less background load for cleaner results), but so far it looks like there will be some performance degradation. UPD: not clean, but more or less the same background load - ~2.8 t/s original greedy inference vs ~2.6 t/s new one. Mistral-Nemo-Instruct-2407.q5_k_l, CPU only, OpenMP and SGEMM. UPD2: Maybe initialize |
Tested with less background load. Greedy sampler: 3.15 t/s The difference is small, but noticeable from the start of generation. However, the ability to combine greedy sampling with other samplers might be useful - I'm getting quite interesting results with XTC (recommended settings) -> greedy. |
The more I think about it, the less logical it seems: previously we had a clear contrast between greedy sampling and distance sampling. With this change we force greedy sampling to work through distance sampling, which brings unnecessary operations and seems wrong. UPD2: I've missed that This will still require users to have temperature in the queue, but will not force greedy sampling through softmax and distance: static void llama_sampler_greedy_impl(llama_token_data_array * cur_p) {
cur_p->selected = 0;
for (size_t i = 1; i < cur_p->size; ++i) {
if (cur_p->data[i].logit > cur_p->data[cur_p->selected].logit) {
cur_p->selected = i;
}
}
}
static void llama_sampler_temp_impl(llama_token_data_array * cur_p, float temp) {
if (temp <= 0.0f) {
llama_sampler_greedy_impl(cur_p);
return;
}
for (size_t i = 0; i < cur_p->size; ++i) {
cur_p->data[i].logit /= temp;
}
}
...
static void llama_sampler_dist_apply(struct llama_sampler * smpl, llama_token_data_array * cur_p) {
auto * ctx = (llama_sampler_dist *) smpl->ctx;
if (cur_p->selected != -1) return;
llama_sampler_softmax_impl(cur_p);
cur_p->selected = llama_sample_dist(cur_p, ctx->rng);
} In future this would allow other samplers that might work similar way to choose the final token without unnecessary operations. |
@ggerganov Apologies for the delay, but I've committed proposed changes in a separate fork. What's done:
With this, greedy sampling will avoid unnecessary operations like Like I've mentioned previously, there is a potential in greedy sampling being used outside of strict tasks due to XTC: partial removal of top tokens will give variety in responses even with greedy sampling. |
Hm, I think we are overthinking this. |
Isn't it a bit unexpected for a user? I thought greedy sampling is associated with |
I think it is more confusing to print a sampling chain, and then ignore it when temp is 0. In this way, temp 0 effectively causes all tokens except than the top token to be removed at the point in the chain where the temperature sampler is. |
I just realized that we've lost the "obtain the probabilities of the top tokens" feature that was previously available in greedy sampling. I noticed it while testing another sampler in server UI. It kind of works with Not sure how to fix it for now. |
I think there are two kinds of token probabilities that may be of interest:
Currently, we only report the second option, which IMO is the least useful, especially in cases where the samplers completely remove tokens from the list of candidates, which is what seems to be happening here. I think it would be more useful to return the raw probabilities from the model, before they are modified by the samplers. |
I see now that the problem I've encountered happens only if there's no recalculation of probabilities (only sorting, for example). I think I will try fixing this later and add both options you've described. The reasons to have the second option are both XTC (cuts out top tokens, so we only need to see those left after it) and K-Shift I'm working on right now (shifts top token once per dialog, still testing). In both cases greedy sampling can be useful, but looking in the depth of candidates may give an idea if you need to cut out more top tokens or not (especially for the first token and with fixed seed). |
* llama : deprecate softmax sampler + fix dist sampler ggml-ci * tests : replace macros with functions ggml-ci * sampling : change temperature sampler logic For t <= 0.0f, keep the max logit intact and set the rest to -inf * cont : no need for special "greedy" logic top-k == 1 is the same * tests : init prob correctly * llama : handle temp <= 0.0 in the temp_ext sampler too ggml-ci * cont : avoid extra loop in temperature sampler for sub-zero temp ggml-ci
* llama : deprecate softmax sampler + fix dist sampler ggml-ci * tests : replace macros with functions ggml-ci * sampling : change temperature sampler logic For t <= 0.0f, keep the max logit intact and set the rest to -inf * cont : no need for special "greedy" logic top-k == 1 is the same * tests : init prob correctly * llama : handle temp <= 0.0 in the temp_ext sampler too ggml-ci * cont : avoid extra loop in temperature sampler for sub-zero temp ggml-ci
* llama : deprecate softmax sampler + fix dist sampler ggml-ci * tests : replace macros with functions ggml-ci * sampling : change temperature sampler logic For t <= 0.0f, keep the max logit intact and set the rest to -inf * cont : no need for special "greedy" logic top-k == 1 is the same * tests : init prob correctly * llama : handle temp <= 0.0 in the temp_ext sampler too ggml-ci * cont : avoid extra loop in temperature sampler for sub-zero temp ggml-ci
ref #9896 (comment)
dist
sampler now applies softmaxllama_sampler_init_softmax()
temperature <= 0.0f
(see below for details)Changes for
temperature <= 0.0f
and the "Greedy" modeBefore this change, the sampling logic implemented in
common/sampling.cpp
implemented 2 main branches for the sampling chain:temp > 0.0f
- construct default sampling chain involving most of the available samplers, such as Top-K, Top-P, and Temperature samplertemp <= 0.0f
- construct a "greedy" sampling chain that contains only a single "greedy" samplerllama.cpp/common/sampling.cpp
Lines 174 to 228 in 8901755
This caused a relatively hidden and less-known "discontinuity" of the sampling behavior at
temp = 0.0f
because the chain would switch from one mode to the other and this has been causing some confusion.With the changes in this PR, we now always construct the same default sampling chain, regardless of the temperature value. The temperature sampler is modified to also work for
temp <= 0.0f
by simply finding the token with highest logit and setting all other logits to -INF:llama.cpp/src/llama-sampling.cpp
Lines 66 to 91 in 4a5b587
Here are is how to enable Greedy mode in the examples, before and after this change:
--temp 0.0
--sampling-seq k --top-k 1
--temp -1.0
--top-k 1
Note that with the new logic simply setting
--temp 0.0
will be different compared tomaster
. The result will still be deterministic, but instead of picking the token with the maximum logit (i.e. greedy mode) it will apply the full default sampling chain which as of today is:Therefore, at
temp <= 0.0
the token with highest probability entering the temperature sampler will be picked and in some cases this might not be the token with highest probability at the start of the sampling chain.