Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

llama : default sampling changes + greedy update #9897

Merged
merged 7 commits into from
Oct 21, 2024

Conversation

ggerganov
Copy link
Owner

@ggerganov ggerganov commented Oct 15, 2024

ref #9896 (comment)

  • The dist sampler now applies softmax
  • Deprecate llama_sampler_init_softmax()
  • Change examples behavior for temperature <= 0.0f (see below for details)
  • Refactor tests (minor)

Changes for temperature <= 0.0f and the "Greedy" mode

Before this change, the sampling logic implemented in common/sampling.cpp implemented 2 main branches for the sampling chain:

  • For temp > 0.0f - construct default sampling chain involving most of the available samplers, such as Top-K, Top-P, and Temperature sampler
  • For temp <= 0.0f - construct a "greedy" sampling chain that contains only a single "greedy" sampler

if (params.temp > 0.0f) {
if (params.mirostat == 0) {
for (const auto & cnstr : params.samplers) {
switch (cnstr) {
case COMMON_SAMPLER_TYPE_TOP_K:
llama_sampler_chain_add(result->chain, llama_sampler_init_top_k (params.top_k));
break;
case COMMON_SAMPLER_TYPE_TOP_P:
llama_sampler_chain_add(result->chain, llama_sampler_init_top_p (params.top_p, params.min_keep));
break;
case COMMON_SAMPLER_TYPE_MIN_P:
llama_sampler_chain_add(result->chain, llama_sampler_init_min_p (params.min_p, params.min_keep));
break;
case COMMON_SAMPLER_TYPE_XTC:
llama_sampler_chain_add(result->chain, llama_sampler_init_xtc (params.xtc_probability, params.xtc_threshold, params.min_keep, params.seed));
break;
case COMMON_SAMPLER_TYPE_TFS_Z:
llama_sampler_chain_add(result->chain, llama_sampler_init_tail_free(params.tfs_z, params.min_keep));
break;
case COMMON_SAMPLER_TYPE_TYPICAL_P:
llama_sampler_chain_add(result->chain, llama_sampler_init_typical (params.typ_p, params.min_keep));
break;
case COMMON_SAMPLER_TYPE_TEMPERATURE:
llama_sampler_chain_add(result->chain, llama_sampler_init_temp_ext (params.temp, params.dynatemp_range, params.dynatemp_exponent));
break;
case COMMON_SAMPLER_TYPE_INFILL:
llama_sampler_chain_add(result->chain, llama_sampler_init_infill (model));
break;
default:
GGML_ASSERT(false && "unknown sampler type");
}
}
llama_sampler_chain_add(result->chain, llama_sampler_init_softmax());
llama_sampler_chain_add(result->chain, llama_sampler_init_dist(params.seed));
} else if (params.mirostat == 1) {
llama_sampler_chain_add(result->chain, llama_sampler_init_temp(params.temp));
llama_sampler_chain_add(result->chain, llama_sampler_init_mirostat(llama_n_vocab(model), params.seed, params.mirostat_tau, params.mirostat_eta, 100));
} else if (params.mirostat == 2) {
llama_sampler_chain_add(result->chain, llama_sampler_init_temp(params.temp));
llama_sampler_chain_add(result->chain, llama_sampler_init_mirostat_v2(params.seed, params.mirostat_tau, params.mirostat_eta));
} else {
GGML_ASSERT(false && "unknown mirostat version");
}
} else {
if (params.n_probs > 0) {
// some use cases require to sample greedily, but still obtain the probabilities of the top tokens
// ref: https://github.com/ggerganov/llama.cpp/pull/9605
//
// the following will not produce exactly the same probs as applyging softmax to the full vocabulary, but
// it is much faster, since we avoid sorting all tokens and should give a good approximation
llama_sampler_chain_add(result->chain, llama_sampler_init_top_k(params.n_probs));
llama_sampler_chain_add(result->chain, llama_sampler_init_softmax());
}
llama_sampler_chain_add(result->chain, llama_sampler_init_greedy());
}

This caused a relatively hidden and less-known "discontinuity" of the sampling behavior at temp = 0.0f because the chain would switch from one mode to the other and this has been causing some confusion.

With the changes in this PR, we now always construct the same default sampling chain, regardless of the temperature value. The temperature sampler is modified to also work for temp <= 0.0f by simply finding the token with highest logit and setting all other logits to -INF:

static void llama_sampler_temp_impl(llama_token_data_array * cur_p, float temp) {
if (temp <= 0.0f) {
// find the token with the highest logit and set the rest to -inf
llama_token max_id = cur_p->data[0].id;
float max_logit = cur_p->data[0].logit;
for (size_t i = 1; i < cur_p->size; ++i) {
if (cur_p->data[i].logit > max_logit) {
max_id = cur_p->data[i].id;
max_logit = cur_p->data[i].logit;
}
}
for (size_t i = 0; i < cur_p->size; ++i) {
if (cur_p->data[i].id != max_id) {
cur_p->data[i].logit = -INFINITY;
}
}
return;
}
for (size_t i = 0; i < cur_p->size; ++i) {
cur_p->data[i].logit /= temp;
}
}

Here are is how to enable Greedy mode in the examples, before and after this change:

before after
--temp 0.0 --sampling-seq k --top-k 1
--temp -1.0 --top-k 1

Note that with the new logic simply setting --temp 0.0 will be different compared to master. The result will still be deterministic, but instead of picking the token with the maximum logit (i.e. greedy mode) it will apply the full default sampling chain which as of today is:

sampler chain: logits -> logit-bias -> penalties -> top-k -> tail-free -> typical -> top-p -> min-p -> xtc -> temp-ext -> dist 

Therefore, at temp <= 0.0 the token with highest probability entering the temperature sampler will be picked and in some cases this might not be the token with highest probability at the start of the sampling chain.

@github-actions github-actions bot added testing Everything test related examples labels Oct 15, 2024
@ggerganov
Copy link
Owner Author

ggerganov commented Oct 17, 2024

I've proposed the following change of the temperature parameter and sampler: 78118c4. This should avoid hijacking the sampler chain at temp == 0.0f and greedy sampling will be triggered at temp < 0.0f. @slaren Let me know if you think this makes sense.

In the meantime will continue to refactor test-sampling.cpp a bit to avoid macro usage.

@slaren
Copy link
Collaborator

slaren commented Oct 17, 2024

I think this behavior for temperature 0 is reasonable, and should be fairly intuitive. The behavior for negative values will still be hard (or impossible) to predict without reading the documentation, and while it may serve as a convenient way to enable greedy sampling, I am not convinced that it is worth keeping. There is a simple and clear way to enable greedy sampling by using top-k 1 that does not require having any special values.

Regardless, at least for the OAI endpoints we should try to replicate the behavior of the OAI API. Unfortunately their API documentation is not very clear about what happens with temperature 0, although it lists it as a valid value:

What sampling temperature to use, between 0 and 2. Higher values like 0.8 will make the output more random, while lower values like 0.2 will make it more focused and deterministic.

@ggerganov ggerganov force-pushed the gg/llama-remove-softmax branch from 60d8af4 to e5150b1 Compare October 17, 2024 15:16
@ggerganov
Copy link
Owner Author

ggerganov commented Oct 17, 2024

Yes, seems like there isn't much benefit for this extra "greedy" logic at temp < 0.0f. I removed it in e5150b1. We still accept negative temperatures from the CLI and they should behave the same as temp == 0.0f

@ggerganov ggerganov force-pushed the gg/llama-remove-softmax branch from baf3d37 to cd97850 Compare October 17, 2024 15:25
@ggerganov ggerganov marked this pull request as ready for review October 17, 2024 15:25
@MaggotHATE
Copy link
Contributor

MaggotHATE commented Oct 17, 2024

Do I understand it correctly that with these changes "greedy sampling" would require either temperature or top-k in the queue? Is it intended? The name of this PR doesn't reflect that, and it seems to be a significant change.

UPD: wouldn't other samplers, included by default, affect the results of new "greedy" sampling? Seems like a user would have to deliberately switch all other samplers off in order to get the same type of deterministic results (so, either temp only, or use top-k instead...but then all other samplers would do nothing).

@ggerganov
Copy link
Owner Author

After these changes you would need to explicitly set a greedy chain that picks the token with the highest logit. For example with:

  • --sampling-seq k --top-k 1
  • --sampling-seq t --temp 0.0

If you use the default sampling chain, then simply --top-k 1 would do the job since Top-K is at the start of the chain.

Note that if you set temp <= 0.0f and keep the default samplers, you are still getting a deterministic result. It might not be the token with the highest logit, but it is still deterministic.

I think the new way makes more sense than what we have on master.

@MaggotHATE
Copy link
Contributor

I think the new way makes more sense than what we have on master.

I agree, but maybe it is worth renaming this PR to mention that greedy sampling logics was changed too (at least to avoid questions about it later).

@ggerganov
Copy link
Owner Author

Yes, will update it. I also spotted some additional problems that will fix later today.

@ggerganov ggerganov changed the title llama : deprecate softmax sampler + fix dist sampler llama : default sampling changes Oct 18, 2024
@ggerganov
Copy link
Owner Author

@MaggotHATE I've updated the description. PTAL and any testing and feedback is appreciated.

@MaggotHATE
Copy link
Contributor

@ggerganov Do you want to keep _temp along with _temp_ext? I see that _temp is used with Mirostat/v2 only - would it be detrimental to have dynatemp there?

@ggerganov
Copy link
Owner Author

_temp is indeed a subset of _temp_ext. I guess the former can be removed in order to simplify things. We can do this in a separate PR to avoid increasing the scope here.

@MaggotHATE
Copy link
Contributor

Tested with Mistral-Nemo-Instruct-2407.q5_k_l, seed 404.

temp = 0

[INST]What's a paladin?[/INST]A paladin is a type of character class or character in fantasy literature, video games, and other media. Paladins are typically:

1. **Lawful Good**: They follow a strict moral code and are dedicated to protecting the innocent and upholding justice.

2. **Holy Warriors**: They are often associated with a deity or a divine cause, and their powers come from divine sources. They are usually skilled in combat and have healing abilities.

3. **Knights**: They often have a knightly or chivalrous background, with a strong sense of honor and duty.

4. **Protector**: They are often protectors of the weak, and may have a special bond with a specific group or individual.

Here are a few examples of paladins from popular media:

- **Dungeons & Dragons**: The Paladin is a class known for their divine magic and combat abilities.
- **World of Warcraft**: The Paladin is a class that combines melee combat and healing abilities.
- **The Legend of Zelda**: Princess Zelda is often depicted as a paladin, with her royal duties and divine powers.
- **The Elder Scrolls**: The Paladin is a rank in the Order of the Silver Dawn, a knightly order dedicated to the defense of Cyrodiil.[INST]

top_k = 1

[INST]What's a paladin?[/INST]A paladin is a type of character class or character in fantasy literature, video games, and other media. Paladins are typically:

1. **Lawful Good**: They follow a strict moral code and are dedicated to protecting the innocent and upholding justice.

2. **Holy Warriors**: They are often associated with a deity or a divine cause, and their powers come from divine sources. They are usually skilled in combat and have healing abilities.

3. **Knights**: They often have a knightly or chivalrous background, with a strong sense of honor and duty.

4. **Protector**: They are often protectors of the weak, and may have a special bond with a specific group or individual.

Here are a few examples of paladins from popular media:

- **Dungeons & Dragons**: The Paladin is a class known for their divine magic and combat abilities.
- **World of Warcraft**: The Paladin is a class that combines melee combat and healing abilities.
- **The Legend of Zelda**: Princess Zelda is often depicted as a paladin, with her royal duties and divine powers.
- **The Elder Scrolls**: The Paladin is a rank in the Order of the Silver Dawn, a knightly order dedicated to the defense of Cyrodiil.[INST]

Looks good!

@ggerganov ggerganov changed the title llama : default sampling changes llama : default sampling changes + greedy update Oct 18, 2024
@MaggotHATE
Copy link
Contributor

MaggotHATE commented Oct 18, 2024

Setting all tokens except the first one to -INFINITY instead of resizing is done because temperature, as a sampling technique, should not resize? Seems expensive if used as a greedy sampler (i.e. only temperature in queue), Nemo gives more that a hundred thousand tokens each time, for example.

UPD. Tokens are not sorted, that's the problem.

Am I missing anything? Results are reproduced with the same seed, so it should work.

static void llama_sampler_temp_impl(llama_token_data_array * cur_p, float temp) {
    if (temp <= 0.0f) {
        // find the position of a token with the highest logit and cut candidates to have only that token
        int   max_pos = 0;
        float max_logit = cur_p->data[0].logit;

        for (size_t i = 1; i < cur_p->size; ++i) {
            if (cur_p->data[i].logit > max_logit) {
                max_pos = i;
                max_logit = cur_p->data[i].logit;
            }
        }

        cur_p->data += max_pos;
        cur_p->size = 1;

        return;
    }

    for (size_t i = 0; i < cur_p->size; ++i) {
        cur_p->data[i].logit /= temp;
    }
}

@ggerganov
Copy link
Owner Author

Setting all tokens except the first one to -INFINITY instead of resizing is done because temperature, as a sampling technique, should not resize?

Yes, I didn't want to change this behavior because normally the temperature sampler does not resize the candidates. Is the overhead from the extra loop measurable in your tests?

@MaggotHATE
Copy link
Contributor

MaggotHATE commented Oct 18, 2024

At least we can remove the second loop:

static void llama_sampler_temp_impl(llama_token_data_array * cur_p, float temp) {
    if (temp <= 0.0f) {
        // find the token with the highest logit and set the rest to -inf
        int max_pos = 0;
        float max_logit = cur_p->data[0].logit;

        for (size_t i = 1; i < cur_p->size; ++i) {
            if (cur_p->data[i].logit > max_logit) {
                cur_p->data[max_pos].logit = -INFINITY;
                max_pos   = i;
                max_logit = cur_p->data[i].logit;
            } else cur_p->data[i].logit = -INFINITY;
        }

        return;
    }

    for (size_t i = 0; i < cur_p->size; ++i) {
        cur_p->data[i].logit /= temp;
    }
}

However, it will still go through llama_sampler_dist, which now has llama_sampler_softmax_impl in it. This means that we will sort and recalculate all the candidates if temperature is used for greedy sampling. Even if we use top_k for that, it will still sort candidates, then llama_sampler_softmax_impl will recalculate all probabilities anyway (which we probably don't want in greedy sampling).

The original greedy sampling has 1 loop over all candidates and that's all. It is sorted only if n_probs > 0 and top_k is activated.

I will test for performance more later (with less background load for cleaner results), but so far it looks like there will be some performance degradation.

UPD: not clean, but more or less the same background load - ~2.8 t/s original greedy inference vs ~2.6 t/s new one. Mistral-Nemo-Instruct-2407.q5_k_l, CPU only, OpenMP and SGEMM.

UPD2: Maybe initialize selected in llama_token_data_array to -1 by default as a way to check against it and skip llama_sampler_softmax_impl and llama_sampler_dist_apply?

@MaggotHATE
Copy link
Contributor

Tested with less background load.
Windows 10, i7 8700, "n_threads": 5, "n_threads_batch": 10.
Mistral-Nemo-Instruct-2407.q5_k_l, CPU only, OpenMP and SGEMM.

Greedy sampler: 3.15 t/s
Temperature sampler: 3.14t/s

The difference is small, but noticeable from the start of generation. However, the ability to combine greedy sampling with other samplers might be useful - I'm getting quite interesting results with XTC (recommended settings) -> greedy.

@MaggotHATE
Copy link
Contributor

MaggotHATE commented Oct 18, 2024

The more I think about it, the less logical it seems: previously we had a clear contrast between greedy sampling and distance sampling. With this change we force greedy sampling to work through distance sampling, which brings unnecessary operations and seems wrong.

UPD2: I've missed that selected is already tracked and can be used. I think it would be better to refactor greedy into llama_sampler_greedy_impl and use it in temperature sampler when temp <= 0. Then we can track cur_p->selected != -1 in all other samplers from the main queue and in dist, skipping any other changes since we've already chosen a token greedily.

This will still require users to have temperature in the queue, but will not force greedy sampling through softmax and distance:

static void llama_sampler_greedy_impl(llama_token_data_array * cur_p) {
    cur_p->selected = 0;
    for (size_t i = 1; i < cur_p->size; ++i) {
        if (cur_p->data[i].logit > cur_p->data[cur_p->selected].logit) {
            cur_p->selected = i;
        }
    }
}

static void llama_sampler_temp_impl(llama_token_data_array * cur_p, float temp) {
    if (temp <= 0.0f) {
        llama_sampler_greedy_impl(cur_p);
        return;
    }

    for (size_t i = 0; i < cur_p->size; ++i) {
        cur_p->data[i].logit /= temp;
    }
}

...

static void llama_sampler_dist_apply(struct llama_sampler * smpl, llama_token_data_array * cur_p) {
    auto * ctx = (llama_sampler_dist *) smpl->ctx;

    if (cur_p->selected != -1) return;

    llama_sampler_softmax_impl(cur_p);

    cur_p->selected = llama_sample_dist(cur_p, ctx->rng);
}

In future this would allow other samplers that might work similar way to choose the final token without unnecessary operations.

@MaggotHATE
Copy link
Contributor

MaggotHATE commented Oct 20, 2024

@ggerganov Apologies for the delay, but I've committed proposed changes in a separate fork. What's done:

  1. Since there is no longer a distinction between greedy and distance sampling in common_sampler_init, I renamed llama_sampler_init_dist to llama_sampler_init_post, as in "post-sampling". This is up to discussion because I'm not sure it won't cause confusion in other, simpler examples. At the same time, it may house more post-sampling techniques in future.
  2. Added llama_sampler_greedy_impl to be used in llama_sampler_temp_impl and the separate greedy sampler (again, not sure if it's worth keeping, but gritlm example still uses llama_sampler_init_greedy).
  3. Added checks whether cur_p->selected was set or not into each sampler that may appear in sampler queue. This should ensure that the order of samplers will never change after a greedy choice.
  4. Added ctx->temp > 0 check to llama_sampler_temp_ext: dynatemp isn't needed in greedy sampling.
  5. And in the end, added the required logics into llama_sampler_post_apply: either skip everything due to prior greedy choice, or use llama_sample_dist.

With this, greedy sampling will avoid unnecessary operations like llama_sampler_softmax_impl, or all other samplers after it. At the same time it will still require temp to be in the sampling queue.

Like I've mentioned previously, there is a potential in greedy sampling being used outside of strict tasks due to XTC: partial removal of top tokens will give variety in responses even with greedy sampling.

@ggerganov
Copy link
Owner Author

Hm, I think we are overthinking this. --sampling-seq k --top-k 1 should be pretty much indistinguishable from the old greedy in terms of performance, so this should become the recommended way to do it. For now, I don't plan to do more of the discussed changes in this PR and will likely merge tomorrow if no objections.

@MaggotHATE
Copy link
Contributor

MaggotHATE commented Oct 20, 2024

--sampling-seq k --top-k 1

Isn't it a bit unexpected for a user? I thought greedy sampling is associated with temp 0. If it's not important, then yes, top_k is the easiest way.

@slaren
Copy link
Collaborator

slaren commented Oct 20, 2024

I think it is more confusing to print a sampling chain, and then ignore it when temp is 0. In this way, temp 0 effectively causes all tokens except than the top token to be removed at the point in the chain where the temperature sampler is.

@ggerganov ggerganov merged commit 55e4778 into master Oct 21, 2024
60 checks passed
@ggerganov ggerganov deleted the gg/llama-remove-softmax branch October 21, 2024 06:46
@MaggotHATE
Copy link
Contributor

MaggotHATE commented Oct 25, 2024

I just realized that we've lost the "obtain the probabilities of the top tokens" feature that was previously available in greedy sampling. I noticed it while testing another sampler in server UI.

It kind of works with temp-based greedy solution (which is slower), not the top_k one, but it seems to be pretty much useless with the temperature since logits are changed to -infinity.

Not sure how to fix it for now.

@slaren
Copy link
Collaborator

slaren commented Oct 25, 2024

I think there are two kinds of token probabilities that may be of interest:

  • As generated from the model, without modifications, to understand what tokens the model wants to generate and how it differs from the token that was sampled
  • Before the last sampling step, to understand the effects of the samplers

Currently, we only report the second option, which IMO is the least useful, especially in cases where the samplers completely remove tokens from the list of candidates, which is what seems to be happening here. I think it would be more useful to return the raw probabilities from the model, before they are modified by the samplers.

@MaggotHATE
Copy link
Contributor

MaggotHATE commented Oct 25, 2024

I think it would be more useful to return the raw probabilities from the model, before they are modified by the samplers.

I see now that the problem I've encountered happens only if there's no recalculation of probabilities (only sorting, for example). I think I will try fixing this later and add both options you've described.

The reasons to have the second option are both XTC (cuts out top tokens, so we only need to see those left after it) and K-Shift I'm working on right now (shifts top token once per dialog, still testing). In both cases greedy sampling can be useful, but looking in the depth of candidates may give an idea if you need to cut out more top tokens or not (especially for the first token and with fixed seed).

dsx1986 pushed a commit to dsx1986/llama.cpp that referenced this pull request Oct 29, 2024
* llama : deprecate softmax sampler + fix dist sampler

ggml-ci

* tests : replace macros with functions

ggml-ci

* sampling : change temperature sampler logic

For t <= 0.0f, keep the max logit intact and set the rest to -inf

* cont : no need for special "greedy" logic

top-k == 1 is the same

* tests : init prob correctly

* llama : handle temp <= 0.0 in the temp_ext sampler too

ggml-ci

* cont : avoid extra loop in temperature sampler for sub-zero temp

ggml-ci
arthw pushed a commit to arthw/llama.cpp that referenced this pull request Nov 15, 2024
* llama : deprecate softmax sampler + fix dist sampler

ggml-ci

* tests : replace macros with functions

ggml-ci

* sampling : change temperature sampler logic

For t <= 0.0f, keep the max logit intact and set the rest to -inf

* cont : no need for special "greedy" logic

top-k == 1 is the same

* tests : init prob correctly

* llama : handle temp <= 0.0 in the temp_ext sampler too

ggml-ci

* cont : avoid extra loop in temperature sampler for sub-zero temp

ggml-ci
arthw pushed a commit to arthw/llama.cpp that referenced this pull request Nov 18, 2024
* llama : deprecate softmax sampler + fix dist sampler

ggml-ci

* tests : replace macros with functions

ggml-ci

* sampling : change temperature sampler logic

For t <= 0.0f, keep the max logit intact and set the rest to -inf

* cont : no need for special "greedy" logic

top-k == 1 is the same

* tests : init prob correctly

* llama : handle temp <= 0.0 in the temp_ext sampler too

ggml-ci

* cont : avoid extra loop in temperature sampler for sub-zero temp

ggml-ci
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
examples testing Everything test related
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants