Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sampling : add XTC sampler #9742

Merged
merged 49 commits into from
Oct 15, 2024
Merged

sampling : add XTC sampler #9742

merged 49 commits into from
Oct 15, 2024

Conversation

MaggotHATE
Copy link
Contributor

@MaggotHATE MaggotHATE commented Oct 4, 2024

This is my implementation of the XTC sampler as described by @p-e-w in oobabooga/text-generation-webui#6335 , the main difference is threshold_max parameter which seems to help with finetunes. Additional "thank you" goes to @LostRuins who implemented XTC a bit earlier than me (probabilities + logits system is a bit confusing at first).

XTC is a novel sampler that turns truncation on its head: Instead of pruning the least likely tokens, under certain circumstances, it removes the most likely tokens from consideration.

Technically, this sampler:

  • checks for probability xtc_p
  • searches for all tokens with probabilities above threshold xtc_t and below upper threshold xtc_t_max
  • penalizes and removes all such tokens besides the one least probable of them.

This means that at least one "most probable" token is always left untouched, which should keep things stable. For more details and explanations it is better to read @p-e-w 's original PR as it is very thorough and comprehensive.

Additionally, I have not encountered problems with eos tokens that were described in discussions of the original implementation, so I have not added any checks for it.

Being a unique sampler, XTC is not included in the queue by default (like Mirostats)

In my experience, the effect varies between models and even sizes of models. XTC improves creativity greatly, but may break models in specific cases (for example, multi-language capabilities of Nemo can be slightly worsened for inflective languages). With larger models XTC may require very careful choice of parameters - I have seen more refusals in Mistral Large 123b with default XTC settings, for example (although, not on my implementation, but through Kobold.cpp).

In general, however, XTC is a really interesting idea and a sampler worth testing.

Adds XTC sampler, not activated by default, but recommended settings by default.
To be more inline with the original implementation, chance is calculated once at the beginning.
Copy link
Contributor

@pnb pnb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not very familiar with the code base, but XTC is a very interesting idea to me so I took a read through to see how you implemented it, and left a few comments in case they are helpful. In case not, feel free to ignore. :)

common/arg.cpp Outdated Show resolved Hide resolved
common/common.h Outdated Show resolved Hide resolved
src/llama-sampling.cpp Show resolved Hide resolved
src/llama-sampling.cpp Outdated Show resolved Hide resolved
src/llama-sampling.cpp Outdated Show resolved Hide resolved
@MaggotHATE MaggotHATE marked this pull request as draft October 4, 2024 15:19
@David-AU-github
Copy link

I have tested this with a number of models of different sizes / types via Text Generation / oobabooga on day of release.
This works very well and a strong addition to LLAMACPP.

@p-e-w
Copy link

p-e-w commented Oct 5, 2024

What is the theory behind the xtc_t_max parameter? It seems to me that in many cases, that parameter would do the opposite of what XTC is supposed to do: Reduce creativity compared to not using XTC at all.

Here's an example: Imagine the input is ... shiver down his, and the predictions from the model are

spine 0.65
back 0.15
throat 0.12
...

With the original XTC algorithm and xtc_t set to 0.1, spine and back will be eliminated, leaving throat and the rest of the distribution. The behavior is more creative because the most clichéd continuations are gone.

Now let's say you set xtc_t to 0.1 again, and you also set the xtc_t_max parameter from this PR to 0.6. In that case, the algorithm will eliminate only back, leaving spine because it is above the upper threshold. The probability mass of back is distributed to the remaining tokens after renormalization, and the most clichéd continuation (spine), the one we want to avoid above all else, is now even more likely than it would have been if we had done nothing at all.

That doesn't seem right.

@MaggotHATE
Copy link
Contributor Author

Here's an example: Imagine the input is ... shiver down his

These example and explanations are true for baseline models, but may not work for finetuned models that were trained against clichéd phrases.

Consider a model which, as a result of finetuning, is now less likely to output the phrase in your example: less likely doesn't mean "erased entirely", and such models may still be able to generate it, but with lower probability. In such case XTC will remove a more probable (supposedly less clichéd) variant and increase the chance of choosing a previously less probable cliché. In the long run it can steer such finetuned model back to its baseline, "undoing" the effect of finetuning to some extent.

At the same time XTC may still be useful for such a model if a middle range is carefully set. I would assume that it is possible to cut out a range where penalized clichés may still appear, further improving model's capabilities. In practice, I have tested only a few such models, but I was getting better (as in, less typical) responses with lowered upper threshold.

You can look at various benchmarks that estimate creativity and, more relevant in our case, so called "slop", which is exactly the type of phrases we want to avoid using XTC. It's worth testing XTC with such models too.

As a note, while testing new fixed random, I've encountered refusals on mini-magnum-12b-v1.1.Q8_0 where they never happened even on baseline Nemo Instruct, with parameters M=0.020->xtc=0.20-0.99 (99% chance) (and for statistics I was counting tokens that came through XTC sampler and that were removed: 611/37275(1.64%) ). While this is an extreme example (99% chance), it shows that it is possible to get an undesirable result in some corner cases. The addition of upper threshold simply makes XTC a bit more flexible without breaking it (default value is still 1.0).

Shifts tokens to remove the penalized ones, then puts the penalized at the back. Should make `min_keep` still viable.
@p-e-w
Copy link

p-e-w commented Oct 6, 2024

Consider a model which, as a result of finetuning, is now less likely to output the phrase in your example: less likely doesn't mean "erased entirely", and such models may still be able to generate it, but with lower probability. In such case XTC will remove a more probable (supposedly less clichéd) variant and increase the chance of choosing a previously less probable cliché. In the long run it can steer such finetuned model back to its baseline, "undoing" the effect of finetuning to some extent.

So the purpose of xtc_t_max is to act as a countermeasure so that the finetune's anti-cliché countermeasures don't get overridden by XTC, which is itself an anti-cliché countermeasure? That's highly convoluted.

I suspect that this parameter is useless in most cases, actively harmful unless extremely carefully calibrated for each model and input, and above all else, confusing and very difficult to interpret for most users. The standard XTC threshold gives you a straightforward knob to tweak, with a clearly defined meaning: The probability that you consider "viable". By contrast, with xtc_t_max, you have to guess to what degree the finetune has already suppressed clichés, and that degree might not even be uniform across all clichés. Essentially, xtc_t_max means nothing at all. It's just a value that you can fiddle with in the hope that you will like the output better.

IMO, xtc_t_max is a philosophical mismatch for XTC, injecting complexity and volatility into an otherwise simple and straightforward sampler. I considered several additional parameters while designing XTC, but in the end decided against including any of them, because simplicity is king, and interpretability is what makes parameters useful in the first place.

@MaggotHATE
Copy link
Contributor Author

So the purpose of xtc_t_max is to act as a countermeasure so that the finetune's anti-cliché countermeasures don't get overridden by XTC, which is itself an anti-cliché countermeasure?

No, it a measurement of a model's perceived creativity. If a model is already creative enough, more clichés will be less probable - and the range of tokens we perceive as undesirable would have lower ceiling.

Look at it from a broader perspective: every model needs a direction to create an output a user might need. To achieve that, we can perform an internal work and an external one on that model.

Internal work is training/finetuning - if affects the internal data of the model, and the results of such work are "baked" into the model itself. While not usually used, it is possible for any well-made model to output coherently (almost) without samplers.

External work are sampling algorithms, RAG, etc. They are used to redirect the model towards a more desirable output for every input a user might have. We can also say that prompt engineering falls into this category (as opposed to creating training data pairs, for example, which would be internal work).

Now, let's look at XTC again: the initial idea was to remove the most "boring", repetitive tokens from consideration. This means that we see the results of the internal work done on the model as "incomplete" from what we see as "creativity" of a model. To make it "complete", we use XTC and get exactly (or almost exactly) what we need.

However, if the internal work done one a model has already considered that creativity might be "incomplete", and further training was performed, that model by itself will be closer to what we want to achieve. We see the base output of such model as "more creative" in comparison to another (usually, the base/instruct versions) model. This also means that undesirable tokens will have lower probabilities during sampling.

With the original XTC, however, we will still penalize top tokens only. There are no mechanisms to control it's behavior towards upper top tokens, because raising the bar affects lower top tokens, and changing the probabilities affects all top tokens. This means that we don't differentiate between models that are more creative and models that are less creative - lower threshold cannot reflect that as it controls lower top tokens, not the upper ones.

By introducing xtc_t_max we give a user more control over the perceived creativity, making it easier to find per-model settings. It has always been the case with presets which might be sufficient for one model, but not sufficient for another. Considering the large number of models nowadays, xtc_t_max is an important addition and a useful flexibility.

because simplicity is king

I would argue that functionality is the king: giving users more options and more control over the tools they use is more important than keeping algorithms simple. KISS is important, but it should not hurt usability. In this case: original XTC is a sampler that works with ranges of tokens from top only, with xtc_t_max it works with any range of tokens, from top or in the middle. Users who don't need to fiddle with it, will keep it on default, while those who want to experiment will be able to try middle ranges. Possibilities are doubled.

1. Scan token from top till the first non-penalizable
2. Remove the last captured token (the least probable above threshold)
3. Shift all tokens to override the remaining penalizable
4. Penalize and put them at the the bottom.
@MaggotHATE MaggotHATE marked this pull request as ready for review October 6, 2024 11:27
@MaggotHATE
Copy link
Contributor Author

I have reworked the algorithm, given that std::sort is no longer used. It should be ready now.

@MaggotHATE MaggotHATE requested a review from slaren October 6, 2024 11:31
@MaggotHATE
Copy link
Contributor Author

MaggotHATE commented Oct 7, 2024

After tinkering with the algorithm more I've tested two options:

The more elegant one is to iterate through all tokens from back to front, which makes capturing the last op token easier:

    std::vector<llama_token_data> tokens_left;
    bool last_captured = false;

    for (int i = (cur_p->size - 1); i >= 0; --i) {
        if (cur_p->data[i].p < ctx->threshold || cur_p->data[i].p > ctx->threshold_max) {
            tokens_left.emplace_back(cur_p->data[i]);
        } else if (!last_captured) {
            tokens_left.emplace_back(cur_p->data[i]);
            last_captured = true;
        }
    }

    if (last_captured) {
        int to_remove = cur_p->size - tokens_left.size();

        if (to_remove >= 1 && tokens_left.size() >= ctx->min_keep) {
            memcpy(cur_p->data, tokens_left.data(), tokens_left.size()*sizeof(llama_token_data));

            cur_p->size = tokens_left.size();
            // since tokens are in reverse order, we'll need to sort them later
            cur_p->sorted = false;
        }
    }

The problem is, it requires sorting later, which will be performed by softmax later. That is expensive, and we can avoid it by going from the front:

    std::vector<llama_token_data> tokens_left;
    bool last_captured = true;

    for (size_t i = 0; i < cur_p->size; ++i) {
        if (cur_p->data[i].p < ctx->threshold || cur_p->data[i].p > ctx->threshold_max) {
            // capturing the last top token
            if (!last_captured && i > 0 && cur_p->data[i-1].p >= ctx->threshold) {
                tokens_left.emplace_back(cur_p->data[i-1]);
                last_captured = true;
            }

            tokens_left.emplace_back(cur_p->data[i]);
        }
    }

    // in case all candidates are penalizable
    if (cur_p->data[cur_p->size - 1].p >= ctx->threshold) tokens_left.emplace_back(cur_p->data[cur_p->size - 1]);

    if (tokens_left.size() < ctx->min_keep) return;

    size_t to_remove = cur_p->size - tokens_left.size();

    if (to_remove >= 1) {
        memcpy(cur_p->data, tokens_left.data(), tokens_left.size()*sizeof(llama_token_data));
        cur_p->size = tokens_left.size();
    }

However, the second one feels bulky. Which one is more acceptable? @slaren can I request your help here?

@slaren
Copy link
Collaborator

slaren commented Oct 7, 2024

If I understand the algorithm correctly, this implementation seems too complicated. Since the tokens are sorted by probability, you should only need to first find the index of the last token to remove, and then move the remaining tokens to the top of the list. The implementation in textgen-webui also seems to completely disable itself if any special tokens would be removed, which I imagine is important.

@MaggotHATE
Copy link
Contributor Author

MaggotHATE commented Oct 7, 2024

you should only need to first find the index of the last token to remove, and then move the remaining tokens to the top of the list

It will be slightly more complex due to addition of max threshold, but yes, I understand. I have also added XTC into test-sampling for self-check.

disable itself if any special tokens would be removed

It's not a part of the original idea, and I haven't encountered any problems with special tokens in llama.cpp specifically - that's why I left it out for now. If any problems are found later, I'll definitely add it. Thank you!

@pnb
Copy link
Contributor

pnb commented Oct 7, 2024

... The implementation in textgen-webui also seems to completely disable itself if any special tokens would be removed, which I imagine is important.

Wouldn't doing so bias the responses to be shorter, on average? If EOT is a special token and can be elevated in probability by XTC when other (more likely) tokens are removed, but never decreased in probability by itself being removed, it seems like responses will be shorter.

I've tried this implementation a bit now and it does occasionally ramble a bit at the end of a response, especially when generating code (not a great use for XTC but entertaining at least). That might be caused by removing EOT, so perhaps there's good reason to avoid it. But I would definitely prefer an occasional ramble over shortened responses.

@p-e-w
Copy link

p-e-w commented Oct 12, 2024

@slaren

I don't think it is strictly necessary to clone the exact same behavior as the original implementation, the idea can be improved, but in that case out of respect for the original author and to avoid confusion, it may be better to use a different name for the sampler. Unless that's ok with @p-e-w.

As I have stated above, the plain implementation from this PR (without special treatment for \n and EOS) is the original design. The special tokens hack was added by the maintainer against my objections because some users were complaining loudly about the behavior with certain large finetunes. As far as I'm concerned, that hack doesn't belong in XTC, and the other implementations (Kobold and ExLlamaV2) thankfully don't have it.

What is not part of the original XTC design is the xtc_t_max parameter introduced by this PR. No other implementation has that parameter, and as I have explained in detail above, I consider adding that parameter a bad idea because it complicates the user interface with an extremely capricious knob that can actually do the opposite of what XTC is designed for.

@MaggotHATE
Copy link
Contributor Author

What is not part of the original XTC design is the xtc_t_max parameter introduced by this PR. No other implementation has that parameter, and as I have explained in detail above, I consider adding that parameter a bad idea because it complicates the user interface with an extremely capricious knob that can actually do the opposite of what XTC is designed for.

If a frontend doesn't implement xtc_t_max in its UI, nothing will break, though. XTC will work exactly as the original if the upper threshold is left untouched.

@p-e-w
Copy link

p-e-w commented Oct 12, 2024

If frontends don't expose xtc_t_max then the parameter is useless. If frontends do expose it, then users will play with it, the output will become less creative rather than more creative, and people will conclude "XTC doesn't work".

I maintain that xtc_t_max doesn't belong in XTC. If you believe otherwise, there is a very simple way to resolve this: Submit that parameter in a separate PR, so that it may be evaluated on its own merit.

But please don't use the popularity that XTC has gained to try to shoehorn your own ideas into the sampler by coupling the two, and then submitting them under the title "add XTC sampler". There is nothing wrong in principle with adding parameters to a sampler, including XTC, but that is not the way to do it.

@MaggotHATE
Copy link
Contributor Author

MaggotHATE commented Oct 12, 2024

@p-e-w

to try to shoehorn your own ideas into the sampler

That was never the intention, I simply wanted it to be more flexible after experiencing more cliched responses and refusals on finetuned models, that is all. But I understand what you mean.

There is nothing wrong in principle with adding parameters to a sampler, including XTC

On that note, as was testing this XTC implementation in server UI, I looked at candidates and saw no consistency in the result any range has on a finetuned model (still using mini-magnum for testing). At times XTC pulls out clichés on default settings, but different clichés appear with different ranges. The "dimly lit" is probably the most elusive one from my current tests. However, tokens that usually lead to cliched phases have lower probabilities most of the time and, what's more important, they are more mixed with the tokens that would give less cliched results.

As such, while I'm still sure that there should be a mechanism for more control, I am not longer sure about the top range being the best of them. Alternatives I tried in the past were:

1. Penalizing each token above threshold regardless of their number (i.e. even if there's only one of them)
2. Calculating probability for each token

UPD: After testing more, I think I've found what to work with: in mini-magnum at least, penalizing tokens at around 50 to 60% results in more cliched phases, and the whole style becomes more of a typical "purple prose". In the original XTC it is partly offset by chance and penalizing tokens with higher probabilities. At the same time penalizing all tokens above ~70% gives less cliched writing overall. This can be done by penalizing exactly one top token if threshold is above 0.5, I will add it as a separate PR once this gets merged.

@p-e-w
Copy link

p-e-w commented Oct 14, 2024

This can be done by penalizing exactly one top token if threshold is above 0.5, I will add it as a separate PR once this gets merged.

That's an interesting proposal, but it sounds like a completely new sampler, rather than an addition to XTC. I recommend submitting that mechanism as a separate sampler under a new name. It's pretty much the inverse of XTC in the way it removes tokens.

@MaggotHATE
Copy link
Contributor Author

MaggotHATE commented Oct 14, 2024

I recommend submitting that mechanism as a separate sampler under a new name.

Yes, I tested more in the meantime and see potential in two ideas:

  1. A simple range-based removal would allow penalizing from 1.0 till 0.8, for example, doing exactly that.
  2. A top tokens disposition would check if the difference between first two tokens is higher than X (in this case, for example, 0.65) and, if true, remove the top token. This would allow removing only the most "dominating" tokens while keeping those with more variety (for example, in candidates 40%, 40%, 10%,etc) - or vise versa if, again, a range is defined.

common/arg.cpp Outdated Show resolved Hide resolved
common/common.cpp Outdated Show resolved Hide resolved
common/common.h Outdated Show resolved Hide resolved

By removing top tokens XTC can improve the variety of answers, break writing clichés and inhibit repition, since clichés and repeated phrases are usually more likely to appear. By keeping the last token above the threshold, XTC ensures that the answer is still coherent. XTC is meant to be used for creative tasks, but feel free to experiment with different settings for different models.

Being experimental and unique, XTC is disabled by default. The recommended combination of samplers is Min-P followed by XTC on its default settings: `--sampling-seq mx --min-p 0.02 -xtc-p 0.5`.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it still necessary to pass an explicit sampler chain via --sampling-seq in order to activate XTC? I thought that it is now in the sampler chain by default, and disabled by having xtc_probability set to 0.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@p-e-w While XTC is included into the sampler queue by default, it is put after all other truncating samplers. As such, the recommended combinations of samplers, as per your words in oobabooga/text-generation-webui#6335 , requires passing samplers chain explicitly.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't get it. Seems like the order is correct by default (XTC after truncation, which Min-P is). And all samplers are set to neutral (off) by default, right? So what does --sampling-seq mx do that wouldn't happen otherwise?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@p-e-w Not all samplers - Top K is 40 by default and Top P is 0.95. So, either they need to be set to 0 and 1.0 respectively, or (which is easier and more logical) sampling queue should be limited to only the samplers we need.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

😮 I had no idea! So llama.cpp users are getting crap samplers from the Stone Age without even realizing it. That's terrible. I would have expected a clean slate that samples from the raw model distribution unless parameters are set explicitly.

Anyway, this means your example command is correct, of course.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I remember correctly, this topic has already been brought up some time ago in other PRs. However, since in most cases llama.cpp is used as a library through another app or as a server, this issue is mostly related to llama-cli users. You can look at index-new.html (new server UI): it has different "default" values with top_k and top_p turned off, and I assume any other frontend will have a payload with all parameters set as needed.

But yes, this is an issue.

Copy link
Contributor

@strawberrymelonpanda strawberrymelonpanda Oct 15, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Llama-cli user here via scripts... thanks for bringing this up at any rate, Improved my (custom) benchmark scores just by adjusting settings to disable Top K and Top P.

I've seen the default params adjusted a few times for llama-cli; feels like this would be a good change.

examples/server/public/index-new.html Outdated Show resolved Hide resolved
src/llama-sampling.cpp Outdated Show resolved Hide resolved
delete (llama_sampler_xtc *) smpl->ctx;
}

static void llama_sampler_xtc_reset(struct llama_sampler * smpl) {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the purpose of this function?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AFAIK this is necessary to properly reset seed and maintain repeatability, as recommended by @slaren earlier.

src/llama-sampling.cpp Outdated Show resolved Hide resolved
int pos_last = 0;

for (size_t i = 0; i < cur_p->size; ++i) {
if (cur_p->data[i].p - ctx->threshold >= -1e-5) {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why this epsilon instead of a regular comparison?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added it after running tests - they were failing due to precision problem.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ugh, no. Never change correct code to satisfy tests. If the code is semantically correct, and the tests don't pass, the tests need to be adapted to account for things such as floating point shenanigans. But changing the code itself is always wrong, unless of course the code has a bug.

The correct way to express this condition is

if (cur_p->data[i].p >= ctx->threshold) {

Nothing else will do. And the tests need to work with that, or the tests are wrong.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure it should be considered a problem with tests (precision problem is a wider topic), but alright.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My basic point is that the tests serve the code, not the other way round. Code is only ever changed in response to failing tests if the code is found to have a bug.

I had a similar problem with tests for an experimental sampler I wrote for llama.cpp a while ago, and I was able to work around it by using a different set of token probabilities in the tests that I constructed specifically so that after probability renormalization, the resulting values were exactly representable in floating point.

src/llama-sampling.cpp Outdated Show resolved Hide resolved
examples/main/README.md Outdated Show resolved Hide resolved
Copy link

@p-e-w p-e-w left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did another full review of the code and while some parts are quite subtle, I'm now pretty confident that the implementation is correct (though I did not run it myself yet).

@slaren: This is ready to go as far as I'm concerned.

examples/main/README.md Outdated Show resolved Hide resolved
src/llama-sampling.cpp Outdated Show resolved Hide resolved
}

if (cur_p->size - pos_last >= ctx->min_keep && pos_last > 0) {
cur_p->data += pos_last;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This may potentially break 3rd party code that expects this pointer to be unchanged (eg. to free it after sampling). I don't think this is necessarily a problem, but we should make it clear that this pointer may be changed by the samplers, and applications should not rely on it being unchanged.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
examples server testing Everything test related
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants