Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

server: fix the disappearance of the end of the text when streaming with stop strings #9867

Merged
merged 3 commits into from
Oct 16, 2024

Conversation

z80maniac
Copy link
Contributor


The problem: the end of the text may not be pushed to the API-client when streaming with stop strings enabled.

For example, let's test the following prompt:

Alice: Ask me any question.
Bob: What color is the sky on Mars

The stop string will be \n (stopping at the end of a dialog line). The next predicted character should be the question mark (it actually depends on the model and the probabilities, but in this case, let's assume that the seed is chosen so that the next character is ?).

The bug depends heavily on the model's tokenizer. In my case the bug can be seen on Llama-3.2-3B-Instruct-Q8_0.gguf model.

First, let's try it without streaming:

curl -Ss --data '{"prompt": "Alice: Ask me any question.\nBob: What color is the sky on Mars", "n_predict": 8, "cache_prompt": true, "stop": ["\n"], "seed": 42}' http://127.0.0.1:8080/completion | jq .content

It gives the correct result:

"?"

Now let's try it with streaming:

curl -Ss --data '{"prompt": "Alice: Ask me any question.\nBob: What color is the sky on Mars", "n_predict": 8, "cache_prompt": true, "stop": ["\n"], "seed": 42, "stream": true}' http://127.0.0.1:8080/completion
data: {"content":"","stop":false,"id_slot":0,"multimodal":false,"index":0}

data: {"content":"","id_slot":0,"stop":true,"model":"/opt/models/text/Llama-3.2-3B-Instruct-Q8_0.gguf","tokens_predicted":1,"tokens_evaluated":17,"generation_settings":{"n_ctx":8192,"n_predict":-1,"model":"/opt/models/text/Llama-3.2-3B-Instruct-Q8_0.gguf","seed":42,"seed_cur":42,"temperature":0.800000011920929,"dynatemp_range":0.0,"dynatemp_exponent":1.0,"top_k":40,"top_p":0.949999988079071,"min_p":0.05000000074505806,"tfs_z":1.0,"typical_p":1.0,"repeat_last_n":64,"repeat_penalty":1.0,"presence_penalty":0.0,"frequency_penalty":0.0,"mirostat":0,"mirostat_tau":5.0,"mirostat_eta":0.10000000149011612,"penalize_nl":false,"stop":["\n"],"max_tokens":8,"n_keep":0,"n_discard":0,"ignore_eos":false,"stream":true,"n_probs":0,"min_keep":0,"grammar":"","samplers":["top_k","tfs_z","typ_p","top_p","min_p","temperature"]},"prompt":"Alice: Ask me any question.\nBob: What color is the sky on Mars","truncated":false,"stopped_eos":false,"stopped_word":true,"stopped_limit":false,"stopping_word":"\n","tokens_cached":17,"timings":{"prompt_n":1,"prompt_ms":30.498,"prompt_per_token_ms":30.498,"prompt_per_second":32.7890353465801,"predicted_n":1,"predicted_ms":0.028,"predicted_per_token_ms":0.028,"predicted_per_second":35714.28571428571},"index":0}

The question mark disappeared. And it's also not featured in the stopping_word. So it's completely lost and the API-client won't be able to restore it.

It happens because the next token returned by the model contains both a question mark and the stop string: ?\n\n. And the current code skips sending the current token completely if it contains the stop string.

The change in this PR sends the remainder of the token to the API-client in this case. When is_stop_full=true it's safe to send the response, because the stop string and everything after it will already be truncated from the generated_text at this point.

The response with this PR applied:

data: {"content":"?","stop":false,"id_slot":0,"multimodal":false,"index":0}

data: {"content":"","id_slot":0,"stop":true,"model":"/opt/models/text/Llama-3.2-3B-Instruct-Q8_0.gguf","tokens_predicted":1,"tokens_evaluated":17,"generation_settings":{"n_ctx":8192,"n_predict":-1,"model":"/opt/models/text/Llama-3.2-3B-Instruct-Q8_0.gguf","seed":42,"seed_cur":42,"temperature":0.800000011920929,"dynatemp_range":0.0,"dynatemp_exponent":1.0,"top_k":40,"top_p":0.949999988079071,"min_p":0.05000000074505806,"tfs_z":1.0,"typical_p":1.0,"repeat_last_n":64,"repeat_penalty":1.0,"presence_penalty":0.0,"frequency_penalty":0.0,"mirostat":0,"mirostat_tau":5.0,"mirostat_eta":0.10000000149011612,"penalize_nl":false,"stop":["\n"],"max_tokens":8,"n_keep":0,"n_discard":0,"ignore_eos":false,"stream":true,"n_probs":0,"min_keep":0,"grammar":"","samplers":["top_k","tfs_z","typ_p","top_p","min_p","temperature"]},"prompt":"Alice: Ask me any question.\nBob: What color is the sky on Mars","truncated":false,"stopped_eos":false,"stopped_word":true,"stopped_limit":false,"stopping_word":"\n","tokens_cached":17,"timings":{"prompt_n":1,"prompt_ms":17.814,"prompt_per_token_ms":17.814,"prompt_per_second":56.13562366677894,"predicted_n":1,"predicted_ms":0.055,"predicted_per_token_ms":0.055,"predicted_per_second":18181.81818181818},"index":0}

@@ -1083,7 +1083,7 @@ struct server_context {
}

// check if there is any token to predict
if (stop_pos == std::string::npos || (!slot.has_next_token && !is_stop_full && stop_pos > 0)) {
if (stop_pos == std::string::npos || is_stop_full || (!slot.has_next_token && !is_stop_full && stop_pos > 0)) {
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this if will always evaluate to true:

  • if stop_pos == std::string::npos -> true
  • if stop_pos != std::string::npos, then is_stop_full == true due to line 1075 -> true

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have attempted to simplify this logic in the new commit. The text will not be sent only when the partial match is found. And partial match will be searched for only when the current token is not the last one.

@ggerganov ggerganov merged commit 1f66b69 into ggerganov:master Oct 16, 2024
53 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants