server: fix the disappearance of the end of the text when streaming with stop strings #9867

z80maniac · 2024-10-12T15:50:22Z

I have read the contributing guidelines
Self-reported review complexity:
- Low
- Medium
- High

The problem: the end of the text may not be pushed to the API-client when streaming with stop strings enabled.

For example, let's test the following prompt:

Alice: Ask me any question.
Bob: What color is the sky on Mars

The stop string will be \n (stopping at the end of a dialog line). The next predicted character should be the question mark (it actually depends on the model and the probabilities, but in this case, let's assume that the seed is chosen so that the next character is ?).

The bug depends heavily on the model's tokenizer. In my case the bug can be seen on Llama-3.2-3B-Instruct-Q8_0.gguf model.

First, let's try it without streaming:

curl -Ss --data '{"prompt": "Alice: Ask me any question.\nBob: What color is the sky on Mars", "n_predict": 8, "cache_prompt": true, "stop": ["\n"], "seed": 42}' http://127.0.0.1:8080/completion | jq .content

It gives the correct result:

"?"

Now let's try it with streaming:

curl -Ss --data '{"prompt": "Alice: Ask me any question.\nBob: What color is the sky on Mars", "n_predict": 8, "cache_prompt": true, "stop": ["\n"], "seed": 42, "stream": true}' http://127.0.0.1:8080/completion

data: {"content":"","stop":false,"id_slot":0,"multimodal":false,"index":0}

data: {"content":"","id_slot":0,"stop":true,"model":"/opt/models/text/Llama-3.2-3B-Instruct-Q8_0.gguf","tokens_predicted":1,"tokens_evaluated":17,"generation_settings":{"n_ctx":8192,"n_predict":-1,"model":"/opt/models/text/Llama-3.2-3B-Instruct-Q8_0.gguf","seed":42,"seed_cur":42,"temperature":0.800000011920929,"dynatemp_range":0.0,"dynatemp_exponent":1.0,"top_k":40,"top_p":0.949999988079071,"min_p":0.05000000074505806,"tfs_z":1.0,"typical_p":1.0,"repeat_last_n":64,"repeat_penalty":1.0,"presence_penalty":0.0,"frequency_penalty":0.0,"mirostat":0,"mirostat_tau":5.0,"mirostat_eta":0.10000000149011612,"penalize_nl":false,"stop":["\n"],"max_tokens":8,"n_keep":0,"n_discard":0,"ignore_eos":false,"stream":true,"n_probs":0,"min_keep":0,"grammar":"","samplers":["top_k","tfs_z","typ_p","top_p","min_p","temperature"]},"prompt":"Alice: Ask me any question.\nBob: What color is the sky on Mars","truncated":false,"stopped_eos":false,"stopped_word":true,"stopped_limit":false,"stopping_word":"\n","tokens_cached":17,"timings":{"prompt_n":1,"prompt_ms":30.498,"prompt_per_token_ms":30.498,"prompt_per_second":32.7890353465801,"predicted_n":1,"predicted_ms":0.028,"predicted_per_token_ms":0.028,"predicted_per_second":35714.28571428571},"index":0}

The question mark disappeared. And it's also not featured in the stopping_word. So it's completely lost and the API-client won't be able to restore it.

It happens because the next token returned by the model contains both a question mark and the stop string: ?\n\n. And the current code skips sending the current token completely if it contains the stop string.

The change in this PR sends the remainder of the token to the API-client in this case. When is_stop_full=true it's safe to send the response, because the stop string and everything after it will already be truncated from the generated_text at this point.

The response with this PR applied:

data: {"content":"?","stop":false,"id_slot":0,"multimodal":false,"index":0}

data: {"content":"","id_slot":0,"stop":true,"model":"/opt/models/text/Llama-3.2-3B-Instruct-Q8_0.gguf","tokens_predicted":1,"tokens_evaluated":17,"generation_settings":{"n_ctx":8192,"n_predict":-1,"model":"/opt/models/text/Llama-3.2-3B-Instruct-Q8_0.gguf","seed":42,"seed_cur":42,"temperature":0.800000011920929,"dynatemp_range":0.0,"dynatemp_exponent":1.0,"top_k":40,"top_p":0.949999988079071,"min_p":0.05000000074505806,"tfs_z":1.0,"typical_p":1.0,"repeat_last_n":64,"repeat_penalty":1.0,"presence_penalty":0.0,"frequency_penalty":0.0,"mirostat":0,"mirostat_tau":5.0,"mirostat_eta":0.10000000149011612,"penalize_nl":false,"stop":["\n"],"max_tokens":8,"n_keep":0,"n_discard":0,"ignore_eos":false,"stream":true,"n_probs":0,"min_keep":0,"grammar":"","samplers":["top_k","tfs_z","typ_p","top_p","min_p","temperature"]},"prompt":"Alice: Ask me any question.\nBob: What color is the sky on Mars","truncated":false,"stopped_eos":false,"stopped_word":true,"stopped_limit":false,"stopping_word":"\n","tokens_cached":17,"timings":{"prompt_n":1,"prompt_ms":17.814,"prompt_per_token_ms":17.814,"prompt_per_second":56.13562366677894,"predicted_n":1,"predicted_ms":0.055,"predicted_per_token_ms":0.055,"predicted_per_second":18181.81818181818},"index":0}

…ith stop strings

ggerganov · 2024-10-14T07:17:34Z

examples/server/server.cpp

@@ -1083,7 +1083,7 @@ struct server_context {
            }

            // check if there is any token to predict
-            if (stop_pos == std::string::npos || (!slot.has_next_token && !is_stop_full && stop_pos > 0)) {
+            if (stop_pos == std::string::npos || is_stop_full || (!slot.has_next_token && !is_stop_full && stop_pos > 0)) {


I think this if will always evaluate to true:

if stop_pos == std::string::npos -> true

if stop_pos != std::string::npos, then is_stop_full == true due to line 1075 -> true

I have attempted to simplify this logic in the new commit. The text will not be sent only when the partial match is found. And partial match will be searched for only when the current token is not the last one.

server: fix the disappearance of the end of the text when streaming w…

ae996a2

…ith stop strings

github-actions bot added examples server labels Oct 12, 2024

ggerganov reviewed Oct 14, 2024

View reviewed changes

z80maniac added 2 commits October 14, 2024 21:53

simplify "send text" checks

40a68f4

Merge remote-tracking branch 'upstream/master' into fix-stop-trim

6e13d3f

ggerganov approved these changes Oct 16, 2024

View reviewed changes

ggerganov merged commit 1f66b69 into ggerganov:master Oct 16, 2024
53 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

server: fix the disappearance of the end of the text when streaming with stop strings #9867

server: fix the disappearance of the end of the text when streaming with stop strings #9867

z80maniac commented Oct 12, 2024

ggerganov Oct 14, 2024

z80maniac Oct 14, 2024

server: fix the disappearance of the end of the text when streaming with stop strings #9867

server: fix the disappearance of the end of the text when streaming with stop strings #9867

Conversation

z80maniac commented Oct 12, 2024

ggerganov Oct 14, 2024

Choose a reason for hiding this comment

z80maniac Oct 14, 2024

Choose a reason for hiding this comment