Server.cpp loadPrompt(): Fix segfault when prompt length exceeds ctx size #3639

coezbek · 2023-10-15T20:50:05Z

Remove broken logic about erasedBlocks and apply same logic as during nextToken (discard half of the ctx window not blocked by n_keep)
Properly initialize n_last_tokens after we adjusted the prompt.
Adjust num_prompt_tokens

FSSRepo · 2023-10-15T22:56:22Z

I will apply this change in my PR #3589. Can you confirm that this changes fix your issue?

coezbek · 2023-10-16T06:24:47Z

For me it prevents the segfault. I would suggest to wait a bit if anybody else from the original issue report #3550 could confirm that it works for them.

Also a note: I didn't fix similar code in the /infill endpoint.

ggerganov

The change looks correct - let's apply it in the other place and merge

z80maniac · 2023-10-17T19:23:42Z

I think I discovered a huge performance regression with this patch. It happens with context size bigger than 2048.

For example, I have two requests: request1.json, that contains this:

{"n_predict":8,"prompt":"<a very big prompt, much more than 4096 tokens>"}

and request2.json:

{"n_predict":8,"prompt":"<the same prompt + a couple of words>"}

The files:
request1.json
request2.json

I start the server like this:

./server -m /opt/models/text/llama-2-13b-chat.Q4_K_M.gguf -c 4096 -ngl 40 --host 0.0.0.0

Now, I'm running request1.json and request2.json against the 8402566 (unpatched):

curl -sS --data '@request1.json' -H 'Content-Type: application/json' http://127.0.0.1:8080/completion
curl -sS --data '@request2.json' -H 'Content-Type: application/json' http://127.0.0.1:8080/completion

The results are:

llama_print_timings:        load time =    1251.42 ms
llama_print_timings:      sample time =       2.80 ms /     8 runs   (    0.35 ms per token,  2854.08 tokens per second)
llama_print_timings: prompt eval time =   12263.82 ms /  2686 tokens (    4.57 ms per token,   219.02 tokens per second)
llama_print_timings:        eval time =     570.18 ms /     7 runs   (   81.45 ms per token,    12.28 tokens per second)
llama_print_timings:       total time =   12840.68 ms

llama_print_timings:        load time =    1251.42 ms
llama_print_timings:      sample time =       2.85 ms /     8 runs   (    0.36 ms per token,  2809.98 tokens per second)
llama_print_timings: prompt eval time =     220.41 ms /     3 tokens (   73.47 ms per token,    13.61 tokens per second)
llama_print_timings:        eval time =     572.14 ms /     7 runs   (   81.73 ms per token,    12.23 tokens per second)
llama_print_timings:       total time =     798.83 ms

As you can see, the second time, the prompt is not parsed, but is taken from the cache. And it's correct because the second request only adds a couple of words at the end of the prompt.

However, after applying the patch, I get this:

llama_print_timings:        load time =    1240.96 ms
llama_print_timings:      sample time =       3.54 ms /     8 runs   (    0.44 ms per token,  2262.44 tokens per second)
llama_print_timings: prompt eval time =    8446.67 ms /  2048 tokens (    4.12 ms per token,   242.46 tokens per second)
llama_print_timings:        eval time =     503.59 ms /     7 runs   (   71.94 ms per token,    13.90 tokens per second)
llama_print_timings:       total time =    8958.22 ms

llama_print_timings:        load time =    1240.96 ms
llama_print_timings:      sample time =       3.34 ms /     8 runs   (    0.42 ms per token,  2395.21 tokens per second)
llama_print_timings: prompt eval time =    8466.13 ms /  2048 tokens (    4.13 ms per token,   241.91 tokens per second)
llama_print_timings:        eval time =     506.15 ms /     7 runs   (   72.31 ms per token,    13.83 tokens per second)
llama_print_timings:       total time =    8979.22 ms

As you can see, the prompt is evaluated again even in the second request.

ggerganov · 2023-10-17T19:52:26Z

Ah interesting - I didn't realize that the old imlpementation was handling this situation via the erased_blocks trick.

Btw, what is the use case to submit prompts that are larger than the context? Everything else before the last n_ctx tokens is just ignored in this case.

FSSRepo · 2023-10-17T20:35:43Z

@ggerganov I'm not in favor of this being merged with the master, as I've already finished my pull request #3589, and it's ready to be merged, this change is already applied.

FSSRepo · 2023-10-17T20:39:36Z

@z80maniac I don't see the point in making requests that exceed the context size, as if part of the prompt will be ignored in the end, it's better to directly increase the context size to make it useful in overly long prompts.

z80maniac · 2023-10-17T21:36:50Z

Everything else before the last n_ctx tokens is just ignored in this case.

Yes, and that's exactly what I want - to automatically erase everything at the start. Also, I suppose, when n_keep is set, then some tokens at the start will be kept too? It would be pretty inconvenient to do all of this manually.

But even worse, if I manually trim the prompt, then the "prompt cache" or whatever it's called will be invalidated and the server will waste a lot of time re-parsing the prompt with each request.

it's better to directly increase the context size

Not if you're already at the context size limit, e.g. 4096 in this case.

EDIT:

I should probably clarify my use case. Yes, for one-shot prompts it does not make sense to submit the prompt longer that the context. But imagine, for example, a chat that grows and grows in size. Its size starts within the context size, but soon it will overflow. Now, as I said before, if I manually trim the prompt, the cache will be reset and now each time I will make a request (and add a new chat line), the prompt needs to be re-parsed again (for 70B models on my machine it takes around 30 seconds). But if I let the server to trim the prompt automatically, the cache is not reset, and a lot of time is saved (the time is only spent on generating the new tokens, not on re-parsing the entire prompt from scratch).

coezbek · 2023-10-17T21:56:59Z

@z80maniac If you submitted two prompts both longer than n_ctx and didn't use n_keep, it shouldn't have cached anything before the commit unless I am mistaken about the code.

Edit: Thinking more about it, I now think I understand how the erased-blocks were meant to work and where they went wrong. What is happening is that the erase_block logic is filling the whole context in the worst case.

I think to fix it we just need to remove the -1 to ensure that at most n_ctx - 1 is filled with the new prompt but the erase block logic stays. I can propose a fix tomorrow.

z80maniac · 2023-10-18T06:28:01Z

If you submitted two prompts both longer than n_ctx and didn't use n_keep, it shouldn't have cached anything

Yeah, I'm also surprised at how it can preserve the cache while moving the context window. I don't know if it works correctly, but it works.

Disclaimer: I have almost no idea how all of this works under the hood, I'm just describing everything from a user perspective.

The bottom line is this:

I use 70B model to generate a continuous chat, adding new chat lines.
The chat becomes much bigger than the context size.
Without the patch in this PR I can generate a new chat line in 10 seconds or so.
When this patch applied, it takes almost 60 seconds to generate a new line.
The result: this patch makes the inference 6 times slower.

ggerganov · 2023-10-18T06:53:31Z

@z80maniac Thank you for the explanation - I think I understand the problem and will try to resolve it when merging #3589

coezbek · 2023-10-18T07:19:33Z

I did another attempt at fixing this in #3661 after @z80maniac found a performance regression in my patch #3639.

* implementing parallel decoding in server example * crash fixed * save dev progress * refactored sampling function * completion endpoint working * multiple client support * grammar + no stream completion * cached prompt support * chat.mjs support cached prompt + some fixes * server ui now support multiple clients * unused change reverted * fixed timings per slot * add context swap * add changes to README.md * llava multimodal integration * fixed tokens probs * add multimodal input - alfa * refactor code + remove unused comments + improved README.md * fix compilation errors with llvm * notify the user from server ui that multimodality is unavialable * some ci fixes * fix ci make build undefined ref errors * fix long prompt than ctx proposed in #3639 * fixed premature end due stop word * context shift fixed * fix llava implementation * sync README.md changes * readme change * update api like OpenAI * multimodal support enabled by default * fix make bui;d errors * fix multiple clients * fix zig build * new sampling API * latest changes of sampling API * server : coding-style normalization * server : coding-style normalization (part 2) * server : remove beam-search functionality * server : bug fix in ingest_images n_tokens is incremented internally by llama_batch_add * server : use refs + use llama_batch_clear() * server : snake case * server : minor sync * added thread safe pipeline * server : bach has to be allocated for n_parallel sequences * server : no need for atomic int - already using mutex * server : logs + minor code style * server : fix multibyte handle in partial response (#3706) * fix image load + view image in chat * make : silence stb warnings * clip : link to ggml, not to llama * server : fix switch fallthrough * server : fix crash in Debug on macOS (I have no idea why this fixes it!?) * server : refactor ctx_sampling init + n_ctx + names * server : bug fix for prompt caching * Do not save/load image_data to localStorage * editorconfig : new line in index.html * server : completion requests remember slot_id * Update readme to document multimodal in server * server : minor style * Update readme to document multimodal in server * server : hide ctx_sampling->prev behind API (#3696) * server : apply fix from #3722 * server : fix slot reuse * server : add comment about changing slot_state to bool --------- Co-authored-by: FSSRepo <[email protected]> Co-authored-by: Damian Stewart <[email protected]> Co-authored-by: Steward Garcia <[email protected]> Co-authored-by: Jhen-Jie Hong <[email protected]> Co-authored-by: M. Yusuf Sarıgöz <[email protected]>

Christopher Oezbek and others added 2 commits October 15, 2023 22:25

Fixed loadPrompt() when prompt length exceeds context.

1841ca0

Fix formatting

5becac8

coezbek mentioned this pull request Oct 15, 2023

[User] The server crashes when a wrong n_keep is set #3550

Closed

4 tasks

FSSRepo added a commit to FSSRepo/llama.cpp that referenced this pull request Oct 15, 2023

fix long prompt than ctx proposed in ggerganov#3639

fd64f04

ggerganov approved these changes Oct 17, 2023

View reviewed changes

Merge branch 'ggerganov:master' into master

c16b80b

coezbek closed this by deleting the head repository Oct 18, 2023

coezbek mentioned this pull request Oct 18, 2023

Fixed segfault when prompt was too long #3661

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Server.cpp loadPrompt(): Fix segfault when prompt length exceeds ctx size #3639

Server.cpp loadPrompt(): Fix segfault when prompt length exceeds ctx size #3639

coezbek commented Oct 15, 2023

FSSRepo commented Oct 15, 2023

coezbek commented Oct 16, 2023

ggerganov left a comment

z80maniac commented Oct 17, 2023 •

edited

Loading

ggerganov commented Oct 17, 2023

FSSRepo commented Oct 17, 2023

FSSRepo commented Oct 17, 2023

z80maniac commented Oct 17, 2023 •

edited

Loading

coezbek commented Oct 17, 2023 •

edited

Loading

z80maniac commented Oct 18, 2023

ggerganov commented Oct 18, 2023

coezbek commented Oct 18, 2023

Server.cpp loadPrompt(): Fix segfault when prompt length exceeds ctx size #3639

Server.cpp loadPrompt(): Fix segfault when prompt length exceeds ctx size #3639

Conversation

coezbek commented Oct 15, 2023

FSSRepo commented Oct 15, 2023

coezbek commented Oct 16, 2023

ggerganov left a comment

Choose a reason for hiding this comment

z80maniac commented Oct 17, 2023 • edited Loading

ggerganov commented Oct 17, 2023

FSSRepo commented Oct 17, 2023

FSSRepo commented Oct 17, 2023

z80maniac commented Oct 17, 2023 • edited Loading

coezbek commented Oct 17, 2023 • edited Loading

z80maniac commented Oct 18, 2023

ggerganov commented Oct 18, 2023

coezbek commented Oct 18, 2023

z80maniac commented Oct 17, 2023 •

edited

Loading

z80maniac commented Oct 17, 2023 •

edited

Loading

coezbek commented Oct 17, 2023 •

edited

Loading