-
Notifications
You must be signed in to change notification settings - Fork 9.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Server.cpp loadPrompt(): Fix segfault when prompt length exceeds ctx size #3639
Conversation
I will apply this change in my PR #3589. Can you confirm that this changes fix your issue? |
For me it prevents the segfault. I would suggest to wait a bit if anybody else from the original issue report #3550 could confirm that it works for them. Also a note: I didn't fix similar code in the /infill endpoint. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The change looks correct - let's apply it in the other place and merge
I think I discovered a huge performance regression with this patch. It happens with context size bigger than 2048. For example, I have two requests:
and
The files: I start the server like this:
Now, I'm running
The results are:
As you can see, the second time, the prompt is not parsed, but is taken from the cache. And it's correct because the second request only adds a couple of words at the end of the prompt. However, after applying the patch, I get this:
As you can see, the prompt is evaluated again even in the second request. |
Ah interesting - I didn't realize that the old imlpementation was handling this situation via the Btw, what is the use case to submit prompts that are larger than the context? Everything else before the last |
@ggerganov I'm not in favor of this being merged with the master, as I've already finished my pull request #3589, and it's ready to be merged, this change is already applied. |
@z80maniac I don't see the point in making requests that exceed the context size, as if part of the prompt will be ignored in the end, it's better to directly increase the context size to make it useful in overly long prompts. |
Yes, and that's exactly what I want - to automatically erase everything at the start. Also, I suppose, when But even worse, if I manually trim the prompt, then the "prompt cache" or whatever it's called will be invalidated and the server will waste a lot of time re-parsing the prompt with each request.
Not if you're already at the context size limit, e.g. 4096 in this case. EDIT: I should probably clarify my use case. Yes, for one-shot prompts it does not make sense to submit the prompt longer that the context. But imagine, for example, a chat that grows and grows in size. Its size starts within the context size, but soon it will overflow. Now, as I said before, if I manually trim the prompt, the cache will be reset and now each time I will make a request (and add a new chat line), the prompt needs to be re-parsed again (for 70B models on my machine it takes around 30 seconds). But if I let the server to trim the prompt automatically, the cache is not reset, and a lot of time is saved (the time is only spent on generating the new tokens, not on re-parsing the entire prompt from scratch). |
@z80maniac If you submitted two prompts both longer than n_ctx and didn't use n_keep, it shouldn't have cached anything before the commit unless I am mistaken about the code. Edit: Thinking more about it, I now think I understand how the erased-blocks were meant to work and where they went wrong. What is happening is that the erase_block logic is filling the whole context in the worst case. I think to fix it we just need to remove the |
Yeah, I'm also surprised at how it can preserve the cache while moving the context window. I don't know if it works correctly, but it works. Disclaimer: I have almost no idea how all of this works under the hood, I'm just describing everything from a user perspective. The bottom line is this:
|
@z80maniac Thank you for the explanation - I think I understand the problem and will try to resolve it when merging #3589 |
I did another attempt at fixing this in #3661 after @z80maniac found a performance regression in my patch #3639. |
* implementing parallel decoding in server example * crash fixed * save dev progress * refactored sampling function * completion endpoint working * multiple client support * grammar + no stream completion * cached prompt support * chat.mjs support cached prompt + some fixes * server ui now support multiple clients * unused change reverted * fixed timings per slot * add context swap * add changes to README.md * llava multimodal integration * fixed tokens probs * add multimodal input - alfa * refactor code + remove unused comments + improved README.md * fix compilation errors with llvm * notify the user from server ui that multimodality is unavialable * some ci fixes * fix ci make build undefined ref errors * fix long prompt than ctx proposed in #3639 * fixed premature end due stop word * context shift fixed * fix llava implementation * sync README.md changes * readme change * update api like OpenAI * multimodal support enabled by default * fix make bui;d errors * fix multiple clients * fix zig build * new sampling API * latest changes of sampling API * server : coding-style normalization * server : coding-style normalization (part 2) * server : remove beam-search functionality * server : bug fix in ingest_images n_tokens is incremented internally by llama_batch_add * server : use refs + use llama_batch_clear() * server : snake case * server : minor sync * added thread safe pipeline * server : bach has to be allocated for n_parallel sequences * server : no need for atomic int - already using mutex * server : logs + minor code style * server : fix multibyte handle in partial response (#3706) * fix image load + view image in chat * make : silence stb warnings * clip : link to ggml, not to llama * server : fix switch fallthrough * server : fix crash in Debug on macOS (I have no idea why this fixes it!?) * server : refactor ctx_sampling init + n_ctx + names * server : bug fix for prompt caching * Do not save/load image_data to localStorage * editorconfig : new line in index.html * server : completion requests remember slot_id * Update readme to document multimodal in server * server : minor style * Update readme to document multimodal in server * server : hide ctx_sampling->prev behind API (#3696) * server : apply fix from #3722 * server : fix slot reuse * server : add comment about changing slot_state to bool --------- Co-authored-by: FSSRepo <[email protected]> Co-authored-by: Damian Stewart <[email protected]> Co-authored-by: Steward Garcia <[email protected]> Co-authored-by: Jhen-Jie Hong <[email protected]> Co-authored-by: M. Yusuf Sarıgöz <[email protected]>
erasedBlocks
and apply same logic as during nextToken (discard half of the ctx window not blocked by n_keep)