Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Server.cpp loadPrompt(): Fix segfault when prompt length exceeds ctx size #3639

Closed
wants to merge 3 commits into from

Conversation

coezbek
Copy link
Contributor

@coezbek coezbek commented Oct 15, 2023

  • Remove broken logic about erasedBlocks and apply same logic as during nextToken (discard half of the ctx window not blocked by n_keep)
  • Properly initialize n_last_tokens after we adjusted the prompt.
  • Adjust num_prompt_tokens

@FSSRepo
Copy link
Collaborator

FSSRepo commented Oct 15, 2023

I will apply this change in my PR #3589. Can you confirm that this changes fix your issue?

FSSRepo added a commit to FSSRepo/llama.cpp that referenced this pull request Oct 15, 2023
@coezbek
Copy link
Contributor Author

coezbek commented Oct 16, 2023

For me it prevents the segfault. I would suggest to wait a bit if anybody else from the original issue report #3550 could confirm that it works for them.

Also a note: I didn't fix similar code in the /infill endpoint.

Copy link
Owner

@ggerganov ggerganov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The change looks correct - let's apply it in the other place and merge

@z80maniac
Copy link
Contributor

z80maniac commented Oct 17, 2023

I think I discovered a huge performance regression with this patch. It happens with context size bigger than 2048.

For example, I have two requests: request1.json, that contains this:

{"n_predict":8,"prompt":"<a very big prompt, much more than 4096 tokens>"}

and request2.json:

{"n_predict":8,"prompt":"<the same prompt + a couple of words>"}

The files:
request1.json
request2.json

I start the server like this:

./server -m /opt/models/text/llama-2-13b-chat.Q4_K_M.gguf -c 4096 -ngl 40 --host 0.0.0.0

Now, I'm running request1.json and request2.json against the 8402566 (unpatched):

curl -sS --data '@request1.json' -H 'Content-Type: application/json' http://127.0.0.1:8080/completion
curl -sS --data '@request2.json' -H 'Content-Type: application/json' http://127.0.0.1:8080/completion

The results are:

llama_print_timings:        load time =    1251.42 ms
llama_print_timings:      sample time =       2.80 ms /     8 runs   (    0.35 ms per token,  2854.08 tokens per second)
llama_print_timings: prompt eval time =   12263.82 ms /  2686 tokens (    4.57 ms per token,   219.02 tokens per second)
llama_print_timings:        eval time =     570.18 ms /     7 runs   (   81.45 ms per token,    12.28 tokens per second)
llama_print_timings:       total time =   12840.68 ms

llama_print_timings:        load time =    1251.42 ms
llama_print_timings:      sample time =       2.85 ms /     8 runs   (    0.36 ms per token,  2809.98 tokens per second)
llama_print_timings: prompt eval time =     220.41 ms /     3 tokens (   73.47 ms per token,    13.61 tokens per second)
llama_print_timings:        eval time =     572.14 ms /     7 runs   (   81.73 ms per token,    12.23 tokens per second)
llama_print_timings:       total time =     798.83 ms

As you can see, the second time, the prompt is not parsed, but is taken from the cache. And it's correct because the second request only adds a couple of words at the end of the prompt.

However, after applying the patch, I get this:

llama_print_timings:        load time =    1240.96 ms
llama_print_timings:      sample time =       3.54 ms /     8 runs   (    0.44 ms per token,  2262.44 tokens per second)
llama_print_timings: prompt eval time =    8446.67 ms /  2048 tokens (    4.12 ms per token,   242.46 tokens per second)
llama_print_timings:        eval time =     503.59 ms /     7 runs   (   71.94 ms per token,    13.90 tokens per second)
llama_print_timings:       total time =    8958.22 ms

llama_print_timings:        load time =    1240.96 ms
llama_print_timings:      sample time =       3.34 ms /     8 runs   (    0.42 ms per token,  2395.21 tokens per second)
llama_print_timings: prompt eval time =    8466.13 ms /  2048 tokens (    4.13 ms per token,   241.91 tokens per second)
llama_print_timings:        eval time =     506.15 ms /     7 runs   (   72.31 ms per token,    13.83 tokens per second)
llama_print_timings:       total time =    8979.22 ms

As you can see, the prompt is evaluated again even in the second request.

@ggerganov
Copy link
Owner

Ah interesting - I didn't realize that the old imlpementation was handling this situation via the erased_blocks trick.

Btw, what is the use case to submit prompts that are larger than the context? Everything else before the last n_ctx tokens is just ignored in this case.

@FSSRepo
Copy link
Collaborator

FSSRepo commented Oct 17, 2023

@ggerganov I'm not in favor of this being merged with the master, as I've already finished my pull request #3589, and it's ready to be merged, this change is already applied.

@FSSRepo
Copy link
Collaborator

FSSRepo commented Oct 17, 2023

@z80maniac I don't see the point in making requests that exceed the context size, as if part of the prompt will be ignored in the end, it's better to directly increase the context size to make it useful in overly long prompts.

@z80maniac
Copy link
Contributor

z80maniac commented Oct 17, 2023

Everything else before the last n_ctx tokens is just ignored in this case.

Yes, and that's exactly what I want - to automatically erase everything at the start. Also, I suppose, when n_keep is set, then some tokens at the start will be kept too? It would be pretty inconvenient to do all of this manually.

But even worse, if I manually trim the prompt, then the "prompt cache" or whatever it's called will be invalidated and the server will waste a lot of time re-parsing the prompt with each request.

it's better to directly increase the context size

Not if you're already at the context size limit, e.g. 4096 in this case.

EDIT:

I should probably clarify my use case. Yes, for one-shot prompts it does not make sense to submit the prompt longer that the context. But imagine, for example, a chat that grows and grows in size. Its size starts within the context size, but soon it will overflow. Now, as I said before, if I manually trim the prompt, the cache will be reset and now each time I will make a request (and add a new chat line), the prompt needs to be re-parsed again (for 70B models on my machine it takes around 30 seconds). But if I let the server to trim the prompt automatically, the cache is not reset, and a lot of time is saved (the time is only spent on generating the new tokens, not on re-parsing the entire prompt from scratch).

@coezbek
Copy link
Contributor Author

coezbek commented Oct 17, 2023

@z80maniac If you submitted two prompts both longer than n_ctx and didn't use n_keep, it shouldn't have cached anything before the commit unless I am mistaken about the code.

Edit: Thinking more about it, I now think I understand how the erased-blocks were meant to work and where they went wrong. What is happening is that the erase_block logic is filling the whole context in the worst case.

I think to fix it we just need to remove the -1 to ensure that at most n_ctx - 1 is filled with the new prompt but the erase block logic stays. I can propose a fix tomorrow.

@coezbek coezbek closed this by deleting the head repository Oct 18, 2023
@z80maniac
Copy link
Contributor

If you submitted two prompts both longer than n_ctx and didn't use n_keep, it shouldn't have cached anything

Yeah, I'm also surprised at how it can preserve the cache while moving the context window. I don't know if it works correctly, but it works.

Disclaimer: I have almost no idea how all of this works under the hood, I'm just describing everything from a user perspective.

The bottom line is this:

  • I use 70B model to generate a continuous chat, adding new chat lines.
  • The chat becomes much bigger than the context size.
  • Without the patch in this PR I can generate a new chat line in 10 seconds or so.
  • When this patch applied, it takes almost 60 seconds to generate a new line.
  • The result: this patch makes the inference 6 times slower.

@ggerganov
Copy link
Owner

@z80maniac Thank you for the explanation - I think I understand the problem and will try to resolve it when merging #3589

@coezbek
Copy link
Contributor Author

coezbek commented Oct 18, 2023

I did another attempt at fixing this in #3661 after @z80maniac found a performance regression in my patch #3639.

ggerganov added a commit that referenced this pull request Oct 22, 2023
* implementing parallel decoding in server example

* crash fixed

* save dev progress

* refactored sampling function

* completion endpoint working

* multiple client support

* grammar + no stream completion

* cached prompt support

* chat.mjs support cached prompt + some fixes

* server ui now support multiple clients

* unused change reverted

* fixed timings per slot

* add context swap

* add changes to README.md

* llava multimodal integration

* fixed tokens probs

* add multimodal input - alfa

* refactor code + remove unused comments + improved README.md

* fix compilation errors with llvm

* notify the user from server ui that multimodality is unavialable

* some ci fixes

* fix ci make build undefined ref errors

* fix long prompt than ctx proposed in #3639

* fixed premature end due stop word

* context shift fixed

* fix llava implementation

* sync README.md changes

* readme change

* update api like OpenAI

* multimodal support enabled by default

* fix make bui;d errors

* fix multiple clients

* fix zig build

* new sampling API

* latest changes of sampling API

* server : coding-style normalization

* server : coding-style normalization (part 2)

* server : remove beam-search functionality

* server : bug fix in ingest_images

n_tokens is incremented internally by llama_batch_add

* server : use refs + use llama_batch_clear()

* server : snake case

* server : minor sync

* added thread safe pipeline

* server : bach has to be allocated for n_parallel sequences

* server : no need for atomic int - already using mutex

* server : logs + minor code style

* server : fix multibyte handle in partial response (#3706)

* fix image load + view image in chat

* make : silence stb warnings

* clip : link to ggml, not to llama

* server : fix switch fallthrough

* server : fix crash in Debug on macOS (I have no idea why this fixes it!?)

* server : refactor ctx_sampling init + n_ctx + names

* server : bug fix for prompt caching

* Do not save/load image_data to localStorage

* editorconfig : new line in index.html

* server : completion requests remember slot_id

* Update readme to document multimodal in server

* server : minor style

* Update readme to document multimodal in server

* server : hide ctx_sampling->prev behind API (#3696)

* server : apply fix from #3722

* server : fix slot reuse

* server : add comment about changing slot_state to bool

---------

Co-authored-by: FSSRepo <[email protected]>
Co-authored-by: Damian Stewart <[email protected]>
Co-authored-by: Steward Garcia <[email protected]>
Co-authored-by: Jhen-Jie Hong <[email protected]>
Co-authored-by: M. Yusuf Sarıgöz <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants