-
-
Notifications
You must be signed in to change notification settings - Fork 5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
YaRN support implementation #1264
Conversation
Signed-off-by: Antoni Baum <[email protected]>
Signed-off-by: Antoni Baum <[email protected]>
Signed-off-by: Antoni Baum <[email protected]>
Use |
Tested it, but got the following error with model NousResearch/Yarn-Llama-2-7b-64k:
|
@viktor-ferenczi can you post the full stack trace? |
@viktor-ferenczi were you loading from a directory or from HF model id? I think it may not work in the former case as HF Transformers will use the built-in LlamaConfig class, which causes validation to fail. It should work if you specify the HF model id instead, as that should use the configuration present in the repository. This is a HF Transformers issue. |
Thanks for the diagnosis. We have to pass In if not isinstance(self.rope_scaling, dict) or len(self.rope_scaling) != 2:
...
if rope_scaling_type is None or rope_scaling_type not in ["linear", "dynamic"]: Now it can load the model and produces reasonable output. I go ahead testing YaRN itself. |
Tested OK with the |
Attempted to test at full 64k context length:
Ran into this error:
The check is:
Variables:
From this padded_max_seq_len must be less than or equal to 25344, therefore max_model_len <= 25329 Retrying the test with Result: The first generation gets frozen with 100% CPU core load (1 core) without using the GPU at all. The model is loaded into the GPU based on VRAM usage, it is just not used. It is frozen in while self.llm_engine.has_unfinished_requests():
step_outputs = self.llm_engine.step()
for output in step_outputs:
if output.finished:
outputs.append(output)
if use_tqdm:
pbar.update(1)
|
More test results: With one 4090 GPU (24GB VRAM):
With two 4090 GPUs (2x24GB VRAM),
Crash at 25328:
In |
I think it might be frozen because you do not have enough KV cache memory to support the entire context length. This is orthogonal to the error raised about shared memory. The only "fix" here would be to use a GPU with more memory (and perhaps raise a warning in the scheduler if a new request cannot be scheduled due to lack of KV cache. That would be best done in a separate PR). I think quantization would also help. |
I'm going to try on both 4090s then with Another way is to increase |
Tested OK on two 4090 GPUs up to 25088 context size. (Not exhaustively, only at various points, see above.) I cannot go much higher due to various vLLM limitations (see above). |
@casper-hansen We got progress with YaRN here, but facing problems/limitations with vLLM at longer context sizes. Could you please suggest? Thanks! - See #1264 (comment) |
I think memory related context length limitations should not block this PR and can be fixed independently. I know that the vLLM team is investigating ways to improve shared memory usage, for example. I'll apply the suggestions you have made, @viktor-ferenczi. |
I agree about not blocking this PR because of unrelated context length limitations. I'm not sure how #555 was tested, cannot find its test case in the source code. Wanted to reuse that test case for testing this PR as well, since they are related. (I could test manually with the pass key retrieval test I added, but it is not streamlined, not added to automation.) |
I would advise to test on A100s to make sure it’s not a limit on the memory side. Other than that, it’s probably some niche area of vLLM that I have little insight into. |
@Yard1 Would it be possible to first add Yarn without the tests? We also need to add tests for other RoPe scaling methods and it seems the Yarn test code will conflict with that. |
@Yard1 I got this error when tried the
|
@WoosukKwon did you use Actually, we should probably make that clearer... maybe catch this exception and raise our own telling the user to set it to true. |
@Yard1 Got it. It works after adding However, I got the following error during the initial memory profiling:
However, the error disappeared when I fixed We saw the same kind of error when using Mistral model with AWQ (#1236). The error happened because there was integer overflow in pointer arithmetic inside the AWQ GEMM kernel, when the number of input tokens is too large. We fixed this in #1295 by changing some |
Yes, I think the default transformers input len is way too big. I got it to work successfully with 24K on an A100. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems the configs of Yarn models break
Line 354 in 9738b84
def _get_and_verify_max_len( |
For example, vLLM infers that the maximum length of NousResearch/Yarn-Llama-2-7b-64k
is 1M rather than 64K because the max_position_embeddings
in the model config is 64K and the rope scaling factor is 16.
On the other hand, for NousResearch/Yarn-Mistral-7b-128k
model, vLLM regards the maximum length is 512K because max_position_embeddings
in its config is 32K (which is the original model's maximum length after 4x RoPE scaling).
In conclusion, I feel we need a separate logic for Yarn models in the _get_and_verify_max_len
function. WDYT?
@WoosukKwon Updated, PTAL! |
I discovered some sort of a correctness error with the probability tensor getting a non-finite value, marking as draft until investigation is complete. |
Fixed, the int in kernel was overflowing. |
Hi @Yard1, just informing you that there is now a Mistral Yarn model out. Perhaps a good idea to test if this PR works with it. |
Good catch with the activation kernel @WoosukKwon ! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Yard1 LGTM. Thanks for submitting the PR! FYI, I also fixed the same kind of overflow error in activation_kernels.cu
.
BTw whats the minimal resourses to deploy this 64k model ? Would vllm support offload method in the future?> |
Signed-off-by: Antoni Baum <[email protected]> Co-authored-by: Viktor Ferenczi <[email protected]> Co-authored-by: Woosuk Kwon <[email protected]>
Signed-off-by: Antoni Baum <[email protected]> Co-authored-by: Viktor Ferenczi <[email protected]> Co-authored-by: Woosuk Kwon <[email protected]>
Supersedes #1161, thank you @viktor-ferenczi for laying down the groundwork :)
This PR implements support for YaRN models.
YaRN paper: https://arxiv.org/abs/2309.00071
YaRN repository: https://github.com/jquesnelle/yarn
Smallest model to test with: https://huggingface.co/NousResearch/Yarn-Llama-2-7b-64k
Closes #980