-
Notifications
You must be signed in to change notification settings - Fork 29
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Not able to load llama 3 70b on inf2.24xlarge instance #92
Comments
Hi @sangraamp , thanks for filing the issue. The example is verified to work with TP=24. For TP=12, while the formula can give a rough back-of-envelop estimate, there are additional memory usage from runtime and compiler that would increase the memory usage above just the parameters and KV cache. You can try limiting the number of buckets by setting n_positions to an explicit list (bucketing info): Instead of:
Making n_positions a list will force a single bucket to be compiled: |
@jeffhataws Thanks a lot for the workaround! Can successfully load the model in now. The issue I'm facing now is quite weird, the original issue of loading the model in has been resolved, though.
Here is the output/gibberish I get:
When I run the same code with Llama 3 8B Instruct, it gives coherent outputs:
I can't seem to figure out what is going wrong here. |
@sangraamp thanks for reporting back and glad that you have made progress. I agree that the LLam3 70B output is not expected and will investigate. One thing you can try is to match the flag recommended in the example by just having |
@jeffhataws Thanks, tried removing the flag but still facing the same issue. |
Thank you @sangraamp. We have reproduced the issue with TP=12 and will be looking into it. |
That's great, @jeffhataws! Hoping to hear from you soon regarding any steps I'll need to take, if necessary, to not encounter this error. |
Here is the code I am using, taken directly from the aws-neuron-samples repository:
And this is the error I am getting:
However, as mentioned in this notebook:
The memory required to host any model can be computed with:
When using
float16
casted weights for a 8 billion parameter model, this works out to2 * 70B
or ~140GB of weights. In reality, the total space required is often greater than just the number of parameters due to caching attention layer projections (KV caching). This caching mechanism grows memory allocations linearly with sequence length and batch size.To get very large language models to fit on Inf2 & Trn1, tensor parallelism is used to split weights, data, and compute across multiple NeuronCores. The number of NeuronCores that the weights are split across can be controlled by setting the
tp_degree
parameter. This parallelism degree must be chosen to ensure that the memory usage per NeuronCore will be less than the physical 16GB limit. When configuring tensor parallelism, the memory per NeuronCore can be computed with:This can be used to compute the minimum instance sizing by ensuring that the value selected for
tp_degree
results in less than 16GB allocated per NeuronCore.Using this formula, plugging in the tp_degree I am using (12), I obtain the memory per core as 2 * 70B / 12 which is 140GB / 12 which is less than 12GB per core! I cannot figure out why the cores are running out of memory in this case.
The text was updated successfully, but these errors were encountered: