Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: NVIDIA NIM on EKS Pattern #565

Merged
merged 17 commits into from
Jul 11, 2024

Conversation

hustshawn
Copy link
Contributor

What does this PR do?

🛑 Please open an issue first to discuss any significant work and flesh out details/direction - we would hate for your time to be wasted.
Consult the CONTRIBUTING guide for submitting pull-requests.

Motivation

This PR is to address #560

More

  • Yes, I have tested the PR using my local account setup (Provide any test evidence report under Additional Notes)
  • Mandatory for new blueprints. Yes, I have added a example to support my blueprint PR
  • Mandatory for new blueprints. Yes, I have updated the website/docs or website/blog section for this feature
  • Yes, I ran pre-commit run -a with this PR. Link for installing pre-commit locally

For Moderators

  • E2E Test successfully complete before merge?

Additional Notes

@hustshawn hustshawn changed the title NVIDIA NIM on EKS Pattern feat: NVIDIA NIM on EKS Pattern Jun 30, 2024
Copy link
Collaborator

@askulkarni2 askulkarni2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we deploy this using existing Triton + vLLM stack? From NIM docs, it seems like NIM can detect TensorRT-LLM or vLLM profiles for optimizations. Also can we include an observability section?

Copy link
Collaborator

@vara-bonthu vara-bonthu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 to @askulkarni2 . I had a conversation with @hustshawn and explained the necessary changes to reuse the existing blueprint, as mentioned in the issue here.

1/ Use the existing NVIDIA Triton blueprint.
2/ Create a new file called nvidia-nim.tf to add the EFS resources and NIM Helm chart, with a variable called enable_nvidia_nim=true.
3/ Add a new variable called enable_nvidia_triton_server=true as default, so it can be disabled when NIM is enabled.
4/ Add labels and tolerations to the NVIDIA NIM Helm chart values.
5/ Show observability dashboards using Grafana in the website docs.
6/ Create a python client that reads prompts from a file and generates responses for hosted NIM model.
7/ Create a comparison between NVIDIA Triton with vLLM and NVIDIA NIM using the same model, and publish this as part of the website docs.

@hustshawn
Copy link
Contributor Author

Sure, that makes sense to me. I will make changes accordingly and update you once it's ready for your review again.

@vara-bonthu vara-bonthu added the gen-ai pattern Distributed Training and Inference Patterns for Various Generative AI Large Language Models (LLMs) label Jul 3, 2024
@hustshawn
Copy link
Contributor Author

7/ Create a comparison between NVIDIA Triton with vLLM and NVIDIA NIM using the same model, and publish this as part of the website docs.

Hi @vara-bonthu , in terms of the comparison, there are some limitations here,

  1. current triton_vllm pattern only provides Llama2 and mistral-7b models, while NIM pattern uses Llama3 models;
  2. NIM has builtin vLLM and TRT-LLM, should I just compare Llama3-7b vs Llama3-70b on NIM, so that it can be an apple-apple comparison?
  3. I created a similar script as triton_client, but found the result has significant difference. So again I cannot see it's a valid comparison with the existing one mentioned in triton doc.
Loading inputs from `prompts.txt`...
Model meta/llama3-8b-instruct - Request 14: 4877.96 ms
Model meta/llama3-8b-instruct - Request 10: 6582.67 ms
Model meta/llama3-8b-instruct - Request 3: 7919.11 ms
Model meta/llama3-8b-instruct - Request 15: 7972.63 ms
Model meta/llama3-8b-instruct - Request 1: 8646.89 ms
Model meta/llama3-8b-instruct - Request 5: 8933.86 ms
Model meta/llama3-8b-instruct - Request 12: 9068.71 ms
Model meta/llama3-8b-instruct - Request 18: 9932.78 ms
Model meta/llama3-8b-instruct - Request 0: 10393.83 ms
Model meta/llama3-8b-instruct - Request 6: 10416.12 ms
Model meta/llama3-8b-instruct - Request 16: 10688.25 ms
Model meta/llama3-8b-instruct - Request 4: 10735.83 ms
Model meta/llama3-8b-instruct - Request 11: 10938.80 ms
Model meta/llama3-8b-instruct - Request 8: 11117.52 ms
Model meta/llama3-8b-instruct - Request 17: 12112.03 ms
Model meta/llama3-8b-instruct - Request 2: 12302.77 ms
Model meta/llama3-8b-instruct - Request 19: 12796.39 ms
Model meta/llama3-8b-instruct - Request 9: 13636.33 ms
Model meta/llama3-8b-instruct - Request 13: 13731.89 ms
Model meta/llama3-8b-instruct - Request 7: 15027.77 ms
Storing results into `results.txt`...
Total time for all requests: 207.83 seconds (207832.13 milliseconds)
PASS: NVIDIA NIM example

@hustshawn
Copy link
Contributor Author

Hi @vara-bonthu , please review again. Thanks.

@vara-bonthu
Copy link
Collaborator

vara-bonthu commented Jul 7, 2024

Thanks for updating the PR @hustshawn !I will test your PR and update accordingly.

@askulkarni2 @ratnopamc please review

Copy link
Collaborator

@vara-bonthu vara-bonthu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

couple of minor comments for the PR

@@ -140,6 +145,20 @@ module "eks_blueprints_addons" {
],
}

helm_releases = {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice!

ai-ml/nvidia-triton-server/nvidia-nim.tf Show resolved Hide resolved
ai-ml/nvidia-triton-server/nvidia-nim.tf Show resolved Hide resolved
ai-ml/nvidia-triton-server/variables.tf Outdated Show resolved Hide resolved
@hustshawn
Copy link
Contributor Author

@vara-bonthu updated based on our discussion and latest comment. Please review again. Thanks

Copy link
Collaborator

@vara-bonthu vara-bonthu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Thanks for the updates 🙌🏼

@vara-bonthu
Copy link
Collaborator

@askulkarni2 please review

Copy link
Collaborator

@ratnopamc ratnopamc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@hustshawn thanks for the PR. Could you please review the comments and address?

Copy link
Collaborator

@askulkarni2 askulkarni2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@vara-bonthu vara-bonthu merged commit 78d4e0f into awslabs:main Jul 11, 2024
36 of 37 checks passed
ovaleanu pushed a commit to ovaleanu/data-on-eks that referenced this pull request Aug 10, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
gen-ai pattern Distributed Training and Inference Patterns for Various Generative AI Large Language Models (LLMs)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants