Pre-Training LLama3.1 on AWS Trainium using Ray and PyTorch Lightning #725

sindhupalakodety · 2025-01-15T04:56:30Z

What does this PR do?

Example showing a combination of technologies such as Ray + PTL + Neuron for pre-training llama3.1 model on Trn1 instances. This example was requested by multiple customers.

The integration of Ray, PyTorch Lightning (PTL), and AWS Neuron combines PTL's intuitive model development API, Ray Train's robust distributed computing capabilities for seamless scaling across multiple nodes, and AWS Neuron's hardware optimization for Trainium, significantly simplifying the setup and management of distributed training environments for large-scale AI projects, particularly those involving computationally intensive tasks like large language models.

Motivation

Issue: #724

More

[x ] Yes, I have tested the PR using my local account setup (Provide any test evidence report under Additional Notes)
[ x] Mandatory for new blueprints. Yes, I have added a example to support my blueprint PR
[ x] Mandatory for new blueprints. Yes, I have updated the website/docs or website/blog section for this feature
Yes, I ran pre-commit run -a with this PR. Link for installing pre-commit locally

For Moderators

[ x] E2E Test successfully complete before merge?

Additional Notes

We tested this out for a customer use-case and even demoed the solution to the customer.
The customer was impressed with the results.

sindhupalakodety added 4 commits January 14, 2025 19:49

code for the issue 724 for data-on-eks

6b03601

Merge branch 'awslabs:main' into main

eedcded

commiting the code for issue 724

694b777

Merge branch 'main' of github.com:sindhupalakodety/data-on-eks

b69c12f

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pre-Training LLama3.1 on AWS Trainium using Ray and PyTorch Lightning #725

Pre-Training LLama3.1 on AWS Trainium using Ray and PyTorch Lightning #725

sindhupalakodety commented Jan 15, 2025

Pre-Training LLama3.1 on AWS Trainium using Ray and PyTorch Lightning #725

Are you sure you want to change the base?

Pre-Training LLama3.1 on AWS Trainium using Ray and PyTorch Lightning #725

Conversation

sindhupalakodety commented Jan 15, 2025

What does this PR do?

Motivation

More

For Moderators

Additional Notes