Multi-node without hosts or hostfile, using torch.distributed.launch #5728
Unanswered
casper-hansen
asked this question in
Community | Q&A
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Hi ColossalAI community. I have a simple question about how to use ColossalAI in my multi-node environment. I am looking to train Llama 3 and other models alike, but I could not find an example or documentation that fully answers my question.
I do not have a hosts or hostfile available at training time like what is showcased for Llama 3 70B. I only have environment variables like
MASTER_ADDR
,MASTER_PORT
,WORLD_SIZE
,RANK
,LOCAL_RANK
,NODE_RANK
that are decided at runtime by the cluster (AzureML in this case).Is it possible to use ColossalAI for multi-node training, e.g. by using the following command to launch?
And then using
colossalai.launch_from_torch
intrain.py
for wrapping torch/colossal together and launching multi-node training?Can you provide an example of how to do this with the Llama 3 example? Would be helpful!
Beta Was this translation helpful? Give feedback.
All reactions