-
Notifications
You must be signed in to change notification settings - Fork 14
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Error When Reproducing Nvidia's LLama2-70b-LoRA Results #5
Comments
Are you using slurm+enroot+pyxis? Or docker? EDIT: I see "docker exec". You need slurm+enroot+pyxis to enable TP_COMM_OVERLAP and get the perf |
How did you even get the docker run? We deprecated that file and have not included it in our submission? Did you dig it up from our previous submissions? |
Hi, we have a single node system so we looked at Dell's submission which has directions for single node execution using docker: My assumption is that they worked with Nvidia |
Ok, I see. Dell also ran with the overlap off, and it indeed cost them ~2 minutes. We are pretty busy with our new submission so I don't think we have the time to chase it that late.
|
The docker approach won't work, The docker approach in the Dell 4.0 submission is most likely some leftover code from some failed attempt to bypass and use docker instead running the same training code. However, when we tried to run the submitted code with slurm (slurm+enroot+pyxis), we ran into some permission issues. @mmarcinkiewicz could you please check the logs,? Maybe we did some trivial mistake. |
Hi @mmarcinkiewicz,
|
Hi @mmarcinkiewicz, So to be able to have a try on using slurm + enroot + pyxis we had to do some changes to the submission:
After these modifications we ran into another python error that we cannot figure out Could you please take a look at it? |
Here's how to build a working container (tested on our side): Replace the upstream NeMo
with
(please mind that the fork won't be there forever) Also, please go to https://github.com/mlcommons/training_results_v4.0/blob/main/NVIDIA/benchmarks/llama2_70b_lora/implementations/nemo/requirements.txt
sorry, I know it's a hassle, we've fixed that in 4.1 |
@blevai
in your log. Can you make /usr/bin writable? |
@balazslevai-htec please build a new container according to the recipe provided above |
Note that the only change between https://github.com/NVIDIA/[email protected] and [email protected]:ggruza/[email protected]_modified is to pin the versions in the requirements/requirements_nlp.txt file:
|
Hi @matthew-frank and @mmarcinkiewicz, thank you for the support. Regarding the error message: /usr/bin/ld: cannot open output file helpers.cpython-310-x86_64-linux-gnu.so: Read-only file system, /usr/bin/ld is the linker for c++, the permission issue is in NeMo/nemo/collections/nlp/data/language_modeling/megatron , there is no write permission for that after cloning, where the training tries to compile helpers.cpp during runtime, anyway I added this compilation to the Dockerfile, so that's not a problem anymore. Beside the above, I followed the docker recipe modifications to the letter but received the same error message only in a different format:
The complete log is log-236.txt |
Can you dump printenv and attach as a file? |
Hi @mmarcinkiewicz, Here is the printenv dump: Thanks. |
Any suggestions: @matthew-frank @mmarcinkiewicz ? |
I don't see anything suspicious. Is there a way you can share the container with us? Either push to dockerhub or as a sqsh file? Also, a random idea - does your node have python installed? Sometimes, enroot for some reason uses python from the root instead of the container. Adding |
@mmarcinkiewicz: I have sent a container location to @ShriyaPalsamudram over email - we do not want to share publicly and I do not have your email. Please share internally and let us know. |
@mmarcinkiewicz : Any update? |
We were able to repro. Trying to understand what's the difference |
those |
@mmarcinkiewicz :I get the same error new issues: #6 |
@mrmhodak @blevai @zhenghuanbo it seems that TE had submodules that have not been frozen. Here's the recipe to fix it. Please modify the TE install block in the dockerfile to the following:
please retry |
@mmarcinkiewicz : This works for us - thanks! One more think @matthew-frank pointed out that we still have errors with "gdrcopy open failed". We have that installed - any ideas what is going on there? |
@mmarcinkiewicz Thank you very much, the error is resolved. |
Hello,
when trying to reproduce Nvidia's results on DGX H100, the code cannot be executed and results in Segmentation Fault - see attached file
We have found that the error disappears when TP_COMM_OVERLAP is set to FALSE, but then the time is 38 minutes, instead of about ~28 minutes.
Please help us resolve.
mpi_error_message_1 2.txt
The text was updated successfully, but these errors were encountered: