Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error When Reproducing Nvidia's LLama2-70b-LoRA Results #5

Open
mrmhodak opened this issue Aug 26, 2024 · 24 comments
Open

Error When Reproducing Nvidia's LLama2-70b-LoRA Results #5

mrmhodak opened this issue Aug 26, 2024 · 24 comments

Comments

@mrmhodak
Copy link

Hello,

when trying to reproduce Nvidia's results on DGX H100, the code cannot be executed and results in Segmentation Fault - see attached file

We have found that the error disappears when TP_COMM_OVERLAP is set to FALSE, but then the time is 38 minutes, instead of about ~28 minutes.

Please help us resolve.
mpi_error_message_1 2.txt

@mmarcinkiewicz
Copy link

mmarcinkiewicz commented Aug 26, 2024

Are you using slurm+enroot+pyxis? Or docker?

EDIT: I see "docker exec". You need slurm+enroot+pyxis to enable TP_COMM_OVERLAP and get the perf

@mmarcinkiewicz
Copy link

mmarcinkiewicz commented Aug 26, 2024

How did you even get the docker run? We deprecated that file and have not included it in our submission? Did you dig it up from our previous submissions?

@mrmhodak
Copy link
Author

Hi, we have a single node system so we looked at Dell's submission which has directions for single node execution using docker:
https://github.com/mlcommons/training_results_v4.0/tree/main/Dell/benchmarks/llama2_70b_lora/implementations/pytorch

My assumption is that they worked with Nvidia

@mmarcinkiewicz
Copy link

Ok, I see. Dell also ran with the overlap off, and it indeed cost them ~2 minutes.

We are pretty busy with our new submission so I don't think we have the time to chase it that late.
Two quick questions:

  1. Is the data on raid? Is the raid array configured as RAID0?
  2. How did you get the container running? There were issues with floating libraries in one of our dependencies. We'll sort it out for the next submission but sadly it's non-trivial to repro old results. Maybe you've made some changes there to make it work?

@blevai
Copy link

blevai commented Aug 27, 2024

The docker approach won't work, The docker approach in the Dell 4.0 submission is most likely some leftover code from some failed attempt to bypass and use docker instead running the same training code.

However, when we tried to run the submitted code with slurm (slurm+enroot+pyxis), we ran into some permission issues.

slurm-32-1.txt

@mmarcinkiewicz could you please check the logs,? Maybe we did some trivial mistake.

@nehmathe
Copy link

Hi @mmarcinkiewicz,

  1. Yes, the data is on raid and configured as RAID0.

@balazslevai-htec
Copy link

balazslevai-htec commented Sep 6, 2024

Hi @mmarcinkiewicz,

So to be able to have a try on using slurm + enroot + pyxis we had to do some changes to the submission:

  • downgrade the transformers and huggingface_hub libs (huggingface_hub==0.23.2 transformers==4.40.2) because the versions included in the docker image built from the Dockerfile in the submission missed some constants at the very start of the training
  • compile in advance in Dockerfile NeMo/nemo/collections/nlp/data/language_modeling/megatron/helpers.cpp because during training, that failed due to permission issues

After these modifications we ran into another python error that we cannot figure out

slurm-219.txt

Could you please take a look at it?

@mmarcinkiewicz
Copy link

mmarcinkiewicz commented Sep 9, 2024

Here's how to build a working container (tested on our side):

Replace the upstream NeMo
https://github.com/mlcommons/training_results_v4.0/blob/main/NVIDIA/benchmarks/llama2_70b_lora/implementations/nemo/Dockerfile#L25

RUN git clone https://github.com/NVIDIA/NeMo.git && \
    cd NeMo && \
    echo NEMO_REVISION=${NEMO_REVISION} && \
    git checkout ${NEMO_REVISION} && \
    echo NEMO_COMMIT_HASH=$(git rev-parse HEAD) && \
    pip install --no-build-isolation -e ".[nlp]"

with

RUN git clone https://github.com/ggruza/NeMo.git && \
    cd NeMo && \
    echo NEMO_REVISION=${NEMO_REVISION} && \
    git checkout v2.0.0.rc0.beta_modified && \
    pip install --no-build-isolation -e ".[nlp]"

(please mind that the fork won't be there forever)

Also, please go to https://github.com/mlcommons/training_results_v4.0/blob/main/NVIDIA/benchmarks/llama2_70b_lora/implementations/nemo/requirements.txt
and add


botocore==1.34.104
datasets==2.19.1
huggingface-hub==0.23.0
inflect==7.2.1
more-itertools==10.2.0
numcodecs==0.12.1
portalocker==2.8.2
pretty-errors==1.2.25
pytorch-lightning==2.2.4
requests==2.31.0
s3transfer==0.10.1
safetensors==0.4.3
sentry-sdk==2.1.1
torchmetrics==1.4.0
tqdm==4.66.2
transformers==4.40.2
typeguard==4.2.1
wandb==0.17.0

sorry, I know it's a hassle, we've fixed that in 4.1

@mmarcinkiewicz
Copy link

@blevai
there's

0: g++ -O3 -Wall -shared -std=c++11 -fPIC -fdiagnostics-color -I/usr/include/python3.10 -I/usr/local/lib/python3.10/dist-packages/pybind11/include helpers.cpp -o helpers.cpython-310-x86_64-linux-gnu.so
0: /usr/bin/ld: cannot open output file helpers.cpython-310-x86_64-linux-gnu.so: Read-only file system
0: collect2: error: ld returned 1 exit status

in your log. Can you make /usr/bin writable?

@mmarcinkiewicz
Copy link

@balazslevai-htec please build a new container according to the recipe provided above

@matthew-frank
Copy link

Note that the only change between https://github.com/NVIDIA/[email protected] and [email protected]:ggruza/[email protected]_modified is to pin the versions in the requirements/requirements_nlp.txt file:

$ diff -r NeMo/ ggruza-NeMo/
diff -r NeMo/requirements/requirements_nlp.txt ggruza-NeMo/requirements/requirements_nlp.txt
1,12c1,12
< boto3
< einops
< faiss-cpu
< fasttext
< flask_restful
< ftfy
< gdown
< h5py
< ijson
< jieba
< markdown2
< matplotlib>=3.3.2
---
> boto3==1.34.104
> einops==0.7.0
> faiss-cpu==1.8.0
> fasttext==0.9.2
> flask_restful==0.3.10
> ftfy==6.2.0
> gdown==5.2.0
> h5py==3.11.0
> ijson==3.2.3
> jieba==0.42.1
> markdown2==2.4.13
> matplotlib==3.8.4
14,22c14,22
< nltk>=3.6.5
< opencc<1.1.7
< pangu
< rapidfuzz
< rouge_score
< sacrebleu  # manually install sacrebleu[ja] for Japanese support; MeCab is unsupported in Python 3.11+
< sentence_transformers
< tensorstore<0.1.46
< zarr
---
> nltk==3.8.1
> opencc==1.1.6
> pangu==4.0.6.1
> rapidfuzz==3.9.0
> rouge_score==0.1.2
> sacrebleu==2.4.2
> sentence_transformers==2.7.0
> tensorstore==0.1.45
> zarr==2.18.0

@balazslevai-htec
Copy link

Hi @matthew-frank and @mmarcinkiewicz,

thank you for the support. Regarding the error message: /usr/bin/ld: cannot open output file helpers.cpython-310-x86_64-linux-gnu.so: Read-only file system, /usr/bin/ld is the linker for c++, the permission issue is in NeMo/nemo/collections/nlp/data/language_modeling/megatron , there is no write permission for that after cloning, where the training tries to compile helpers.cpp during runtime, anyway I added this compilation to the Dockerfile, so that's not a problem anymore.

Beside the above, I followed the docker recipe modifications to the letter but received the same error message only in a different format:

 0: attention.py 2399 forward
 0: out_fp8, aux_ctx_tensors = fused_attn_fwd(
 0: 
 0: fused_attn.py 853 fused_attn_fwd
 0: output_tensors = tex.fused_attn_fwd(
 0: 
 0: RuntimeError:
 0: /workspace/ft-llm/TransformerEngine/transformer_engine/common/fused_attn/fused_attn_fp8.cu:2066 in function operator(): cuDNN Error: Tensor 'sdpa_fp8::Amax_O' strides not set.. For more information, enable cuDNN error logging by setting CUDNN_LOGERR_DBG=1 and CUDNN_LOGDEST_DBG=stderr in the environment.

The complete log is log-236.txt

@mmarcinkiewicz
Copy link

Can you dump printenv and attach as a file?

@nehmathe
Copy link

Hi @mmarcinkiewicz,

Here is the printenv dump:
env_output.txt

Thanks.

@mrmhodak
Copy link
Author

Any suggestions: @matthew-frank @mmarcinkiewicz ?

@mmarcinkiewicz
Copy link

I don't see anything suspicious. Is there a way you can share the container with us? Either push to dockerhub or as a sqsh file?

Also, a random idea - does your node have python installed? Sometimes, enroot for some reason uses python from the root instead of the container. Adding --no-container-mount-home to your srun command sometimes help

@mrmhodak
Copy link
Author

@mmarcinkiewicz: I have sent a container location to @ShriyaPalsamudram over email - we do not want to share publicly and I do not have your email. Please share internally and let us know.

@mrmhodak
Copy link
Author

@mmarcinkiewicz : Any update?

@mmarcinkiewicz
Copy link

We were able to repro. Trying to understand what's the difference

@matthew-frank
Copy link

those gdrcopy open failed lines are really suspicious. I have no idea where those are coming from.

@zhenghuanbo
Copy link

zhenghuanbo commented Sep 19, 2024

@mmarcinkiewicz :I get the same error
/workspace/ft-llm/TransformerEngine/transformer_engine/common/fused_attn/fused_attn_fp8.cu:2066 in function operator(): cuDNN Error: Tensor 'sdpa_fp8::Amax_O' strides not set.. For more information, enable cuDNN error logging by setting CUDNN_LOGERR_DBG=1 and CUDNN_LOGDEST_DBG=stderr in the environment.

new issues: #6

@mmarcinkiewicz
Copy link

@mrmhodak @blevai @zhenghuanbo it seems that TE had submodules that have not been frozen. Here's the recipe to fix it. Please modify the TE install block in the dockerfile to the following:

ARG TE_REVISION=v1.6rc2
ENV CUSTOM_TE_REVISION ${TE_REVISION}

ARG CUDNN_FRONTEND_REVISION=1b0b5eac540b7f8fd19b18f1e6b8427c95503348
ENV CUSTOM_CUDNN_FRONTEND_REVISION ${CUDNN_FRONTEND_REVISION}

ARG GTEST_REVISION=f8d7d77c06936315286eb55f8de22cd23c188571
ENV CUSTOM_GTEST_REVISION ${GTEST_REVISION}

RUN if [ "${TE_REVISION}" != SKIP ]; then \
      git clone https://github.com/NVIDIA/TransformerEngine.git && \
      cd TransformerEngine && \
      git submodule init && git submodule update && \
      echo TE_REVISION=${TE_REVISION} && \
      git checkout ${CUSTOM_TE_REVISION} && \
      # Checkout specific commit for cudnn-frontend submodule
      cd 3rdparty/cudnn-frontend && \
      git checkout ${CUSTOM_CUDNN_FRONTEND_REVISION} && \
      echo CUDNN_FRONTEND_COMMIT_HASH=$(git rev-parse HEAD) && \
      cd - && \
      # Checkout specific commit for googletest submodule
      cd 3rdparty/googletest && \
      git checkout ${CUSTOM_GTEST_REVISION} && \
      echo GTEST_COMMIT_HASH=$(git rev-parse HEAD) && \
      cd - && \
      echo TE_COMMIT_HASH=$(git rev-parse HEAD) && \
      NVTE_FRAMEWORK=pytorch NVTE_WITH_USERBUFFERS=1 MPI_HOME=/usr/local/mpi pip install --force-reinstall --no-deps . \
    ; fi

please retry

@mrmhodak
Copy link
Author

@mmarcinkiewicz : This works for us - thanks!

One more think @matthew-frank pointed out that we still have errors with "gdrcopy open failed". We have that installed - any ideas what is going on there?

@zhenghuanbo
Copy link

@mmarcinkiewicz Thank you very much, the error is resolved.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

8 participants
@matthew-frank @blevai @mrmhodak @zhenghuanbo @mmarcinkiewicz @nehmathe @balazslevai-htec and others