Error When Reproducing Nvidia's LLama2-70b-LoRA Results #5

mrmhodak · 2024-08-26T07:52:20Z

Hello,

when trying to reproduce Nvidia's results on DGX H100, the code cannot be executed and results in Segmentation Fault - see attached file

We have found that the error disappears when TP_COMM_OVERLAP is set to FALSE, but then the time is 38 minutes, instead of about ~28 minutes.

Please help us resolve.
mpi_error_message_1 2.txt

mmarcinkiewicz · 2024-08-26T18:48:07Z

Are you using slurm+enroot+pyxis? Or docker?

EDIT: I see "docker exec". You need slurm+enroot+pyxis to enable TP_COMM_OVERLAP and get the perf

mmarcinkiewicz · 2024-08-26T19:28:57Z

How did you even get the docker run? We deprecated that file and have not included it in our submission? Did you dig it up from our previous submissions?

mrmhodak · 2024-08-26T21:09:17Z

Hi, we have a single node system so we looked at Dell's submission which has directions for single node execution using docker:
https://github.com/mlcommons/training_results_v4.0/tree/main/Dell/benchmarks/llama2_70b_lora/implementations/pytorch

My assumption is that they worked with Nvidia

mmarcinkiewicz · 2024-08-27T09:16:40Z

Ok, I see. Dell also ran with the overlap off, and it indeed cost them ~2 minutes.

We are pretty busy with our new submission so I don't think we have the time to chase it that late.
Two quick questions:

Is the data on raid? Is the raid array configured as RAID0?
How did you get the container running? There were issues with floating libraries in one of our dependencies. We'll sort it out for the next submission but sadly it's non-trivial to repro old results. Maybe you've made some changes there to make it work?

blevai · 2024-08-27T15:15:55Z

The docker approach won't work, The docker approach in the Dell 4.0 submission is most likely some leftover code from some failed attempt to bypass and use docker instead running the same training code.

However, when we tried to run the submitted code with slurm (slurm+enroot+pyxis), we ran into some permission issues.

slurm-32-1.txt

@mmarcinkiewicz could you please check the logs,? Maybe we did some trivial mistake.

nehmathe · 2024-08-27T16:44:38Z

Hi @mmarcinkiewicz,

Yes, the data is on raid and configured as RAID0.

balazslevai-htec · 2024-09-06T16:34:43Z

Hi @mmarcinkiewicz,

So to be able to have a try on using slurm + enroot + pyxis we had to do some changes to the submission:

downgrade the transformers and huggingface_hub libs (huggingface_hub==0.23.2 transformers==4.40.2) because the versions included in the docker image built from the Dockerfile in the submission missed some constants at the very start of the training
compile in advance in Dockerfile NeMo/nemo/collections/nlp/data/language_modeling/megatron/helpers.cpp because during training, that failed due to permission issues

After these modifications we ran into another python error that we cannot figure out

slurm-219.txt

Could you please take a look at it?

mmarcinkiewicz · 2024-09-09T16:18:22Z

Here's how to build a working container (tested on our side):

Replace the upstream NeMo
https://github.com/mlcommons/training_results_v4.0/blob/main/NVIDIA/benchmarks/llama2_70b_lora/implementations/nemo/Dockerfile#L25

RUN git clone https://github.com/NVIDIA/NeMo.git && \
    cd NeMo && \
    echo NEMO_REVISION=${NEMO_REVISION} && \
    git checkout ${NEMO_REVISION} && \
    echo NEMO_COMMIT_HASH=$(git rev-parse HEAD) && \
    pip install --no-build-isolation -e ".[nlp]"

with

RUN git clone https://github.com/ggruza/NeMo.git && \
    cd NeMo && \
    echo NEMO_REVISION=${NEMO_REVISION} && \
    git checkout v2.0.0.rc0.beta_modified && \
    pip install --no-build-isolation -e ".[nlp]"

(please mind that the fork won't be there forever)

Also, please go to https://github.com/mlcommons/training_results_v4.0/blob/main/NVIDIA/benchmarks/llama2_70b_lora/implementations/nemo/requirements.txt
and add


botocore==1.34.104
datasets==2.19.1
huggingface-hub==0.23.0
inflect==7.2.1
more-itertools==10.2.0
numcodecs==0.12.1
portalocker==2.8.2
pretty-errors==1.2.25
pytorch-lightning==2.2.4
requests==2.31.0
s3transfer==0.10.1
safetensors==0.4.3
sentry-sdk==2.1.1
torchmetrics==1.4.0
tqdm==4.66.2
transformers==4.40.2
typeguard==4.2.1
wandb==0.17.0

sorry, I know it's a hassle, we've fixed that in 4.1

mmarcinkiewicz · 2024-09-09T16:20:19Z

@blevai
there's

0: g++ -O3 -Wall -shared -std=c++11 -fPIC -fdiagnostics-color -I/usr/include/python3.10 -I/usr/local/lib/python3.10/dist-packages/pybind11/include helpers.cpp -o helpers.cpython-310-x86_64-linux-gnu.so
0: /usr/bin/ld: cannot open output file helpers.cpython-310-x86_64-linux-gnu.so: Read-only file system
0: collect2: error: ld returned 1 exit status

in your log. Can you make /usr/bin writable?

mmarcinkiewicz · 2024-09-09T16:21:40Z

@balazslevai-htec please build a new container according to the recipe provided above

matthew-frank · 2024-09-09T16:43:39Z

Note that the only change between https://github.com/NVIDIA/[email protected] and [email protected]:ggruza/[email protected]_modified is to pin the versions in the requirements/requirements_nlp.txt file:

$ diff -r NeMo/ ggruza-NeMo/
diff -r NeMo/requirements/requirements_nlp.txt ggruza-NeMo/requirements/requirements_nlp.txt
1,12c1,12
< boto3
< einops
< faiss-cpu
< fasttext
< flask_restful
< ftfy
< gdown
< h5py
< ijson
< jieba
< markdown2
< matplotlib>=3.3.2
---
> boto3==1.34.104
> einops==0.7.0
> faiss-cpu==1.8.0
> fasttext==0.9.2
> flask_restful==0.3.10
> ftfy==6.2.0
> gdown==5.2.0
> h5py==3.11.0
> ijson==3.2.3
> jieba==0.42.1
> markdown2==2.4.13
> matplotlib==3.8.4
14,22c14,22
< nltk>=3.6.5
< opencc<1.1.7
< pangu
< rapidfuzz
< rouge_score
< sacrebleu  # manually install sacrebleu[ja] for Japanese support; MeCab is unsupported in Python 3.11+
< sentence_transformers
< tensorstore<0.1.46
< zarr
---
> nltk==3.8.1
> opencc==1.1.6
> pangu==4.0.6.1
> rapidfuzz==3.9.0
> rouge_score==0.1.2
> sacrebleu==2.4.2
> sentence_transformers==2.7.0
> tensorstore==0.1.45
> zarr==2.18.0

balazslevai-htec · 2024-09-10T13:45:10Z

Hi @matthew-frank and @mmarcinkiewicz,

thank you for the support. Regarding the error message: /usr/bin/ld: cannot open output file helpers.cpython-310-x86_64-linux-gnu.so: Read-only file system, /usr/bin/ld is the linker for c++, the permission issue is in NeMo/nemo/collections/nlp/data/language_modeling/megatron , there is no write permission for that after cloning, where the training tries to compile helpers.cpp during runtime, anyway I added this compilation to the Dockerfile, so that's not a problem anymore.

Beside the above, I followed the docker recipe modifications to the letter but received the same error message only in a different format:

 0: attention.py 2399 forward
 0: out_fp8, aux_ctx_tensors = fused_attn_fwd(
 0: 
 0: fused_attn.py 853 fused_attn_fwd
 0: output_tensors = tex.fused_attn_fwd(
 0: 
 0: RuntimeError:
 0: /workspace/ft-llm/TransformerEngine/transformer_engine/common/fused_attn/fused_attn_fp8.cu:2066 in function operator(): cuDNN Error: Tensor 'sdpa_fp8::Amax_O' strides not set.. For more information, enable cuDNN error logging by setting CUDNN_LOGERR_DBG=1 and CUDNN_LOGDEST_DBG=stderr in the environment.

The complete log is log-236.txt

mmarcinkiewicz · 2024-09-10T16:01:11Z

Can you dump printenv and attach as a file?

nehmathe · 2024-09-10T18:38:05Z

Hi @mmarcinkiewicz,

Here is the printenv dump:
env_output.txt

Thanks.

mrmhodak · 2024-09-16T14:15:15Z

Any suggestions: @matthew-frank @mmarcinkiewicz ?

mmarcinkiewicz · 2024-09-16T19:07:55Z

I don't see anything suspicious. Is there a way you can share the container with us? Either push to dockerhub or as a sqsh file?

Also, a random idea - does your node have python installed? Sometimes, enroot for some reason uses python from the root instead of the container. Adding --no-container-mount-home to your srun command sometimes help

mrmhodak · 2024-09-17T00:27:19Z

@mmarcinkiewicz: I have sent a container location to @ShriyaPalsamudram over email - we do not want to share publicly and I do not have your email. Please share internally and let us know.

mrmhodak · 2024-09-18T06:33:57Z

@mmarcinkiewicz : Any update?

mmarcinkiewicz · 2024-09-18T14:13:50Z

We were able to repro. Trying to understand what's the difference

matthew-frank · 2024-09-18T15:24:27Z

those gdrcopy open failed lines are really suspicious. I have no idea where those are coming from.

zhenghuanbo · 2024-09-19T07:08:25Z

@mmarcinkiewicz ：I get the same error
/workspace/ft-llm/TransformerEngine/transformer_engine/common/fused_attn/fused_attn_fp8.cu:2066 in function operator(): cuDNN Error: Tensor 'sdpa_fp8::Amax_O' strides not set.. For more information, enable cuDNN error logging by setting CUDNN_LOGERR_DBG=1 and CUDNN_LOGDEST_DBG=stderr in the environment.

new issues: #6

mmarcinkiewicz · 2024-09-19T10:31:13Z

@mrmhodak @blevai @zhenghuanbo it seems that TE had submodules that have not been frozen. Here's the recipe to fix it. Please modify the TE install block in the dockerfile to the following:

ARG TE_REVISION=v1.6rc2
ENV CUSTOM_TE_REVISION ${TE_REVISION}

ARG CUDNN_FRONTEND_REVISION=1b0b5eac540b7f8fd19b18f1e6b8427c95503348
ENV CUSTOM_CUDNN_FRONTEND_REVISION ${CUDNN_FRONTEND_REVISION}

ARG GTEST_REVISION=f8d7d77c06936315286eb55f8de22cd23c188571
ENV CUSTOM_GTEST_REVISION ${GTEST_REVISION}

RUN if [ "${TE_REVISION}" != SKIP ]; then \
      git clone https://github.com/NVIDIA/TransformerEngine.git && \
      cd TransformerEngine && \
      git submodule init && git submodule update && \
      echo TE_REVISION=${TE_REVISION} && \
      git checkout ${CUSTOM_TE_REVISION} && \
      # Checkout specific commit for cudnn-frontend submodule
      cd 3rdparty/cudnn-frontend && \
      git checkout ${CUSTOM_CUDNN_FRONTEND_REVISION} && \
      echo CUDNN_FRONTEND_COMMIT_HASH=$(git rev-parse HEAD) && \
      cd - && \
      # Checkout specific commit for googletest submodule
      cd 3rdparty/googletest && \
      git checkout ${CUSTOM_GTEST_REVISION} && \
      echo GTEST_COMMIT_HASH=$(git rev-parse HEAD) && \
      cd - && \
      echo TE_COMMIT_HASH=$(git rev-parse HEAD) && \
      NVTE_FRAMEWORK=pytorch NVTE_WITH_USERBUFFERS=1 MPI_HOME=/usr/local/mpi pip install --force-reinstall --no-deps . \
    ; fi

please retry

mrmhodak · 2024-09-20T17:03:27Z

@mmarcinkiewicz : This works for us - thanks!

One more think @matthew-frank pointed out that we still have errors with "gdrcopy open failed". We have that installed - any ideas what is going on there?

zhenghuanbo · 2024-09-24T02:00:58Z

@mmarcinkiewicz Thank you very much, the error is resolved.

zhenghuanbo mentioned this issue Sep 18, 2024

cuDNN Error: Tensor 'sdpa_fp8::Amax_O' strides not set #6

Open

asesorov mentioned this issue Sep 24, 2024

Cuda error when reproducing Llama2 70B finetune #7

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Error When Reproducing Nvidia's LLama2-70b-LoRA Results #5

Error When Reproducing Nvidia's LLama2-70b-LoRA Results #5

mrmhodak commented Aug 26, 2024

mmarcinkiewicz commented Aug 26, 2024 •

edited

Loading

mmarcinkiewicz commented Aug 26, 2024 •

edited

Loading

mrmhodak commented Aug 26, 2024

mmarcinkiewicz commented Aug 27, 2024

blevai commented Aug 27, 2024

nehmathe commented Aug 27, 2024

balazslevai-htec commented Sep 6, 2024 •

edited

Loading

mmarcinkiewicz commented Sep 9, 2024 •

edited

Loading

mmarcinkiewicz commented Sep 9, 2024

mmarcinkiewicz commented Sep 9, 2024

matthew-frank commented Sep 9, 2024

balazslevai-htec commented Sep 10, 2024

mmarcinkiewicz commented Sep 10, 2024

nehmathe commented Sep 10, 2024

mrmhodak commented Sep 16, 2024

mmarcinkiewicz commented Sep 16, 2024

mrmhodak commented Sep 17, 2024

mrmhodak commented Sep 18, 2024

mmarcinkiewicz commented Sep 18, 2024

matthew-frank commented Sep 18, 2024

zhenghuanbo commented Sep 19, 2024 •

edited

Loading

mmarcinkiewicz commented Sep 19, 2024

mrmhodak commented Sep 20, 2024

zhenghuanbo commented Sep 24, 2024

Error When Reproducing Nvidia's LLama2-70b-LoRA Results #5

Error When Reproducing Nvidia's LLama2-70b-LoRA Results #5

Comments

mrmhodak commented Aug 26, 2024

mmarcinkiewicz commented Aug 26, 2024 • edited Loading

mmarcinkiewicz commented Aug 26, 2024 • edited Loading

mrmhodak commented Aug 26, 2024

mmarcinkiewicz commented Aug 27, 2024

blevai commented Aug 27, 2024

nehmathe commented Aug 27, 2024

balazslevai-htec commented Sep 6, 2024 • edited Loading

mmarcinkiewicz commented Sep 9, 2024 • edited Loading

mmarcinkiewicz commented Sep 9, 2024

mmarcinkiewicz commented Sep 9, 2024

matthew-frank commented Sep 9, 2024

balazslevai-htec commented Sep 10, 2024

mmarcinkiewicz commented Sep 10, 2024

nehmathe commented Sep 10, 2024

mrmhodak commented Sep 16, 2024

mmarcinkiewicz commented Sep 16, 2024

mrmhodak commented Sep 17, 2024

mrmhodak commented Sep 18, 2024

mmarcinkiewicz commented Sep 18, 2024

matthew-frank commented Sep 18, 2024

zhenghuanbo commented Sep 19, 2024 • edited Loading

mmarcinkiewicz commented Sep 19, 2024

mrmhodak commented Sep 20, 2024

zhenghuanbo commented Sep 24, 2024

mmarcinkiewicz commented Aug 26, 2024 •

edited

Loading

mmarcinkiewicz commented Aug 26, 2024 •

edited

Loading

balazslevai-htec commented Sep 6, 2024 •

edited

Loading

mmarcinkiewicz commented Sep 9, 2024 •

edited

Loading

zhenghuanbo commented Sep 19, 2024 •

edited

Loading