Enable pytorch nightly CI #17335

ydshieh · 2022-05-18T18:44:28Z

What does this PR do?

Make necessary changes to build huggingface/transformers-pytorch-nightly-gpu
Update self-nightly-scheduled.yml to run PyTorch nightly build CI (almost a copy from scheduled CI)

test workflow run here.
docker build run

Print versions

HuggingFaceDocBuilderDev · 2022-05-18T19:05:08Z

The documentation is not available anymore as the PR was closed or merged.

docker/transformers-pytorch-deepspeed-nightly-gpu/Dockerfile

ydshieh · 2022-05-19T12:44:45Z

@stas00, @LysandreJik

I run the docker image transformers-pytorch-deepspeed-latest-gpu and found it's PyTroch has version 1.9.0.
This image is based on nvcr.io/nvidia/pytorch:21.03-py3, which is used in the job Test Torch CUDA Extensions for both daily scheduled CI and push CI.

We can discuss about this. But so far I won't include DeepSpeed test with nightly-built PyTorch.

>>> import torch
>>> torch.__version__
'1.9.0a0+df837d0'
>>> exit()

.github/workflows/build-docker-images.yml

docker/transformers-pytorch-deepspeed-nightly-gpu/Dockerfile

stas00 · 2022-05-19T14:57:19Z

I run the docker image transformers-pytorch-deepspeed-latest-gpu and found it's PyTroch has version 1.9.0.
This image is based on nvcr.io/nvidia/pytorch:21.03-py3, which is used in the job Test Torch CUDA Extensions for both daily scheduled CI and push CI.

But that's not nightly. I don't think you can rely on any pre-made docker to run nightly. It has to be manually installed since it is a new version every day. You probably want to switch to 22.04 (latest at the moment) and then update it to the actual nightly. Does it make sense?

I also don't understand why deepspeed tests were removed. It's critical that we run deepspeed tests on nightly.

ydshieh · 2022-05-19T15:23:56Z

@stas00

The non DeepSpeed tests added in this PR use pytorch-nightly. You can verify in this run page https://github.com/huggingface/transformers/runs/6507879699?check_suite_focus=true and click Echo versions, and you will see Pytorch Version: 1.12.0.dev20220519+cu102. (It will change everyday).

DeepSpeed

Regarding this, we have something to discuss:

For the push and scheuled CIs currently running, the deepspeed tests are run with PyTorch 1.9.0.
- Before going to the nightly pytorch with deepspeed, it might be better to decide what should we test for push and scheduled CI for deepspeed test. Should we use the latest stable Pytorch instead?
I don't mean to remove DeepSpeed tests with nightly PyTorch. The reason is that I am not able to make a docker image with PyTorch Nightly + DeepSpeed. I even tried with PyTorch Stable + DeepSpeed and the docker image also fails.

Details

In the Dockerfile transformers-pytorch-deepspeed-latest-gpu:

RUN python3 -m pip install --no-cache-dir -e ./transformers[deepspeed-testing]

# This fail with if we install Pytorch Nightly or Stable above.
RUN git clone https://github.com/microsoft/DeepSpeed && cd DeepSpeed && rm -rf build && \
    DS_BUILD_CPU_ADAM=1 DS_BUILD_AIO=1 DS_BUILD_UTILS=1 python3 -m pip install -e . --global-option="build_ext" --global-option="-j8" --no-cache -v --disable-pip-version-check 2>&1

stas00 · 2022-05-19T15:29:00Z

Should we use the latest stable Pytorch instead?

Yes, please.

I think we should use the latest stable pytorch for all our tests unless we explicitly test older pytorch versions every few days or so. And to ensure we update to the new stable once it gets released.

I even tried with PyTorch Stable + DeepSpeed and the docker image also fails.

Could you please point me to the actual issues and I will help to sort them out?

We can discuss it on slack if it's easier.

stas00

Looks good, thank you, @ydshieh!

.github/workflows/self-nightly-scheduled.yml

docker/transformers-pytorch-gpu/Dockerfile

LysandreJik

Looks good to me, @ydshieh! Thanks for taking care of that.

LysandreJik · 2022-05-31T08:09:26Z

docker/transformers-pytorch-deepspeed-nightly-gpu/Dockerfile

+# this line must be added in order for python to be aware of transformers.
+RUN cd transformers && python3 setup.py develop
+
+RUN python3 -c "from deepspeed.launcher.runner import main"


Nice to test it!

ydshieh

Hi @LysandreJik @stas00

After the recent changes regarding tests, I have to update this PR. In particular, I would like to make sure the goal of the scheduled nightly test(s) are:

tests with nightly torch + latest TF (+ stable Flax)
tests with latest torch + nightly TF (+ stable Flax)
tests with latest torch + latest TF (+ nightlyFlax)

Here the summary of the latest changes in this PR

build a new docker image with nightly torch + latest stable TF
make self-nightly-scheduled the same as self-scheduled, except:
- only keep the following tests (so far), running with docker images having nightly torch and/or deepspeed + stable release TF
  - run_tests_single_gpu
  - run_tests_multi_gpu
  - run_all_tests_torch_cuda_extensions_gpu
- (we could add run_examples_gpu and run_pipelines_torch_gpu in a follow up PR if necessary)

LysandreJik · 2022-06-13T09:55:58Z

I don't think we really care about nightly flax right now, so I would keep the following:

tests with nightly torch + latest TF
tests with latest torch + nightly TF

The rest sounds good to me!

ydshieh

Hey, @LysandreJik @stas00 !

I have updated this PR. I think the best way to save your time reviewing is to check my comments below.

If everything is fine, I will run some tests to make sure before merge.

ydshieh · 2022-06-13T15:55:15Z

.github/workflows/build-docker-images.yml

+          context: ./docker/transformers-all-latest-gpu
+          build-args: |
+            REF=main
+            PYTORCH=pre


to install nightly build torch

ydshieh · 2022-06-13T15:57:27Z

.github/workflows/self-nightly-scheduled.yml

@@ -1,250 +1,235 @@
-name: Self-hosted runner; Nightly (scheduled)
+name: Self-hosted runner (nightly)


Basically copied from self-scheduled.yml, with the necessary changes to use nightly built torch & deepspeed master.

well, I can change this back if prefered - I didn't pay attention here.

ydshieh · 2022-06-13T15:58:11Z

.github/workflows/self-nightly-scheduled.yml

+  RUN_SLOW: yes
+  SIGOPT_API_TOKEN: ${{ secrets.SIGOPT_API_TOKEN }}
+  TF_FORCE_GPU_ALLOW_GROWTH: true
+  RUN_PT_TF_CROSS_TESTS: 1


Keep the values in self-scheduled.yml instead of the values in the old self-nightly-scheduled.yml, hope this makes sense.

ydshieh · 2022-06-13T16:00:52Z

docker/transformers-all-latest-gpu/Dockerfile

@@ -3,6 +3,9 @@ LABEL maintainer="Hugging Face"

 ARG DEBIAN_FRONTEND=noninteractive

+# Used to read variables from `~/.profile` (to pass dynamic created variables between RUN commands)
+SHELL ["sh", "-lc"]


Dockfile is quite limited, and I really need to pass some variables between RUN command.

ydshieh · 2022-06-13T16:01:38Z

docker/transformers-all-latest-gpu/Dockerfile

@@ -21,11 +24,20 @@ ARG REF=main
 RUN git clone https://github.com/huggingface/transformers && cd transformers && git checkout $REF
 RUN python3 -m pip install --no-cache-dir -e ./transformers[dev,onnxruntime]

-RUN python3 -m pip install --no-cache-dir -U torch==$PYTORCH torchvision torchaudio --extra-index-url https://download.pytorch.org/whl/$CUDA
+# TODO: Handle these in a python utility script
+RUN [ ${#PYTORCH} -gt 0 -a "$PYTORCH" != "pre" ] && VERSION='torch=='$PYTORCH'.*' ||  VERSION='torch'; echo "export VERSION='$VERSION'" >> ~/.profile


This is where we need SHELL ["sh", "-lc"] in order to pass VERSION to the next line.

ydshieh · 2022-06-13T16:03:54Z

docker/transformers-all-latest-gpu/Dockerfile

+# `torchvision` and `torchaudio` should be installed along with `torch`, especially for nightly build.
+# Currently, let's just use their latest releases (when `torch` is installed with a release version)
+# TODO: We might need to specify proper versions that work with a specific torch version (especially for past CI).
+RUN [ "$PYTORCH" != "pre" ] && python3 -m pip install --no-cache-dir -U $VERSION torchvision torchaudio --extra-index-url https://download.pytorch.org/whl/$CUDA || python3 -m pip install --no-cache-dir -U --pre torch torchvision torchaudio --extra-index-url https://download.pytorch.org/whl/nightly/$CUDA


I really think we can make things and life easier here. I have some POC, and can do it in a follow-up PR.

ydshieh · 2022-06-13T16:04:58Z

docker/transformers-pytorch-deepspeed-nightly-gpu/Dockerfile

+# This has to be run inside the GPU VMs running the tests. (So far, it fails here due to GPU checks during compilation.)
+# Issue: https://github.com/microsoft/DeepSpeed/issues/2010
+# RUN git clone https://github.com/microsoft/DeepSpeed && cd DeepSpeed && rm -rf build && \
+#    DS_BUILD_CPU_ADAM=1 DS_BUILD_AIO=1 DS_BUILD_UTILS=1 python3 -m pip install . --global-option="build_ext" --global-option="-j8" --no-cache -v --disable-pip-version-check 2>&1


See microsoft/DeepSpeed#2010

We should be able to fix this one quickly I think on the DS side - let's see what the maintainers think - I can make a PR to not have it fail easily.

There is no rush, as pre-build works inside docker run. We can always change it back (to pre-build DeepSpeed in Dockerfile) once the issue is fixed on DeepSpeed side.

I will double check that issue before merging this PR (and make the cleaning up if the fix is done at that time).

ydshieh · 2022-06-13T16:05:47Z

docker/transformers-pytorch-deepspeed-latest-gpu/Dockerfile

+# This has to be run (again) inside the GPU VMs running the tests.
+# The installation works here, but some tests fail, if we don't pre-build deepspeed again in the VMs running the tests.
+# TODO: Find out why test fail.
+RUN DS_BUILD_CPU_ADAM=1 DS_BUILD_AIO=1 DS_BUILD_UTILS=1 python3 -m pip install deepspeed --global-option="build_ext" --global-option="-j8" --no-cache -v --disable-pip-version-check 2>&1


need to pre-build again in workflow file (i.e. when docker is launched) - until a fix in deepspeed is done.

stas00 · 2022-06-13T16:17:37Z

Looks good to me, @ydshieh! Thank you!

ydshieh · 2022-06-13T19:01:17Z

.github/workflows/self-nightly-scheduled.yml

+          python3 -m pip uninstall -y deepspeed
+          rm -rf DeepSpeed
+          git clone https://github.com/microsoft/DeepSpeed && cd DeepSpeed && rm -rf build
+          DS_BUILD_CPU_ADAM=1 DS_BUILD_AIO=1 DS_BUILD_UTILS=1 python3 -m pip install . --global-option="build_ext" --global-option="-j8" --no-cache -v --disable-pip-version-check


pre-build deepspeed (master version), which is currently necessary.

ydshieh · 2022-06-14T07:49:47Z

A full workflow run is here.

ydshieh force-pushed the enable_pytorch_nightly_ci branch from 51f3a91 to b605c9c Compare May 19, 2022 12:04

ydshieh commented May 19, 2022

View reviewed changes

docker/transformers-pytorch-deepspeed-nightly-gpu/Dockerfile Outdated Show resolved Hide resolved

ydshieh commented May 19, 2022

View reviewed changes

.github/workflows/build-docker-images.yml Outdated Show resolved Hide resolved

ydshieh commented May 19, 2022

View reviewed changes

docker/transformers-pytorch-deepspeed-nightly-gpu/Dockerfile Show resolved Hide resolved

ydshieh requested review from LysandreJik and stas00 May 19, 2022 13:26

ydshieh marked this pull request as draft May 19, 2022 16:40

ydshieh marked this pull request as ready for review May 19, 2022 19:46

stas00 approved these changes May 19, 2022

View reviewed changes

.github/workflows/self-nightly-scheduled.yml Show resolved Hide resolved

stas00 mentioned this pull request May 25, 2022

Support compilation via Torchdynamo, AOT Autograd, NVFuser #17308

Merged

5 tasks

stas00 reviewed May 25, 2022

View reviewed changes

docker/transformers-pytorch-gpu/Dockerfile Outdated Show resolved Hide resolved

LysandreJik approved these changes May 31, 2022

View reviewed changes

ydshieh force-pushed the enable_pytorch_nightly_ci branch from 618ef3a to 5909e53 Compare June 13, 2022 09:22

ydshieh commented Jun 13, 2022

View reviewed changes

ydshieh force-pushed the enable_pytorch_nightly_ci branch from 5909e53 to 3096e2a Compare June 13, 2022 15:54

ydshieh commented Jun 13, 2022

View reviewed changes

nightly build pytorch CI

417c586

ydshieh force-pushed the enable_pytorch_nightly_ci branch from 1decdc0 to 417c586 Compare June 17, 2022 09:04

ydshieh added 2 commits June 17, 2022 11:41

fix working dir

46580ce

change time and event name

d9ae529

ydshieh merged commit ca169db into main Jun 17, 2022

ydshieh deleted the enable_pytorch_nightly_ci branch June 17, 2022 14:42

ydshieh mentioned this pull request Jun 21, 2022

Fix artifact path for cuda extension test in push CI #17788

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enable pytorch nightly CI #17335

Enable pytorch nightly CI #17335

ydshieh commented May 18, 2022 •

edited

Loading

HuggingFaceDocBuilderDev commented May 18, 2022 •

edited

Loading

ydshieh commented May 19, 2022

stas00 commented May 19, 2022

ydshieh commented May 19, 2022 •

edited

Loading

stas00 commented May 19, 2022 •

edited

Loading

stas00 left a comment

LysandreJik left a comment

LysandreJik May 31, 2022

ydshieh left a comment •

edited

Loading

LysandreJik commented Jun 13, 2022

ydshieh left a comment

ydshieh Jun 13, 2022

ydshieh Jun 13, 2022

ydshieh Jun 13, 2022 •

edited

Loading

ydshieh Jun 13, 2022 •

edited

Loading

ydshieh Jun 13, 2022 •

edited

Loading

ydshieh Jun 13, 2022

ydshieh Jun 13, 2022

ydshieh Jun 13, 2022

stas00 Jun 13, 2022

ydshieh Jun 13, 2022 •

edited

Loading

ydshieh Jun 13, 2022

stas00 commented Jun 13, 2022

ydshieh Jun 13, 2022

ydshieh commented Jun 14, 2022

		@@ -1,250 +1,235 @@
		name: Self-hosted runner; Nightly (scheduled)
		name: Self-hosted runner (nightly)

Enable pytorch nightly CI #17335

Enable pytorch nightly CI #17335

Conversation

ydshieh commented May 18, 2022 • edited Loading

What does this PR do?

Print versions

HuggingFaceDocBuilderDev commented May 18, 2022 • edited Loading

ydshieh commented May 19, 2022

stas00 commented May 19, 2022

ydshieh commented May 19, 2022 • edited Loading

DeepSpeed

Details

stas00 commented May 19, 2022 • edited Loading

stas00 left a comment

Choose a reason for hiding this comment

LysandreJik left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ydshieh left a comment • edited Loading

Choose a reason for hiding this comment

LysandreJik commented Jun 13, 2022

ydshieh left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ydshieh Jun 13, 2022 • edited Loading

Choose a reason for hiding this comment

ydshieh Jun 13, 2022 • edited Loading

Choose a reason for hiding this comment

ydshieh Jun 13, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ydshieh Jun 13, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

stas00 commented Jun 13, 2022

Choose a reason for hiding this comment

ydshieh commented Jun 14, 2022

ydshieh commented May 18, 2022 •

edited

Loading

HuggingFaceDocBuilderDev commented May 18, 2022 •

edited

Loading

ydshieh commented May 19, 2022 •

edited

Loading

stas00 commented May 19, 2022 •

edited

Loading

ydshieh left a comment •

edited

Loading

ydshieh Jun 13, 2022 •

edited

Loading

ydshieh Jun 13, 2022 •

edited

Loading

ydshieh Jun 13, 2022 •

edited

Loading

ydshieh Jun 13, 2022 •

edited

Loading