-
Notifications
You must be signed in to change notification settings - Fork 27.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Enable pytorch nightly CI #17335
Enable pytorch nightly CI #17335
Conversation
The documentation is not available anymore as the PR was closed or merged. |
51f3a91
to
b605c9c
Compare
I run the docker image We can discuss about this. But so far I won't include DeepSpeed test with nightly-built PyTorch.
|
But that's not nightly. I don't think you can rely on any pre-made docker to run nightly. It has to be manually installed since it is a new version every day. You probably want to switch to 22.04 (latest at the moment) and then update it to the actual nightly. Does it make sense? I also don't understand why deepspeed tests were removed. It's critical that we run deepspeed tests on nightly. |
The non DeepSpeedRegarding this, we have something to discuss:
DetailsIn the Dockerfile
|
Yes, please. I think we should use the latest stable pytorch for all our tests unless we explicitly test older pytorch versions every few days or so. And to ensure we update to the new stable once it gets released.
Could you please point me to the actual issues and I will help to sort them out? We can discuss it on slack if it's easier. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, thank you, @ydshieh!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me, @ydshieh! Thanks for taking care of that.
# this line must be added in order for python to be aware of transformers. | ||
RUN cd transformers && python3 setup.py develop | ||
|
||
RUN python3 -c "from deepspeed.launcher.runner import main" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice to test it!
618ef3a
to
5909e53
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After the recent changes regarding tests, I have to update this PR. In particular, I would like to make sure the goal of the scheduled nightly test(s) are:
- tests with nightly torch + latest TF (+ stable Flax)
- tests with latest torch + nightly TF (+ stable Flax)
- tests with latest torch + latest TF (+ nightlyFlax)
Here the summary of the latest changes in this PR
- build a new docker image with nightly torch + latest stable TF
- make
self-nightly-scheduled
the same asself-scheduled
, except:- only keep the following tests (so far), running with docker images having nightly torch and/or deepspeed + stable release TF
- run_tests_single_gpu
- run_tests_multi_gpu
- run_all_tests_torch_cuda_extensions_gpu
- (we could add
run_examples_gpu
andrun_pipelines_torch_gpu
in a follow up PR if necessary)
- only keep the following tests (so far), running with docker images having nightly torch and/or deepspeed + stable release TF
I don't think we really care about nightly flax right now, so I would keep the following:
The rest sounds good to me! |
5909e53
to
3096e2a
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey, @LysandreJik @stas00 !
I have updated this PR. I think the best way to save your time reviewing is to check my comments below.
If everything is fine, I will run some tests to make sure before merge.
context: ./docker/transformers-all-latest-gpu | ||
build-args: | | ||
REF=main | ||
PYTORCH=pre |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
to install nightly build torch
@@ -1,250 +1,235 @@ | |||
name: Self-hosted runner; Nightly (scheduled) | |||
name: Self-hosted runner (nightly) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Basically copied from self-scheduled.yml
, with the necessary changes to use nightly built torch & deepspeed master.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
well, I can change this back if prefered - I didn't pay attention here.
RUN_SLOW: yes | ||
SIGOPT_API_TOKEN: ${{ secrets.SIGOPT_API_TOKEN }} | ||
TF_FORCE_GPU_ALLOW_GROWTH: true | ||
RUN_PT_TF_CROSS_TESTS: 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Keep the values in self-scheduled.yml
instead of the values in the old self-nightly-scheduled.yml
, hope this makes sense.
@@ -3,6 +3,9 @@ LABEL maintainer="Hugging Face" | |||
|
|||
ARG DEBIAN_FRONTEND=noninteractive | |||
|
|||
# Used to read variables from `~/.profile` (to pass dynamic created variables between RUN commands) | |||
SHELL ["sh", "-lc"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Dockfile is quite limited, and I really need to pass some variables between RUN
command.
@@ -21,11 +24,20 @@ ARG REF=main | |||
RUN git clone https://github.com/huggingface/transformers && cd transformers && git checkout $REF | |||
RUN python3 -m pip install --no-cache-dir -e ./transformers[dev,onnxruntime] | |||
|
|||
RUN python3 -m pip install --no-cache-dir -U torch==$PYTORCH torchvision torchaudio --extra-index-url https://download.pytorch.org/whl/$CUDA | |||
# TODO: Handle these in a python utility script | |||
RUN [ ${#PYTORCH} -gt 0 -a "$PYTORCH" != "pre" ] && VERSION='torch=='$PYTORCH'.*' || VERSION='torch'; echo "export VERSION='$VERSION'" >> ~/.profile |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is where we need SHELL ["sh", "-lc"]
in order to pass VERSION
to the next line.
# `torchvision` and `torchaudio` should be installed along with `torch`, especially for nightly build. | ||
# Currently, let's just use their latest releases (when `torch` is installed with a release version) | ||
# TODO: We might need to specify proper versions that work with a specific torch version (especially for past CI). | ||
RUN [ "$PYTORCH" != "pre" ] && python3 -m pip install --no-cache-dir -U $VERSION torchvision torchaudio --extra-index-url https://download.pytorch.org/whl/$CUDA || python3 -m pip install --no-cache-dir -U --pre torch torchvision torchaudio --extra-index-url https://download.pytorch.org/whl/nightly/$CUDA |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I really think we can make things and life easier here. I have some POC, and can do it in a follow-up PR.
# This has to be run inside the GPU VMs running the tests. (So far, it fails here due to GPU checks during compilation.) | ||
# Issue: https://github.com/microsoft/DeepSpeed/issues/2010 | ||
# RUN git clone https://github.com/microsoft/DeepSpeed && cd DeepSpeed && rm -rf build && \ | ||
# DS_BUILD_CPU_ADAM=1 DS_BUILD_AIO=1 DS_BUILD_UTILS=1 python3 -m pip install . --global-option="build_ext" --global-option="-j8" --no-cache -v --disable-pip-version-check 2>&1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should be able to fix this one quickly I think on the DS side - let's see what the maintainers think - I can make a PR to not have it fail easily.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is no rush, as pre-build works inside docker run. We can always change it back (to pre-build DeepSpeed in Dockerfile) once the issue is fixed on DeepSpeed side.
I will double check that issue before merging this PR (and make the cleaning up if the fix is done at that time).
# This has to be run (again) inside the GPU VMs running the tests. | ||
# The installation works here, but some tests fail, if we don't pre-build deepspeed again in the VMs running the tests. | ||
# TODO: Find out why test fail. | ||
RUN DS_BUILD_CPU_ADAM=1 DS_BUILD_AIO=1 DS_BUILD_UTILS=1 python3 -m pip install deepspeed --global-option="build_ext" --global-option="-j8" --no-cache -v --disable-pip-version-check 2>&1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
need to pre-build again in workflow file (i.e. when docker is launched) - until a fix in deepspeed is done.
Looks good to me, @ydshieh! Thank you! |
python3 -m pip uninstall -y deepspeed | ||
rm -rf DeepSpeed | ||
git clone https://github.com/microsoft/DeepSpeed && cd DeepSpeed && rm -rf build | ||
DS_BUILD_CPU_ADAM=1 DS_BUILD_AIO=1 DS_BUILD_UTILS=1 python3 -m pip install . --global-option="build_ext" --global-option="-j8" --no-cache -v --disable-pip-version-check |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pre-build deepspeed (master version), which is currently necessary.
A full workflow run is here. |
1decdc0
to
417c586
Compare
What does this PR do?
huggingface/transformers-pytorch-nightly-gpu
self-nightly-scheduled.yml
to run PyTorch nightly build CI (almost a copy from scheduled CI)test workflow run here.
docker build run
Print versions