-
-
Notifications
You must be signed in to change notification settings - Fork 4.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Upgrade to torch==2.2.1
#2804
Upgrade to torch==2.2.1
#2804
Conversation
- `torch==2.1.1` -> `torch==2.2.0` - `xformers==0.0.23.post1` -> `xformers==0.0.24` - ROCM not updated because no `torch==2.2.0` containers have been published yet
The failing tests are all failing with errors similar to:
We're already using the latest versions of |
We have seen similar errors and have come to the conclusion any pkg that contains c/c++ extensions will randomly break if base pkg such as torch is updated via pip. Normally, if torch is updated to 2.2.0, the underlying dependent pkgs such as xformers/flashattn should recompile but they do not. Our course of fix is:
The pip pkg system is a huge minefield in my view when it comes to pkgs that contains jit or precompiled compiled c/c++ code. |
After merging the latest changes from master the import errors are gone and now we are only seeing OOM errors in the
However, these seem to be treated as soft fails by the CI. |
What's missing from merging this PR? GCP/A3 doesn't work with torch<2.2 - so we can't use vllm Thank you! |
Hi @stas00 , I've just been waiting for a review. I've just merged @WoosukKwon if you get a moment, could I get a review please? AMD have still not published any PyTorch 2.2.0 containers |
Thank you for leading this effort, @hmellor Why does it have to be |
I've not used rocm myself, but think it's used with the container specified in https://github.com/vllm-project/vllm/blob/main/Dockerfile.rocm which comes with a specific version of PyTorch installed, hence why |
yes, but what I was trying to say is that by relaxing the restriction this project and its users will suffer a lot less maintenance as torch=2.2.1 will be released soon, and so on. and |
I see your point, I'll wait for @WoosukKwon to comment in case he objects. |
@stas00 Good point. Originally, we used |
Thank you for clarifying, @WoosukKwon - there are several projects in this boat and they have to be constantly asked to build wheels for new versions of pytorch because of that. I think at the very least you shouldn't ask for the So how do we then move forward - do you plan to make a torch=2.2.* compatible binary wheel - should we open a request? that would mean
but of course by default the latest build should work fine w/o |
can you set torch==2.1.1 -> torch>=2.2.0 |
@stas00 Thanks for the advice!
Could you let us know where you found this information? If this is guaranteed, I think we can be a bit more flexible like |
Co-authored-by: Woosuk Kwon <[email protected]>
I'm double checking with the pytorch developers here: https://pytorch.slack.com/archives/C3PDTEV8E/p1708624187379559 - will follow up once I get a yay or nay from there. |
Hi @stas00 I don't believe it is a fair assumption that there is binary compatibility between pytorch patch versions. I investigated this myself a few weeks ago and found this not to be true. In addition, there is this issue response from a pytorch developer that claims this is not a priority: pytorch/pytorch#88301 (comment)
|
Thank you for finding that reply, @mgoin - my question at the pytorch slack got the same answer - we don't know - it might be compatible or it might be not - it's not being validated or expected to be so. So I stand corrected, my suggestion that one could use |
Sure, done |
Thanks a lot, Harry - this branch can be built from source again. |
Pytorch 2.2.1 does not work, either. I tested it today. (pls. |
Are there a merge plan? |
I'm not sure why the Docker container is failing to build. Unhelpfully, the failing command |
which is a problem in its own way - GCP A3 (H100) instances require nccl-2.19.3+ for their custom TCPX networking to function. It can't work with a lower NCCL version that you proposed. So if you force this nccl version you will cut out GCP A3 users. To be exact it'd only impact use cases with more than one node, since one node doesn't require TCPX. |
@youkaichao based on your issue NVIDIA/nccl#1234 am I right in saying that the issue is that torch 2.2 defaults to a newer version of nccl which uses more memory? |
Yes. So we need to pin a nccl version ourselves. |
close via #3805 . |
Closes #2738
Closes #2722
torch==2.1.1
->torch==2.2.1
xformers==0.0.23.post1
->xformers==0.0.25
torch==2.2
containers have been published yet