-
Notifications
You must be signed in to change notification settings - Fork 53
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add back the UCC backend for Bcast_sharded/PipelineTestTwoStages tests. #3124
Comments
The issue However, I am not able to reproduce the "hang" you are mentioning, on the setup mentioned above. |
It's interesting that we are seeing different symptoms. Anyhow, FWIW, the repro in OP gives me the following on
|
interesting! What is your |
Also, if I ran all tests matching
|
I'll get back to you. There are some personal touches (such as apt install more tools) in my Dockerfile. I don't think they are related but I'll double check on a clean |
FWIW, https://gitlab-master.nvidia.com/jingyuew/pjnl contains my Dockerfile, the build script and run. |
@samnordmann here's a repro with a clean image.
Same segfault as #3124 (comment). |
Thank you. I'm indeed able to reproduce, even with UCX's fix merged openucx/ucx#10195. My guess is that the original bug with I need to investigate and probably escalate to UCX |
Regardless of my last message, I realized that CI is running an old-ish stable version of ucx and not master nightly. IIUC, it takes what's provided by the pytorch stable container. I understand that stable releases are preferable for CI. But because of that, in the present case, we will see the bug in CI for as long as the ucx version doesn't change. |
cc @xwang233 to comment on the versions. I thought our CI (github or nightly) uses pjnl-latest so the versions should match. But apparently no? |
We don't modify HPCX in pjnl-latest image, which inherits the HPCX version from internal upstream base image. Also, pjnl-latest is the image that we use in CI. The versions in github (pjnl-latest) or nightly may mismatch for 1 day at most. Can you comment on a job log or a docker image where you see an old HPCX version? |
Right, sorry I probably got confused. For the record, in this log I see that the following image is used: Then doing
which points to commit 39c8f9b which is the head of v1.17.x tag and dates back to last July. pjnl-latest points to the same commit. So: either we'll need to wait the next hpcx release to see some change -- or we move to nightly hpcx builds. |
Will follow up offline on the HPCX version update in our base image. |
These tests have been disabled by #2794 but should be fixed.
To reproduce,
The symptom appears to be non-deterministic. Sometimes the test hangs, sometimes it segfaults.
I ran into this on
viking-prod-231
. I'm unsure if it's machine or GPU dependent.The text was updated successfully, but these errors were encountered: