-
Notifications
You must be signed in to change notification settings - Fork 703
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
{ai}[foss/2022b] PyTorch v2.1.2 w/ CUDA 12.0.0 #20155
base: develop
Are you sure you want to change the base?
{ai}[foss/2022b] PyTorch v2.1.2 w/ CUDA 12.0.0 #20155
Conversation
Test report by @Flamefire |
easybuild/easyconfigs/p/PyTorch/PyTorch-2.1.2-foss-2022b-CUDA-12.0.0.eb
Outdated
Show resolved
Hide resolved
Test report by @Flamefire |
Test report by @casparvl |
Here and in #20156 |
Hm, the log contains a lot, it's a bit hard to read, but I think this is the relevant part: Error log
|
Error in Error log:
|
Yep that is a known issue: Reinstall your pybind11 with the latest EC |
Great, will do! Sorry, there are so many fixes that I often can't keep up and don't always rebuild stuff XD I'll send a new test report after the pybind11 rebuild. |
Yeah I know that is annoying, but we can't do much better than updating the existing EC(s) for such major bugs. It came up recently with someone else too so I remembered it. Side note: This is actually a good reason to run the PyTorch test suite and investigate errors: Our pybind11 version isn't (wasn't) compatible with this PyTorch version which would make it less usable as this error is likely to pop up in user code using this module. |
Ok, I rebuild |
Test report sassy-crick: |
Test report by @casparvl |
Failures are the same for
The only new one was a failure in
|
@boegelbot please test @ generoso |
@casparvl: Request for testing this PR well received on login1 PR test command '
Test results coming soon (I hope)... - notification for comment with ID 2016793002 processed Message to humans: this is just bookkeeping information for me, |
Test report by @boegelbot |
@boegelbot please test @ jsc-zen3 |
@casparvl: Request for testing this PR well received on jsczen3l1.int.jsc-zen3.fz-juelich.de PR test command '
Test results coming soon (I hope)... - notification for comment with ID 2020037044 processed Message to humans: this is just bookkeeping information for me, |
Test report by @boegelbot |
easybuild/easyconfigs/p/PyTorch/PyTorch-2.1.2-foss-2022b-CUDA-12.0.0.eb
Outdated
Show resolved
Hide resolved
Test report by @casparvl |
You are missing the patches from #19666 which are in develop |
Ah, let me sync your branch with develop - I'm assuming you won't mind... :) |
Ok, rebuild started succesfully now. Test reporting should be there somewhere tonight. I'll trigger one more rebuild on one of the test clusters for good measure. Should be good to go afterwards... |
@boegelbot please test @ jsc-zen3 |
@casparvl: Request for testing this PR well received on jsczen3l1.int.jsc-zen3.fz-juelich.de PR test command '
Test results coming soon (I hope)... - notification for comment with ID 2032212011 processed Message to humans: this is just bookkeeping information for me, |
Test report by @boegelbot |
Test report by @casparvl |
…es: PyTorch-2.1.2_add-cuda-skip-markers.patch, PyTorch-2.1.2_fix-conj-mismatch-test-failures.patch, PyTorch-2.1.2_fix-device-mesh-check.patch, PyTorch-2.1.2_fix-locale-issue-in-nvrtcCompileProgram.patch, PyTorch-2.1.2_fix-test_extension_backend-without-vectorization.patch, PyTorch-2.1.2_fix-test_memory_profiler.patch, PyTorch-2.1.2_fix-test_torchinductor-rounding.patch, PyTorch-2.1.2_fix-vsx-vector-abs.patch, PyTorch-2.1.2_fix-vsx-vector-div.patch, PyTorch-2.1.2_fix-with_temp_dir-decorator.patch, PyTorch-2.1.2_fix-wrong-device-mesh-size-in-tests.patch, PyTorch-2.1.2_relax-cuda-tolerances.patch, PyTorch-2.1.2_remove-nccl-backend-default-without-gpus.patch, PyTorch-2.1.2_skip-cpu_repro-test-without-vectorization.patch, PyTorch-2.1.2_skip-failing-test_dtensor_ops-subtests.patch, PyTorch-2.1.2_skip-test_fsdp_tp_checkpoint_integration.patch, PyTorch-2.1.2_workaround_dynamo_failure_without_nnpack.patch
5564421
to
8bb0d57
Compare
Test report by @akesandgren |
Failing especially due to the seg faults:
|
Yeah, they are a bit weird. Can't see any reason for them to have happened... |
Check the logs if there is any more specific message for those crashing tests. |
Didn't find anything useful, a bunch of tracebacks but they weren't very informative... |
@Flamefire This is what I got:
and
and
|
ok, the segfault in test_wrap doesn't happen everytime. A cleaner stacktrace is:
|
I also have a problem with test/distributed/test_c10d_nccl.py::NcclProcessGroupWithDispatchedCollectivesTests::test_allgather_base which also segfaults. But if I change to NCCL 2.18.3 (i.e. the one used in 2023a) that problem goes away. |
Using a newer NCCL also seem to eliminate the SEGV in distributed/fsdp/test_wrap I'd say we should drop this one due to the NCCL version problem. Or add this to some more tests: |
Also note that .github/scripts/generate_binary_build_matrix.py hints at NCCL 2.18.1 being "required", or at least what they test with. |
Those 3 suites are crashing in your report:
The test_c10d_nccl seems to be "just" a failure or at least counted as such, easy enough to skip if it fails consistently This is the only PyTorch 2.x version in 2022b, so might be worth keeping. So 3 options:
I prefer the first option. |
Test report by @Flamefire |
I ran some tests and I see random failures of whole suites with SIGSEGV (test_jit) and I opened another PR using the newer NCCL: #20520 |
Updated software
|
Test report by @Flamefire |
Test report by @Flamefire |
(created using
eb --new-pr
)