You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Describe the bug
Two different checkpointing tests are failing in Gitlab CI (see also #1895 in case they are related). For TestCheckpoint.test_checkpoint_1:
With Intel 19 on 1 or 2 ranks, it has failed with errors like below:
vt: [0] (t) phase: phase=0, duration=206e-3 s, rank_max_compute_time=665e-6 s, rank_avg_compute_time=665e-6 s, imbalance=0.000, grain_max_time=367e-6 s, migration count=0, lb_name=NoLB
unknown file: Failure
C++ exception with description "Unpacking wrong type, got=vt::tests::unit::TestCol (idx=3724549805), expected= (idx=0)
#0 vt::tests::unit::TestCol" thrown in the test body.
With Intel 18 on 4 ranks, it has failed with the error:
vt: [0] (t) phase: phase=0, duration=721e-3 s, rank_max_compute_time=3.33e-3 s, rank_avg_compute_time=1.61e-3 s, imbalance=1.064, grain_max_time=3.00e-3 s, migration count=0, lb_name=NoLB
unknown file: Failure
C++ exception with description "mmap64 failed for writing file: errno=22: Invalid argument" thrown in the test body.
This test has also resulted in SIGBUS on 4 ranks with Intel 19 or 1 rank with Intel 18.
And I just saw this error on 2 and 4 ranks with Intel 19, with may be a significant clue:
vt: [0] (t) phase: phase=0, duration=1.18e0 s, rank_max_compute_time=1.05e-3 s, rank_avg_compute_time=940e-6 s, imbalance=0.116, grain_max_time=750e-6 s, migration count=0, lb_name=NoLB
/ascldap/users/nlslatt/.jacamar-ci/builds/8YUycndo/001/darma-tasking/vt/tests/unit/collection/test_checkpoint.extended.cc:231: Failure
Expected equality of these values:
got_label
Which is: "test_checkpoint_in_place_2"
expected_label
Which is: "test_checkpoint_1"
The text was updated successfully, but these errors were encountered:
I just looked over the source code and I think this may be a case of a race condition on the filesystem. These two unit tests use the same checkpoint filename. CTest is called differently on the Gitlab CI jobs so that they can be reported to CDash, so I think that some unit tests are running concurrently instead of one at a time.
Describe the bug
Two different checkpointing tests are failing in Gitlab CI (see also #1895 in case they are related). For
TestCheckpoint.test_checkpoint_1
:With Intel 19 on 1 or 2 ranks, it has failed with errors like below:
With Intel 18 on 4 ranks, it has failed with the error:
This test has also resulted in
SIGBUS
on 4 ranks with Intel 19 or 1 rank with Intel 18.And I just saw this error on 2 and 4 ranks with Intel 19, with may be a significant clue:
The text was updated successfully, but these errors were encountered: