You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Describe the bug
Two different checkpointing tests are failing in Gitlab CI (see also #1894 in case they are related). For TestCheckpoint.test_checkpoint_in_place_2:
With Intel 18 on 2 ranks, Intel 19 on 2 or 4 ranks, and ARM with 2 or 4 ranks, this test has failed with errors like below:
vt: [0] (t) phase: phase=4, duration=4.44e-3 s, rank_max_compute_time=1.15e-3 s, rank_avg_compute_time=987e-6 s, imbalance=0.166, grain_max_time=1.07e-3 s, migration count=18, lb_name=TemperedLB
vt: [0] (t) phase: After load balancing, expected execution should get a 1.07x speedup (or take 6.7% less time)
vt: [0] (t) phase: Due to the large object grain size, no further speedup is possible
vt: [0] (t) phase: With a smaller object grain size, up to 1.09x more speedup (or 8.1% decrease in execution time) might have been possible
vt: [1] (t) general: checkpointToFile
vt: [0] (t) general: checkpointToFile
vt: [0] (t) phase: phase=5, duration=1.05e0 s, rank_max_compute_time=1.21e-3 s, rank_avg_compute_time=1.06e-3 s, imbalance=0.141, grain_max_time=912e-6 s, migration count=26, lb_name=TemperedLB
vt: [0] (t) phase: After load balancing, expected execution should get a 1.14x speedup (or take 12.3% less time)
vt: [0] (t) general: restoreFromFileInPlace
vt: [1] (t) general: restoreFromFileInPlace
vt: [0] ------------------------------------------------------------------------------------------------------------------------
vt: [0] ------------------------------------------- Runtime Error: System Aborting! --------------------------------------------
vt: [0] ------------------------------------------------ Fatal Error on Node 0 -------------------------------------------------
vt: [0] ------------------------------------------------------------------------------------------------------------------------
vt: [0]
vt: [0] Assertion failed: (proxy(idx).tryGetLocalPtr() != nullptr)
vt: [0] Node: 0
vt: [0] Num Nodes: 2
vt: [0] File: ../../src/vt/vrt/collection/manager.impl.h
vt: [0] Line: 2267
vt: [0] Function: restoreFromFileInPlace
vt: [0] Code: 1
vt: [0] Build SHA: a086f820b7d5723ec9f8ac6fd0981230824c2230
With Intel 18 on 4 ranks, it has also failed with:
vt: [0] (t) phase: phase=4, duration=4.32e-3 s, rank_max_compute_time=1.18e-3 s, rank_avg_compute_time=913e-6 s, imbalance=0.293, grain_max_time=1.18e-3 s, migration count=8, lb_name=TemperedLB
vt: [0] (t) phase: After load balancing, negligible or no speedup is expected
vt: [0] (t) phase: Due to the large object grain size, no further speedup is possible
vt: [0] (t) phase: With a smaller object grain size, up to 1.29x more speedup (or 22.7% decrease in execution time) might have been possible
vt: [3] (t) general: checkpointToFile
vt: [0] (t) general: checkpointToFile
vt: [1] (t) general: checkpointToFile
vt: [2] (t) general: checkpointToFile
vt: [0] (t) phase: phase=5, duration=2.18e0 s, rank_max_compute_time=1.20e-3 s, rank_avg_compute_time=982e-6 s, imbalance=0.220, grain_max_time=1.20e-3 s, migration count=4, lb_name=TemperedLB
vt: [0] (t) phase: After load balancing, negligible or no speedup is expected
vt: [0] (t) phase: Due to the large object grain size, no further speedup is possible
vt: [0] (t) phase: With a smaller object grain size, up to 1.22x more speedup (or 18.1% decrease in execution time) might have been possible
vt: [2] (t) general: restoreFromFileInPlace
vt: [1] (t) general: restoreFromFileInPlace
vt: [3] (t) general: restoreFromFileInPlace
vt: [0] (t) general: restoreFromFileInPlace
vt: [0] ------------------------------------------------------------------------------------------------------------------------
vt: [0] ------------------------------------------- Runtime Error: System Aborting! --------------------------------------------
vt: [0] ------------------------------------------------ Fatal Error on Node 0 -------------------------------------------------
vt: [0] ------------------------------------------------------------------------------------------------------------------------
vt: [0]
vt: [0] Reason: This should have been a forwarding node
vt: [0] Assertion failed: (this_node != deliver_node)
vt: [0] Node: 0
vt: [0] Num Nodes: 4
vt: [0] File: /ascldap/users/nlslatt/.jacamar-ci/builds/8YUycndo/000/darma-tasking/vt/src/vt/topos/location/location.impl.h
vt: [0] Line: 438
vt: [0] Function: handleEagerUpdate
vt: [0] Code: 1
vt: [0] Build SHA: a086f820b7d5723ec9f8ac6fd0981230824c2230
With Intel 19 on 2 and 4 ranks, this has also been seen, making it seem related to the other issue mentioned above:
vt: [0] (t) phase: phase=5, duration=727e-3 s, rank_max_compute_time=972e-6 s, rank_avg_compute_time=807e-6 s, imbalance=0.205, grain_max_time=944e-6 s, migration count=7, lb_name=TemperedLB
vt: [0] (t) phase: After load balancing, expected execution should get a 1.03x speedup (or take 2.9% less time)
vt: [0] (t) phase: Due to the large object grain size, no further speedup is possible
vt: [0] (t) phase: With a smaller object grain size, up to 1.17x more speedup (or 14.5% decrease in execution time) might have been possible
vt: [1] (t) general: restoreFromFileInPlace
vt: [0] (t) general: restoreFromFileInPlace
unknown file: Failure
C++ exception with description "Unpacking wrong type, got=vt::vrt::collection::CollectionDirectory<vt::index::DenseIndexArray<int, (signed char)3> > (idx=3724550061), expected= (idx=0)
#0 vt::vrt::collection::CollectionDirectory<vt::index::DenseIndexArray<int, (signed char)3> >" thrown in the test body.
The text was updated successfully, but these errors were encountered:
I just looked over the source code and I think this may be a case of a race condition on the filesystem. These two unit tests use the same checkpoint filename. CTest is called differently on the Gitlab CI jobs so that they can be reported to CDash, so I think that some unit tests are running concurrently instead of one at a time.
Describe the bug
Two different checkpointing tests are failing in Gitlab CI (see also #1894 in case they are related). For
TestCheckpoint.test_checkpoint_in_place_2
:With Intel 18 on 2 ranks, Intel 19 on 2 or 4 ranks, and ARM with 2 or 4 ranks, this test has failed with errors like below:
With Intel 18 on 4 ranks, it has also failed with:
With Intel 19 on 2 and 4 ranks, this has also been seen, making it seem related to the other issue mentioned above:
The text was updated successfully, but these errors were encountered: