TestCheckpoint.test_checkpoint_in_place_2 failures in Gitlab CI #1895

nlslatt · 2022-08-02T22:22:47Z

Describe the bug
Two different checkpointing tests are failing in Gitlab CI (see also #1894 in case they are related). For TestCheckpoint.test_checkpoint_in_place_2:

With Intel 18 on 2 ranks, Intel 19 on 2 or 4 ranks, and ARM with 2 or 4 ranks, this test has failed with errors like below:

vt: [0] (t) phase: phase=4, duration=4.44e-3 s, rank_max_compute_time=1.15e-3 s, rank_avg_compute_time=987e-6 s, imbalance=0.166, grain_max_time=1.07e-3 s, migration count=18, lb_name=TemperedLB
vt: [0] (t) phase: After load balancing, expected execution should get a 1.07x speedup (or take 6.7% less time)
vt: [0] (t) phase: Due to the large object grain size, no further speedup is possible
vt: [0] (t) phase: With a smaller object grain size, up to 1.09x more speedup (or 8.1% decrease in execution time) might have been possible
vt: [1] (t) general: checkpointToFile
vt: [0] (t) general: checkpointToFile
vt: [0] (t) phase: phase=5, duration=1.05e0 s, rank_max_compute_time=1.21e-3 s, rank_avg_compute_time=1.06e-3 s, imbalance=0.141, grain_max_time=912e-6 s, migration count=26, lb_name=TemperedLB
vt: [0] (t) phase: After load balancing, expected execution should get a 1.14x speedup (or take 12.3% less time)
vt: [0] (t) general: restoreFromFileInPlace
vt: [1] (t) general: restoreFromFileInPlace
vt: [0] ------------------------------------------------------------------------------------------------------------------------
vt: [0] ------------------------------------------- Runtime Error: System Aborting! --------------------------------------------
vt: [0] ------------------------------------------------ Fatal Error on Node 0 -------------------------------------------------
vt: [0] ------------------------------------------------------------------------------------------------------------------------
vt: [0] 
vt: [0]    Assertion failed: (proxy(idx).tryGetLocalPtr() != nullptr)
vt: [0]                Node: 0
vt: [0]           Num Nodes: 2
vt: [0]                File: ../../src/vt/vrt/collection/manager.impl.h
vt: [0]                Line: 2267
vt: [0]            Function: restoreFromFileInPlace
vt: [0]                Code: 1
vt: [0]           Build SHA: a086f820b7d5723ec9f8ac6fd0981230824c2230

With Intel 18 on 4 ranks, it has also failed with:

vt: [0] (t) phase: phase=4, duration=4.32e-3 s, rank_max_compute_time=1.18e-3 s, rank_avg_compute_time=913e-6 s, imbalance=0.293, grain_max_time=1.18e-3 s, migration count=8, lb_name=TemperedLB
vt: [0] (t) phase: After load balancing, negligible or no speedup is expected
vt: [0] (t) phase: Due to the large object grain size, no further speedup is possible
vt: [0] (t) phase: With a smaller object grain size, up to 1.29x more speedup (or 22.7% decrease in execution time) might have been possible
vt: [3] (t) general: checkpointToFile
vt: [0] (t) general: checkpointToFile
vt: [1] (t) general: checkpointToFile
vt: [2] (t) general: checkpointToFile
vt: [0] (t) phase: phase=5, duration=2.18e0 s, rank_max_compute_time=1.20e-3 s, rank_avg_compute_time=982e-6 s, imbalance=0.220, grain_max_time=1.20e-3 s, migration count=4, lb_name=TemperedLB
vt: [0] (t) phase: After load balancing, negligible or no speedup is expected
vt: [0] (t) phase: Due to the large object grain size, no further speedup is possible
vt: [0] (t) phase: With a smaller object grain size, up to 1.22x more speedup (or 18.1% decrease in execution time) might have been possible
vt: [2] (t) general: restoreFromFileInPlace
vt: [1] (t) general: restoreFromFileInPlace
vt: [3] (t) general: restoreFromFileInPlace
vt: [0] (t) general: restoreFromFileInPlace
vt: [0] ------------------------------------------------------------------------------------------------------------------------
vt: [0] ------------------------------------------- Runtime Error: System Aborting! --------------------------------------------
vt: [0] ------------------------------------------------ Fatal Error on Node 0 -------------------------------------------------
vt: [0] ------------------------------------------------------------------------------------------------------------------------
vt: [0] 
vt: [0]              Reason: This should have been a forwarding node
vt: [0]    Assertion failed: (this_node != deliver_node)
vt: [0]                Node: 0
vt: [0]           Num Nodes: 4
vt: [0]                File: /ascldap/users/nlslatt/.jacamar-ci/builds/8YUycndo/000/darma-tasking/vt/src/vt/topos/location/location.impl.h
vt: [0]                Line: 438
vt: [0]            Function: handleEagerUpdate
vt: [0]                Code: 1
vt: [0]           Build SHA: a086f820b7d5723ec9f8ac6fd0981230824c2230

With Intel 19 on 2 and 4 ranks, this has also been seen, making it seem related to the other issue mentioned above:

vt: [0] (t) phase: phase=5, duration=727e-3 s, rank_max_compute_time=972e-6 s, rank_avg_compute_time=807e-6 s, imbalance=0.205, grain_max_time=944e-6 s, migration count=7, lb_name=TemperedLB
vt: [0] (t) phase: After load balancing, expected execution should get a 1.03x speedup (or take 2.9% less time)
vt: [0] (t) phase: Due to the large object grain size, no further speedup is possible
vt: [0] (t) phase: With a smaller object grain size, up to 1.17x more speedup (or 14.5% decrease in execution time) might have been possible
vt: [1] (t) general: restoreFromFileInPlace
vt: [0] (t) general: restoreFromFileInPlace
unknown file: Failure
C++ exception with description "Unpacking wrong type, got=vt::vrt::collection::CollectionDirectory<vt::index::DenseIndexArray<int, (signed char)3> > (idx=3724550061), expected= (idx=0)
#0 vt::vrt::collection::CollectionDirectory<vt::index::DenseIndexArray<int, (signed char)3> >" thrown in the test body.

The text was updated successfully, but these errors were encountered:

nlslatt · 2022-08-03T16:43:14Z

I just looked over the source code and I think this may be a case of a race condition on the filesystem. These two unit tests use the same checkpoint filename. CTest is called differently on the Gitlab CI jobs so that they can be reported to CDash, so I think that some unit tests are running concurrently instead of one at a time.

nlslatt · 2022-08-03T16:44:27Z

I think that all filenames used by tests need to be unique both across tests and across test invocations with different numbers of ranks.

nlslatt · 2022-08-16T18:50:42Z

This appears to have been fixed by #1897.

nlslatt added the type: bug label Aug 2, 2022

nlslatt mentioned this issue Aug 2, 2022

TestCheckpoint.test_checkpoint_1 failures in Gitlab CI #1894

Closed

nlslatt self-assigned this Aug 3, 2022

nlslatt mentioned this issue Aug 3, 2022

Use only unique filenames in unit tests #1896

Closed

nmm0 mentioned this issue Aug 16, 2022

Meeting Agenda [do not close] #925

Open

nlslatt closed this as completed Sep 12, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TestCheckpoint.test_checkpoint_in_place_2 failures in Gitlab CI #1895

TestCheckpoint.test_checkpoint_in_place_2 failures in Gitlab CI #1895

nlslatt commented Aug 2, 2022 •

edited

Loading

nlslatt commented Aug 3, 2022

nlslatt commented Aug 3, 2022

nlslatt commented Aug 16, 2022

TestCheckpoint.test_checkpoint_in_place_2 failures in Gitlab CI #1895

TestCheckpoint.test_checkpoint_in_place_2 failures in Gitlab CI #1895

Comments

nlslatt commented Aug 2, 2022 • edited Loading

nlslatt commented Aug 3, 2022

nlslatt commented Aug 3, 2022

nlslatt commented Aug 16, 2022

nlslatt commented Aug 2, 2022 •

edited

Loading