Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TestCheckpoint.test_checkpoint_in_place_2 failures in Gitlab CI #1895

Closed
nlslatt opened this issue Aug 2, 2022 · 3 comments
Closed

TestCheckpoint.test_checkpoint_in_place_2 failures in Gitlab CI #1895

nlslatt opened this issue Aug 2, 2022 · 3 comments
Assignees

Comments

@nlslatt
Copy link
Collaborator

nlslatt commented Aug 2, 2022

Describe the bug
Two different checkpointing tests are failing in Gitlab CI (see also #1894 in case they are related). For TestCheckpoint.test_checkpoint_in_place_2:

With Intel 18 on 2 ranks, Intel 19 on 2 or 4 ranks, and ARM with 2 or 4 ranks, this test has failed with errors like below:

vt: [0] (t) phase: phase=4, duration=4.44e-3 s, rank_max_compute_time=1.15e-3 s, rank_avg_compute_time=987e-6 s, imbalance=0.166, grain_max_time=1.07e-3 s, migration count=18, lb_name=TemperedLB
vt: [0] (t) phase: After load balancing, expected execution should get a 1.07x speedup (or take 6.7% less time)
vt: [0] (t) phase: Due to the large object grain size, no further speedup is possible
vt: [0] (t) phase: With a smaller object grain size, up to 1.09x more speedup (or 8.1% decrease in execution time) might have been possible
vt: [1] (t) general: checkpointToFile
vt: [0] (t) general: checkpointToFile
vt: [0] (t) phase: phase=5, duration=1.05e0 s, rank_max_compute_time=1.21e-3 s, rank_avg_compute_time=1.06e-3 s, imbalance=0.141, grain_max_time=912e-6 s, migration count=26, lb_name=TemperedLB
vt: [0] (t) phase: After load balancing, expected execution should get a 1.14x speedup (or take 12.3% less time)
vt: [0] (t) general: restoreFromFileInPlace
vt: [1] (t) general: restoreFromFileInPlace
vt: [0] ------------------------------------------------------------------------------------------------------------------------
vt: [0] ------------------------------------------- Runtime Error: System Aborting! --------------------------------------------
vt: [0] ------------------------------------------------ Fatal Error on Node 0 -------------------------------------------------
vt: [0] ------------------------------------------------------------------------------------------------------------------------
vt: [0] 
vt: [0]    Assertion failed: (proxy(idx).tryGetLocalPtr() != nullptr)
vt: [0]                Node: 0
vt: [0]           Num Nodes: 2
vt: [0]                File: ../../src/vt/vrt/collection/manager.impl.h
vt: [0]                Line: 2267
vt: [0]            Function: restoreFromFileInPlace
vt: [0]                Code: 1
vt: [0]           Build SHA: a086f820b7d5723ec9f8ac6fd0981230824c2230

With Intel 18 on 4 ranks, it has also failed with:

vt: [0] (t) phase: phase=4, duration=4.32e-3 s, rank_max_compute_time=1.18e-3 s, rank_avg_compute_time=913e-6 s, imbalance=0.293, grain_max_time=1.18e-3 s, migration count=8, lb_name=TemperedLB
vt: [0] (t) phase: After load balancing, negligible or no speedup is expected
vt: [0] (t) phase: Due to the large object grain size, no further speedup is possible
vt: [0] (t) phase: With a smaller object grain size, up to 1.29x more speedup (or 22.7% decrease in execution time) might have been possible
vt: [3] (t) general: checkpointToFile
vt: [0] (t) general: checkpointToFile
vt: [1] (t) general: checkpointToFile
vt: [2] (t) general: checkpointToFile
vt: [0] (t) phase: phase=5, duration=2.18e0 s, rank_max_compute_time=1.20e-3 s, rank_avg_compute_time=982e-6 s, imbalance=0.220, grain_max_time=1.20e-3 s, migration count=4, lb_name=TemperedLB
vt: [0] (t) phase: After load balancing, negligible or no speedup is expected
vt: [0] (t) phase: Due to the large object grain size, no further speedup is possible
vt: [0] (t) phase: With a smaller object grain size, up to 1.22x more speedup (or 18.1% decrease in execution time) might have been possible
vt: [2] (t) general: restoreFromFileInPlace
vt: [1] (t) general: restoreFromFileInPlace
vt: [3] (t) general: restoreFromFileInPlace
vt: [0] (t) general: restoreFromFileInPlace
vt: [0] ------------------------------------------------------------------------------------------------------------------------
vt: [0] ------------------------------------------- Runtime Error: System Aborting! --------------------------------------------
vt: [0] ------------------------------------------------ Fatal Error on Node 0 -------------------------------------------------
vt: [0] ------------------------------------------------------------------------------------------------------------------------
vt: [0] 
vt: [0]              Reason: This should have been a forwarding node
vt: [0]    Assertion failed: (this_node != deliver_node)
vt: [0]                Node: 0
vt: [0]           Num Nodes: 4
vt: [0]                File: /ascldap/users/nlslatt/.jacamar-ci/builds/8YUycndo/000/darma-tasking/vt/src/vt/topos/location/location.impl.h
vt: [0]                Line: 438
vt: [0]            Function: handleEagerUpdate
vt: [0]                Code: 1
vt: [0]           Build SHA: a086f820b7d5723ec9f8ac6fd0981230824c2230

With Intel 19 on 2 and 4 ranks, this has also been seen, making it seem related to the other issue mentioned above:

vt: [0] (t) phase: phase=5, duration=727e-3 s, rank_max_compute_time=972e-6 s, rank_avg_compute_time=807e-6 s, imbalance=0.205, grain_max_time=944e-6 s, migration count=7, lb_name=TemperedLB
vt: [0] (t) phase: After load balancing, expected execution should get a 1.03x speedup (or take 2.9% less time)
vt: [0] (t) phase: Due to the large object grain size, no further speedup is possible
vt: [0] (t) phase: With a smaller object grain size, up to 1.17x more speedup (or 14.5% decrease in execution time) might have been possible
vt: [1] (t) general: restoreFromFileInPlace
vt: [0] (t) general: restoreFromFileInPlace
unknown file: Failure
C++ exception with description "Unpacking wrong type, got=vt::vrt::collection::CollectionDirectory<vt::index::DenseIndexArray<int, (signed char)3> > (idx=3724550061), expected= (idx=0)
#0 vt::vrt::collection::CollectionDirectory<vt::index::DenseIndexArray<int, (signed char)3> >" thrown in the test body.
@nlslatt
Copy link
Collaborator Author

nlslatt commented Aug 3, 2022

I just looked over the source code and I think this may be a case of a race condition on the filesystem. These two unit tests use the same checkpoint filename. CTest is called differently on the Gitlab CI jobs so that they can be reported to CDash, so I think that some unit tests are running concurrently instead of one at a time.

@nlslatt
Copy link
Collaborator Author

nlslatt commented Aug 3, 2022

I think that all filenames used by tests need to be unique both across tests and across test invocations with different numbers of ranks.

@nlslatt
Copy link
Collaborator Author

nlslatt commented Aug 16, 2022

This appears to have been fixed by #1897.

@nlslatt nlslatt closed this as completed Sep 12, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant