Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TestCheckpoint.test_checkpoint_1 failures in Gitlab CI #1894

Closed
nlslatt opened this issue Aug 2, 2022 · 4 comments
Closed

TestCheckpoint.test_checkpoint_1 failures in Gitlab CI #1894

nlslatt opened this issue Aug 2, 2022 · 4 comments
Assignees

Comments

@nlslatt
Copy link
Collaborator

nlslatt commented Aug 2, 2022

Describe the bug
Two different checkpointing tests are failing in Gitlab CI (see also #1895 in case they are related). For TestCheckpoint.test_checkpoint_1:

With Intel 19 on 1 or 2 ranks, it has failed with errors like below:

vt: [0] (t) phase: phase=0, duration=206e-3 s, rank_max_compute_time=665e-6 s, rank_avg_compute_time=665e-6 s, imbalance=0.000, grain_max_time=367e-6 s, migration count=0, lb_name=NoLB
unknown file: Failure
C++ exception with description "Unpacking wrong type, got=vt::tests::unit::TestCol (idx=3724549805), expected= (idx=0)
#0 vt::tests::unit::TestCol" thrown in the test body.

With Intel 18 on 4 ranks, it has failed with the error:

vt: [0] (t) phase: phase=0, duration=721e-3 s, rank_max_compute_time=3.33e-3 s, rank_avg_compute_time=1.61e-3 s, imbalance=1.064, grain_max_time=3.00e-3 s, migration count=0, lb_name=NoLB
unknown file: Failure
C++ exception with description "mmap64 failed for writing file: errno=22: Invalid argument" thrown in the test body.

This test has also resulted in SIGBUS on 4 ranks with Intel 19 or 1 rank with Intel 18.

And I just saw this error on 2 and 4 ranks with Intel 19, with may be a significant clue:

vt: [0] (t) phase: phase=0, duration=1.18e0 s, rank_max_compute_time=1.05e-3 s, rank_avg_compute_time=940e-6 s, imbalance=0.116, grain_max_time=750e-6 s, migration count=0, lb_name=NoLB
/ascldap/users/nlslatt/.jacamar-ci/builds/8YUycndo/001/darma-tasking/vt/tests/unit/collection/test_checkpoint.extended.cc:231: Failure
Expected equality of these values:
  got_label
    Which is: "test_checkpoint_in_place_2"
  expected_label
    Which is: "test_checkpoint_1"
@PhilMiller
Copy link
Member

I really hope we haven't found the hypothetical platform where the registry doesn't work

@nlslatt
Copy link
Collaborator Author

nlslatt commented Aug 3, 2022

I just looked over the source code and I think this may be a case of a race condition on the filesystem. These two unit tests use the same checkpoint filename. CTest is called differently on the Gitlab CI jobs so that they can be reported to CDash, so I think that some unit tests are running concurrently instead of one at a time.

@nlslatt
Copy link
Collaborator Author

nlslatt commented Aug 3, 2022

I think that all filenames used by tests need to be unique both across tests and across test invocations with different numbers of ranks.

@nlslatt
Copy link
Collaborator Author

nlslatt commented Aug 16, 2022

This appears to have been fixed by #1897.

@nlslatt nlslatt closed this as completed Sep 12, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants