-
Notifications
You must be signed in to change notification settings - Fork 3.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Standalone tests can silently fail #12474
Comments
ok, just checking and we already have |
this is weird. Just checked, test is passing on lightning cluster |
Did you try using the script?
and running all standalone tests? |
Thanks, @carmocca!! Now it's failing. Let me check the issue. btw how is this different from running the tests manually? |
It manages calling multiple (or all) standalone tests with a printed report at the end. |
okay.. but still trying to understand why it failed when running with |
@rohitgr7 As to why the CI doesn't pick it up, I guess it's because the test passed on rank 0 but not on rank 1. The error on rank 1 does not seem to get picked up even though it appears in the logs. If one added a barrier at the end of the test, we most likely would have seen a hang instead of a silent pass. This will happen again in the future, so the issue is not yet resolved. I vote for reopening this. |
yep, I mentioned this in the PR description. Thanks for reopening it :) |
Is this still relevant? |
Yes, especially since now we don't log the stdout and stderr anymore for standalone tests. For ddp tests, rank0 is the process that launches the tests and if an error occurs there, we will see it raised. However, if there is a test that only fails on a rank > 0, we won't see it. It is rare but has happened before. |
Do you have any ideas about how we can avoid this issue? |
The only way I can think of is to parse the stderr produced by the pytest and look for any exceptions being printed. That would have to be done in the bash script that makes the call to pytest, but I'm not sure if it's worth it, seems complicated to me. |
We already capture the stdout and stderr ( We could easily grep for exceptions inside this function: https://github.com/Lightning-AI/lightning/blob/8af85eeaafc4fe4ef11098a27f795416fe608c6a/tests/tests_pytorch/run_standalone_tests.sh#L49-L54 and error out in that case. This might be a flaky heuristic though |
Maybe you closed this for a different reason, but just in case I wasn't clear with my previous comment
This is not implemented and could be done |
@carmocca I don't exactly know what to do and I don't feel like working on this right now. However, if it helps anyone, I can provide a test that succeeds on rank 0 and fails on rank 1, which reproduces the silent behavior described here. @RunIf(standalone=True, min_cuda_gpus=2)
def test_silent_standalone_failure_rank_1():
"""A test that will pass on rank 0, making the CI happy, but silently failing on rank 1
with an error in the logs but no CI failing."""
trainer = Trainer(strategy="ddp", accelerator="cuda", devices=2)
trainer.fit(BoringModel()) # just to trigger launching processes
if trainer.global_rank == 1:
assert False |
🐛 Bug
The CI passes with a green check mark, but the logs clearly show that the standalone tests fail.
To Reproduce
Can be seen on the latest commits on master.
At least two parametrization of this test fail: test_progress_bar_max_val_check_interval
Expected behavior
The CI fails when a test fails.
Environment
PL master
Additional context
As I always say, don't trust the CI.
cc @tchaton @rohitgr7 @akihironitta @carmocca @Borda
The text was updated successfully, but these errors were encountered: