Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Standalone tests can silently fail #12474

Closed
awaelchli opened this issue Mar 27, 2022 · 16 comments · Fixed by #12493 or #15341
Closed

Standalone tests can silently fail #12474

awaelchli opened this issue Mar 27, 2022 · 16 comments · Fixed by #12493 or #15341
Labels
bug Something isn't working ci Continuous Integration help wanted Open to be worked on

Comments

@awaelchli
Copy link
Contributor

awaelchli commented Mar 27, 2022

🐛 Bug

The CI passes with a green check mark, but the logs clearly show that the standalone tests fail.

To Reproduce

Can be seen on the latest commits on master.

At least two parametrization of this test fail: test_progress_bar_max_val_check_interval

Expected behavior

The CI fails when a test fails.

Environment

PL master

Additional context

As I always say, don't trust the CI.

cc @tchaton @rohitgr7 @akihironitta @carmocca @Borda

@awaelchli awaelchli added needs triage Waiting to be triaged by maintainers bug Something isn't working ci Continuous Integration labels Mar 27, 2022
@awaelchli awaelchli added this to the 1.6.x milestone Mar 27, 2022
@awaelchli
Copy link
Contributor Author

@rohitgr7 I'm not 100% sure but it looks like this started to happen since #11657

@carmocca carmocca added priority: 0 High priority task and removed needs triage Waiting to be triaged by maintainers labels Mar 28, 2022
@Borda
Copy link
Member

Borda commented Mar 28, 2022

ok, just checking and we already have set -e which shall enforce upstream the error
https://github.com/PyTorchLightning/pytorch-lightning/blob/2e5728a4841cb1d3b75123ac941cd36f681f9a11/.azure-pipelines/gpu-tests.yml#L112

@rohitgr7
Copy link
Contributor

this is weird. Just checked, test is passing on lightning cluster

@carmocca
Copy link
Contributor

Did you try using the script?

./tests/standalone_tests.sh -k test_progress_bar_max_val_check_interval

and running all standalone tests?

@rohitgr7
Copy link
Contributor

Thanks, @carmocca!! Now it's failing. Let me check the issue.

btw how is this different from running the tests manually?

@rohitgr7 rohitgr7 mentioned this issue Mar 28, 2022
12 tasks
@carmocca
Copy link
Contributor

It manages calling multiple (or all) standalone tests with a printed report at the end.

@rohitgr7
Copy link
Contributor

okay.. but still trying to understand why it failed when running with ./tests/standalone_tests.sh and not when running manually, although the test had a bug.

@carmocca carmocca modified the milestones: 1.6.x, 1.6 Mar 28, 2022
@awaelchli
Copy link
Contributor Author

@rohitgr7 As to why the CI doesn't pick it up, I guess it's because the test passed on rank 0 but not on rank 1. The error on rank 1 does not seem to get picked up even though it appears in the logs. If one added a barrier at the end of the test, we most likely would have seen a hang instead of a silent pass.

This will happen again in the future, so the issue is not yet resolved. I vote for reopening this.

@awaelchli awaelchli reopened this Mar 29, 2022
@carmocca carmocca modified the milestones: 1.6, 1.6.x Mar 29, 2022
@rohitgr7
Copy link
Contributor

This will happen again in the future, so the issue is not yet resolved. I vote for reopening this.

yep, I mentioned this in the PR description. Thanks for reopening it :)

@carmocca
Copy link
Contributor

Is this still relevant?

@awaelchli
Copy link
Contributor Author

Yes, especially since now we don't log the stdout and stderr anymore for standalone tests.

For ddp tests, rank0 is the process that launches the tests and if an error occurs there, we will see it raised. However, if there is a test that only fails on a rank > 0, we won't see it. It is rare but has happened before.

@carmocca
Copy link
Contributor

Do you have any ideas about how we can avoid this issue?

@carmocca carmocca changed the title Standalone tests silently fail Standalone tests can silently fail Jul 28, 2022
@carmocca carmocca removed the priority: 0 High priority task label Jul 28, 2022
@carmocca carmocca modified the milestones: pl:1.6.x, pl:1.7.x Jul 28, 2022
@awaelchli
Copy link
Contributor Author

The only way I can think of is to parse the stderr produced by the pytest and look for any exceptions being printed. That would have to be done in the bash script that makes the call to pytest, but I'm not sure if it's worth it, seems complicated to me.

@carmocca
Copy link
Contributor

carmocca commented Aug 3, 2022

We already capture the stdout and stderr (&>>) from pytest https://github.com/Lightning-AI/lightning/blob/8af85eeaafc4fe4ef11098a27f795416fe608c6a/tests/tests_pytorch/run_standalone_tests.sh#L69

We could easily grep for exceptions inside this function: https://github.com/Lightning-AI/lightning/blob/8af85eeaafc4fe4ef11098a27f795416fe608c6a/tests/tests_pytorch/run_standalone_tests.sh#L49-L54 and error out in that case. This might be a flaky heuristic though

@carmocca carmocca added the help wanted Open to be worked on label Aug 3, 2022
@carmocca carmocca modified the milestones: pl:1.7.x, v1.8.x Oct 13, 2022
@carmocca
Copy link
Contributor

carmocca commented Oct 26, 2022

Maybe you closed this for a different reason, but just in case I wasn't clear with my previous comment

We could easily grep for exceptions inside this function:

This is not implemented and could be done

@awaelchli
Copy link
Contributor Author

awaelchli commented Oct 26, 2022

@carmocca I don't exactly know what to do and I don't feel like working on this right now. However, if it helps anyone, I can provide a test that succeeds on rank 0 and fails on rank 1, which reproduces the silent behavior described here.

@RunIf(standalone=True, min_cuda_gpus=2)
def test_silent_standalone_failure_rank_1():
    """A test that will pass on rank 0, making the CI happy, but silently failing on rank 1 
    with an error in the logs but no CI failing."""

    trainer = Trainer(strategy="ddp", accelerator="cuda", devices=2)
    trainer.fit(BoringModel())  # just to trigger launching processes

    if trainer.global_rank == 1:
        assert False

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working ci Continuous Integration help wanted Open to be worked on
Projects
None yet
4 participants