Standalone tests can silently fail #12474

awaelchli · 2022-03-27T20:39:35Z

🐛 Bug

The CI passes with a green check mark, but the logs clearly show that the standalone tests fail.

To Reproduce

Can be seen on the latest commits on master.

At least two parametrization of this test fail: test_progress_bar_max_val_check_interval

Expected behavior

The CI fails when a test fails.

Environment

PL master

Additional context

As I always say, don't trust the CI.

cc @tchaton @rohitgr7 @akihironitta @carmocca @Borda

awaelchli · 2022-03-27T20:43:31Z

@rohitgr7 I'm not 100% sure but it looks like this started to happen since #11657

Borda · 2022-03-28T02:28:11Z

ok, just checking and we already have set -e which shall enforce upstream the error
https://github.com/PyTorchLightning/pytorch-lightning/blob/2e5728a4841cb1d3b75123ac941cd36f681f9a11/.azure-pipelines/gpu-tests.yml#L112

rohitgr7 · 2022-03-28T08:43:08Z

this is weird. Just checked, test is passing on lightning cluster

carmocca · 2022-03-28T12:52:50Z

Did you try using the script?

./tests/standalone_tests.sh -k test_progress_bar_max_val_check_interval

and running all standalone tests?

rohitgr7 · 2022-03-28T13:11:17Z

Thanks, @carmocca!! Now it's failing. Let me check the issue.

btw how is this different from running the tests manually?

carmocca · 2022-03-28T13:57:03Z

It manages calling multiple (or all) standalone tests with a printed report at the end.

rohitgr7 · 2022-03-28T14:05:00Z

okay.. but still trying to understand why it failed when running with ./tests/standalone_tests.sh and not when running manually, although the test had a bug.

awaelchli · 2022-03-29T00:53:30Z

@rohitgr7 As to why the CI doesn't pick it up, I guess it's because the test passed on rank 0 but not on rank 1. The error on rank 1 does not seem to get picked up even though it appears in the logs. If one added a barrier at the end of the test, we most likely would have seen a hang instead of a silent pass.

This will happen again in the future, so the issue is not yet resolved. I vote for reopening this.

rohitgr7 · 2022-03-29T10:56:59Z

This will happen again in the future, so the issue is not yet resolved. I vote for reopening this.

yep, I mentioned this in the PR description. Thanks for reopening it :)

carmocca · 2022-07-25T09:14:20Z

Is this still relevant?

awaelchli · 2022-07-25T11:29:26Z

Yes, especially since now we don't log the stdout and stderr anymore for standalone tests.

For ddp tests, rank0 is the process that launches the tests and if an error occurs there, we will see it raised. However, if there is a test that only fails on a rank > 0, we won't see it. It is rare but has happened before.

carmocca · 2022-07-28T19:25:43Z

Do you have any ideas about how we can avoid this issue?

awaelchli · 2022-08-02T01:58:54Z

The only way I can think of is to parse the stderr produced by the pytest and look for any exceptions being printed. That would have to be done in the bash script that makes the call to pytest, but I'm not sure if it's worth it, seems complicated to me.

carmocca · 2022-08-03T12:49:08Z

We already capture the stdout and stderr (&>>) from pytest https://github.com/Lightning-AI/lightning/blob/8af85eeaafc4fe4ef11098a27f795416fe608c6a/tests/tests_pytorch/run_standalone_tests.sh#L69

We could easily grep for exceptions inside this function: https://github.com/Lightning-AI/lightning/blob/8af85eeaafc4fe4ef11098a27f795416fe608c6a/tests/tests_pytorch/run_standalone_tests.sh#L49-L54 and error out in that case. This might be a flaky heuristic though

carmocca · 2022-10-26T00:09:47Z

Maybe you closed this for a different reason, but just in case I wasn't clear with my previous comment

We could easily grep for exceptions inside this function:

This is not implemented and could be done

awaelchli · 2022-10-26T13:10:58Z

@carmocca I don't exactly know what to do and I don't feel like working on this right now. However, if it helps anyone, I can provide a test that succeeds on rank 0 and fails on rank 1, which reproduces the silent behavior described here.

@RunIf(standalone=True, min_cuda_gpus=2)
def test_silent_standalone_failure_rank_1():
    """A test that will pass on rank 0, making the CI happy, but silently failing on rank 1 
    with an error in the logs but no CI failing."""

    trainer = Trainer(strategy="ddp", accelerator="cuda", devices=2)
    trainer.fit(BoringModel())  # just to trigger launching processes

    if trainer.global_rank == 1:
        assert False

awaelchli added needs triage Waiting to be triaged by maintainers bug Something isn't working ci Continuous Integration labels Mar 27, 2022

awaelchli added this to the 1.6.x milestone Mar 27, 2022

carmocca added priority: 0 High priority task and removed needs triage Waiting to be triaged by maintainers labels Mar 28, 2022

rohitgr7 mentioned this issue Mar 28, 2022

Fix tqdm standalone test #12493

Merged

12 tasks

rohitgr7 closed this as completed in #12493 Mar 28, 2022

carmocca modified the milestones: 1.6.x, 1.6 Mar 28, 2022

awaelchli reopened this Mar 29, 2022

carmocca modified the milestones: 1.6, 1.6.x Mar 29, 2022

rohitgr7 mentioned this issue Apr 1, 2022

Run main progress bar independent of val progress bar in TQDMProgressBar #12563

Merged

12 tasks

carmocca changed the title ~~Standalone tests silently fail~~ Standalone tests can silently fail Jul 28, 2022

carmocca removed the priority: 0 High priority task label Jul 28, 2022

carmocca modified the milestones: pl:1.6.x, pl:1.7.x Jul 28, 2022

carmocca added the help wanted Open to be worked on label Aug 3, 2022

carmocca modified the milestones: pl:1.7.x, v1.8.x Oct 13, 2022

awaelchli closed this as completed Oct 25, 2022

carmocca removed this from the v1.8.x milestone Oct 26, 2022

carmocca mentioned this issue Oct 26, 2022

Grep for potential errors in standalone tests #15341

Merged

carmocca mentioned this issue Nov 8, 2022

TorchElastic standalone test silently fails #15418

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Standalone tests can silently fail #12474

Standalone tests can silently fail #12474

awaelchli commented Mar 27, 2022 •

edited by github-actions bot

Loading

awaelchli commented Mar 27, 2022

Borda commented Mar 28, 2022

rohitgr7 commented Mar 28, 2022

carmocca commented Mar 28, 2022

rohitgr7 commented Mar 28, 2022

carmocca commented Mar 28, 2022

rohitgr7 commented Mar 28, 2022

awaelchli commented Mar 29, 2022

rohitgr7 commented Mar 29, 2022

carmocca commented Jul 25, 2022

awaelchli commented Jul 25, 2022

carmocca commented Jul 28, 2022

awaelchli commented Aug 2, 2022

carmocca commented Aug 3, 2022

carmocca commented Oct 26, 2022 •

edited

Loading

awaelchli commented Oct 26, 2022 •

edited

Loading

Standalone tests can silently fail #12474

Standalone tests can silently fail #12474

Comments

awaelchli commented Mar 27, 2022 • edited by github-actions bot Loading

🐛 Bug

To Reproduce

Expected behavior

Environment

Additional context

awaelchli commented Mar 27, 2022

Borda commented Mar 28, 2022

rohitgr7 commented Mar 28, 2022

carmocca commented Mar 28, 2022

rohitgr7 commented Mar 28, 2022

carmocca commented Mar 28, 2022

rohitgr7 commented Mar 28, 2022

awaelchli commented Mar 29, 2022

rohitgr7 commented Mar 29, 2022

carmocca commented Jul 25, 2022

awaelchli commented Jul 25, 2022

carmocca commented Jul 28, 2022

awaelchli commented Aug 2, 2022

carmocca commented Aug 3, 2022

carmocca commented Oct 26, 2022 • edited Loading

awaelchli commented Oct 26, 2022 • edited Loading

awaelchli commented Mar 27, 2022 •

edited by github-actions bot

Loading

carmocca commented Oct 26, 2022 •

edited

Loading

awaelchli commented Oct 26, 2022 •

edited

Loading