Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix flaky tests that are recently popping up #4506

Closed
14 tasks done
NicolasHug opened this issue Sep 30, 2021 · 11 comments · Fixed by #4766
Closed
14 tasks done

Fix flaky tests that are recently popping up #4506

NicolasHug opened this issue Sep 30, 2021 · 11 comments · Fixed by #4766

Comments

@NicolasHug
Copy link
Member

NicolasHug commented Sep 30, 2021

Since #4497 was merged, we're observing a few tests that start randomly failing.

Before #4497, these tests were almost always using the same RNG state, which was set in a test that was run earlier in the test execution suite. Now that all tests are properly independent and that the RNG doesn't leak, these tests run with a new RNG at each execution, and if they're unstable they might fail.

(Note: this is a good thing; it's better to know that they fail now rather than when submiting an unrelated PR, which is what happened in #3032 (comment))

For each of these tests we should find out whether the flakyness is severe or not. A simple solution is to parametrize the test over 100 or 1000 random seeds and check the failure rate. If the failure rate is reasonable we can just set a seed with toch.manual_seed(). If not, we should try to fix the test and make it more robust.

The list of tests so far is:

cc @pmeier

@NicolasHug
Copy link
Member Author

As we previously discussed offline, I'll start looking into the tests that rely on p-value testing. test_random_apply above is one of them.

I'll change these to just test the case p == 0 and p == 1 -- for all values in-between, we'll just rely on pytorch core tests for torch.rand(1)

@NicolasHug
Copy link
Member Author

looking into test_randomperspective_fill rn

@datumbox
Copy link
Contributor

investigating test_frozenbatchnorm2d_eps

@jdsgomes
Copy link
Contributor

looking at test_batched_nms_implementations

@NicolasHug
Copy link
Member Author

looking at the test_color_jitter_xyz ones

@datumbox
Copy link
Contributor

looking at test.test_ops.TestRoiPool

@NicolasHug
Copy link
Member Author

Looking into random_erasing, I think it's the last one

@datumbox
Copy link
Contributor

Looking back at this ticket, I think we all did fine intern job. Good for us :P

@jdsgomes
Copy link
Contributor

In the case of test_batched_nms_implementations I observed 6 fails in the first 10k seeds.
I am fixing the seeds for now in this PR but I would like to know your opinion on whether or not I should dig deeper into the cases where it fails to see if there are any boundary conditions that might cause the difference between the two methods?
@NicolasHug @datumbox any thoughts?

@NicolasHug
Copy link
Member Author

It would be interesting to figure out whether the 6 failures correspond to a specific edge-case, but I wouldn't spend too much time on it either.

It could just be some ties in the sorting (which is not a stable sort)?

@datumbox
Copy link
Contributor

It's the unstable sort. See #4766 (comment)

I think it's worth understanding why the open-source contributor couldn't make the sort stable (he was facing seg fault if I remember correctly). Fixing the sort will fix lots of instability on the Detection models, so definitely worth while.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants