Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix flakiness on detection tests #2966

Merged
merged 11 commits into from
Nov 6, 2020

Conversation

datumbox
Copy link
Contributor

@datumbox datumbox commented Nov 5, 2020

The detection model tests are extremely brittle and break regularly. From past investigations we know that this is caused by floating point errors and the unstable sort in NMS (see here for details). The problem is getting worse by the fact that our tests use random input and random model weights.

As a result of the above, our unit-test code contains a large number of hacks/workarounds to handle special cases; it also turns off quite a few tests. This PR removes all previous workarounds in favour of a single one.

We follow the same approach as on the NMS tests:

vision/test/test_ops.py

Lines 437 to 442 in dc5d055

is_eq = torch.allclose(r_cpu, r_cuda.cpu())
if not is_eq:
# if the indices are not the same, ensure that it's because the scores
# are duplicate
is_eq = torch.allclose(scores[r_cpu], scores[r_cuda.cpu()], rtol=tol, atol=tol)
self.assertTrue(is_eq, err_msg.format(iou))

More specifically we try to verify the entire output (full validation) but if this fails we check for duplicate scores that can lead to unstable sort (partial validation). If all the full validations done in a test pass, we mark it as success. If any of the partial validations fails, we mark it as a failure. Else we mark it as skipped to indicate that the test can only be verified partially.

@fmassa proposed an alternative approach which is less aggressive and it's documented in the source code for future reference. Given that this technique is more complex and requires investigation I propose to merge this to unblock other pieces of work (such as #2954) and revisit this on the near future.

I left some comments in this review for clarifications.

@codecov
Copy link

codecov bot commented Nov 5, 2020

Codecov Report

Merging #2966 into master will not change coverage.
The diff coverage is n/a.

Impacted file tree graph

@@           Coverage Diff           @@
##           master    #2966   +/-   ##
=======================================
  Coverage   73.43%   73.43%           
=======================================
  Files          99       99           
  Lines        8813     8813           
  Branches     1391     1391           
=======================================
  Hits         6472     6472           
  Misses       1916     1916           
  Partials      425      425           

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 4738267...496cf21. Read the comment docs.

if elements_per_sample > 30:
return compute_mean_std(tensor)
else:
return subsample_tensor(tensor)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some tensors are too large to store on expected (for example masks). If the size of the input is over 30 elements per record, then we assert its mean and standard dev instead.

ith_index = num_elems // num_samples
return flat_tensor[ith_index - 1::ith_index]
return tensor[ith_index - 1::ith_index]
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We no longer flatten the results of the sample. This can be useful if we want to maintain the structure of the boxes.

# and then using the Hungarian algorithm as in DETR to find the
# best match between output and expected boxes and eliminate some
# of the flakiness. Worth exploring.
return False # Partial validation performed
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we reached this point, we partially validated the results. Return False so that we can flag this accordingly.

check_out(out)
full_validation &= check_out(out)

if not full_validation:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If any of the validations was partial, flag the test as skipped and raise a warning. It's better to flag it as skipped than mark it as green.

torch.save(output, expected_file)
MAX_PICKLE_SIZE = 50 * 1000 # 50 KB
binary_size = os.path.getsize(expected_file)
self.assertTrue(binary_size <= MAX_PICKLE_SIZE)
if binary_size > MAX_PICKLE_SIZE:
raise RuntimeError("The output for {}, is larger than 50kb".format(filename))
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do manual check instead of an assertion. We don't want to throw an AssertionError exception that will be caught by the external try.

@@ -148,10 +145,9 @@ def _test_detection_model(self, name, dev):
set_rng_seed(0)
kwargs = {}
if "retinanet" in name:
kwargs["score_thresh"] = 0.013
# Reduce the default threshold to ensure the returned boxes are not empty.
kwargs["score_thresh"] = 0.01
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use a threshold that will produce many more boxes and make things more interesting.

@datumbox datumbox changed the title [WIP] Fix flakiness on detection tests Fix flakiness on detection tests Nov 5, 2020
@datumbox datumbox requested a review from fmassa November 5, 2020 18:14
Copy link
Member

@fmassa fmassa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great to me, thanks a lot Vasilis!

# in NMS. If matching across all outputs fails, use the same approach
# as in NMSTester.test_nms_cuda to see if this is caused by duplicate
# scores.
expected_file = self._get_expected_file(strip_suffix=strip_suffix)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note for the future: we might move away from our custom base Tester in the future in favor of PyTorch one.

@fmassa fmassa merged commit 7f7ff05 into pytorch:master Nov 6, 2020
@datumbox datumbox deleted the tests/fix_detection_flakiness branch November 6, 2020 10:49
bryant1410 pushed a commit to bryant1410/vision-1 that referenced this pull request Nov 22, 2020
* Simplify the ACCEPT=True logic in assertExpected().

* Separate the expected filename estimation from assertExpected

* Unflatten expected values.

* Assert for duplicate scores if primary check fails.

* Remove custom exceptions for algorithms and add a compact function for shrinking large ouputs.

* Removing unused variables.

* Add warning and comments.

* Re-enable all autocast unit-test for detection and marking the tests as skipped in partial validation.

* Move test skip at the end.

* Changing the warning message.
vfdev-5 pushed a commit to Quansight/vision that referenced this pull request Dec 4, 2020
* Simplify the ACCEPT=True logic in assertExpected().

* Separate the expected filename estimation from assertExpected

* Unflatten expected values.

* Assert for duplicate scores if primary check fails.

* Remove custom exceptions for algorithms and add a compact function for shrinking large ouputs.

* Removing unused variables.

* Add warning and comments.

* Re-enable all autocast unit-test for detection and marking the tests as skipped in partial validation.

* Move test skip at the end.

* Changing the warning message.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants