-
Notifications
You must be signed in to change notification settings - Fork 7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix flakiness on detection tests #2966
Fix flakiness on detection tests #2966
Conversation
…r shrinking large ouputs.
Codecov Report
@@ Coverage Diff @@
## master #2966 +/- ##
=======================================
Coverage 73.43% 73.43%
=======================================
Files 99 99
Lines 8813 8813
Branches 1391 1391
=======================================
Hits 6472 6472
Misses 1916 1916
Partials 425 425 Continue to review full report at Codecov.
|
…as skipped in partial validation.
if elements_per_sample > 30: | ||
return compute_mean_std(tensor) | ||
else: | ||
return subsample_tensor(tensor) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some tensors are too large to store on expected (for example masks). If the size of the input is over 30 elements per record, then we assert its mean and standard dev instead.
ith_index = num_elems // num_samples | ||
return flat_tensor[ith_index - 1::ith_index] | ||
return tensor[ith_index - 1::ith_index] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We no longer flatten the results of the sample. This can be useful if we want to maintain the structure of the boxes.
# and then using the Hungarian algorithm as in DETR to find the | ||
# best match between output and expected boxes and eliminate some | ||
# of the flakiness. Worth exploring. | ||
return False # Partial validation performed |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we reached this point, we partially validated the results. Return False so that we can flag this accordingly.
check_out(out) | ||
full_validation &= check_out(out) | ||
|
||
if not full_validation: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If any of the validations was partial, flag the test as skipped and raise a warning. It's better to flag it as skipped than mark it as green.
torch.save(output, expected_file) | ||
MAX_PICKLE_SIZE = 50 * 1000 # 50 KB | ||
binary_size = os.path.getsize(expected_file) | ||
self.assertTrue(binary_size <= MAX_PICKLE_SIZE) | ||
if binary_size > MAX_PICKLE_SIZE: | ||
raise RuntimeError("The output for {}, is larger than 50kb".format(filename)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do manual check instead of an assertion. We don't want to throw an AssertionError
exception that will be caught by the external try.
@@ -148,10 +145,9 @@ def _test_detection_model(self, name, dev): | |||
set_rng_seed(0) | |||
kwargs = {} | |||
if "retinanet" in name: | |||
kwargs["score_thresh"] = 0.013 | |||
# Reduce the default threshold to ensure the returned boxes are not empty. | |||
kwargs["score_thresh"] = 0.01 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use a threshold that will produce many more boxes and make things more interesting.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great to me, thanks a lot Vasilis!
# in NMS. If matching across all outputs fails, use the same approach | ||
# as in NMSTester.test_nms_cuda to see if this is caused by duplicate | ||
# scores. | ||
expected_file = self._get_expected_file(strip_suffix=strip_suffix) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note for the future: we might move away from our custom base Tester in the future in favor of PyTorch one.
* Simplify the ACCEPT=True logic in assertExpected(). * Separate the expected filename estimation from assertExpected * Unflatten expected values. * Assert for duplicate scores if primary check fails. * Remove custom exceptions for algorithms and add a compact function for shrinking large ouputs. * Removing unused variables. * Add warning and comments. * Re-enable all autocast unit-test for detection and marking the tests as skipped in partial validation. * Move test skip at the end. * Changing the warning message.
* Simplify the ACCEPT=True logic in assertExpected(). * Separate the expected filename estimation from assertExpected * Unflatten expected values. * Assert for duplicate scores if primary check fails. * Remove custom exceptions for algorithms and add a compact function for shrinking large ouputs. * Removing unused variables. * Add warning and comments. * Re-enable all autocast unit-test for detection and marking the tests as skipped in partial validation. * Move test skip at the end. * Changing the warning message.
The detection model tests are extremely brittle and break regularly. From past investigations we know that this is caused by floating point errors and the unstable sort in NMS (see here for details). The problem is getting worse by the fact that our tests use random input and random model weights.
As a result of the above, our unit-test code contains a large number of hacks/workarounds to handle special cases; it also turns off quite a few tests. This PR removes all previous workarounds in favour of a single one.
We follow the same approach as on the NMS tests:
vision/test/test_ops.py
Lines 437 to 442 in dc5d055
More specifically we try to verify the entire output (full validation) but if this fails we check for duplicate scores that can lead to unstable sort (partial validation). If all the full validations done in a test pass, we mark it as success. If any of the partial validations fails, we mark it as a failure. Else we mark it as skipped to indicate that the test can only be verified partially.
@fmassa proposed an alternative approach which is less aggressive and it's documented in the source code for future reference. Given that this technique is more complex and requires investigation I propose to merge this to unblock other pieces of work (such as #2954) and revisit this on the near future.
I left some comments in this review for clarifications.