Fix flakiness on detection tests #2966

datumbox · 2020-11-05T15:40:59Z

The detection model tests are extremely brittle and break regularly. From past investigations we know that this is caused by floating point errors and the unstable sort in NMS (see here for details). The problem is getting worse by the fact that our tests use random input and random model weights.

As a result of the above, our unit-test code contains a large number of hacks/workarounds to handle special cases; it also turns off quite a few tests. This PR removes all previous workarounds in favour of a single one.

We follow the same approach as on the NMS tests:

vision/test/test_ops.py

Lines 437 to 442 in dc5d055

    
           is_eq = torch.allclose(r_cpu, r_cuda.cpu()) 
        
           if not is_eq: 
        
               # if the indices are not the same, ensure that it's because the scores 
        
               # are duplicate 
        
               is_eq = torch.allclose(scores[r_cpu], scores[r_cuda.cpu()], rtol=tol, atol=tol) 
        
           self.assertTrue(is_eq, err_msg.format(iou))

More specifically we try to verify the entire output (full validation) but if this fails we check for duplicate scores that can lead to unstable sort (partial validation). If all the full validations done in a test pass, we mark it as success. If any of the partial validations fails, we mark it as a failure. Else we mark it as skipped to indicate that the test can only be verified partially.

@fmassa proposed an alternative approach which is less aggressive and it's documented in the source code for future reference. Given that this technique is more complex and requires investigation I propose to merge this to unblock other pieces of work (such as #2954) and revisit this on the near future.

I left some comments in this review for clarifications.

…r shrinking large ouputs.

codecov · 2020-11-05T16:48:24Z

Codecov Report

Merging #2966 into master will not change coverage.
The diff coverage is n/a.

@@           Coverage Diff           @@
##           master    #2966   +/-   ##
=======================================
  Coverage   73.43%   73.43%           
=======================================
  Files          99       99           
  Lines        8813     8813           
  Branches     1391     1391           
=======================================
  Hits         6472     6472           
  Misses       1916     1916           
  Partials      425      425

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 4738267...496cf21. Read the comment docs.

…as skipped in partial validation.

datumbox · 2020-11-05T17:14:03Z

test/test_models.py

+                if elements_per_sample > 30:
+                    return compute_mean_std(tensor)
+                else:
+                    return subsample_tensor(tensor)


Some tensors are too large to store on expected (for example masks). If the size of the input is over 30 elements per record, then we assert its mean and standard dev instead.

datumbox · 2020-11-05T17:14:51Z

test/test_models.py

                ith_index = num_elems // num_samples
-                return flat_tensor[ith_index - 1::ith_index]
+                return tensor[ith_index - 1::ith_index]


We no longer flatten the results of the sample. This can be useful if we want to maintain the structure of the boxes.

datumbox · 2020-11-05T17:25:01Z

test/test_models.py

+                # and then using the Hungarian algorithm as in DETR to find the
+                # best match between output and expected boxes and eliminate some
+                # of the flakiness. Worth exploring.
+                return False  # Partial validation performed


If we reached this point, we partially validated the results. Return False so that we can flag this accordingly.

datumbox · 2020-11-05T17:25:51Z

test/test_models.py

-                    check_out(out)
+                    full_validation &= check_out(out)
+
+        if not full_validation:


If any of the validations was partial, flag the test as skipped and raise a warning. It's better to flag it as skipped than mark it as green.

datumbox · 2020-11-05T17:31:22Z

test/common_utils.py

            torch.save(output, expected_file)
            MAX_PICKLE_SIZE = 50 * 1000  # 50 KB
            binary_size = os.path.getsize(expected_file)
-            self.assertTrue(binary_size <= MAX_PICKLE_SIZE)
+            if binary_size > MAX_PICKLE_SIZE:
+                raise RuntimeError("The output for {}, is larger than 50kb".format(filename))


Do manual check instead of an assertion. We don't want to throw an AssertionError exception that will be caught by the external try.

datumbox · 2020-11-05T17:32:09Z

test/test_models.py

@@ -148,10 +145,9 @@ def _test_detection_model(self, name, dev):
        set_rng_seed(0)
        kwargs = {}
        if "retinanet" in name:
-            kwargs["score_thresh"] = 0.013
+            # Reduce the default threshold to ensure the returned boxes are not empty.
+            kwargs["score_thresh"] = 0.01


Use a threshold that will produce many more boxes and make things more interesting.

fmassa

Looks great to me, thanks a lot Vasilis!

fmassa · 2020-11-06T09:41:55Z

test/test_models.py

+                # in NMS. If matching across all outputs fails, use the same approach
+                # as in NMSTester.test_nms_cuda to see if this is caused by duplicate
+                # scores.
+                expected_file = self._get_expected_file(strip_suffix=strip_suffix)


Note for the future: we might move away from our custom base Tester in the future in favor of PyTorch one.

* Simplify the ACCEPT=True logic in assertExpected(). * Separate the expected filename estimation from assertExpected * Unflatten expected values. * Assert for duplicate scores if primary check fails. * Remove custom exceptions for algorithms and add a compact function for shrinking large ouputs. * Removing unused variables. * Add warning and comments. * Re-enable all autocast unit-test for detection and marking the tests as skipped in partial validation. * Move test skip at the end. * Changing the warning message.

datumbox added 6 commits November 5, 2020 12:38

Simplify the ACCEPT=True logic in assertExpected().

8977b7a

Separate the expected filename estimation from assertExpected

23c153c

Unflatten expected values.

ecec6c2

Merge branch 'master' into tests/fix_detection_flakiness

b824945

Assert for duplicate scores if primary check fails.

f2ac321

Remove custom exceptions for algorithms and add a compact function fo…

7ae01df

…r shrinking large ouputs.

facebook-github-bot added the cla signed label Nov 5, 2020

datumbox added 2 commits November 5, 2020 15:43

Removing unused variables.

83b2682

Add warning and comments.

4b59173

Re-enable all autocast unit-test for detection and marking the tests …

f908aef

…as skipped in partial validation.

datumbox commented Nov 5, 2020

View reviewed changes

Move test skip at the end.

3ae8f09

datumbox commented Nov 5, 2020

View reviewed changes

datumbox changed the title ~~[WIP] Fix flakiness on detection tests~~ Fix flakiness on detection tests Nov 5, 2020

datumbox requested a review from fmassa November 5, 2020 18:14

Changing the warning message.

496cf21

fmassa approved these changes Nov 6, 2020

View reviewed changes

fmassa merged commit 7f7ff05 into pytorch:master Nov 6, 2020

datumbox deleted the tests/fix_detection_flakiness branch November 6, 2020 10:49

datumbox mentioned this pull request Nov 6, 2020

[BC-breaking] Fix initialisation bug on FeaturePyramidNetwork #2954

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix flakiness on detection tests #2966

Fix flakiness on detection tests #2966

datumbox commented Nov 5, 2020 •

edited

Loading

codecov bot commented Nov 5, 2020 •

edited

Loading

datumbox Nov 5, 2020

datumbox Nov 5, 2020

datumbox Nov 5, 2020

datumbox Nov 5, 2020

datumbox Nov 5, 2020

datumbox Nov 5, 2020

fmassa left a comment

fmassa Nov 6, 2020

	is_eq = torch.allclose(r_cpu, r_cuda.cpu())
	if not is_eq:
	# if the indices are not the same, ensure that it's because the scores
	# are duplicate
	is_eq = torch.allclose(scores[r_cpu], scores[r_cuda.cpu()], rtol=tol, atol=tol)
	self.assertTrue(is_eq, err_msg.format(iou))

Fix flakiness on detection tests #2966

Fix flakiness on detection tests #2966

Conversation

datumbox commented Nov 5, 2020 • edited Loading

codecov bot commented Nov 5, 2020 • edited Loading

Codecov Report

datumbox Nov 5, 2020

Choose a reason for hiding this comment

datumbox Nov 5, 2020

Choose a reason for hiding this comment

datumbox Nov 5, 2020

Choose a reason for hiding this comment

datumbox Nov 5, 2020

Choose a reason for hiding this comment

datumbox Nov 5, 2020

Choose a reason for hiding this comment

datumbox Nov 5, 2020

Choose a reason for hiding this comment

fmassa left a comment

Choose a reason for hiding this comment

fmassa Nov 6, 2020

Choose a reason for hiding this comment

datumbox commented Nov 5, 2020 •

edited

Loading

codecov bot commented Nov 5, 2020 •

edited

Loading