Add more dataset format detectors #536

IRDonch · 2021-11-03T17:02:42Z

Summary

Add detectors for a few more dataset formats:

ade20k2017
ade20k2020
cvat
datumaro
icdar_text_segmentation
icdar_word_recognition
kitti_raw
label_me
mot_seq
yolo

To support them, add a mechanism for placing requirements on file contents.

How to test

Checklist

I submit my changes into the develop branch
I have added description of my changes into CHANGELOG
I have updated the documentation accordingly
I have added tests to cover my changes
I have linked related issues

License

I submit my code changes under the same MIT License that covers the project.
Feel free to contact the maintainers if that's a concern.
I have updated the license header for each file (see an example below)

# Copyright (C) 2021 Intel Corporation
#
# SPDX-License-Identifier: MIT

zhiltsov-max · 2021-11-03T18:16:05Z

datumaro/components/format_detection.py

+        Requirement-placing methods that use this to verify their arguments
+        should raise a FormatRequirementsUnmet rather than a "hard" error like
+        AssertionError if False is returned. The reason is that the path passed
+        by the detector might not have been hardcoded, and instead might have
+        been acquired from another file in the dataset. In that case, an invalid
+        pattern signifies a problem with the dataset, not with the detector.


Maybe we need to raise invalid requirement error in such cases?

What do you mean by that? A new exception type? How would it be handled?

Raising FormatRequirementsUnmet doesn't seem correct when incorrect requirements are specified. But I agree that the whole detection process shouldn't be interrupted in such cases. So, probably, just another error can be introduced. I suppose, we can come up with:
DatasetDetectionError
^- FormatRequirementsUnmet
^- InvalidRequirement

And just catch and return DatasetDetectionErrors.

So, probably, just another error can be introduced.

That will cause a problem later with the addition of alternatives. If one alternative raises a FormatRequirementsUnmet, and another raises an InvalidRequirement, then what exception should the detector as a whole raise?

Furthermore, I predict that "incorrect requirements" will, in practice, be caused by the dataset not meeting other format requirements (like in the scenario explained in the comment: when a value is read from a file that's supposed to be a path, but isn't), and therefore it's actually reasonable to report them as unmet requirements.

The only other possible cause that I can see would be an invalid requirement that is hardcoded into the detector, but such a detector will always fail, so an error like that would be caught in testing and corrected.

That will cause a problem later with the addition of alternatives. If one alternative raises a FormatRequirementsUnmet, and another raises an InvalidRequirement, then what exception should the detector as a whole raise?

I don't think I understand the way you want to implement alternatives, but unless they are implemented, the question makes no sense. From my perspective, we will iterate over alternatives and then return a list of errors. If there is a matching one, the confidence is returned.

Maybe, if there are incorrect requirements, we need to stop checking the format at all. Print a debug log message / a warning, return the NONE confidence. I expect such situations to happen only in custom plugins, or during development. So maybe, we can just fail, actually.

datumaro/components/format_detection.py

datumaro/plugins/ade20k2020_format.py

datumaro/plugins/cvat_format/extractor.py

datumaro/plugins/datumaro_format/extractor.py

tests/cli/test_revpath.py

Currently, such requirements are described by a single requirement string (the one specified in the call to `probe_text_file`). I considered making it so that you could specify a separate requirement string for each test done in the prober context (e.g. "must be an XML file"; "must have `annotations` as the root element"), but that seems cumbersome to use and not terribly important. If needed, this functionality could be added later (for example, we could add a method on the context that will tell it to use a more specific message for the next exception thrown).

Unfortunately, JSON files can't really be iteratively parsed (because object keys can be stored in any order), so when we need to probe the contents of such files, we have to parse the entire file.

It's more descriptive.

Previously the subtests in `test_can_parse` checked what happens if a revpath with no format specified was ambiguous due to multiple detected formats. However, since the addition of a precise detector for the Datumaro format, that revpath is not ambiguous anymore. Add a separate test that uses a dataset that deliberately mixes annotations from different formats.

zhiltsov-max

Lets' merge after the mapillary problem is resolved.