-
Notifications
You must be signed in to change notification settings - Fork 138
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add more dataset format detectors #536
Conversation
Requirement-placing methods that use this to verify their arguments | ||
should raise a FormatRequirementsUnmet rather than a "hard" error like | ||
AssertionError if False is returned. The reason is that the path passed | ||
by the detector might not have been hardcoded, and instead might have | ||
been acquired from another file in the dataset. In that case, an invalid | ||
pattern signifies a problem with the dataset, not with the detector. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we need to raise invalid requirement error in such cases?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you mean by that? A new exception type? How would it be handled?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Raising FormatRequirementsUnmet
doesn't seem correct when incorrect requirements are specified. But I agree that the whole detection process shouldn't be interrupted in such cases. So, probably, just another error can be introduced. I suppose, we can come up with:
DatasetDetectionError
^- FormatRequirementsUnmet
^- InvalidRequirement
And just catch and return DatasetDetectionError
s.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, probably, just another error can be introduced.
That will cause a problem later with the addition of alternatives. If one alternative raises a FormatRequirementsUnmet
, and another raises an InvalidRequirement
, then what exception should the detector as a whole raise?
Furthermore, I predict that "incorrect requirements" will, in practice, be caused by the dataset not meeting other format requirements (like in the scenario explained in the comment: when a value is read from a file that's supposed to be a path, but isn't), and therefore it's actually reasonable to report them as unmet requirements.
The only other possible cause that I can see would be an invalid requirement that is hardcoded into the detector, but such a detector will always fail, so an error like that would be caught in testing and corrected.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That will cause a problem later with the addition of alternatives. If one alternative raises a
FormatRequirementsUnmet
, and another raises anInvalidRequirement
, then what exception should the detector as a whole raise?
I don't think I understand the way you want to implement alternatives, but unless they are implemented, the question makes no sense. From my perspective, we will iterate over alternatives and then return a list of errors. If there is a matching one, the confidence is returned.
Maybe, if there are incorrect requirements, we need to stop checking the format at all. Print a debug log message / a warning, return the NONE confidence. I expect such situations to happen only in custom plugins, or during development. So maybe, we can just fail, actually.
Currently, such requirements are described by a single requirement string (the one specified in the call to `probe_text_file`). I considered making it so that you could specify a separate requirement string for each test done in the prober context (e.g. "must be an XML file"; "must have `annotations` as the root element"), but that seems cumbersome to use and not terribly important. If needed, this functionality could be added later (for example, we could add a method on the context that will tell it to use a more specific message for the next exception thrown).
Unfortunately, JSON files can't really be iteratively parsed (because object keys can be stored in any order), so when we need to probe the contents of such files, we have to parse the entire file.
It's more descriptive.
Previously the subtests in `test_can_parse` checked what happens if a revpath with no format specified was ambiguous due to multiple detected formats. However, since the addition of a precise detector for the Datumaro format, that revpath is not ambiguous anymore. Add a separate test that uses a dataset that deliberately mixes annotations from different formats.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lets' merge after the mapillary problem is resolved.
Summary
Add detectors for a few more dataset formats:
ade20k2017
ade20k2020
cvat
datumaro
icdar_text_segmentation
icdar_word_recognition
kitti_raw
label_me
mot_seq
yolo
To support them, add a mechanism for placing requirements on file contents.
How to test
Checklist
develop
branchLicense
Feel free to contact the maintainers if that's a concern.