-
Notifications
You must be signed in to change notification settings - Fork 2.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for metadata files to imagefolder
#4069
Conversation
The documentation is not available anymore as the PR was closed or merged. |
Love it ! +1 to using JSON Lines rather than CSV. I've also seen image datasets for which JSON Lines was used. A You suggested to name the file infos.json - since we already have a datasets_infos.json file, maybe it would be nice to have a name for the metadata/annotations that doesn't contain "info" ? (e.g. metadata.json, annotations.json, labels.json) |
@lhoestq I've addressed your comments and my TODOs. Additionally, I've updated |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cool thanks !
Since the script's complexity increased, maybe you can try to improve readability here and there, I added a few suggestions if it can help
downloaded_metadata_file, | ||
) | ||
for metadata_file, downloaded_metadata_file in metadata_files | ||
if metadata_file is None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I understand correctly, checking that it is None here means that it's a single file, not coming from an archive.
I feel like it's a bit hard to understand that. Maybe separating single files from files from archives more explicitly would make the code more readable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cool thanks !
I think we can add some tests to make sure that
- it fails when several metadata files don't match
- it maps the correct metadata to the correct file (in an archive or not)
- it has the correct behavior based on the directory structure, even if it has nested directories or archives, and depending on the location of the metadata files
I know you're also working on other things at the same time, so let me know if I can help writing the tests
Co-authored-by: Quentin Lhoest <[email protected]>
@lhoestq Sure, feel free to add more tests if you have the time. |
I created a dedicated test file for Let me know if the test looks ok to you. I'll add similar tests but with the other structures we support on tuesday |
|
||
|
||
@pytest.mark.parametrize("n_splits", [1, 2]) | ||
def test_data_files_with_metadata( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is the test I added
Thanks a lot for working on this! The test looks great :). |
Added a test for archives. Will also add a test when the metadata file is not named correctly, and see if we can raise an informative error |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm done with the tests I wanted to add @mariosasko :)
Feel free to add more if you want, otherwise I think we can merge
This PR adds support for metadata files to
imagefolder
to add an ability to specify image fields other thanimage
andlabel
, which are inferred from the directory structure in the loaded dataset.To be parsed as an image metadata file, a file should be named
"info.csv"
and should have the following structure:This is how the resolution works:
Open questions:
image_id
column, which contains image identifiers? Maybeimage_file
orimage_filename
?with_metadata=True
the default behavior if the loaded repo/directory contains aninfo.csv
file?An example repository: https://huggingface.co/datasets/mariosasko/PetImages. Can be loaded by installing
datasets
from the PR branch and runningload_dataset("mariosasko/PetImages", with_metadata=True)
.cc: @abhishekkrthakur (this PR should address https://huggingface.slack.com/archives/C02JB9L6JKF/p1645450017434029?thread_ts=1645157416.389499&cid=C02JB9L6JKF)
TODOs: