Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for metadata files to imagefolder #4069

Merged
merged 23 commits into from
May 3, 2022
Merged

Conversation

mariosasko
Copy link
Collaborator

@mariosasko mariosasko commented Mar 30, 2022

This PR adds support for metadata files to imagefolder to add an ability to specify image fields other than image and label, which are inferred from the directory structure in the loaded dataset.

To be parsed as an image metadata file, a file should be named "info.csv" and should have the following structure:

image_id,some_col1_name,some_col2_name
rel/path/to/image1.jpg,image1_col1_value,image1_col2_value
rel/path/to/image2.jpg,image2_col1_value,image2_col2_value 
...

This is how the resolution works:

- path/to/imagefolder/directory
  - info.csv
  - 10.jpg # referenced as 10.jpg in "info.csv"
  - Cat
    - 0.jpg  # referenced as Cat/0.jpg in "info.csv"
    - 1.jpg  # referenced as Cat/1.jpg in "info.csv"
  - Dog
    - 0.jpg  # referenced as Dog/0.jpg in "info.csv"
    - 1.jpg  # referenced as Dog/1.jpg in "info.csv"

Open questions:

  1. IMO it makes more sense to store image metadata as JSON Lines than CSV. CSV is sufficient for textual metadata but not the best for representing bounding boxes, for instance. Also, JSON Lines is more strict, which is good in this case (CSV supports various delimiters, the header line is optional, etc., so it's easier to enforce rules on JSON Lines that it's on CSV)
  2. A better name for the image_id column, which contains image identifiers? Maybe image_file or image_filename?
  3. WDYT about making with_metadata=True the default behavior if the loaded repo/directory contains an info.csv file?

An example repository: https://huggingface.co/datasets/mariosasko/PetImages. Can be loaded by installing datasets from the PR branch and running load_dataset("mariosasko/PetImages", with_metadata=True).

cc: @abhishekkrthakur (this PR should address https://huggingface.slack.com/archives/C02JB9L6JKF/p1645450017434029?thread_ts=1645157416.389499&cid=C02JB9L6JKF)

TODOs:

  • Test
  • Metadata file nesting
     - path/to/imagefolder/directory
       - info.csv
       - 10.jpg
       - Cat
         - info.csv  # should have higher precedence in this directory than the top-level info.csv, but we choose the first "eligible" metadata file currently
         - 0.jpg
         - 1.jpg
    

@mariosasko mariosasko marked this pull request as draft March 30, 2022 17:50
@HuggingFaceDocBuilderDev
Copy link

HuggingFaceDocBuilderDev commented Mar 30, 2022

The documentation is not available anymore as the PR was closed or merged.

@polinaeterna polinaeterna mentioned this pull request Mar 31, 2022
@lhoestq
Copy link
Member

lhoestq commented Apr 1, 2022

Love it !

+1 to using JSON Lines rather than CSV. I've also seen image datasets for which JSON Lines was used.

A file_name column sounds good as well, and it means we could reuse the same name for audio. And ok to check the metadata file by default :)

You suggested to name the file infos.json - since we already have a datasets_infos.json file, maybe it would be nice to have a name for the metadata/annotations that doesn't contain "info" ? (e.g. metadata.json, annotations.json, labels.json)

@mariosasko mariosasko marked this pull request as ready for review April 5, 2022 11:15
@mariosasko
Copy link
Collaborator Author

@lhoestq I've addressed your comments and my TODOs. Additionally, I've updated encode_nested_example/decode_nested_example to support null values in place of a dictionary (if it's not top-level) since JSON Lines also supports this.

Copy link
Member

@lhoestq lhoestq left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool thanks !

Since the script's complexity increased, maybe you can try to improve readability here and there, I added a few suggestions if it can help

src/datasets/packaged_modules/imagefolder/imagefolder.py Outdated Show resolved Hide resolved
src/datasets/packaged_modules/imagefolder/imagefolder.py Outdated Show resolved Hide resolved
tests/test_packaged_modules.py Outdated Show resolved Hide resolved
src/datasets/packaged_modules/imagefolder/imagefolder.py Outdated Show resolved Hide resolved
downloaded_metadata_file,
)
for metadata_file, downloaded_metadata_file in metadata_files
if metadata_file is None
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I understand correctly, checking that it is None here means that it's a single file, not coming from an archive.
I feel like it's a bit hard to understand that. Maybe separating single files from files from archives more explicitly would make the code more readable.

Copy link
Member

@lhoestq lhoestq left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool thanks !

I think we can add some tests to make sure that

  • it fails when several metadata files don't match
  • it maps the correct metadata to the correct file (in an archive or not)
  • it has the correct behavior based on the directory structure, even if it has nested directories or archives, and depending on the location of the metadata files

I know you're also working on other things at the same time, so let me know if I can help writing the tests

src/datasets/packaged_modules/imagefolder/imagefolder.py Outdated Show resolved Hide resolved
@mariosasko
Copy link
Collaborator Author

@lhoestq Sure, feel free to add more tests if you have the time.

@lhoestq
Copy link
Member

lhoestq commented Apr 22, 2022

I created a dedicated test file for imagefolder, moved some existing tests there from test_packaged_modules.py, and added an end-to-end test of imagefolder with metadata. I tested for train split only, and for two splits train and test.

Let me know if the test looks ok to you. I'll add similar tests but with the other structures we support on tuesday



@pytest.mark.parametrize("n_splits", [1, 2])
def test_data_files_with_metadata(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is the test I added

@mariosasko
Copy link
Collaborator Author

Thanks a lot for working on this! The test looks great :).

@lhoestq
Copy link
Member

lhoestq commented Apr 29, 2022

Added a test for archives. Will also add a test when the metadata file is not named correctly, and see if we can raise an informative error

Copy link
Member

@lhoestq lhoestq left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm done with the tests I wanted to add @mariosasko :)

Feel free to add more if you want, otherwise I think we can merge

@mariosasko mariosasko merged commit 7017b09 into master May 3, 2022
@mariosasko mariosasko deleted the imagefolder-metadata branch May 3, 2022 12:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants