Add support for metadata files to `imagefolder` #4069

mariosasko · 2022-03-30T17:47:51Z

This PR adds support for metadata files to imagefolder to add an ability to specify image fields other than image and label, which are inferred from the directory structure in the loaded dataset.

To be parsed as an image metadata file, a file should be named "info.csv" and should have the following structure:

image_id,some_col1_name,some_col2_name
rel/path/to/image1.jpg,image1_col1_value,image1_col2_value
rel/path/to/image2.jpg,image2_col1_value,image2_col2_value 
...

This is how the resolution works:

- path/to/imagefolder/directory
  - info.csv
  - 10.jpg # referenced as 10.jpg in "info.csv"
  - Cat
    - 0.jpg  # referenced as Cat/0.jpg in "info.csv"
    - 1.jpg  # referenced as Cat/1.jpg in "info.csv"
  - Dog
    - 0.jpg  # referenced as Dog/0.jpg in "info.csv"
    - 1.jpg  # referenced as Dog/1.jpg in "info.csv"

Open questions:

IMO it makes more sense to store image metadata as JSON Lines than CSV. CSV is sufficient for textual metadata but not the best for representing bounding boxes, for instance. Also, JSON Lines is more strict, which is good in this case (CSV supports various delimiters, the header line is optional, etc., so it's easier to enforce rules on JSON Lines that it's on CSV)
A better name for the image_id column, which contains image identifiers? Maybe image_file or image_filename?
WDYT about making with_metadata=True the default behavior if the loaded repo/directory contains an info.csv file?

An example repository: https://huggingface.co/datasets/mariosasko/PetImages. Can be loaded by installing datasets from the PR branch and running load_dataset("mariosasko/PetImages", with_metadata=True).

cc: @abhishekkrthakur (this PR should address https://huggingface.slack.com/archives/C02JB9L6JKF/p1645450017434029?thread_ts=1645157416.389499&cid=C02JB9L6JKF)

TODOs:

Test

Metadata file nesting

 - path/to/imagefolder/directory
   - info.csv
   - 10.jpg
   - Cat
     - info.csv  # should have higher precedence in this directory than the top-level info.csv, but we choose the first "eligible" metadata file currently
     - 0.jpg
     - 1.jpg

…lder-metadata

HuggingFaceDocBuilderDev · 2022-03-30T17:56:05Z

The documentation is not available anymore as the PR was closed or merged.

lhoestq · 2022-04-01T12:09:08Z

Love it !

+1 to using JSON Lines rather than CSV. I've also seen image datasets for which JSON Lines was used.

A file_name column sounds good as well, and it means we could reuse the same name for audio. And ok to check the metadata file by default :)

You suggested to name the file infos.json - since we already have a datasets_infos.json file, maybe it would be nice to have a name for the metadata/annotations that doesn't contain "info" ? (e.g. metadata.json, annotations.json, labels.json)

…lder-metadata

mariosasko · 2022-04-05T11:24:53Z

@lhoestq I've addressed your comments and my TODOs. Additionally, I've updated encode_nested_example/decode_nested_example to support null values in place of a dictionary (if it's not top-level) since JSON Lines also supports this.

lhoestq

Cool thanks !

Since the script's complexity increased, maybe you can try to improve readability here and there, I added a few suggestions if it can help

src/datasets/packaged_modules/imagefolder/imagefolder.py

tests/test_packaged_modules.py

src/datasets/packaged_modules/imagefolder/imagefolder.py

lhoestq · 2022-04-08T13:11:17Z

src/datasets/packaged_modules/imagefolder/imagefolder.py

+                                        downloaded_metadata_file,
+                                    )
+                                    for metadata_file, downloaded_metadata_file in metadata_files
+                                    if metadata_file is None


If I understand correctly, checking that it is None here means that it's a single file, not coming from an archive.
I feel like it's a bit hard to understand that. Maybe separating single files from files from archives more explicitly would make the code more readable.

…lder-metadata

lhoestq

Cool thanks !

I think we can add some tests to make sure that

it fails when several metadata files don't match
it maps the correct metadata to the correct file (in an archive or not)
it has the correct behavior based on the directory structure, even if it has nested directories or archives, and depending on the location of the metadata files

I know you're also working on other things at the same time, so let me know if I can help writing the tests

src/datasets/packaged_modules/imagefolder/imagefolder.py

Co-authored-by: Quentin Lhoest <[email protected]>

mariosasko · 2022-04-14T15:13:15Z

@lhoestq Sure, feel free to add more tests if you have the time.

lhoestq · 2022-04-22T16:16:25Z

I created a dedicated test file for imagefolder, moved some existing tests there from test_packaged_modules.py, and added an end-to-end test of imagefolder with metadata. I tested for train split only, and for two splits train and test.

Let me know if the test looks ok to you. I'll add similar tests but with the other structures we support on tuesday

lhoestq · 2022-04-22T16:16:59Z

tests/packaged_modules/test_imagefolder.py

+
+
+@pytest.mark.parametrize("n_splits", [1, 2])
+def test_data_files_with_metadata(


this is the test I added

mariosasko · 2022-04-22T17:05:42Z

Thanks a lot for working on this! The test looks great :).

lhoestq · 2022-04-29T16:39:43Z

Added a test for archives. Will also add a test when the metadata file is not named correctly, and see if we can raise an informative error

lhoestq

I'm done with the tests I wanted to add @mariosasko :)

Feel free to add more if you want, otherwise I think we can merge

mariosasko added 3 commits March 30, 2022 18:47

Add support for metadata files to imagefolder

d895abf

Fix imagefolder drop_labels test

488eaa4

Merge branch 'master' of github.com:huggingface/datasets into imagefo…

0e20904

…lder-metadata

mariosasko marked this pull request as draft March 30, 2022 17:50

polinaeterna mentioned this pull request Mar 31, 2022

Add Audio Folder #3963

Closed

mariosasko added 7 commits April 4, 2022 14:29

Replace csv with jsonl

1a818b6

Add test

21c86b8

Merge branch 'master' of github.com:huggingface/datasets into imagefo…

f8a4ada

…lder-metadata

Correct resolution for nested metadata files

b7e17d4

Allow None as JSON Lines value

de58c5c

Add comments

67e5529

Count path segments

98e0a7e

mariosasko marked this pull request as ready for review April 5, 2022 11:15

lhoestq reviewed Apr 8, 2022

View reviewed changes

mariosasko added 3 commits April 8, 2022 17:53

Merge branch 'master' of github.com:huggingface/datasets into imagefo…

c094098

…lder-metadata

Address comments

99c9f48

Improve test

ceb4a2f

lhoestq reviewed Apr 11, 2022

View reviewed changes

src/datasets/packaged_modules/imagefolder/imagefolder.py Outdated Show resolved Hide resolved

This was referenced Apr 12, 2022

Fix splits in local packaged modules, local datasets without script and hub datasets without script #4144

Merged

load_dataset for winoground returning decoding error #4149

Closed

Update src/datasets/packaged_modules/imagefolder/imagefolder.py

8e7177f

Co-authored-by: Quentin Lhoest <[email protected]>

lhoestq added 2 commits April 22, 2022 17:16

Merge branch 'master' into imagefolder-metadata

4e6c8b7

test e2e imagefolder with metadata

d12dfb7

lhoestq reviewed Apr 22, 2022

View reviewed changes

lhoestq added 2 commits April 29, 2022 17:26

add test for zip archives

ca25da8

fix test

00c8f86

lhoestq added 4 commits May 2, 2022 12:17

add some debug logging to know which files are ignored

b391a44

add test for bad/malformed metadata file

010f798

revert use of posix path to fix windows tests

9aeb5a7

style

9ccf693

lhoestq approved these changes May 2, 2022

View reviewed changes

Refactor tests for packaged modules Text and Csv

3ac31ef

mariosasko merged commit 7017b09 into master May 3, 2022

mariosasko deleted the imagefolder-metadata branch May 3, 2022 12:42

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support for metadata files to `imagefolder` #4069

Add support for metadata files to `imagefolder` #4069

mariosasko commented Mar 30, 2022 •

edited

Loading

HuggingFaceDocBuilderDev commented Mar 30, 2022 •

edited

Loading

lhoestq commented Apr 1, 2022

mariosasko commented Apr 5, 2022

lhoestq left a comment

lhoestq Apr 8, 2022

lhoestq left a comment

mariosasko commented Apr 14, 2022

lhoestq commented Apr 22, 2022 •

edited

Loading

lhoestq Apr 22, 2022

mariosasko commented Apr 22, 2022

lhoestq commented Apr 29, 2022

lhoestq left a comment



		@pytest.mark.parametrize("n_splits", [1, 2])
		def test_data_files_with_metadata(

Add support for metadata files to imagefolder #4069

Add support for metadata files to imagefolder #4069

Conversation

mariosasko commented Mar 30, 2022 • edited Loading

HuggingFaceDocBuilderDev commented Mar 30, 2022 • edited Loading

lhoestq commented Apr 1, 2022

mariosasko commented Apr 5, 2022

lhoestq left a comment

Choose a reason for hiding this comment

lhoestq Apr 8, 2022

Choose a reason for hiding this comment

lhoestq left a comment

Choose a reason for hiding this comment

mariosasko commented Apr 14, 2022

lhoestq commented Apr 22, 2022 • edited Loading

lhoestq Apr 22, 2022

Choose a reason for hiding this comment

mariosasko commented Apr 22, 2022

lhoestq commented Apr 29, 2022

lhoestq left a comment

Choose a reason for hiding this comment

Add support for metadata files to `imagefolder` #4069

Add support for metadata files to `imagefolder` #4069

mariosasko commented Mar 30, 2022 •

edited

Loading

HuggingFaceDocBuilderDev commented Mar 30, 2022 •

edited

Loading

lhoestq commented Apr 22, 2022 •

edited

Loading