Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add RVL-CDIP dataset #4050

Merged
merged 38 commits into from
Apr 21, 2022
Merged

Conversation

dnaveenr
Copy link
Contributor

Resolves #2762

Dataset Request : Add RVL-CDIP dataset #2762
This PR adds the RVL-CDIP dataset.

The dataset contains Google Drive link for download and wasn't getting downloaded automatically, so I have provided manual_download_instructions.

  • I have added the dummy_data.zip as well.

Needed inputs on how I can run the real data and the dummy data tests for datasets with manual download ?

Inputs and suggestions for improvement are welcome. Thank you.

@HuggingFaceDocBuilderDev
Copy link

HuggingFaceDocBuilderDev commented Mar 29, 2022

The documentation is not available anymore as the PR was closed or merged.

Copy link
Collaborator

@mariosasko mariosasko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! I've left a few comments.

You can use this URL to avoid manual download: https://drive.google.com/uc?export=download&id=0Bz1dfcnrpXM-MUt4cHNzUEFXcmc

Also, since the data files are TAR archives, we need to implement the streamable (can't use os.path.join) and the non-streamable versions of _generate_examples (see this script for instance). The labels-only data file URL doesn't work for me, so feel free to ask the authors whether they are OK with us hosting the file on the Hub/S3 (to speed up the streamable version). Let me know if you need help implementing this part.

datasets/rvl_cdip/README.md Outdated Show resolved Hide resolved
datasets/rvl_cdip/README.md Outdated Show resolved Hide resolved
datasets/rvl_cdip/README.md Show resolved Hide resolved
datasets/rvl_cdip/README.md Outdated Show resolved Hide resolved
datasets/rvl_cdip/README.md Outdated Show resolved Hide resolved
datasets/rvl_cdip/README.md Outdated Show resolved Hide resolved
datasets/rvl_cdip/README.md Outdated Show resolved Hide resolved
datasets/rvl_cdip/README.md Outdated Show resolved Hide resolved
datasets/rvl_cdip/README.md Outdated Show resolved Hide resolved
datasets/rvl_cdip/README.md Outdated Show resolved Hide resolved
dnaveenr and others added 11 commits March 29, 2022 23:23
License as other for custom license.

Co-authored-by: Mario Šaško <[email protected]>
Add leaderboard details.

Co-authored-by: Mario Šaško <[email protected]>
Add data collection info.

Co-authored-by: Mario Šaško <[email protected]>
Fix links for the license.

Co-authored-by: Mario Šaško <[email protected]>
Add bibtex tag.

Co-authored-by: Mario Šaško <[email protected]>
@dnaveenr
Copy link
Contributor Author

dnaveenr commented Mar 29, 2022

Thanks a lot for inputs. I'll use the URL suggested and check.

we need to implement the streamable (can't use os.path.join) and the non-streamable versions of _generate_examples.

Sure. I will check the reference and try this out, will get back to you if I face any issues.

The labels-only data file URL doesn't work for me, so feel free to ask the authors whether they are OK with us hosting the file on the Hub/S3 (to speed up the streamable version)

Just checked. The author (Adam Harley) has responded positively and allowed us to host the file. Do I share the file with you for hosting it on Hub/S3 ?

@mariosasko
Copy link
Collaborator

Just checked. The author (Adam Harley) has responded positively and allowed us to host the file. Do I share the file with you for hosting it on Hub/S3 ?

Yes, feel free to e-mail me the file. Then I'll create a repo under my namespace and push the file there. We run a GH action on a GH dataset after merging to create its repo on the Hub, so after this PR is merged, I'll push the file to the "official" namespace and update the download link.

@dnaveenr
Copy link
Contributor Author

dnaveenr commented Mar 30, 2022

You can use this URL to avoid manual download: https://drive.google.com/uc?export=download&id=0Bz1dfcnrpXM-MUt4cHNzUEFXcmc

For some reason, the direct download doesn't seem to work for me even with this URL.

Downloading and preparing dataset rvl_cdip/default to ~/.cache/huggingface/datasets/rvl_cdip/default/1.0.0/ea152149e06310d60a9ef3c3020199dd4780bb952a773ba5aac6b57d59f12628...
Downloading data files: 100%|█████████████████████████████████████████████████████| 1/1 [00:00<00:00, 6307.22it/s]
{'rvl-cdip': '~/.cache/huggingface/datasets/downloads/07ef956a33750078d570d76fefe9fed49f7dc32ecf6e872d690de11e66bbe869'}

And this directory does not exist. Am I doing something wrong ?
To verify, I tried using gdown for the above URL, we get the following :

Access denied with the following error:

        Cannot retrieve the public link of the file. You may need to change
        the permission to 'Anyone with the link', or have had many accesses. 

You may still be able to access the file from the browser:

Yes, feel free to e-mail me the file. Then I'll create a repo under my namespace and push the file there. We run a GH action on a GH dataset after merging to create its repo on the Hub, so after this PR is merged, I'll push the file to the "official" namespace and update the download link.

Got it. I've sent you an email with the file. Thank you.

@dnaveenr
Copy link
Contributor Author

dnaveenr commented Apr 4, 2022

Actually this URL works for direct download :
https://drive.google.com/uc?export=download&confirm=pbef&id=0Bz1dfcnrpXM-MUt4cHNzUEFXcmc
Ref : wkentaro/gdown#146 (comment)

I'm working on the streamable versions of _generate_examples as well, will update you regarding this.

@mariosasko
Copy link
Collaborator

Google Drive is a tricky host, and it's easy to exceed daily download quota limits, so if we are allowed to host the rvl-cdip.tar.gz file, I can push it to the Hub.

@dnaveenr
Copy link
Contributor Author

dnaveenr commented Apr 4, 2022

Just checked, the authors have agreed. He mentioned that he had complaints about the GDrive link.
You can push it to the Hub and share the link. :)

@dnaveenr
Copy link
Contributor Author

dnaveenr commented Apr 5, 2022

I have added :

  • streaming support for rvl-cdip.tar.gz file. [ Need to test this ]

Is it possible for you to upload the train.txt, test.txt, val.txt files separately to the Hub instead of labels_only.tar.gz file.
Currently during the tests in stream mode, we get :
NotImplementedError: Extraction protocol for TAR archives like 'https://huggingface.co/datasets/mariosasko/rvl_cdip/resolve/main/labels_only.tar.gz' is not implemented in streaming mode. Please use dl_manager.iter_archive instead.
If the label files are present as .txt files then we can directly use dl_manager.download.

@mariosasko
Copy link
Collaborator

The rvl-cdip.tar.gz archive and txt files with the labels are on the Hub!

@dnaveenr
Copy link
Contributor Author

  • Added 🤗 Hub download links.
  • streamable and non-streamable versions of _generate_examples.
  • Updated dummy data, both real and dummy dataset tests have passed.

Copy link
Collaborator

@mariosasko mariosasko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some minor nits, and we can merge then.

datasets/rvl_cdip/README.md Outdated Show resolved Hide resolved
datasets/rvl_cdip/README.md Outdated Show resolved Hide resolved
datasets/rvl_cdip/rvl_cdip.py Outdated Show resolved Hide resolved
@dnaveenr
Copy link
Contributor Author

I've removed the extraction of the archive file locally as suggested. Let me know if any other changes are required. :)

@dnaveenr dnaveenr requested a review from lhoestq April 19, 2022 11:12
Copy link
Collaborator

@mariosasko mariosasko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've pushed minor fixes/improvements to the card and the script. It all looks good now. Good job!

Copy link
Member

@lhoestq lhoestq left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot @dnaveenr ! And thanks @mariosasko for updating the URLs

It looks all good to me :)

feel free to merge when the CI is green

@mariosasko mariosasko merged commit 4134e89 into huggingface:master Apr 21, 2022
@dnaveenr dnaveenr deleted the add_rvl-cdip_dataset branch April 21, 2022 17:22
@dnaveenr
Copy link
Contributor Author

The check for Update Hub repositories / update-hub-repositories has failed.

https://github.com/huggingface/datasets/runs/6116502392?check_suite_focus=true

@lhoestq
Copy link
Member

lhoestq commented Apr 22, 2022

Hi ! Thanks for reporting ;) yes this CI job has been failing for a few days. I'm working on fixing it, and I'm manually running it on my side in the meantime

@dnaveenr
Copy link
Contributor Author

Great. :D Thank you @lhoestq

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add RVL-CDIP dataset
4 participants