Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add British Library books dataset #3603

Merged
merged 21 commits into from
Jan 31, 2022
Merged

Add British Library books dataset #3603

merged 21 commits into from
Jan 31, 2022

Conversation

davanstrien
Copy link
Member

@davanstrien davanstrien commented Jan 19, 2022

This pull request adds a dataset of text from digitised (primarily 19th Century) books from the British Library. This collection has previously been used for training language models, e.g. https://github.com/dbmdz/clef-hipe/blob/main/hlms.md. It would be nice to make this dataset more accessible for others to use through datasets.

This is still a WIP but I wanted to get some initial feedback in particular; I wanted to check:

  • I am handling the use of iter_archive correctly - I intend to ensure that dl_manager.download gets the complete list of URLs to download upfront, so the progress bar knows how much is left to download and then to pass through the gen_kwargs a list of downloaded zip archives wrapped in iter_archive. I am unsure if there is a more elegant approach for this?
  • the number of configs: I have aimed to keep this limited - there are a lot of URLs covering the entire dataset, but I have tried to base the configs on what I believe the majority of people will want to they are not presented with too many options - I am happy to hear suggestions for changing this

If there are other glaring omissions or mistakes, I'd be happy to hear them. If this approach seems sensible in general, I will finish all the remaining TODOs, generate dummy_data, etc.

Copy link
Member

@lhoestq lhoestq left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice ! Thanks a lot for adding this dataset :)

The dataset card is awesome ! And you did well regarding iter_archive and the configurations, feel free to continue in this direction !

datasets/blbooks/README.md Outdated Show resolved Hide resolved
datasets/blbooks/README.md Outdated Show resolved Hide resolved
datasets/blbooks/blbooks.py Show resolved Hide resolved
datasets/blbooks/blbooks.py Outdated Show resolved Hide resolved
Copy link
Member

@lhoestq lhoestq left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome thank you !

The only thing missing are the dummy data we use to test the dataset script regularly. You can use the datasets-cli dummy_data ./datasets/blbooks command to have the instructions to generate them.

Since the dataset has a very specific structure it might not be that easy so feel free to ping me if you have questions or if I can help !

@davanstrien
Copy link
Member Author

davanstrien commented Jan 27, 2022

Thanks for all the help and suggestions

Since the dataset has a very specific structure it might not be that easy so feel free to ping me if you have questions or if I can help !

I did get a little stuck here! So far I have created directories for each config i.e:

datasets/datasets/blbooks/dummy/1700_1799/1.0.2/dummy_data.zip

I have then added two examples of the jsonl.gz files that are in the underlying dataset to each dummy_data directory.This fails the test using local files.

Since

def _generate_examples(self, data_dirs):

takes as input data_dirs which is a list of iter_dirs do I need to put the dummy files inside another directory? i.e.

datasets/datasets/blbooks/dummy/1700_1799/1.0.2/dummy_data/1700/00.jsonl.gz

@lhoestq
Copy link
Member

lhoestq commented Jan 31, 2022

I think I managed to create the dummy data :)

I think everything is good now, if you don't have other changes to do, please mark your PR as "ready for review" and ping me!

@davanstrien davanstrien marked this pull request as ready for review January 31, 2022 16:03
@davanstrien
Copy link
Member Author

I think I managed to create the dummy data :)

Thanks so much for that!

I think everything is good now, if you don't have other changes to do, please mark your PR as "ready for review" and ping me!

Think it is ready to merge from my end @lhoestq.

Copy link
Member

@lhoestq lhoestq left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice thanks :)

datasets/blbooks/README.md Outdated Show resolved Hide resolved
@lhoestq
Copy link
Member

lhoestq commented Jan 31, 2022

The CI failure on windows is unrelated to your PR and fixed on master, we can ignore it

@lhoestq lhoestq merged commit 4c417d5 into huggingface:master Jan 31, 2022
@davanstrien davanstrien deleted the add-bl-books branch January 31, 2022 17:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants