Add British Library books dataset #3603

davanstrien · 2022-01-19T17:53:05Z

This pull request adds a dataset of text from digitised (primarily 19th Century) books from the British Library. This collection has previously been used for training language models, e.g. https://github.com/dbmdz/clef-hipe/blob/main/hlms.md. It would be nice to make this dataset more accessible for others to use through datasets.

This is still a WIP but I wanted to get some initial feedback in particular; I wanted to check:

I am handling the use of iter_archive correctly - I intend to ensure that dl_manager.download gets the complete list of URLs to download upfront, so the progress bar knows how much is left to download and then to pass through the gen_kwargs a list of downloaded zip archives wrapped in iter_archive. I am unsure if there is a more elegant approach for this?
the number of configs: I have aimed to keep this limited - there are a lot of URLs covering the entire dataset, but I have tried to base the configs on what I believe the majority of people will want to they are not presented with too many options - I am happy to hear suggestions for changing this

If there are other glaring omissions or mistakes, I'd be happy to hear them. If this approach seems sensible in general, I will finish all the remaining TODOs, generate dummy_data, etc.

lhoestq

Nice ! Thanks a lot for adding this dataset :)

The dataset card is awesome ! And you did well regarding iter_archive and the configurations, feel free to continue in this direction !

datasets/blbooks/README.md

datasets/blbooks/blbooks.py

Co-authored-by: Quentin Lhoest <[email protected]>

lhoestq

Awesome thank you !

The only thing missing are the dummy data we use to test the dataset script regularly. You can use the datasets-cli dummy_data ./datasets/blbooks command to have the instructions to generate them.

Since the dataset has a very specific structure it might not be that easy so feel free to ping me if you have questions or if I can help !

davanstrien · 2022-01-27T17:37:57Z

Thanks for all the help and suggestions

Since the dataset has a very specific structure it might not be that easy so feel free to ping me if you have questions or if I can help !

I did get a little stuck here! So far I have created directories for each config i.e:

datasets/datasets/blbooks/dummy/1700_1799/1.0.2/dummy_data.zip

I have then added two examples of the jsonl.gz files that are in the underlying dataset to each dummy_data directory.This fails the test using local files.

Since

def _generate_examples(self, data_dirs):

takes as input data_dirs which is a list of iter_dirs do I need to put the dummy files inside another directory? i.e.

datasets/datasets/blbooks/dummy/1700_1799/1.0.2/dummy_data/1700/00.jsonl.gz

lhoestq · 2022-01-31T13:58:48Z

I think I managed to create the dummy data :)

I think everything is good now, if you don't have other changes to do, please mark your PR as "ready for review" and ping me!

davanstrien · 2022-01-31T16:04:43Z

I think I managed to create the dummy data :)

Thanks so much for that!

I think everything is good now, if you don't have other changes to do, please mark your PR as "ready for review" and ping me!

Think it is ready to merge from my end @lhoestq.

lhoestq

Nice thanks :)

datasets/blbooks/README.md

lhoestq · 2022-01-31T16:51:57Z

The CI failure on windows is unrelated to your PR and fixed on master, we can ignore it

davanstrien added 5 commits January 19, 2022 17:36

loading script draft

fe1f9e2

improve config naming

73d7786

move parsing code into function

78d2103

fix type hints

2b0735f

fix default config name

b8ae997

lhoestq reviewed Jan 24, 2022

View reviewed changes

datasets/blbooks/README.md Outdated Show resolved Hide resolved

datasets/blbooks/README.md Outdated Show resolved Hide resolved

datasets/blbooks/blbooks.py Show resolved Hide resolved

datasets/blbooks/blbooks.py Outdated Show resolved Hide resolved

davanstrien and others added 13 commits January 24, 2022 18:04

fix typo

6d65aea

Co-authored-by: Quentin Lhoest <[email protected]>

add header

2c2f4d0

Co-authored-by: Quentin Lhoest <[email protected]>

remove readlines call

3b13ac7

Co-authored-by: Quentin Lhoest <[email protected]>

update copyright date

1f8188d

add citation to README

6a1dd85

update citation key

90020be

update citation key

ea2d3cd

add contact details

59ab737

add URLs to configs

38a0eea

add url

b8d470d

black formatting

a3d2652

add config options to readme

39f18fa

generate dataset_infos

17b507c

lhoestq reviewed Jan 27, 2022

View reviewed changes

lhoestq added 2 commits January 31, 2022 14:56

add dummy data

9796416

fix tags

93417de

davanstrien marked this pull request as ready for review January 31, 2022 16:03

lhoestq approved these changes Jan 31, 2022

View reviewed changes

datasets/blbooks/README.md Outdated Show resolved Hide resolved

Update datasets/blbooks/README.md

36ce4a4

lhoestq merged commit 4c417d5 into huggingface:master Jan 31, 2022

davanstrien deleted the add-bl-books branch January 31, 2022 17:22

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add British Library books dataset #3603

Add British Library books dataset #3603

davanstrien commented Jan 19, 2022 •

edited

Loading

lhoestq left a comment

lhoestq left a comment

davanstrien commented Jan 27, 2022 •

edited

Loading

lhoestq commented Jan 31, 2022 •

edited

Loading

davanstrien commented Jan 31, 2022

lhoestq left a comment

lhoestq commented Jan 31, 2022

Add British Library books dataset #3603

Add British Library books dataset #3603

Conversation

davanstrien commented Jan 19, 2022 • edited Loading

lhoestq left a comment

Choose a reason for hiding this comment

lhoestq left a comment

Choose a reason for hiding this comment

davanstrien commented Jan 27, 2022 • edited Loading

lhoestq commented Jan 31, 2022 • edited Loading

davanstrien commented Jan 31, 2022

lhoestq left a comment

Choose a reason for hiding this comment

lhoestq commented Jan 31, 2022

davanstrien commented Jan 19, 2022 •

edited

Loading

davanstrien commented Jan 27, 2022 •

edited

Loading

lhoestq commented Jan 31, 2022 •

edited

Loading