Change features vs schema logic #423

lhoestq · 2020-07-21T14:52:47Z

New logic for `nlp.Features` in datasets

Previously, it was confusing to have features and pyarrow's schema in nlp.Dataset.
However features is supposed to be the front-facing object to define the different fields of a dataset, while schema is only used to write arrow files.

Changes:

Remove schema field in nlp.Dataset
Make features the source of truth to read/write examples
features can no longer be None in nlp.Dataset
Update features after each dataset transform such as nlp.Dataset.map

Todo: change the tests to take these changes into account

lhoestq · 2020-07-22T16:43:15Z

I had to make SplitDict serializable to be able to copy DatasetInfo objects properly.
Serialization was also asked in #389

thomwolf

Really cool!

thomwolf · 2020-07-25T09:08:34Z

One thing I forgot to say here, is that we also want to use the features arguments of load_dataset (which goes in the builder’s config) to override the default features of a dataset script.

Change features vs schema logic

26962b5

This was linked to issues Jul 22, 2020

Features should be updated when map() changes schema #342

Closed

[Arrow writer, Trivia_qa] Could not convert TagMe with type str: converting to null type #211

Closed

lhoestq added 3 commits July 22, 2020 17:21

test output features of dataset transforms

9901138

style

e82c948

fix error msg in SplitInfo + make serializable

099f423

lhoestq marked this pull request as ready for review July 22, 2020 17:15

fix dictionary_encode_column

d1f3821

lhoestq force-pushed the change-features-vs-schema-logic branch from f754505 to d1f3821 Compare July 22, 2020 17:33

thomwolf approved these changes Jul 23, 2020

View reviewed changes

lhoestq merged commit 8d828b9 into master Jul 23, 2020

lhoestq deleted the change-features-vs-schema-logic branch July 23, 2020 10:15

This was referenced Jul 23, 2020

Fix pickling of SplitDict #389

Closed

fix concatenate_datasets #428

Merged

Adding support for generic multi dimensional tensors and auxillary image data for multimodal datasets #363

Merged

jarednielsen mentioned this pull request Aug 11, 2020

Export to TFRecords Error aws-samples/deep-learning-models#30

Closed

This was referenced Feb 25, 2024

Update the print message for chunked_dataset in process.mdx gzbfgjf2/datasets#1

Open

Update the print message for chunked_dataset in process.mdx #6693

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Change features vs schema logic #423

Change features vs schema logic #423

lhoestq commented Jul 21, 2020

lhoestq commented Jul 22, 2020

thomwolf left a comment

thomwolf commented Jul 25, 2020

Change features vs schema logic #423

Change features vs schema logic #423

Conversation

lhoestq commented Jul 21, 2020

New logic for nlp.Features in datasets

lhoestq commented Jul 22, 2020

thomwolf left a comment

Choose a reason for hiding this comment

thomwolf commented Jul 25, 2020

New logic for `nlp.Features` in datasets