to_tf_dataset rewrite #4170

Rocketknight1 · 2022-04-14T11:30:58Z

This PR rewrites almost all of to_tf_dataset(), which makes it kind of hard to list all the changes, but the most critical ones are:

Much better stability and no more dropping unexpected column names (Sorry @NielsRogge)
Doesn't clobber custom transforms on the data (Sorry @NielsRogge again)
Much better handling of the situation when the collate_fn adds columns that aren't in the dataset.
Better inference of shapes and data types
Lots of hacky special-casing code removed
Can return string columns (as tf.String)
Most arguments have default values, calling the method should be much simpler
~~Can accept a model argument and only return columns that are valid inputs to that model~~
Drops the dummy_labels argument - this was a workaround for Keras issues that have been resolved by changes in transformers. Also remove it from tests and the Overview notebook.

I still have a couple of TODOs remaining and some testing to do, so don't merge yet, but it should be mostly ready for review at this point!

HuggingFaceDocBuilderDev · 2022-04-14T11:45:06Z

The documentation is not available anymore as the PR was closed or merged.

sgugger

The fact your changes make this very entangled with Transformers code makes me think you are trying to do too much here. In particular the code for guessing the labels or the numpy data collator from Transformers are very aimed at Transformer models, and the method in Datasets would be expected to work on any model. Copy-pasting code from Transformers is kind of a red flag ;-)

I think we need to rethink the design and make the base to_tf_dataset more basic and less magic, then have another method in Transformers that will do the magic guessing of defaults using the model (and use a different default collate).

src/datasets/arrow_dataset.py

Rocketknight1 · 2022-04-14T14:20:54Z

Magic is now banned by decree of @sgugger. This is honestly much cleaner, and the functionality will make much more sense in transformers anyway!

sgugger

Better for me! Are there any real changes in the notebook? All I see in the diff is Pycharm adding useless metadata.

src/datasets/arrow_dataset.py

gante

Looks good! Added a few minor comments, and I'd like to make a request as well: since we now have default arguments for everything, can we add a test case with no passed arguments?

(I don't want to approve because I don't have much datasets knowledge :) )

src/datasets/arrow_dataset.py

Rocketknight1 · 2022-04-14T16:13:08Z

@gante I renamed the default collator to minimal_tf_collate_fn!

lhoestq

Awesome thank you !

I added a few comments, and also found a bug when you don't specify the columns explicitly (a None is passed to _get_output_signature).

Feel free to also add some tests to make sure that the new default behavior works as expected !

src/datasets/utils/tf_utils.py

lhoestq · 2022-04-15T15:55:56Z

src/datasets/arrow_dataset.py

        """
+        # TODO Try an Image dataset and see if we can do the conversion


Image datasets return PIL.Image objects by default, which are not supported by TF.

However we can improve the .with_format("numpy") so that it returns a numpy array instead of the PIL Image, so that you don't have to deal with custom types by yourself in to_tf_dataset and always assume that you'll get numpy arrays that work with TF :)

cc @mariosasko

Now that we're not clobbering with_transform then users can also just do that!

src/datasets/arrow_dataset.py

lhoestq

I think it's still not working, see my comment. Can you please add some tests ?

src/datasets/arrow_dataset.py

lhoestq · 2022-04-21T10:48:30Z

notebooks/Overview.ipynb

    "encoded_tf_dataset = encoded_dataset['train'].to_tf_dataset(\n",
    "    columns=columns,\n",
    "    collate_fn=collate_fn,\n",
    "    batch_size=8,\n",
    "    shuffle=True,\n",
-    "    dummy_labels=True\n",


It was pretty hard to find these little changes in this big diff xD

Unfortunately, I don't know how to stop Jupyter saving all that junk when I just want to change one line!

src/datasets/utils/tf_utils.py

src/datasets/arrow_dataset.py

Rocketknight1 · 2022-04-25T18:19:08Z

@lhoestq @sgugger @gante

I think this should now be ready, it looks good in testing! I'll try a few more notebooks today and tomorrow to be sure before I merge. Key changes are:

No column autodetection magic (will make a separate PR to add this as a transformers function)
Drops non-numerical features automatically (this is more of a 'DataLoader' method, we'll have a separate method to expose 'raw' datasets to tf.data)
Better autodetection of numerical features.
Shouldn't randomly crash mid-function 💀

We definitely have some questions still to resolve about how to handle making a 'DataLoader' dataset versus a 'raw' dataset - see the Notion doc if you're interested. Still, since this PR is just fixes/improvements to an existing method which never supported non-numerical features anyway, we can merge it before we've resolved those issues, and then think about how to name and split things afterwards.

Rocketknight1 · 2022-04-26T14:00:36Z

P.S. I'll take out the region comments at the end before I merge, I promise! They're just helpful while I'm editing it

gante

Looks good! 👍

Can we also add a test case for .to_tf_dataset(), i.e. using defaults for all arguments?

src/datasets/utils/tf_utils.py

lhoestq · 2022-05-02T16:56:06Z

+1 for the tests

Drops non-numerical features automatically

Can you give more details on how this work and the rationale as well ? This is not explained in the docs

Also why are you adding error_on_missing and auto_fix_label_names ? The rationale is not clear to me. In particular I think it is sensible enough to expect users to not ask columns that don't exist, and to rename a label column when required.

Rocketknight1 · 2022-05-05T16:38:12Z

@lhoestq I rewrote those parts - they were causing some other issues too! error_on_missing and auto_fix_label_names have been removed. The new logic is to simply drop (before batch collation) all columns the user doesn't ask for, but not to raise errors if the user asked for columns not in the dataset, as they may be added by the collator. Hopefully this cleans it up and matches the documentation better!

Rocketknight1 · 2022-05-05T16:42:39Z

src/datasets/arrow_dataset.py

+            # Following the logic in `transformers.Trainer`, we do not drop `label_ids` or `label` even if they
+            # are not in the list of requested columns, because the collator may rename them
+            # This might work better if moved to a method attached to our transformers Model objects, but doing so
+            # could break backward compatibility
+            unwanted_columns = [
+                col
+                for col in self.features.keys()
+                if col not in columns and col not in label_cols and col not in ("label_ids", "label")
+            ]
+            dataset = dataset.remove_columns(unwanted_columns)


One thing I'm still not sure about is that we special-case "label_ids" and "label", because those columns are frequently renamed to "labels" in transformers and we don't want to drop them. In PyTorch this special-casing occurs too, but it happens inside Trainer, so it doesn't mix things up between the transformers and datasets libraries. We might have to leave it here for now until I can make the 'magic method' on our transformers models!

I'd like this part to be removed since it's transformers.Trainer-specific
What needs to be backward compatible ?

The standard workflow for transformers is that the column in the dataset is called label, but the argument to the model is called labels, and the renaming is done by the collate_fn. Therefore, we must be careful not to drop columns called label or label_ids, even if the user doesn't request them, because the collate_fn might need them.

In PyTorch columns are dropped by the Trainer, and it makes sure not to drop these. However, in Tensorflow, columns are dropped by the to_tf_dataset() method, and therefore this code needs to be in there.

I think a good solution would be to make a method in transformers that calls to_tf_dataset() and converts datasets for training, and then we could move the label and label_ids special casing in there. However, until we do that, we'll need to keep the special casing in to_tf_dataset() or all our examples will break!

lhoestq

Cool, let me know if you need help for the tests :)

I think we need several additional tests, especially since the function is quite long and not trivial to maintain. In particular it would be nice to check the default behavior, and check the behavior of the main parameters.

src/datasets/arrow_dataset.py

Rocketknight1 · 2022-05-17T14:50:17Z

@lhoestq New tests are now in!

Rocketknight1 · 2022-05-17T16:44:21Z

Seeing some other random tests failing that don't look to be associated with this PR.

Rocketknight1 · 2022-05-19T16:03:14Z

@lhoestq I can't figure out these test failures! They don't seem related to this PR at all, but I rebased to the latest version and they keep happening, even though they're not visible on master.

lhoestq · 2022-05-19T16:44:06Z

Thanks for the ping, will take a look tomorrow :)

Maybe the rebase didn't go well for the code recently merged about label alignment from #4277 ?

Rocketknight1 · 2022-05-20T13:22:16Z

@lhoestq Got it! It was caused by a name collision - I was importing typing.Sequence, but the code also needed features.Sequence. The tests from that PR were expecting the latter but got the former, and then crashed.

lhoestq

Oh good catch ! And thanks for the tests :)

I think we just need to update the docstring, and I also added a question about the default batch_size (sorry for not raising the question earlier, I missed it)

src/datasets/arrow_dataset.py

…shape

lhoestq

Thank you ! I added more comments about the default for shuffle and drop_remainder, and some suggestions:

src/datasets/arrow_dataset.py

Co-authored-by: Quentin Lhoest <[email protected]>

lhoestq

Cool ! This is a HUGE step !! Thanks a lot @Rocketknight1 :)

Here are my final comments, then I think we can merge 🚀

src/datasets/arrow_dataset.py

tests/test_arrow_dataset.py

src/datasets/arrow_dataset.py

Co-authored-by: Quentin Lhoest <[email protected]>

Rocketknight1 · 2022-05-25T16:17:03Z

@lhoestq Thanks! Also, when you're ready, don't merge it immediately! I'd like to do a quick round of manual testing with the very final build once you're happy to make sure it still works in our notebooks and examples.

lhoestq

Thanks ! Feel free to run all your tests and merge yourself when you're good and if the CI is green :)

Rocketknight1 · 2022-06-06T14:21:59Z

@lhoestq Tests look good to me, merging now!

Rocketknight1 requested review from gante, sgugger and lhoestq April 14, 2022 11:30

lhoestq mentioned this pull request Apr 14, 2022

Add code examples to API docs #4168

Merged

sgugger reviewed Apr 14, 2022

View reviewed changes

src/datasets/arrow_dataset.py Outdated Show resolved Hide resolved

sgugger reviewed Apr 14, 2022

View reviewed changes

src/datasets/arrow_dataset.py Outdated Show resolved Hide resolved

gante reviewed Apr 14, 2022

View reviewed changes

lhoestq reviewed Apr 15, 2022

View reviewed changes

lhoestq reviewed Apr 21, 2022

View reviewed changes

cakiki mentioned this pull request Apr 25, 2022

The to_tf_dataset is missing collate_fn argument. nlp-with-transformers/notebooks#42

Open

11 tasks

Rocketknight1 force-pushed the to_tf_dataset_tpu_warning branch from 652facf to 71f3982 Compare April 26, 2022 14:47

Rocketknight1 requested review from gante, sgugger and lhoestq April 27, 2022 13:01

gante reviewed Apr 27, 2022

View reviewed changes

src/datasets/utils/tf_utils.py Outdated Show resolved Hide resolved

Rocketknight1 commented May 5, 2022

View reviewed changes

lhoestq reviewed May 16, 2022

View reviewed changes

src/datasets/arrow_dataset.py Show resolved Hide resolved

Rocketknight1 force-pushed the to_tf_dataset_tpu_warning branch from a440a15 to 3da0cb4 Compare May 17, 2022 16:26

Rocketknight1 force-pushed the to_tf_dataset_tpu_warning branch from 3da0cb4 to 172b231 Compare May 19, 2022 15:06

Add cleanup for windows

c518d47

Rocketknight1 force-pushed the to_tf_dataset_tpu_warning branch from 172b231 to c518d47 Compare May 20, 2022 13:09

No more Sequence name collision

dbaf21a

lhoestq reviewed May 20, 2022

View reviewed changes

src/datasets/arrow_dataset.py Outdated Show resolved Hide resolved

src/datasets/arrow_dataset.py Outdated Show resolved Hide resolved

src/datasets/arrow_dataset.py Outdated Show resolved Hide resolved

src/datasets/arrow_dataset.py Show resolved Hide resolved

Rocketknight1 added 3 commits May 20, 2022 16:05

Adding defaults to docstring

938cebd

Adding defaults to docstring

ff1d974

Remove default batch_size and explain batch_size argument for output …

4426ed7

…shape

lhoestq reviewed May 23, 2022

View reviewed changes

Rocketknight1 and others added 8 commits May 23, 2022 15:11

Update src/datasets/arrow_dataset.py

5c3ab7c

Co-authored-by: Quentin Lhoest <[email protected]>

Update src/datasets/arrow_dataset.py

79b7afa

Co-authored-by: Quentin Lhoest <[email protected]>

Update src/datasets/arrow_dataset.py

39936d8

Co-authored-by: Quentin Lhoest <[email protected]>

Update docstring to explain when dicts aren't returned

6284c7e

Fix docstring for shuffle

3d3faf7

Small typo fix

e64c077

drop_remainder defaults to False to match tf.data.Dataset.batch()

4036e01

Merge branch 'master' into to_tf_dataset_tpu_warning

1e3d733

lhoestq reviewed May 25, 2022

View reviewed changes

Rocketknight1 and others added 5 commits May 25, 2022 16:45

Update src/datasets/arrow_dataset.py

3e1f36f

Co-authored-by: Quentin Lhoest <[email protected]>

Update src/datasets/arrow_dataset.py

e5c20fe

Co-authored-by: Quentin Lhoest <[email protected]>

Remove shuffle=False from tests now it's the default

401c7fa

Test output values as well as shape

8f80817

Remove region blocks

4c6056d

Merge branch 'master' into to_tf_dataset_tpu_warning

0b84dc9

lhoestq approved these changes May 30, 2022

View reviewed changes

sayakpaul mentioned this pull request May 31, 2022

Add a image classification fine-tuning notebook for TF vision models huggingface/notebooks#197

Closed

Rocketknight1 merged commit e3f2bbb into master Jun 6, 2022

Rocketknight1 deleted the to_tf_dataset_tpu_warning branch June 6, 2022 14:22

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

to_tf_dataset rewrite #4170

to_tf_dataset rewrite #4170

Rocketknight1 commented Apr 14, 2022 •

edited

Loading

HuggingFaceDocBuilderDev commented Apr 14, 2022 •

edited

Loading

sgugger left a comment

Rocketknight1 commented Apr 14, 2022

sgugger left a comment

gante left a comment

Rocketknight1 commented Apr 14, 2022

lhoestq left a comment

lhoestq Apr 15, 2022

Rocketknight1 Apr 20, 2022

lhoestq left a comment

lhoestq Apr 21, 2022

Rocketknight1 Apr 21, 2022

Rocketknight1 commented Apr 25, 2022

Rocketknight1 commented Apr 26, 2022

gante left a comment

lhoestq commented May 2, 2022 •

edited

Loading

Rocketknight1 commented May 5, 2022

Rocketknight1 May 5, 2022

lhoestq May 9, 2022

Rocketknight1 May 10, 2022

lhoestq left a comment •

edited

Loading

Rocketknight1 commented May 17, 2022

Rocketknight1 commented May 17, 2022

Rocketknight1 commented May 19, 2022

lhoestq commented May 19, 2022

Rocketknight1 commented May 20, 2022 •

edited

Loading

lhoestq left a comment

lhoestq left a comment

lhoestq left a comment

Rocketknight1 commented May 25, 2022

lhoestq left a comment •

edited

Loading

Rocketknight1 commented Jun 6, 2022

		"""
		# TODO Try an Image dataset and see if we can do the conversion

to_tf_dataset rewrite #4170

to_tf_dataset rewrite #4170

Conversation

Rocketknight1 commented Apr 14, 2022 • edited Loading

HuggingFaceDocBuilderDev commented Apr 14, 2022 • edited Loading

sgugger left a comment

Choose a reason for hiding this comment

Rocketknight1 commented Apr 14, 2022

sgugger left a comment

Choose a reason for hiding this comment

gante left a comment

Choose a reason for hiding this comment

Rocketknight1 commented Apr 14, 2022

lhoestq left a comment

Choose a reason for hiding this comment

lhoestq Apr 15, 2022

Choose a reason for hiding this comment

Rocketknight1 Apr 20, 2022

Choose a reason for hiding this comment

lhoestq left a comment

Choose a reason for hiding this comment

lhoestq Apr 21, 2022

Choose a reason for hiding this comment

Rocketknight1 Apr 21, 2022

Choose a reason for hiding this comment

Rocketknight1 commented Apr 25, 2022

Rocketknight1 commented Apr 26, 2022

gante left a comment

Choose a reason for hiding this comment

lhoestq commented May 2, 2022 • edited Loading

Rocketknight1 commented May 5, 2022

Rocketknight1 May 5, 2022

Choose a reason for hiding this comment

lhoestq May 9, 2022

Choose a reason for hiding this comment

Rocketknight1 May 10, 2022

Choose a reason for hiding this comment

lhoestq left a comment • edited Loading

Choose a reason for hiding this comment

Rocketknight1 commented May 17, 2022

Rocketknight1 commented May 17, 2022

Rocketknight1 commented May 19, 2022

lhoestq commented May 19, 2022

Rocketknight1 commented May 20, 2022 • edited Loading

lhoestq left a comment

Choose a reason for hiding this comment

lhoestq left a comment

Choose a reason for hiding this comment

lhoestq left a comment

Choose a reason for hiding this comment

Rocketknight1 commented May 25, 2022

lhoestq left a comment • edited Loading

Choose a reason for hiding this comment

Rocketknight1 commented Jun 6, 2022

Rocketknight1 commented Apr 14, 2022 •

edited

Loading

HuggingFaceDocBuilderDev commented Apr 14, 2022 •

edited

Loading

lhoestq commented May 2, 2022 •

edited

Loading

lhoestq left a comment •

edited

Loading

Rocketknight1 commented May 20, 2022 •

edited

Loading

lhoestq left a comment •

edited

Loading