-
Notifications
You must be signed in to change notification settings - Fork 2.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
to_tf_dataset rewrite #4170
to_tf_dataset rewrite #4170
Conversation
The documentation is not available anymore as the PR was closed or merged. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The fact your changes make this very entangled with Transformers code makes me think you are trying to do too much here. In particular the code for guessing the labels or the numpy data collator from Transformers are very aimed at Transformer models, and the method in Datasets would be expected to work on any model. Copy-pasting code from Transformers is kind of a red flag ;-)
I think we need to rethink the design and make the base to_tf_dataset
more basic and less magic, then have another method in Transformers that will do the magic guessing of defaults using the model (and use a different default collate).
Magic is now banned by decree of @sgugger. This is honestly much cleaner, and the functionality will make much more sense in |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Better for me! Are there any real changes in the notebook? All I see in the diff is Pycharm adding useless metadata.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good! Added a few minor comments, and I'd like to make a request as well: since we now have default arguments for everything, can we add a test case with no passed arguments?
(I don't want to approve because I don't have much datasets
knowledge :) )
@gante I renamed the default collator to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome thank you !
I added a few comments, and also found a bug when you don't specify the columns
explicitly (a None is passed to _get_output_signature).
Feel free to also add some tests to make sure that the new default behavior works as expected !
src/datasets/arrow_dataset.py
Outdated
""" | ||
# TODO Try an Image dataset and see if we can do the conversion |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Image datasets return PIL.Image objects by default, which are not supported by TF.
However we can improve the .with_format("numpy")
so that it returns a numpy array instead of the PIL Image, so that you don't have to deal with custom types by yourself in to_tf_dataset
and always assume that you'll get numpy arrays that work with TF :)
cc @mariosasko
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now that we're not clobbering with_transform
then users can also just do that!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's still not working, see my comment. Can you please add some tests ?
"encoded_tf_dataset = encoded_dataset['train'].to_tf_dataset(\n", | ||
" columns=columns,\n", | ||
" collate_fn=collate_fn,\n", | ||
" batch_size=8,\n", | ||
" shuffle=True,\n", | ||
" dummy_labels=True\n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It was pretty hard to find these little changes in this big diff xD
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unfortunately, I don't know how to stop Jupyter saving all that junk when I just want to change one line!
I think this should now be ready, it looks good in testing! I'll try a few more notebooks today and tomorrow to be sure before I merge. Key changes are:
We definitely have some questions still to resolve about how to handle making a 'DataLoader' dataset versus a 'raw' dataset - see the Notion doc if you're interested. Still, since this PR is just fixes/improvements to an existing method which never supported non-numerical features anyway, we can merge it before we've resolved those issues, and then think about how to name and split things afterwards. |
P.S. I'll take out the region comments at the end before I merge, I promise! They're just helpful while I'm editing it |
652facf
to
71f3982
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good! 👍
Can we also add a test case for .to_tf_dataset()
, i.e. using defaults for all arguments?
+1 for the tests
Can you give more details on how this work and the rationale as well ? This is not explained in the docs Also why are you adding |
@lhoestq I rewrote those parts - they were causing some other issues too! |
src/datasets/arrow_dataset.py
Outdated
# Following the logic in `transformers.Trainer`, we do not drop `label_ids` or `label` even if they | ||
# are not in the list of requested columns, because the collator may rename them | ||
# This might work better if moved to a method attached to our transformers Model objects, but doing so | ||
# could break backward compatibility | ||
unwanted_columns = [ | ||
col | ||
for col in self.features.keys() | ||
if col not in columns and col not in label_cols and col not in ("label_ids", "label") | ||
] | ||
dataset = dataset.remove_columns(unwanted_columns) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One thing I'm still not sure about is that we special-case "label_ids" and "label", because those columns are frequently renamed to "labels" in transformers
and we don't want to drop them. In PyTorch this special-casing occurs too, but it happens inside Trainer
, so it doesn't mix things up between the transformers
and datasets
libraries. We might have to leave it here for now until I can make the 'magic method' on our transformers
models!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd like this part to be removed since it's transformers.Trainer
-specific
What needs to be backward compatible ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The standard workflow for transformers
is that the column in the dataset is called label
, but the argument to the model is called labels
, and the renaming is done by the collate_fn
. Therefore, we must be careful not to drop columns called label
or label_ids
, even if the user doesn't request them, because the collate_fn
might need them.
In PyTorch columns are dropped by the Trainer, and it makes sure not to drop these. However, in Tensorflow, columns are dropped by the to_tf_dataset()
method, and therefore this code needs to be in there.
I think a good solution would be to make a method in transformers
that calls to_tf_dataset()
and converts datasets for training, and then we could move the label
and label_ids
special casing in there. However, until we do that, we'll need to keep the special casing in to_tf_dataset()
or all our examples will break!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cool, let me know if you need help for the tests :)
I think we need several additional tests, especially since the function is quite long and not trivial to maintain. In particular it would be nice to check the default behavior, and check the behavior of the main parameters.
@lhoestq New tests are now in! |
a440a15
to
3da0cb4
Compare
Seeing some other random tests failing that don't look to be associated with this PR. |
3da0cb4
to
172b231
Compare
@lhoestq I can't figure out these test failures! They don't seem related to this PR at all, but I rebased to the latest version and they keep happening, even though they're not visible on master. |
Thanks for the ping, will take a look tomorrow :) Maybe the rebase didn't go well for the code recently merged about label alignment from #4277 ? |
172b231
to
c518d47
Compare
@lhoestq Got it! It was caused by a name collision - I was importing |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh good catch ! And thanks for the tests :)
I think we just need to update the docstring, and I also added a question about the default batch_size
(sorry for not raising the question earlier, I missed it)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you ! I added more comments about the default for shuffle
and drop_remainder
, and some suggestions:
Co-authored-by: Quentin Lhoest <[email protected]>
Co-authored-by: Quentin Lhoest <[email protected]>
Co-authored-by: Quentin Lhoest <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cool ! This is a HUGE step !! Thanks a lot @Rocketknight1 :)
Here are my final comments, then I think we can merge 🚀
Co-authored-by: Quentin Lhoest <[email protected]>
Co-authored-by: Quentin Lhoest <[email protected]>
@lhoestq Thanks! Also, when you're ready, don't merge it immediately! I'd like to do a quick round of manual testing with the very final build once you're happy to make sure it still works in our notebooks and examples. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks ! Feel free to run all your tests and merge yourself when you're good and if the CI is green :)
@lhoestq Tests look good to me, merging now! |
This PR rewrites almost all of
to_tf_dataset()
, which makes it kind of hard to list all the changes, but the most critical ones are:collate_fn
adds columns that aren't in the dataset.tf.String
)Can accept amodel
argument and only return columns that are valid inputs to that modeldummy_labels
argument - this was a workaround for Keras issues that have been resolved by changes intransformers
. Also remove it from tests and the Overview notebook.I still have a couple of TODOs remaining and some testing to do, so don't merge yet, but it should be mostly ready for review at this point!