-
Notifications
You must be signed in to change notification settings - Fork 4.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Remove library specific tokenizers #108 #7027
Conversation
…MTokenizer tests adjusted and moved into HFTransformersNLP tests
…Also moved tests from test_convert_tokenizer into test_convert_featurizer and adjusted accordingly
@koernerfelicia I started looking into the PR, I think the way how you have implemented your intended logic, it looks good.
I am not sure what's the ideal solution here but I'll think a bit more on the original problem itself -
What if we get around that problem by relaxing the constraint of certain tokenizers should necessarily be used for certain featurizers? @tabergma Thoughts on the above? |
I think we definitely need to get rid of the dependency between |
@tabergma @dakshvar22 To summarize: I'm still not sure how to change the logic so that the
Which solution should I select? |
I would go with option 2. (Just as a note: But we need to keep the old components and deprecate them. We cannot simply remove them.) |
@tabergma @koernerfelicia I agree option 2 seems much better. Just wondering about this -
So, if we move all the logic inside |
Yes. For Chinese our users would be able to use the |
Okay, sounds good. 👍 |
…nguageModelFeaturizer. HFTransformersNLP is deprecated; however will still work as before if included in pipeline (i.e. LanguageModelFeaturizer does not overwrite tokens or features)
@dakshvar22 @tabergma I've moved the logic from One tricky thing -- since we are only deprecating What do you think? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just took a brief look and added some first comments.
I think we can move the check so that we just print the warning once per batch. See inline comments.
Please, also add a changelog entry.
@koernerfelicia We should merge this after #7089 is merged into |
@koernerfelicia Let me know if this is ready for another review round. |
Hi @howl-anderson, you're right, it's a link to a private repo. Do you have any questions about the issue that I can answer? |
Hi @koernerfelicia, I am just working on an issue related to this, so far everything is OK, if I have a question, I will let you know. Thank you! |
Commit: 15c5047, The full report is available as an artifact. Dataset:
|
Hey @koernerfelicia! 👋 To run model regression tests, comment with the Tips 💡: The model regression test will be run on Tips 💡: Every time when you want to change a configuration you should edit the comment with the previous configuration. You can copy this in your comment and customize:
|
/modeltest include:
- dataset: ["all"]
config: ["BERT + DIET(bow) + ResponseSelector(bow)", "BERT + DIET(seq) + ResponseSelector(t2t)","Sparse + BERT + DIET(bow) + ResponseSelector(bow)","Sparse+BERT + DIET(seq) + ResponseSelector(t2t)"]
|
The model regression tests have started. It might take a while, please be patient. Used configuration can be found in the comment. |
Commit: 15c5047, The full report is available as an artifact. Dataset:
Dataset:
Dataset:
Dataset:
Dataset:
Dataset:
|
Research issue 108
Proposed changes
I set the sub-token information in HFTransformersNLP. This requires that tokenization occurs before the information is set.
We used to handle this by always tokenizing with the WhitespaceTokenizer inside HFTransformersNLP. Now, the Tokenizer must be placed before HFTransformersNLP in the pipeline, but any Tokenizer can be used. This may be confusing to the user.
One alternative would be to move the logic of setting the sub-token information into LanguageModelFeaturizer. This requires moving several functions from HFTransformersNLP as well and reloading the model in LanguageModelFeaturizer.
Alternatively, the featurization and tokenization from HFTransformersNLP can be moved into LanguageModelFeaturizer, doing away with HFTransformersNLP entirely. The drawback maybe less flexibility should we want to apply features from HFTransformersNLP differently in the future. I'm not sure if this is likely, maybe @dakshvar22 knows?
Both Tokenizers will give a deprecation warning if used, and will revert to WhitespaceTokenizer.
Status (please check what you already did):
black
(please check Readme for instructions)