-
Notifications
You must be signed in to change notification settings - Fork 2.3k
Adding a PretrainedTransformerTokenizer #3145
Adding a PretrainedTransformerTokenizer #3145
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm, modulo (1) the one question about the test fixture, and (2) do we want to replace the BertBasicWordSplitter
with this (or just put a comment there to that effect, or will we just wipe that code out eventually anyway)
|
||
class TestPretrainedTransformerTokenizer(AllenNlpTestCase): | ||
def test_splits_into_wordpieces(self): | ||
tokenizer = PretrainedTransformerTokenizer('bert-base-cased', do_lowercase=False) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we want to download and use the full bert-base-cased
tokenizer in our test (it's not that big, I guess), or would we rather rely on a smaller test fixture?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be good to not have to the dependency on the web for the test, but I'm pretty sure it wouldn't work, because we rely on pytorch-transformers' AutoTokenizer
.
Well, maybe it would work if I make sure the filename that I give for the fixture passes their heuristics for which actual tokenizer to use, but (1) I'm not sure that I can, because their heuristics might not allow it, and (2) that would make the test too dependent on the internals of their implementation.
The BertBasicWordSplitter is for a different use case, where you have mismatched tokenization and indexing. This can't actually replace that code. We should rename all of that mismatched code eventually, though, to make it more obvious what's going on and when you should use each option. |
@@ -131,6 +131,7 @@ | |||
'sqlparse>=0.2.4', | |||
'word2number>=1.1', | |||
'pytorch-pretrained-bert>=0.6.0', | |||
'pytorch-transformers @ https://api.github.com/repos/huggingface/pytorch-transformers/tarball/a7b4cfe9194bf93c7044a42c9f1281260ce6279e', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @matt-gardner, why not use pytorch-transformers==1.1.0
?
PyPI won't allow direct URL like this one, as discussed in pypa/pip#6301
And error msg should be raised as followed when running twine upload
HTTPError: 400 Client Error: Invalid value for requires_dist. Error: Can't have direct dependency: 'pytorch-transformers @ https://api.github.com/repos/huggingface/pytorch-transformers/tarball/a7b4cfe9194bf93c7044a42c9f1281260ce6279e' for url: https://upload.pypi.org/legacy/
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because the auto stuff hadn't been released when I wrote this. It has been now, so this can be updated. PR welcome.
@matt-gardner, I'm getting the following error when attempting to install the library which I believe is caused by this change.
Perhaps we can fix this by upgrading to the recently released pytorch-transformers v1.1.0? https://pypi.org/project/pytorch-transformers/1.1.0/ |
there is a PR for this: I'm just waiting for it to update and then I'll merge |
Thanks, @joelgrus ! |
ok, it's merged |
* Adding a PretrainedTransformerTokenizer * pylint * doc
First (of many) PRs merging our hackathon project into the main repo. This one adds a dependency on the new
pytorch-transformers
repo, and grabs a tokenizer from there.This also fills a gap that we had in our data processing API, where there was no good way to get native BERT (or GPT, etc.) tokenization as input to your model - you had to use words, then split them further into wordpieces or byte pairs. With this tokenizer, you can have wordpieces or byte pairs be your base tokens, instead of words.