Adding a PretrainedTransformerTokenizer #3145

matt-gardner · 2019-08-13T05:17:34Z

First (of many) PRs merging our hackathon project into the main repo. This one adds a dependency on the new pytorch-transformers repo, and grabs a tokenizer from there.

This also fills a gap that we had in our data processing API, where there was no good way to get native BERT (or GPT, etc.) tokenization as input to your model - you had to use words, then split them further into wordpieces or byte pairs. With this tokenizer, you can have wordpieces or byte pairs be your base tokens, instead of words.

joelgrus

lgtm, modulo (1) the one question about the test fixture, and (2) do we want to replace the BertBasicWordSplitter with this (or just put a comment there to that effect, or will we just wipe that code out eventually anyway)

joelgrus · 2019-08-13T13:58:40Z

allennlp/tests/data/tokenizers/pretrained_transformer_tokenizer_test.py

+
+class TestPretrainedTransformerTokenizer(AllenNlpTestCase):
+    def test_splits_into_wordpieces(self):
+        tokenizer = PretrainedTransformerTokenizer('bert-base-cased', do_lowercase=False)


do we want to download and use the full bert-base-cased tokenizer in our test (it's not that big, I guess), or would we rather rely on a smaller test fixture?

It would be good to not have to the dependency on the web for the test, but I'm pretty sure it wouldn't work, because we rely on pytorch-transformers' AutoTokenizer.

Well, maybe it would work if I make sure the filename that I give for the fixture passes their heuristics for which actual tokenizer to use, but (1) I'm not sure that I can, because their heuristics might not allow it, and (2) that would make the test too dependent on the internals of their implementation.

matt-gardner · 2019-08-13T15:11:27Z

The BertBasicWordSplitter is for a different use case, where you have mismatched tokenization and indexing. This can't actually replace that code. We should rename all of that mismatched code eventually, though, to make it more obvious what's going on and when you should use each option.

huntzhan · 2019-08-17T16:09:36Z

setup.py

@@ -131,6 +131,7 @@
          'sqlparse>=0.2.4',
          'word2number>=1.1',
          'pytorch-pretrained-bert>=0.6.0',
+          'pytorch-transformers @ https://api.github.com/repos/huggingface/pytorch-transformers/tarball/a7b4cfe9194bf93c7044a42c9f1281260ce6279e',


Hi @matt-gardner, why not use pytorch-transformers==1.1.0?

PyPI won't allow direct URL like this one, as discussed in pypa/pip#6301
And error msg should be raised as followed when running twine upload

HTTPError: 400 Client Error: Invalid value for requires_dist. Error: Can't have direct dependency: 'pytorch-transformers @ https://api.github.com/repos/huggingface/pytorch-transformers/tarball/a7b4cfe9194bf93c7044a42c9f1281260ce6279e' for url: https://upload.pypi.org/legacy/

Because the auto stuff hadn't been released when I wrote this. It has been now, so this can be updated. PR welcome.

brendan-ai2 · 2019-08-20T22:03:07Z

@matt-gardner, I'm getting the following error when attempting to install the library which I believe is caused by this change.

brendanr.local ➜  entity_salience git:(integrate-bert) ✗ pip install --editable ../allennlp
Obtaining file:///Users/brendanr/repos/brendan-ai2/allennlp
Direct url requirement (like pytorch-transformers@ https://api.github.com/repos/huggingface/pytorch-transformers/tarball/a7b4cfe9194bf93c7044a42c9f1281260ce6279e) are not allowed for dependencies
You are using pip version 10.0.1, however version 19.2.2 is available.
You should consider upgrading via the 'pip install --upgrade pip' command.
brendanr.local ➜  entity_salience git:(integrate-bert) ✗ echo $?
1

Perhaps we can fix this by upgrading to the recently released pytorch-transformers v1.1.0? https://pypi.org/project/pytorch-transformers/1.1.0/

joelgrus · 2019-08-20T22:07:09Z

there is a PR for this:

#3171 (review)

I'm just waiting for it to update and then I'll merge

brendan-ai2 · 2019-08-20T22:09:23Z

Thanks, @joelgrus !

joelgrus · 2019-08-20T22:55:07Z

ok, it's merged

* Adding a PretrainedTransformerTokenizer * pylint * doc

matt-gardner added 3 commits August 12, 2019 22:14

Adding a PretrainedTransformerTokenizer

9f858ef

pylint

52a6ee4

doc

f7f3be8

matt-gardner requested a review from joelgrus August 13, 2019 06:17

This was referenced Aug 13, 2019

Pretrained transformer indexer #3146

Merged

Dataset readers for masked language modeling and next-token-language-modeling #3147

Merged

joelgrus approved these changes Aug 13, 2019

View reviewed changes

matt-gardner merged commit 217022f into allenai:master Aug 13, 2019

matt-gardner deleted the wordpiece_tokenizer branch August 13, 2019 15:12

huntzhan reviewed Aug 17, 2019

View reviewed changes

huntzhan mentioned this pull request Aug 17, 2019

Set pytorch-transformer to 1.1.0 #3171

Merged

reiyw pushed a commit to reiyw/allennlp that referenced this pull request Nov 12, 2019

Adding a PretrainedTransformerTokenizer (allenai#3145)

92d4622

* Adding a PretrainedTransformerTokenizer * pylint * doc

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adding a PretrainedTransformerTokenizer #3145

Adding a PretrainedTransformerTokenizer #3145

matt-gardner commented Aug 13, 2019

joelgrus left a comment

joelgrus Aug 13, 2019

matt-gardner Aug 13, 2019

matt-gardner commented Aug 13, 2019

huntzhan Aug 17, 2019

matt-gardner Aug 17, 2019

brendan-ai2 commented Aug 20, 2019

joelgrus commented Aug 20, 2019

brendan-ai2 commented Aug 20, 2019

joelgrus commented Aug 20, 2019

Adding a PretrainedTransformerTokenizer #3145

Adding a PretrainedTransformerTokenizer #3145

Conversation

matt-gardner commented Aug 13, 2019

joelgrus left a comment

Choose a reason for hiding this comment

joelgrus Aug 13, 2019

Choose a reason for hiding this comment

matt-gardner Aug 13, 2019

Choose a reason for hiding this comment

matt-gardner commented Aug 13, 2019

huntzhan Aug 17, 2019

Choose a reason for hiding this comment

matt-gardner Aug 17, 2019

Choose a reason for hiding this comment

brendan-ai2 commented Aug 20, 2019

joelgrus commented Aug 20, 2019

brendan-ai2 commented Aug 20, 2019

joelgrus commented Aug 20, 2019