Skip to content
This repository has been archived by the owner on Dec 16, 2022. It is now read-only.

Adding a PretrainedTransformerTokenizer #3145

Merged
merged 3 commits into from
Aug 13, 2019

Conversation

matt-gardner
Copy link
Contributor

First (of many) PRs merging our hackathon project into the main repo. This one adds a dependency on the new pytorch-transformers repo, and grabs a tokenizer from there.

This also fills a gap that we had in our data processing API, where there was no good way to get native BERT (or GPT, etc.) tokenization as input to your model - you had to use words, then split them further into wordpieces or byte pairs. With this tokenizer, you can have wordpieces or byte pairs be your base tokens, instead of words.

Copy link
Contributor

@joelgrus joelgrus left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm, modulo (1) the one question about the test fixture, and (2) do we want to replace the BertBasicWordSplitter with this (or just put a comment there to that effect, or will we just wipe that code out eventually anyway)


class TestPretrainedTransformerTokenizer(AllenNlpTestCase):
def test_splits_into_wordpieces(self):
tokenizer = PretrainedTransformerTokenizer('bert-base-cased', do_lowercase=False)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we want to download and use the full bert-base-cased tokenizer in our test (it's not that big, I guess), or would we rather rely on a smaller test fixture?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be good to not have to the dependency on the web for the test, but I'm pretty sure it wouldn't work, because we rely on pytorch-transformers' AutoTokenizer.

Well, maybe it would work if I make sure the filename that I give for the fixture passes their heuristics for which actual tokenizer to use, but (1) I'm not sure that I can, because their heuristics might not allow it, and (2) that would make the test too dependent on the internals of their implementation.

@matt-gardner
Copy link
Contributor Author

The BertBasicWordSplitter is for a different use case, where you have mismatched tokenization and indexing. This can't actually replace that code. We should rename all of that mismatched code eventually, though, to make it more obvious what's going on and when you should use each option.

@matt-gardner matt-gardner merged commit 217022f into allenai:master Aug 13, 2019
@matt-gardner matt-gardner deleted the wordpiece_tokenizer branch August 13, 2019 15:12
@@ -131,6 +131,7 @@
'sqlparse>=0.2.4',
'word2number>=1.1',
'pytorch-pretrained-bert>=0.6.0',
'pytorch-transformers @ https://api.github.com/repos/huggingface/pytorch-transformers/tarball/a7b4cfe9194bf93c7044a42c9f1281260ce6279e',
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @matt-gardner, why not use pytorch-transformers==1.1.0?

PyPI won't allow direct URL like this one, as discussed in pypa/pip#6301
And error msg should be raised as followed when running twine upload

HTTPError: 400 Client Error: Invalid value for requires_dist. Error: Can't have direct dependency: 'pytorch-transformers @ https://api.github.com/repos/huggingface/pytorch-transformers/tarball/a7b4cfe9194bf93c7044a42c9f1281260ce6279e' for url: https://upload.pypi.org/legacy/

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because the auto stuff hadn't been released when I wrote this. It has been now, so this can be updated. PR welcome.

@brendan-ai2
Copy link
Contributor

@matt-gardner, I'm getting the following error when attempting to install the library which I believe is caused by this change.

brendanr.local ➜  entity_salience git:(integrate-bert) ✗ pip install --editable ../allennlp
Obtaining file:///Users/brendanr/repos/brendan-ai2/allennlp
Direct url requirement (like pytorch-transformers@ https://api.github.com/repos/huggingface/pytorch-transformers/tarball/a7b4cfe9194bf93c7044a42c9f1281260ce6279e) are not allowed for dependencies
You are using pip version 10.0.1, however version 19.2.2 is available.
You should consider upgrading via the 'pip install --upgrade pip' command.
brendanr.local ➜  entity_salience git:(integrate-bert) ✗ echo $?
1

Perhaps we can fix this by upgrading to the recently released pytorch-transformers v1.1.0? https://pypi.org/project/pytorch-transformers/1.1.0/

@joelgrus
Copy link
Contributor

there is a PR for this:

#3171 (review)

I'm just waiting for it to update and then I'll merge

@brendan-ai2
Copy link
Contributor

Thanks, @joelgrus !

@joelgrus
Copy link
Contributor

ok, it's merged

reiyw pushed a commit to reiyw/allennlp that referenced this pull request Nov 12, 2019
* Adding a PretrainedTransformerTokenizer

* pylint

* doc
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants