Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AutoTokenizer vs. BertTokenizer #17809

Closed
4 tasks
macleginn opened this issue Jun 21, 2022 · 7 comments · Fixed by #17836
Closed
4 tasks

AutoTokenizer vs. BertTokenizer #17809

macleginn opened this issue Jun 21, 2022 · 7 comments · Fixed by #17836
Labels

Comments

@macleginn
Copy link

System Info

- `transformers` version: 4.20.1
- Platform: Linux-5.17.4-200.fc35.x86_64-x86_64-with-glibc2.34
- Python version: 3.9.7
- Huggingface_hub version: 0.1.0
- PyTorch version (GPU?): 1.9.1 (False)
- Tensorflow version (GPU?): not installed (NA)
- Flax version (CPU?/GPU?/TPU?): not installed (NA)
- Jax version: not installed
- JaxLib version: not installed
- Using GPU in script?: no
- Using distributed or parallel set-up in script?: no

Who can help?

With transformers-4.20.1 and tokenizers-0.12.1, I get the following behaviour:

In [1]: from transformers import AutoTokenizer, BertTokenizer
In [2]: auto_tokenizer = AutoTokenizer.from_pretrained('bert-large-uncased')
In [3]: auto_tokens = auto_tokenizer('This is a sentence.'.split(), is_split_into_words=True)
In [4]: auto_tokens.word_ids()
Out[4]: [None, 0, 1, 2, 3, 3, None]
In [7]: bert_tokenizer = BertTokenizer.from_pretrained('bert-large-uncased')
In [9]: bert_tokens = bert_tokenizer('This is a sentence.'.split(), is_split_into_words=True)
In [10]: bert_tokens.word_ids()
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-10-d69d0750fb87> in <module>
----> 1 bert_tokens.word_ids()

/mount/arbeitsdaten33/projekte/tcl/Users/nikolady/embedalign/lib/python3.9/site-packages/transformers/tokenization_utils_base.py in word_ids(self, batch_index)
    350         """
    351         if not self._encodings:
--> 352             raise ValueError("word_ids() is not available when using Python-based tokenizers")
    353         return self._encodings[batch_index].word_ids
    354 

ValueError: word_ids() is not available when using Python-based tokenizers

Regardless of whether this is expected or not, this is unintuitive and confusing. E.g., am I even getting correct tokenisation by using a more general tokeniser class?

@SaulLu @LysandreJik

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

See above.

Expected behavior

Word ids from BertTokenizer or a more informative error message.
@macleginn macleginn added the bug label Jun 21, 2022
@NielsRogge
Copy link
Contributor

NielsRogge commented Jun 22, 2022

Hi,

The AutoTokenizer defaults to a fast, Rust-based tokenizer. Hence, when typing AutoTokenizer.from_pretrained("bert-base-uncased"), it will instantiate a BertTokenizerFast behind the scenes. Fast tokenizers support word_ids.

Here you're comparing it to a BertTokenizer, which is a slow, Python-based tokenizer.

So the behaviour is expected, and the error message pretty self-explanatory if you ask me.

@macleginn
Copy link
Author

The docs for AutoTokenizer say,

The tokenizer class to instantiate is selected based on the model_type property of the config object (either passed as an argument or loaded from pretrained_model_name_or_path if possible), or when it’s missing, by falling back to using pattern matching on pretrained_model_name_or_path. <...>

bert — BertTokenizer or BertTokenizerFast (BERT model).

I do not pass a config, so I would assume that AutoTokenizer would instantiate BertTokenizer, which goes first in the list of options. Moreover, the docs for BertTokenizer and BertTokenizerFast do not mention that they are Python and Rust based respectively, so the user cannot really figure this out.

@SaulLu
Copy link
Contributor

SaulLu commented Jun 22, 2022

Hi @macleginn ,

Thanks for letting us know that this behavior isn't intuitive for you!

Regarding the fact that AutoTokenizer.from_pretrained loads a fast tokenizer by default, we have in the documentation a line for the use_fast argument that you can change in the from_pretrained method. As indicated in the documentation, this argument is set to True:

use_fast (bool, optional, defaults to True) — Whether or not to try to load the fast version of the tokenizer.

Do you think we should do something differently to make it clearer?

Regarding the error message that you're getting, do you think it would have been clearer to have:

ValueError: word_ids() is not available when using non-fast tokenizers (e.g. XxxTokenizerFast)

@macleginn
Copy link
Author

Hi @SaulLu,

Regarding the error message that you're getting, do you think it would have been clearer to have:

ValueError: word_ids() is not available when using non-fast tokenizers (e.g. XxxTokenizerFast)

Yes, sure. Given this message, I would realise, first, that I need to use BertTokenzerFast if I want word_ids, and second, that this is what AutoTokenizer most likely resolved to.

Do you think we should do something differently to make it clearer?

Perhaps mention this in the preamble to the model list? Something along the lines of

Instantiate one of the tokenizer classes of the library from a pretrained model vocabulary.

The tokenizer class to instantiate is selected based on the model_type property of the config object (either passed as an argument or loaded from pretrained_model_name_or_path if possible), or when it’s missing, by falling back to using pattern matching on pretrained_model_name_or_path. The fast version of the tokenizer will be selected by default when available (see the use_fast parameter above).

But if you assume that the user should familiarise themselves with the params, it's okay as it is, as long as the error message points to something that can be found in the docs.

@asartipi13
Copy link

Hi,
It seems the AutoTokenizer class has a problem with the character-based model google/canine-s. However, I set use_fast to True, I got this value error word_ids() is not available when using non-fast tokenizers.

@NielsRogge
Copy link
Contributor

Hi,

CANINE is a bit of a special model, it doesn't have a fast implementation since it's character based (Rust implementations are only for these fancy tokenization algorithms like WordPiece, BPE etc). I'd recommend to just use CanineTokenizer

@thomas-ferraz
Copy link

Hello, using CanineTokenizer doesn't solve the problem... It doesn't have a "Fast" version with word_ids() implemented

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment