AutoTokenizer vs. BertTokenizer #17809

macleginn · 2022-06-21T20:48:04Z

System Info

- `transformers` version: 4.20.1
- Platform: Linux-5.17.4-200.fc35.x86_64-x86_64-with-glibc2.34
- Python version: 3.9.7
- Huggingface_hub version: 0.1.0
- PyTorch version (GPU?): 1.9.1 (False)
- Tensorflow version (GPU?): not installed (NA)
- Flax version (CPU?/GPU?/TPU?): not installed (NA)
- Jax version: not installed
- JaxLib version: not installed
- Using GPU in script?: no
- Using distributed or parallel set-up in script?: no

Who can help?

With transformers-4.20.1 and tokenizers-0.12.1, I get the following behaviour:

In [1]: from transformers import AutoTokenizer, BertTokenizer
In [2]: auto_tokenizer = AutoTokenizer.from_pretrained('bert-large-uncased')
In [3]: auto_tokens = auto_tokenizer('This is a sentence.'.split(), is_split_into_words=True)
In [4]: auto_tokens.word_ids()
Out[4]: [None, 0, 1, 2, 3, 3, None]
In [7]: bert_tokenizer = BertTokenizer.from_pretrained('bert-large-uncased')
In [9]: bert_tokens = bert_tokenizer('This is a sentence.'.split(), is_split_into_words=True)
In [10]: bert_tokens.word_ids()
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-10-d69d0750fb87> in <module>
----> 1 bert_tokens.word_ids()

/mount/arbeitsdaten33/projekte/tcl/Users/nikolady/embedalign/lib/python3.9/site-packages/transformers/tokenization_utils_base.py in word_ids(self, batch_index)
    350         """
    351         if not self._encodings:
--> 352             raise ValueError("word_ids() is not available when using Python-based tokenizers")
    353         return self._encodings[batch_index].word_ids
    354 

ValueError: word_ids() is not available when using Python-based tokenizers

Regardless of whether this is expected or not, this is unintuitive and confusing. E.g., am I even getting correct tokenisation by using a more general tokeniser class?

@SaulLu @LysandreJik

Information

The official example scripts
My own modified scripts

Tasks

An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
My own task or dataset (give details below)

Reproduction

See above.

Expected behavior

Word ids from BertTokenizer or a more informative error message.

The text was updated successfully, but these errors were encountered:

NielsRogge · 2022-06-22T08:25:34Z

Hi,

The AutoTokenizer defaults to a fast, Rust-based tokenizer. Hence, when typing AutoTokenizer.from_pretrained("bert-base-uncased"), it will instantiate a BertTokenizerFast behind the scenes. Fast tokenizers support word_ids.

Here you're comparing it to a BertTokenizer, which is a slow, Python-based tokenizer.

So the behaviour is expected, and the error message pretty self-explanatory if you ask me.

macleginn · 2022-06-22T08:42:15Z

The docs for AutoTokenizer say,

The tokenizer class to instantiate is selected based on the model_type property of the config object (either passed as an argument or loaded from pretrained_model_name_or_path if possible), or when it’s missing, by falling back to using pattern matching on pretrained_model_name_or_path. <...>

bert — BertTokenizer or BertTokenizerFast (BERT model).

I do not pass a config, so I would assume that AutoTokenizer would instantiate BertTokenizer, which goes first in the list of options. Moreover, the docs for BertTokenizer and BertTokenizerFast do not mention that they are Python and Rust based respectively, so the user cannot really figure this out.

SaulLu · 2022-06-22T12:45:29Z

Hi @macleginn ,

Thanks for letting us know that this behavior isn't intuitive for you!

Regarding the fact that AutoTokenizer.from_pretrained loads a fast tokenizer by default, we have in the documentation a line for the use_fast argument that you can change in the from_pretrained method. As indicated in the documentation, this argument is set to True:

use_fast (bool, optional, defaults to True) — Whether or not to try to load the fast version of the tokenizer.

Do you think we should do something differently to make it clearer?

Regarding the error message that you're getting, do you think it would have been clearer to have:

ValueError: word_ids() is not available when using non-fast tokenizers (e.g. XxxTokenizerFast)

macleginn · 2022-06-22T13:34:14Z

Hi @SaulLu,

Regarding the error message that you're getting, do you think it would have been clearer to have:

ValueError: word_ids() is not available when using non-fast tokenizers (e.g. XxxTokenizerFast)

Yes, sure. Given this message, I would realise, first, that I need to use BertTokenzerFast if I want word_ids, and second, that this is what AutoTokenizer most likely resolved to.

Do you think we should do something differently to make it clearer?

Perhaps mention this in the preamble to the model list? Something along the lines of

Instantiate one of the tokenizer classes of the library from a pretrained model vocabulary.

The tokenizer class to instantiate is selected based on the model_type property of the config object (either passed as an argument or loaded from pretrained_model_name_or_path if possible), or when it’s missing, by falling back to using pattern matching on pretrained_model_name_or_path. The fast version of the tokenizer will be selected by default when available (see the use_fast parameter above).

But if you assume that the user should familiarise themselves with the params, it's okay as it is, as long as the error message points to something that can be found in the docs.

asartipi13 · 2023-01-04T08:58:18Z

Hi,
It seems the AutoTokenizer class has a problem with the character-based model google/canine-s. However, I set use_fast to True, I got this value error word_ids() is not available when using non-fast tokenizers.

NielsRogge · 2023-01-04T09:41:03Z

Hi,

CANINE is a bit of a special model, it doesn't have a fast implementation since it's character based (Rust implementations are only for these fancy tokenization algorithms like WordPiece, BPE etc). I'd recommend to just use CanineTokenizer

thomas-ferraz · 2023-04-13T21:42:40Z

Hello, using CanineTokenizer doesn't solve the problem... It doesn't have a "Fast" version with word_ids() implemented

macleginn added the bug label Jun 21, 2022

This was referenced Jun 22, 2022

[Closed - code changes not shown on GH for unknown reason] replace Python-base tokenizer by non-fast tokenizer in error message #17825

Closed

replace Python-base tokenizer by non-fast tokenizer in error message #17836

Merged

SaulLu closed this as completed in #17836 Jun 23, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

AutoTokenizer vs. BertTokenizer #17809

AutoTokenizer vs. BertTokenizer #17809

macleginn commented Jun 21, 2022

NielsRogge commented Jun 22, 2022 •

edited

Loading

macleginn commented Jun 22, 2022

SaulLu commented Jun 22, 2022

macleginn commented Jun 22, 2022

asartipi13 commented Jan 4, 2023

NielsRogge commented Jan 4, 2023

thomas-ferraz commented Apr 13, 2023

AutoTokenizer vs. BertTokenizer #17809

AutoTokenizer vs. BertTokenizer #17809

Comments

macleginn commented Jun 21, 2022

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

NielsRogge commented Jun 22, 2022 • edited Loading

macleginn commented Jun 22, 2022

SaulLu commented Jun 22, 2022

macleginn commented Jun 22, 2022

asartipi13 commented Jan 4, 2023

NielsRogge commented Jan 4, 2023

thomas-ferraz commented Apr 13, 2023

NielsRogge commented Jun 22, 2022 •

edited

Loading