`run_squad_trainer` doesn't actually use a Rust tokenizer + errors in `squad_convert_example_to_features` when using a Rust tokenizer #7492

k8si · 2020-09-30T23:47:44Z

Environment info

transformers version: 3.3.1
Platform: Linux-4.14.35-1902.303.4.1.el7uek.x86_64-x86_64-with-oracle-7.8
Python version: 3.6.8
PyTorch version (GPU?): 1.6.0 (True)
Tensorflow version (GPU?): not installed (NA)
Using GPU in script?: yes
Using distributed or parallel set-up in script?: no

and

transformers version: 3.3.1
Platform: macOS-10.15.6-x86_64-i386-64bit
Python version: 3.8.5
PyTorch version (GPU?): 1.6.0 (False)
Tensorflow version (GPU?): not installed (NA)
Using GPU in script?: no
Using distributed or parallel set-up in script?: no

Who can help

@mfuntowicz
@LysandreJik
@patil-suraj

Information

Model I am using (Bert, XLNet ...): bert-base-uncased

The problem arises when using:

the official example scripts: (give details below)
my own modified scripts: (give details below)

The tasks I am working on is:

an official GLUE/SQUaD task: (give the name)
my own task or dataset: (give details below)

Firstly, in run_squad_trainer.py, I noticed that the "use_fast" arg doesn't get propagated into the tokenizer instantiation:

transformers/examples/question-answering/run_squad_trainer.py

Line 107 in 0acd1ff

tokenizer = AutoTokenizer.from_pretrained(

Probably should be

    tokenizer = AutoTokenizer.from_pretrained(
        model_args.tokenizer_name if model_args.tokenizer_name else model_args.model_name_or_path,
        cache_dir=model_args.cache_dir,
        use_fast=model_args.use_fast
    )

However, when I make that change, the script hangs at the call to squad_convert_examples_to_features in SquadProcessor.

So, I did a little digging. The error is in squad_convert_example_to_features and seems to be due to inconsistencies in the behavior of tokenizer.encode_plus between the Python and Rust tokenizers, detailed below. I've also provided a gist that hopefully elucidates & will help reproduce each of these points. I tested both BertTokenizer/BertTokenizerFast and GPT2Tokenizer/GPT2TokenizerFast.

Python tokenizers handle negative values for stride, Rust tokenizers throw an exception (OverflowError: can't convert negative int to unsigned)
For sequence pairs, Python tokenizers are fine if the first arg (text) is a list of ints and the second arg (text_pair) is a list of strings. The Rust tokenizers throw an exception ValueError: PreTokenizedInputSequence must be Union[List[str], Tuple[str]]. (Furthermore, the typehints for these arguments indicate that a string, a list of strings, or a list of ints are all fine.)
Leaving the is_split_into_words kwarg at its default value (False), then running tokenizer.encode_plus(list of ints) works fine for the Python tokenizers. The Rust tokenizers raise an exception ValueError: TextInputSequence must be str.
When running on a pair of sequences and setting return_tensors=None, the Python tokenizers return an output dict with input_ids (and other elements) as a list of ints i.e.input_ids = [id1, id2, ...] whereas the Rust tokenizers return a dict with input_ids as a list of list of ints i.e. input_ids = [[id1, id2, ...]]. I also noticed that if you set return_tensors="pt", both the Python and Rust tokenizers return input_ids = tensor([[id1, id2, ...]]).
When return_overflowing_tokens=True, the Python tokenizers return a list of the overflowing tokens at key overflowing_tokens as expected. The Rust tokenizers return them at key overflow_to_sample_mapping which is not documented anywhere, as far as I can tell. The values seem to be different for the Python output vs. Rust output.
Running the same procedure on the same input twice produces the same result each time for the Python tokenizer. For the Rust tokenizer, the result of the second run is different. I am not familiar enough with the Rust tokenizer internals at this point to have a theory as to why this is the case. Anyway, this is the point at which I stopped debugging and decided to file an issue.

To reproduce

Steps to reproduce the behavior:

Download squad 2.0 dataset from "official" squad website
Make fix in run_squad_training.py described above to correctly instantiate a Rust tokenizer
Run script: python examples/question-answering/run_squad_trainer.py --model_name_or_path bert-base-uncased --use_fast --output_dir "./outputs-squad" --do_train --data_dir "./squad-data" --version_2_with_negative

Also see gist detailing issues described above: https://gist.github.com/k8si/a143346dfa875c28d98e95cba1f82f1b

Expected behavior

I expected run_squad_trainer.py to use a Rust tokenizer when the use_fast arg was set to True
I expected SquadProcessor.squad_convert_example_to_features to not raise exceptions when processing squad data when using a Rust tokenizer
I expected tokenizer.encode_plus to return the same outputs given the same inputs, regardless of whether the tokenizer is a Rust tokenizer or a Python tokenizer

The text was updated successfully, but these errors were encountered:

LysandreJik · 2020-10-05T07:49:00Z

Hello! Indeed, the Rust tokenizers are not handled by the SQuAD data processing. This is one item we would like to resolve when refactoring the data processing methods, which will soon be implemented directly in datasets rather than in transformers.

Thank you for your detailed issue!

k8si · 2020-10-15T16:38:50Z

For what it's worth, my main issue is with the behavioral issues with the Python vs. Rust tokenizers, not with the SQuAD data processing itself (I can easily write my own SQuAD processor but writing my own tokenizers is more work--the tokenizers are part of why I've been using the library in the first place).

It places a sizeable burden on people coding against the library that different kwargs result in completely different behaviors (and different errors) between the Rust vs. Python implementations, and these differences often aren't documented. And Item # 6 seems like a fundamental error somewhere in the Rust codebase.

Are there any plans to address these issues in the near future? For what it's worth, I've never had an issue with the Python tokenizers. I'd like to use the Rust ones because they're Fast, plus I can train them myself easily, but navigating the API weirdness has been a slog.

LysandreJik · 2020-10-16T11:38:16Z

You're correct that there is currently a mismatch between the python and rust tokenizers, and thank you for writing such a detailed issue explaining all of your pain points. We'll keep a close eye to the issues mentioned here as we continue working and improving the compatibility between the two APIs, which is something we will be focusing on in the near future.

stale · 2020-12-15T22:50:49Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

stale bot added the wontfix label Dec 15, 2020

stale bot closed this as completed Dec 24, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`run_squad_trainer` doesn't actually use a Rust tokenizer + errors in `squad_convert_example_to_features` when using a Rust tokenizer #7492

`run_squad_trainer` doesn't actually use a Rust tokenizer + errors in `squad_convert_example_to_features` when using a Rust tokenizer #7492

k8si commented Sep 30, 2020

LysandreJik commented Oct 5, 2020

k8si commented Oct 15, 2020 •

edited

Loading

LysandreJik commented Oct 16, 2020 •

edited

Loading

stale bot commented Dec 15, 2020

run_squad_trainer doesn't actually use a Rust tokenizer + errors in squad_convert_example_to_features when using a Rust tokenizer #7492

run_squad_trainer doesn't actually use a Rust tokenizer + errors in squad_convert_example_to_features when using a Rust tokenizer #7492

Comments

k8si commented Sep 30, 2020

Environment info

Who can help

Information

To reproduce

Expected behavior

LysandreJik commented Oct 5, 2020

k8si commented Oct 15, 2020 • edited Loading

LysandreJik commented Oct 16, 2020 • edited Loading

stale bot commented Dec 15, 2020

`run_squad_trainer` doesn't actually use a Rust tokenizer + errors in `squad_convert_example_to_features` when using a Rust tokenizer #7492

`run_squad_trainer` doesn't actually use a Rust tokenizer + errors in `squad_convert_example_to_features` when using a Rust tokenizer #7492

k8si commented Oct 15, 2020 •

edited

Loading

LysandreJik commented Oct 16, 2020 •

edited

Loading