run_squad_trainer
doesn't actually use a Rust tokenizer + errors in squad_convert_example_to_features
when using a Rust tokenizer
#7492
Labels
Environment info
transformers
version: 3.3.1and
transformers
version: 3.3.1Who can help
@mfuntowicz
@LysandreJik
@patil-suraj
Information
Model I am using (Bert, XLNet ...): bert-base-uncased
The problem arises when using:
The tasks I am working on is:
Firstly, in
run_squad_trainer.py
, I noticed that the "use_fast" arg doesn't get propagated into the tokenizer instantiation:transformers/examples/question-answering/run_squad_trainer.py
Line 107 in 0acd1ff
However, when I make that change, the script hangs at the call to
squad_convert_examples_to_features
in SquadProcessor.So, I did a little digging. The error is in
squad_convert_example_to_features
and seems to be due to inconsistencies in the behavior oftokenizer.encode_plus
between the Python and Rust tokenizers, detailed below. I've also provided a gist that hopefully elucidates & will help reproduce each of these points. I tested both BertTokenizer/BertTokenizerFast and GPT2Tokenizer/GPT2TokenizerFast.Python tokenizers handle negative values for
stride
, Rust tokenizers throw an exception (OverflowError: can't convert negative int to unsigned
)For sequence pairs, Python tokenizers are fine if the first arg (
text
) is a list of ints and the second arg (text_pair
) is a list of strings. The Rust tokenizers throw an exceptionValueError: PreTokenizedInputSequence must be Union[List[str], Tuple[str]]
. (Furthermore, the typehints for these arguments indicate that a string, a list of strings, or a list of ints are all fine.)Leaving the
is_split_into_words
kwarg at its default value (False
), then runningtokenizer.encode_plus(list of ints)
works fine for the Python tokenizers. The Rust tokenizers raise an exceptionValueError: TextInputSequence must be str
.When running on a pair of sequences and setting
return_tensors=None
, the Python tokenizers return an output dict with input_ids (and other elements) as a list of ints i.e.input_ids = [id1, id2, ...]
whereas the Rust tokenizers return a dict with input_ids as a list of list of ints i.e.input_ids = [[id1, id2, ...]]
. I also noticed that if you setreturn_tensors="pt"
, both the Python and Rust tokenizers returninput_ids = tensor([[id1, id2, ...]])
.When
return_overflowing_tokens=True
, the Python tokenizers return a list of the overflowing tokens at keyoverflowing_tokens
as expected. The Rust tokenizers return them at keyoverflow_to_sample_mapping
which is not documented anywhere, as far as I can tell. The values seem to be different for the Python output vs. Rust output.Running the same procedure on the same input twice produces the same result each time for the Python tokenizer. For the Rust tokenizer, the result of the second run is different. I am not familiar enough with the Rust tokenizer internals at this point to have a theory as to why this is the case. Anyway, this is the point at which I stopped debugging and decided to file an issue.
To reproduce
Steps to reproduce the behavior:
run_squad_training.py
described above to correctly instantiate a Rust tokenizerpython examples/question-answering/run_squad_trainer.py --model_name_or_path bert-base-uncased --use_fast --output_dir "./outputs-squad" --do_train --data_dir "./squad-data" --version_2_with_negative
Also see gist detailing issues described above: https://gist.github.com/k8si/a143346dfa875c28d98e95cba1f82f1b
Expected behavior
run_squad_trainer.py
to use a Rust tokenizer when theuse_fast
arg was set to TrueSquadProcessor.squad_convert_example_to_features
to not raise exceptions when processing squad data when using a Rust tokenizertokenizer.encode_plus
to return the same outputs given the same inputs, regardless of whether the tokenizer is a Rust tokenizer or a Python tokenizerThe text was updated successfully, but these errors were encountered: