Rust: How to handle models with `precompiled_charsmap = null` #1627

kallebysantos · 2024-09-04T08:33:06Z

Hi guys,
I'm currently working on supabase/edge-runtime#368 that pretends to add a rust implementation of pipeline().

While I was coding the translation task I figured out that I can't load the Tokenizer instance for Xenova/opus-mt-en-fr onnx model and their other opus-mt-* variants.

I got the following:

let tokenizer_path = Path::new("opus-mt-en-fr/tokenizer.json");
let tokenizer = Tokenizer::from_file(tokenizer_path).unwrap();

thread 'main' panicked at /home/kalleby/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokenizers-0.20.0/src/normalizers/mod.rs:143:26:
Precompiled: Error("invalid type: null, expected a borrowed string", line: 1, column: 28)
stack backtrace:
   0: rust_begin_unwind
             at /rustc/80eb5a8e910e5185d47cdefe3732d839c78a5e7e/library/std/src/panicking.rs:662:5
   1: core::panicking::panic_fmt
             at /rustc/80eb5a8e910e5185d47cdefe3732d839c78a5e7e/library/core/src/panicking.rs:74:14
   2: core::result::unwrap_failed
             at /rustc/80eb5a8e910e5185d47cdefe3732d839c78a5e7e/library/core/src/result.rs:1679:5
   3: core::result::Result<T,E>::expect
             at /rustc/80eb5a8e910e5185d47cdefe3732d839c78a5e7e/library/core/src/result.rs:1059:23
   4: <tokenizers::normalizers::NormalizerWrapper as serde::de::Deserialize>::deserialize
             at /home/kalleby/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokenizers-0.20.0/src/normalizers/mod.rs:139:25
   5: <serde::de::impls::OptionVisitor<T> as serde::de::Visitor>::visit_some
             at /home/kalleby/.cargo/registry/src/index.crates.io-6f17d22bba15001f/serde-1.0.207/src/de/impls.rs:916:9
   6: <&mut serde_json::de::Deserializer<R> as serde::de::Deserializer>::deserialize_option
             at /home/kalleby/.cargo/registry/src/index.crates.io-6f17d22bba15001f/serde_json-1.0.124/src/de.rs:1672:18
   7: serde::de::impls::<impl serde::de::Deserialize for core::option::Option<T>>::deserialize
             at /home/kalleby/.cargo/registry/src/index.crates.io-6f17d22bba15001f/serde-1.0.207/src/de/impls.rs:935:9
   8: <core::marker::PhantomData<T> as serde::de::DeserializeSeed>::deserialize
             at /home/kalleby/.cargo/registry/src/index.crates.io-6f17d22bba15001f/serde-1.0.207/src/de/mod.rs:801:9
   9: <serde_json::de::MapAccess<R> as serde::de::MapAccess>::next_value_seed
             at /home/kalleby/.cargo/registry/src/index.crates.io-6f17d22bba15001f/serde_json-1.0.124/src/de.rs:2008:9
  10: serde::de::MapAccess::next_value
             at /home/kalleby/.cargo/registry/src/index.crates.io-6f17d22bba15001f/serde-1.0.207/src/de/mod.rs:1874:9
  11: <tokenizers::tokenizer::serialization::TokenizerVisitor<M,N,PT,PP,D> as serde::de::Visitor>::visit_map
             at /home/kalleby/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokenizers-0.20.0/src/tokenizer/serialization.rs:132:55
  12: <&mut serde_json::de::Deserializer<R> as serde::de::Deserializer>::deserialize_struct
             at /home/kalleby/.cargo/registry/src/index.crates.io-6f17d22bba15001f/serde_json-1.0.124/src/de.rs:1840:31
  13: tokenizers::tokenizer::serialization::<impl serde::de::Deserialize for tokenizers::tokenizer::TokenizerImpl<M,N,PT,PP,D>>::deserialize
             at /home/kalleby/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokenizers-0.20.0/src/tokenizer/serialization.rs:62:9
  14: <tokenizers::tokenizer::_::<impl serde::de::Deserialize for tokenizers::tokenizer::Tokenizer>::deserialize::__Visitor as serde::de::Visitor>::visit_newtype_struct
             at /home/kalleby/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokenizers-0.20.0/src/tokenizer/mod.rs:408:21
  15: <&mut serde_json::de::Deserializer<R> as serde::de::Deserializer>::deserialize_newtype_struct
             at /home/kalleby/.cargo/registry/src/index.crates.io-6f17d22bba15001f/serde_json-1.0.124/src/de.rs:1723:9
  16: tokenizers::tokenizer::_::<impl serde::de::Deserialize for tokenizers::tokenizer::Tokenizer>::deserialize
             at /home/kalleby/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokenizers-0.20.0/src/tokenizer/mod.rs:408:21
  17: serde_json::de::from_trait
             at /home/kalleby/.cargo/registry/src/index.crates.io-6f17d22bba15001f/serde_json-1.0.124/src/de.rs:2478:22
  18: serde_json::de::from_str
             at /home/kalleby/.cargo/registry/src/index.crates.io-6f17d22bba15001f/serde_json-1.0.124/src/de.rs:2679:5
  19: tokenizers::tokenizer::Tokenizer::from_file
             at /home/kalleby/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokenizers-0.20.0/src/tokenizer/mod.rs:439:25
  20: transformers_rs::pipeline::tasks::seq_to_seq::seq_to_seq
             at ./src/pipeline/tasks/seq_to_seq.rs:51:21
  21: app::main
             at ./examples/app/src/main.rs:78:5
  22: core::ops::function::FnOnce::call_once
             at /rustc/80eb5a8e910e5185d47cdefe3732d839c78a5e7e/library/core/src/ops/function.rs:250:5

I now that it occurs because their tokenizer.json file was the following:

opus-mt-en-fr:

"normalizer": {
    "type": "Precompiled",
    "precompiled_charsmap": null
}

While the expected behavior must be something like this:

nllb-200-distilled-600M:

"normalizer": {                           
   "type": "Sequence",                     
   "normalizers": [                        
     {                                     
       "type": "Precompiled",              
       "precompiled_charsmap": "ALQCAACEAAA..."
     }                                    
   ]                                       
 }

Looking in the original version of Helsinki-NLP/opus-mt-en-fr I notice that is no tokenizer.json file for it.

I would like to know if is the precompiled_charsmap necessary expect a non-null?

Maybe it could be handle as Option<_>?

Is there some workaround to execute theses models without change the internal model files?
How can I handle an exported onnx model that doesn't have the tokenizer.json file?

The text was updated successfully, but these errors were encountered:

ankane · 2024-09-18T01:59:42Z

I'm seeing the same error with Python when trying to read the tokenizer from Xenova/speecht5_tts.

wget https://huggingface.co/Xenova/speecht5_tts/resolve/main/tokenizer.json

from tokenizers import Tokenizer

Tokenizer.from_file("tokenizer.json")

thread '<unnamed>' panicked at /Users/runner/work/tokenizers/tokenizers/tokenizers/src/normalizers/mod.rs:143:26:
Precompiled: Error("invalid type: null, expected a borrowed string", line: 1, column: 28)
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
...
pyo3_runtime.PanicException: Precompiled: Error("invalid type: null, expected a borrowed string", line: 1, column: 28)

With Tokenizers 0.19.0, this raised an error which could be handled rather than a panic. It looks like this may be related to #1604.

vicantwin · 2024-10-05T20:08:53Z

I'm also facing the same issue (#1645) with speecht5_tts.

ArthurZucker · 2024-10-06T08:16:06Z

I think passing a "" might work. cc @xenova not sure why you end up with nulls there, but we can probably syn and I can add support for option!

kallebysantos · 2024-10-06T08:36:58Z

I think passing a "" might work. cc @xenova not sure why you end up with nulls there, but we can probably syn and I can add support for option!

Xenova implementation doesn't call the value directly but applies iterators over config normalizers. I think that it ignores the null values.

I agree with you, add support for Option<> may solve it.

vicantwin · 2024-10-06T15:34:04Z

I've implemented spm_precompiled with null support at vicantwin/spm_precompiled, which includes a test with null support, and all tests pass successfully.

But, I need some help with changing this repository, as I'm not entirely familiar with this codebase and unsure how to implement the necessary changes. Any help would be greatly appreciated.

kallebysantos changed the title ~~How to handle models without precompiled_charsmap~~ How to handle models with precompiled_charsmap = null Sep 4, 2024

kallebysantos changed the title ~~How to handle models with precompiled_charsmap = null~~ Rust: How to handle models with precompiled_charsmap = null Sep 4, 2024

vicantwin mentioned this issue Oct 5, 2024

Precompiled: Error("invalid type: null, expected a borrowed string", line : 1, column: 28) #1645

Closed

ArthurZucker added the Feature Request label Oct 6, 2024

vicantwin mentioned this issue Oct 7, 2024

feat!: added null support to spm_precompiled huggingface/spm_precompiled#3

Draft

ankane mentioned this issue Oct 9, 2024

Supporting Multilingual text embedding using paraphrase-multilingual-MiniLM-L12-v2 ankane/informers#8

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Rust: How to handle models with `precompiled_charsmap = null` #1627

Rust: How to handle models with `precompiled_charsmap = null` #1627

kallebysantos commented Sep 4, 2024 •

edited

Loading

ankane commented Sep 18, 2024

vicantwin commented Oct 5, 2024

ArthurZucker commented Oct 6, 2024

kallebysantos commented Oct 6, 2024 •

edited

Loading

vicantwin commented Oct 6, 2024

Rust: How to handle models with precompiled_charsmap = null #1627

Rust: How to handle models with precompiled_charsmap = null #1627

Comments

kallebysantos commented Sep 4, 2024 • edited Loading

ankane commented Sep 18, 2024

vicantwin commented Oct 5, 2024

ArthurZucker commented Oct 6, 2024

kallebysantos commented Oct 6, 2024 • edited Loading

vicantwin commented Oct 6, 2024

Rust: How to handle models with `precompiled_charsmap = null` #1627

Rust: How to handle models with `precompiled_charsmap = null` #1627

kallebysantos commented Sep 4, 2024 •

edited

Loading

kallebysantos commented Oct 6, 2024 •

edited

Loading