Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rust: How to handle models with precompiled_charsmap = null #1627

Open
kallebysantos opened this issue Sep 4, 2024 · 5 comments
Open

Rust: How to handle models with precompiled_charsmap = null #1627

kallebysantos opened this issue Sep 4, 2024 · 5 comments

Comments

@kallebysantos
Copy link

kallebysantos commented Sep 4, 2024

Hi guys,
I'm currently working on supabase/edge-runtime#368 that pretends to add a rust implementation of pipeline().

While I was coding the translation task I figured out that I can't load the Tokenizer instance for Xenova/opus-mt-en-fr onnx model and their other opus-mt-* variants.

I got the following:
let tokenizer_path = Path::new("opus-mt-en-fr/tokenizer.json");
let tokenizer = Tokenizer::from_file(tokenizer_path).unwrap();
thread 'main' panicked at /home/kalleby/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokenizers-0.20.0/src/normalizers/mod.rs:143:26:
Precompiled: Error("invalid type: null, expected a borrowed string", line: 1, column: 28)
stack backtrace:
   0: rust_begin_unwind
             at /rustc/80eb5a8e910e5185d47cdefe3732d839c78a5e7e/library/std/src/panicking.rs:662:5
   1: core::panicking::panic_fmt
             at /rustc/80eb5a8e910e5185d47cdefe3732d839c78a5e7e/library/core/src/panicking.rs:74:14
   2: core::result::unwrap_failed
             at /rustc/80eb5a8e910e5185d47cdefe3732d839c78a5e7e/library/core/src/result.rs:1679:5
   3: core::result::Result<T,E>::expect
             at /rustc/80eb5a8e910e5185d47cdefe3732d839c78a5e7e/library/core/src/result.rs:1059:23
   4: <tokenizers::normalizers::NormalizerWrapper as serde::de::Deserialize>::deserialize
             at /home/kalleby/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokenizers-0.20.0/src/normalizers/mod.rs:139:25
   5: <serde::de::impls::OptionVisitor<T> as serde::de::Visitor>::visit_some
             at /home/kalleby/.cargo/registry/src/index.crates.io-6f17d22bba15001f/serde-1.0.207/src/de/impls.rs:916:9
   6: <&mut serde_json::de::Deserializer<R> as serde::de::Deserializer>::deserialize_option
             at /home/kalleby/.cargo/registry/src/index.crates.io-6f17d22bba15001f/serde_json-1.0.124/src/de.rs:1672:18
   7: serde::de::impls::<impl serde::de::Deserialize for core::option::Option<T>>::deserialize
             at /home/kalleby/.cargo/registry/src/index.crates.io-6f17d22bba15001f/serde-1.0.207/src/de/impls.rs:935:9
   8: <core::marker::PhantomData<T> as serde::de::DeserializeSeed>::deserialize
             at /home/kalleby/.cargo/registry/src/index.crates.io-6f17d22bba15001f/serde-1.0.207/src/de/mod.rs:801:9
   9: <serde_json::de::MapAccess<R> as serde::de::MapAccess>::next_value_seed
             at /home/kalleby/.cargo/registry/src/index.crates.io-6f17d22bba15001f/serde_json-1.0.124/src/de.rs:2008:9
  10: serde::de::MapAccess::next_value
             at /home/kalleby/.cargo/registry/src/index.crates.io-6f17d22bba15001f/serde-1.0.207/src/de/mod.rs:1874:9
  11: <tokenizers::tokenizer::serialization::TokenizerVisitor<M,N,PT,PP,D> as serde::de::Visitor>::visit_map
             at /home/kalleby/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokenizers-0.20.0/src/tokenizer/serialization.rs:132:55
  12: <&mut serde_json::de::Deserializer<R> as serde::de::Deserializer>::deserialize_struct
             at /home/kalleby/.cargo/registry/src/index.crates.io-6f17d22bba15001f/serde_json-1.0.124/src/de.rs:1840:31
  13: tokenizers::tokenizer::serialization::<impl serde::de::Deserialize for tokenizers::tokenizer::TokenizerImpl<M,N,PT,PP,D>>::deserialize
             at /home/kalleby/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokenizers-0.20.0/src/tokenizer/serialization.rs:62:9
  14: <tokenizers::tokenizer::_::<impl serde::de::Deserialize for tokenizers::tokenizer::Tokenizer>::deserialize::__Visitor as serde::de::Visitor>::visit_newtype_struct
             at /home/kalleby/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokenizers-0.20.0/src/tokenizer/mod.rs:408:21
  15: <&mut serde_json::de::Deserializer<R> as serde::de::Deserializer>::deserialize_newtype_struct
             at /home/kalleby/.cargo/registry/src/index.crates.io-6f17d22bba15001f/serde_json-1.0.124/src/de.rs:1723:9
  16: tokenizers::tokenizer::_::<impl serde::de::Deserialize for tokenizers::tokenizer::Tokenizer>::deserialize
             at /home/kalleby/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokenizers-0.20.0/src/tokenizer/mod.rs:408:21
  17: serde_json::de::from_trait
             at /home/kalleby/.cargo/registry/src/index.crates.io-6f17d22bba15001f/serde_json-1.0.124/src/de.rs:2478:22
  18: serde_json::de::from_str
             at /home/kalleby/.cargo/registry/src/index.crates.io-6f17d22bba15001f/serde_json-1.0.124/src/de.rs:2679:5
  19: tokenizers::tokenizer::Tokenizer::from_file
             at /home/kalleby/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokenizers-0.20.0/src/tokenizer/mod.rs:439:25
  20: transformers_rs::pipeline::tasks::seq_to_seq::seq_to_seq
             at ./src/pipeline/tasks/seq_to_seq.rs:51:21
  21: app::main
             at ./examples/app/src/main.rs:78:5
  22: core::ops::function::FnOnce::call_once
             at /rustc/80eb5a8e910e5185d47cdefe3732d839c78a5e7e/library/core/src/ops/function.rs:250:5

I now that it occurs because their tokenizer.json file was the following:

opus-mt-en-fr:

"normalizer": {
    "type": "Precompiled",
    "precompiled_charsmap": null
}

While the expected behavior must be something like this:

nllb-200-distilled-600M:

"normalizer": {                           
   "type": "Sequence",                     
   "normalizers": [                        
     {                                     
       "type": "Precompiled",              
       "precompiled_charsmap": "ALQCAACEAAA..."
     }                                    
   ]                                       
 }

Looking in the original version of Helsinki-NLP/opus-mt-en-fr I notice that is no tokenizer.json file for it.

I would like to know if is the precompiled_charsmap necessary expect a non-null?

Maybe it could be handle as Option<_>?

Is there some workaround to execute theses models without change the internal model files?
How can I handle an exported onnx model that doesn't have the tokenizer.json file?

@kallebysantos kallebysantos changed the title How to handle models without precompiled_charsmap How to handle models with precompiled_charsmap = null Sep 4, 2024
@kallebysantos kallebysantos changed the title How to handle models with precompiled_charsmap = null Rust: How to handle models with precompiled_charsmap = null Sep 4, 2024
@ankane
Copy link
Contributor

ankane commented Sep 18, 2024

I'm seeing the same error with Python when trying to read the tokenizer from Xenova/speecht5_tts.

wget https://huggingface.co/Xenova/speecht5_tts/resolve/main/tokenizer.json
from tokenizers import Tokenizer

Tokenizer.from_file("tokenizer.json")
thread '<unnamed>' panicked at /Users/runner/work/tokenizers/tokenizers/tokenizers/src/normalizers/mod.rs:143:26:
Precompiled: Error("invalid type: null, expected a borrowed string", line: 1, column: 28)
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
...
pyo3_runtime.PanicException: Precompiled: Error("invalid type: null, expected a borrowed string", line: 1, column: 28)

With Tokenizers 0.19.0, this raised an error which could be handled rather than a panic. It looks like this may be related to #1604.

@vicantwin
Copy link

I'm also facing the same issue (#1645) with speecht5_tts.

@ArthurZucker
Copy link
Collaborator

I think passing a "" might work. cc @xenova not sure why you end up with nulls there, but we can probably syn and I can add support for option!

@kallebysantos
Copy link
Author

kallebysantos commented Oct 6, 2024

I think passing a "" might work. cc @xenova not sure why you end up with nulls there, but we can probably syn and I can add support for option!

Xenova implementation doesn't call the value directly but applies iterators over config normalizers. I think that it ignores the null values.

I agree with you, add support for Option<> may solve it.

@vicantwin
Copy link

I've implemented spm_precompiled with null support at vicantwin/spm_precompiled, which includes a test with null support, and all tests pass successfully.

But, I need some help with changing this repository, as I'm not entirely familiar with this codebase and unsure how to implement the necessary changes. Any help would be greatly appreciated.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants