Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Crash on particular emoji with detect_multiple_languages #203

Closed
PalmerAL opened this issue Dec 4, 2023 · 3 comments
Closed

Crash on particular emoji with detect_multiple_languages #203

PalmerAL opened this issue Dec 4, 2023 · 3 comments
Milestone

Comments

@PalmerAL
Copy link

PalmerAL commented Dec 4, 2023

Hi, thanks for writing this library, it's really useful!

I'm seeing a crash with particular emoji input on the latest version installed from PyPI, here's a testcase:

from lingua import Language, LanguageDetectorBuilder
langdetector = LanguageDetectorBuilder.from_all_languages().build()

langdetector.detect_multiple_languages_of('test 🙈')
thread '<unnamed>' panicked at 'byte index 6 is not a char boundary; it is inside '🙈' (bytes 5..9) of `test 🙈`', src/lib.rs:436:27
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
Traceback (most recent call last):
  File "[...]/crash_repro.py", line 4, in <module>
    langdetector.detect_multiple_languages_of('test 🙈')
pyo3_runtime.PanicException: byte index 6 is not a char boundary; it is inside '🙈' (bytes 5..9) of `test 🙈`
@pemistahl
Copy link
Owner

Hi @PalmerAL,

Hi, thanks for writing this library, it's really useful!

Nice of you to say that, thank you. :) That motivates me to maintain and improve the library further on.

The cause of your exception is that, whenever detect_multiple_languages_of() returns exactly one DetectionResult, the end index is erroneously calculated as the character offset for Rust. This should be the byte offset instead which then gets converted to character offset for the Python bindings. I'm going to release version 2.0.2 shortly which will fix it.

@pemistahl pemistahl added this to the Lingua 2.0.2 milestone Dec 6, 2023
@pemistahl
Copy link
Owner

Fixed in pemistahl/lingua-rs@72f2d89. Will be released as soon as all issues in milestone 2.0.2 have been resolved.

@PalmerAL
Copy link
Author

Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants