Crash on particular emoji with detect_multiple_languages #203

PalmerAL · 2023-12-04T00:21:46Z

Hi, thanks for writing this library, it's really useful!

I'm seeing a crash with particular emoji input on the latest version installed from PyPI, here's a testcase:

from lingua import Language, LanguageDetectorBuilder
langdetector = LanguageDetectorBuilder.from_all_languages().build()

langdetector.detect_multiple_languages_of('test 🙈')

thread '<unnamed>' panicked at 'byte index 6 is not a char boundary; it is inside '🙈' (bytes 5..9) of `test 🙈`', src/lib.rs:436:27
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
Traceback (most recent call last):
  File "[...]/crash_repro.py", line 4, in <module>
    langdetector.detect_multiple_languages_of('test 🙈')
pyo3_runtime.PanicException: byte index 6 is not a char boundary; it is inside '🙈' (bytes 5..9) of `test 🙈`

The text was updated successfully, but these errors were encountered:

pemistahl · 2023-12-06T10:55:54Z

Hi @PalmerAL,

Hi, thanks for writing this library, it's really useful!

Nice of you to say that, thank you. :) That motivates me to maintain and improve the library further on.

The cause of your exception is that, whenever detect_multiple_languages_of() returns exactly one DetectionResult, the end index is erroneously calculated as the character offset for Rust. This should be the byte offset instead which then gets converted to character offset for the Python bindings. I'm going to release version 2.0.2 shortly which will fix it.

pemistahl · 2023-12-07T10:03:16Z

Fixed in pemistahl/lingua-rs@72f2d89. Will be released as soon as all issues in milestone 2.0.2 have been resolved.

PalmerAL · 2023-12-10T21:11:25Z

Thanks!

pemistahl added this to the Lingua 2.0.2 milestone Dec 6, 2023

pemistahl closed this as completed Dec 7, 2023

pemistahl mentioned this issue Dec 7, 2023

detect_multiple_languages_of crashes on Arabic #205

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Crash on particular emoji with detect_multiple_languages #203

Crash on particular emoji with detect_multiple_languages #203

PalmerAL commented Dec 4, 2023

pemistahl commented Dec 6, 2023

pemistahl commented Dec 7, 2023

PalmerAL commented Dec 10, 2023

Crash on particular emoji with detect_multiple_languages #203

Crash on particular emoji with detect_multiple_languages #203

Comments

PalmerAL commented Dec 4, 2023

pemistahl commented Dec 6, 2023

pemistahl commented Dec 7, 2023

PalmerAL commented Dec 10, 2023