Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

util::split_at panics with multibyte characters #453

Closed
kyoheiu opened this issue Oct 29, 2022 · 3 comments · Fixed by #456
Closed

util::split_at panics with multibyte characters #453

kyoheiu opened this issue Oct 29, 2022 · 3 comments · Fixed by #456
Labels

Comments

@kyoheiu
Copy link
Contributor

kyoheiu commented Oct 29, 2022

Sorry if duplicated.

To reproduce panic, run the following code:

use syntect::easy::HighlightLines;
use syntect::highlighting::{Style, ThemeSet};
use syntect::parsing::SyntaxSet;
use syntect::util::{as_24_bit_terminal_escaped, split_at, LinesWithEndings};

fn main() {
    let ps = SyntaxSet::load_defaults_newlines();
    let ts = ThemeSet::load_defaults();

    let syntax = ps.find_syntax_by_extension("rs").unwrap();
    let mut h = HighlightLines::new(syntax, &ts.themes["base16-ocean.dark"]);

    let s = "日本の首都は東京です";
    for line in LinesWithEndings::from(s) {
        let ranges: Vec<(Style, &str)> = h.highlight_line(line, &ps).unwrap();
        //thread 'main' panicked at 'byte index 4 is not a char boundary; it is inside '本' (bytes 3..6) of `日本の首都は東京です`', library/core/src/str/mod.rs:127:5
        let ranges = split_at(&ranges, 4);
        let escaped = as_24_bit_terminal_escaped(&ranges.0, false);
        print!("{}", escaped);
        println!();
    }
}

I think it would be better to return Result<(Vec..., Vec...), Error> instead of (Vec... , Vec...) to avoid this type of panic.

@kyoheiu
Copy link
Contributor Author

kyoheiu commented Oct 29, 2022

Seems this panic comes from this line:

        let (sa, sb) = rest[0].1.split_at(rest_split_i);

[str - Rust](https://doc.rust-lang.org/std/primitive.str.html#method.split_at)

Panics if mid is not on a UTF-8 code point boundary, or if it is past the end of the last code point of the string slice.

So it may be resolved by using is_char_boundary() ([str - Rust](https://doc.rust-lang.org/std/primitive.str.html#method.is_char_boundary)), but depends on how you think about this.

@Enselic
Copy link
Collaborator

Enselic commented Nov 2, 2022

Thanks for reporting. Would be great if you could create a PR with a proposed fix together with a regression test.

@kyoheiu
Copy link
Contributor Author

kyoheiu commented Nov 3, 2022

Created 2 PR to fix this issue: One is to fix this function to return Result, another to return the same type as before, just decreases the index to avoid panic. Which is preferred I cannot say, so please check.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants