-
Notifications
You must be signed in to change notification settings - Fork 12.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimize escape_ascii
.
#125340
Optimize escape_ascii
.
#125340
Conversation
rustbot has assigned @Mark-Simulacrum. Use |
r? @Kobzol |
I'm probably not the best person to review this, but I can try. I have the same question as here though - do you have some (micro)benchmarks to show that this is an improvement? :) |
This comment has been minimized.
This comment has been minimized.
6bfb89d
to
8b94af3
Compare
@Kobzol, what's the best way to do a benchmark for this? Just create a standalone crate with two versions of this function, or is there a recommended way to test against different commits in this repo? |
This comment has been minimized.
This comment has been minimized.
Well, that depends. From the microbenchmark side, you could show e.g. on godbolt that this produces "objectively" better asssembly. From the macrobenchmark side, you would probably bring some program that is actually improved by this change. Usually people have some explicit motivation for doing these kinds of optimizations, which is demonstrated by some change either in codegen or an improvement for some real-world code. |
I have updated the Godbolt link in the PR description to reflect the current changes, i.e. 3 fewer jumps and 7 fewer instructions. I have also done a micro benchmark using Source#![feature(ascii_char)]
#![feature(ascii_char_variants)]
#![feature(let_chains)]
#![feature(inline_const)]
#![feature(const_option)]
use core::ascii;
use core::ops::Range;
use criterion::{criterion_group, criterion_main, BenchmarkId, Criterion, PlotConfiguration};
const HEX_DIGITS: [ascii::Char; 16] = *b"0123456789abcdef".as_ascii().unwrap();
#[inline]
const fn backslash<const N: usize>(a: ascii::Char) -> ([ascii::Char; N], Range<u8>) {
const { assert!(N >= 2) };
let mut output = [ascii::Char::Null; N];
output[0] = ascii::Char::ReverseSolidus;
output[1] = a;
(output, 0..2)
}
#[inline]
const fn escape_ascii_before<const N: usize>(byte: u8) -> ([ascii::Char; N], Range<u8>) {
const { assert!(N >= 4) };
match byte {
b'\t' => backslash(ascii::Char::SmallT),
b'\r' => backslash(ascii::Char::SmallR),
b'\n' => backslash(ascii::Char::SmallN),
b'\\' => backslash(ascii::Char::ReverseSolidus),
b'\'' => backslash(ascii::Char::Apostrophe),
b'\"' => backslash(ascii::Char::QuotationMark),
byte => {
let mut output = [ascii::Char::Null; N];
if let Some(c) = byte.as_ascii()
&& !byte.is_ascii_control()
{
output[0] = c;
(output, 0..1)
} else {
let hi = HEX_DIGITS[(byte >> 4) as usize];
let lo = HEX_DIGITS[(byte & 0xf) as usize];
output[0] = ascii::Char::ReverseSolidus;
output[1] = ascii::Char::SmallX;
output[2] = hi;
output[3] = lo;
(output, 0..4)
}
}
}
}
#[inline]
const fn escape_ascii_after<const N: usize>(byte: u8) -> ([ascii::Char; N], Range<u8>) {
const { assert!(N >= 4) };
let mut output = [ascii::Char::Null; N];
// NOTE: This `match` is roughly ordered by the frequency of ASCII
// characters for performance.
match byte.as_ascii() {
Some(
c @ ascii::Char::QuotationMark
| c @ ascii::Char::Apostrophe
| c @ ascii::Char::ReverseSolidus,
) => backslash(c),
Some(c) if !byte.is_ascii_control() => {
output[0] = c;
(output, 0..1)
}
Some(ascii::Char::LineFeed) => backslash(ascii::Char::SmallN),
Some(ascii::Char::CarriageReturn) => backslash(ascii::Char::SmallR),
Some(ascii::Char::CharacterTabulation) => backslash(ascii::Char::SmallT),
_ => {
let hi = HEX_DIGITS[(byte >> 4) as usize];
let lo = HEX_DIGITS[(byte & 0xf) as usize];
output[0] = ascii::Char::ReverseSolidus;
output[1] = ascii::Char::SmallX;
output[2] = hi;
output[3] = lo;
(output, 0..4)
}
}
}
pub fn criterion_benchmark(c: &mut Criterion) {
let mut group = c.benchmark_group("escape_ascii");
group.sample_size(1000);
for i in [b'a', b'Z', b'\"', b'\t', b'\n', b'\xff'] {
let i_s = if let Some(c) = i.as_ascii() {
format!("{c:?}")
} else {
format!("'\\x{i:02x}'")
};
group.bench_with_input(BenchmarkId::new("before", &i_s), &i, |b, i| {
b.iter(|| escape_ascii_before::<4>(*i));
});
group.bench_with_input(BenchmarkId::new("after", &i_s), &i, |b, i| {
b.iter(|| escape_ascii_after::<4>(*i));
});
}
group.finish();
}
criterion_group!(benches, criterion_benchmark);
criterion_main!(benches); Output
Graph (unfortunately Y-axis is not sorted by input): |
Your benchmark was executed on a single byte input? It would be good to also see how it behaves on something larger, e.g. a short/medium size/long byte slice, to see the effects in practice. Could you describe the motivation for this change? If I understand your comment correctly, "frequency of ASCII characters" means how often do given characters appear in the input. It makes sense to me to optimize for the common case, which I would expect is that the input does not need to be escaped at all. So my intuition would be to start with first checking if it's an alphabetic ASCII character, and then continue from there. So this optimization seems reasonable, in general. I just wonder if you have some use-case where this escaping is an actual bottleneck and we could actually see some wins in practice? Btw, in general, the fact that there are less instructions doesn't necessarily mean that the code will be faster. In microarchitecture simulation ( |
Hmm. Omitting the non-ASCII case, perhaps this could be done with a lookup table? You could squeeze it down to just 127 bytes if you use the eighth bit to determine if there should be a backslash, since the escaped character will only need 7 bits. This way, you don't need to worry about ordering things by prevalence. Have no idea what the current codegen looks like so I dunno if it'd be much faster, but that feels like the best route to me. |
This comment has been minimized.
This comment has been minimized.
I have made some further changes and updated the Godbolt link in the PR description. The instruction count is again slightly lower, and LLCM-MCA now also shows fewer instructions and better IPC and throughput. I re-ran the previous benchmark with larger inputs (a 100MB file with random data, and a 100MB JSON file). The results show no difference between the two functions: I also ran LLVM-MCA locally for Cortex M4, and it shows ~25% fewer instructions with ~35% higher throughput: LLVM-MCA (Cortex M4) - before
cargo asm --features before --lib --target thumbv7em-none-eabihf --att --mca --mca-arg=-mcpu=cortex-m4
LLVM-MCA (Cortex M4) - after
cargo asm --features after --lib --target thumbv7em-none-eabihf --att --mca --mca-arg=-mcpu=cortex-m4
|
I suspect that in the grand scheme of things (escaping strings, rather than chars), this might not have such a large effect (btw https://lemire.me/blog/2024/05/31/quickly-checking-whether-a-string-needs-escaping/ might be interesting to you). The code looked a bit more readable before, but not strong opinion on my side. r? libs |
The current version really isn't particularly readable, so I don't think I can accept it. However I found an even better version (at least according to llvm-mca) that is even more readable than the old one: https://rust.godbolt.org/z/8bfWP9aP8 (the top one) Do you want to try that? @rustbot author |
@reitermarkus any updates on this? thanks |
@reitermarkus I'm closing this due to inactivity because the PR hasn't been touched by the author in a few months. @rustbot label: +S-inactive |
Optimize `escape_ascii` using a lookup table Based upon my suggestion here: rust-lang#125340 (comment) Effectively, we can take advantage of the fact that ASCII only needs 7 bits to make the eighth bit store whether the value should be escaped or not. This adds a 256-byte lookup table, but 256 bytes *should* be small enough that very few people will mind, according to my probably not incontrovertible opinion. The generated assembly isn't clearly better (although has fewer branches), so, I decided to benchmark on three inputs: first on a random 200KiB, then on `/bin/cat`, then on `Cargo.toml` for this repo. In all cases, the generated code ran faster on my machine. (an old i7-8700) But, if you want to try my benchmarking code for yourself: <details><summary>Criterion code below. Replace <code>/home/ltdk/rustsrc</code> with the appropriate directory.</summary> ```rust #![feature(ascii_char)] #![feature(ascii_char_variants)] #![feature(const_option)] #![feature(let_chains)] use core::ascii; use core::ops::Range; use criterion::{criterion_group, criterion_main, Criterion}; use rand::{thread_rng, Rng}; const HEX_DIGITS: [ascii::Char; 16] = *b"0123456789abcdef".as_ascii().unwrap(); #[inline] const fn backslash<const N: usize>(a: ascii::Char) -> ([ascii::Char; N], Range<u8>) { const { assert!(N >= 2) }; let mut output = [ascii::Char::Null; N]; output[0] = ascii::Char::ReverseSolidus; output[1] = a; (output, 0..2) } #[inline] const fn hex_escape<const N: usize>(byte: u8) -> ([ascii::Char; N], Range<u8>) { const { assert!(N >= 4) }; let mut output = [ascii::Char::Null; N]; let hi = HEX_DIGITS[(byte >> 4) as usize]; let lo = HEX_DIGITS[(byte & 0xf) as usize]; output[0] = ascii::Char::ReverseSolidus; output[1] = ascii::Char::SmallX; output[2] = hi; output[3] = lo; (output, 0..4) } #[inline] const fn verbatim<const N: usize>(a: ascii::Char) -> ([ascii::Char; N], Range<u8>) { const { assert!(N >= 1) }; let mut output = [ascii::Char::Null; N]; output[0] = a; (output, 0..1) } /// Escapes an ASCII character. /// /// Returns a buffer and the length of the escaped representation. const fn escape_ascii_old<const N: usize>(byte: u8) -> ([ascii::Char; N], Range<u8>) { const { assert!(N >= 4) }; match byte { b'\t' => backslash(ascii::Char::SmallT), b'\r' => backslash(ascii::Char::SmallR), b'\n' => backslash(ascii::Char::SmallN), b'\\' => backslash(ascii::Char::ReverseSolidus), b'\'' => backslash(ascii::Char::Apostrophe), b'\"' => backslash(ascii::Char::QuotationMark), 0x00..=0x1F => hex_escape(byte), _ => match ascii::Char::from_u8(byte) { Some(a) => verbatim(a), None => hex_escape(byte), }, } } /// Escapes an ASCII character. /// /// Returns a buffer and the length of the escaped representation. const fn escape_ascii_new<const N: usize>(byte: u8) -> ([ascii::Char; N], Range<u8>) { /// Lookup table helps us determine how to display character. /// /// Since ASCII characters will always be 7 bits, we can exploit this to store the 8th bit to /// indicate whether the result is escaped or unescaped. /// /// We additionally use 0x80 (escaped NUL character) to indicate hex-escaped bytes, since /// escaped NUL will not occur. const LOOKUP: [u8; 256] = { let mut arr = [0; 256]; let mut idx = 0; loop { arr[idx as usize] = match idx { // use 8th bit to indicate escaped b'\t' => 0x80 | b't', b'\r' => 0x80 | b'r', b'\n' => 0x80 | b'n', b'\\' => 0x80 | b'\\', b'\'' => 0x80 | b'\'', b'"' => 0x80 | b'"', // use NUL to indicate hex-escaped 0x00..=0x1F | 0x7F..=0xFF => 0x80 | b'\0', _ => idx, }; if idx == 255 { break; } idx += 1; } arr }; let lookup = LOOKUP[byte as usize]; // 8th bit indicates escape let lookup_escaped = lookup & 0x80 != 0; // SAFETY: We explicitly mask out the eighth bit to get a 7-bit ASCII character. let lookup_ascii = unsafe { ascii::Char::from_u8_unchecked(lookup & 0x7F) }; if lookup_escaped { // NUL indicates hex-escaped if matches!(lookup_ascii, ascii::Char::Null) { hex_escape(byte) } else { backslash(lookup_ascii) } } else { verbatim(lookup_ascii) } } fn escape_bytes(bytes: &[u8], f: impl Fn(u8) -> ([ascii::Char; 4], Range<u8>)) -> Vec<ascii::Char> { let mut vec = Vec::new(); for b in bytes { let (buf, range) = f(*b); vec.extend_from_slice(&buf[range.start as usize..range.end as usize]); } vec } pub fn criterion_benchmark(c: &mut Criterion) { let mut group = c.benchmark_group("escape_ascii"); group.sample_size(1000); let rand_200k = &mut [0; 200 * 1024]; thread_rng().fill(&mut rand_200k[..]); let cat = include_bytes!("/bin/cat"); let cargo_toml = include_bytes!("/home/ltdk/rustsrc/Cargo.toml"); group.bench_function("old_rand", |b| { b.iter(|| escape_bytes(rand_200k, escape_ascii_old)); }); group.bench_function("new_rand", |b| { b.iter(|| escape_bytes(rand_200k, escape_ascii_new)); }); group.bench_function("old_bin", |b| { b.iter(|| escape_bytes(cat, escape_ascii_old)); }); group.bench_function("new_bin", |b| { b.iter(|| escape_bytes(cat, escape_ascii_new)); }); group.bench_function("old_cargo_toml", |b| { b.iter(|| escape_bytes(cargo_toml, escape_ascii_old)); }); group.bench_function("new_cargo_toml", |b| { b.iter(|| escape_bytes(cargo_toml, escape_ascii_new)); }); group.finish(); } criterion_group!(benches, criterion_benchmark); criterion_main!(benches); ``` </details> My benchmark results: ``` escape_ascii/old_rand time: [1.6965 ms 1.7006 ms 1.7053 ms] Found 22 outliers among 1000 measurements (2.20%) 4 (0.40%) high mild 18 (1.80%) high severe escape_ascii/new_rand time: [1.6749 ms 1.6953 ms 1.7158 ms] Found 38 outliers among 1000 measurements (3.80%) 38 (3.80%) high mild escape_ascii/old_bin time: [224.59 µs 225.40 µs 226.33 µs] Found 39 outliers among 1000 measurements (3.90%) 17 (1.70%) high mild 22 (2.20%) high severe escape_ascii/new_bin time: [164.86 µs 165.63 µs 166.58 µs] Found 107 outliers among 1000 measurements (10.70%) 43 (4.30%) high mild 64 (6.40%) high severe escape_ascii/old_cargo_toml time: [23.397 µs 23.699 µs 24.014 µs] Found 204 outliers among 1000 measurements (20.40%) 21 (2.10%) high mild 183 (18.30%) high severe escape_ascii/new_cargo_toml time: [16.404 µs 16.438 µs 16.483 µs] Found 88 outliers among 1000 measurements (8.80%) 56 (5.60%) high mild 32 (3.20%) high severe ``` Random: 1.7006ms => 1.6953ms (<1% speedup) Binary: 225.40µs => 165.63µs (26% speedup) Text: 23.699µs => 16.438µs (30% speedup)
Optimize `escape_ascii` using a lookup table Based upon my suggestion here: rust-lang/rust#125340 (comment) Effectively, we can take advantage of the fact that ASCII only needs 7 bits to make the eighth bit store whether the value should be escaped or not. This adds a 256-byte lookup table, but 256 bytes *should* be small enough that very few people will mind, according to my probably not incontrovertible opinion. The generated assembly isn't clearly better (although has fewer branches), so, I decided to benchmark on three inputs: first on a random 200KiB, then on `/bin/cat`, then on `Cargo.toml` for this repo. In all cases, the generated code ran faster on my machine. (an old i7-8700) But, if you want to try my benchmarking code for yourself: <details><summary>Criterion code below. Replace <code>/home/ltdk/rustsrc</code> with the appropriate directory.</summary> ```rust #![feature(ascii_char)] #![feature(ascii_char_variants)] #![feature(const_option)] #![feature(let_chains)] use core::ascii; use core::ops::Range; use criterion::{criterion_group, criterion_main, Criterion}; use rand::{thread_rng, Rng}; const HEX_DIGITS: [ascii::Char; 16] = *b"0123456789abcdef".as_ascii().unwrap(); #[inline] const fn backslash<const N: usize>(a: ascii::Char) -> ([ascii::Char; N], Range<u8>) { const { assert!(N >= 2) }; let mut output = [ascii::Char::Null; N]; output[0] = ascii::Char::ReverseSolidus; output[1] = a; (output, 0..2) } #[inline] const fn hex_escape<const N: usize>(byte: u8) -> ([ascii::Char; N], Range<u8>) { const { assert!(N >= 4) }; let mut output = [ascii::Char::Null; N]; let hi = HEX_DIGITS[(byte >> 4) as usize]; let lo = HEX_DIGITS[(byte & 0xf) as usize]; output[0] = ascii::Char::ReverseSolidus; output[1] = ascii::Char::SmallX; output[2] = hi; output[3] = lo; (output, 0..4) } #[inline] const fn verbatim<const N: usize>(a: ascii::Char) -> ([ascii::Char; N], Range<u8>) { const { assert!(N >= 1) }; let mut output = [ascii::Char::Null; N]; output[0] = a; (output, 0..1) } /// Escapes an ASCII character. /// /// Returns a buffer and the length of the escaped representation. const fn escape_ascii_old<const N: usize>(byte: u8) -> ([ascii::Char; N], Range<u8>) { const { assert!(N >= 4) }; match byte { b'\t' => backslash(ascii::Char::SmallT), b'\r' => backslash(ascii::Char::SmallR), b'\n' => backslash(ascii::Char::SmallN), b'\\' => backslash(ascii::Char::ReverseSolidus), b'\'' => backslash(ascii::Char::Apostrophe), b'\"' => backslash(ascii::Char::QuotationMark), 0x00..=0x1F => hex_escape(byte), _ => match ascii::Char::from_u8(byte) { Some(a) => verbatim(a), None => hex_escape(byte), }, } } /// Escapes an ASCII character. /// /// Returns a buffer and the length of the escaped representation. const fn escape_ascii_new<const N: usize>(byte: u8) -> ([ascii::Char; N], Range<u8>) { /// Lookup table helps us determine how to display character. /// /// Since ASCII characters will always be 7 bits, we can exploit this to store the 8th bit to /// indicate whether the result is escaped or unescaped. /// /// We additionally use 0x80 (escaped NUL character) to indicate hex-escaped bytes, since /// escaped NUL will not occur. const LOOKUP: [u8; 256] = { let mut arr = [0; 256]; let mut idx = 0; loop { arr[idx as usize] = match idx { // use 8th bit to indicate escaped b'\t' => 0x80 | b't', b'\r' => 0x80 | b'r', b'\n' => 0x80 | b'n', b'\\' => 0x80 | b'\\', b'\'' => 0x80 | b'\'', b'"' => 0x80 | b'"', // use NUL to indicate hex-escaped 0x00..=0x1F | 0x7F..=0xFF => 0x80 | b'\0', _ => idx, }; if idx == 255 { break; } idx += 1; } arr }; let lookup = LOOKUP[byte as usize]; // 8th bit indicates escape let lookup_escaped = lookup & 0x80 != 0; // SAFETY: We explicitly mask out the eighth bit to get a 7-bit ASCII character. let lookup_ascii = unsafe { ascii::Char::from_u8_unchecked(lookup & 0x7F) }; if lookup_escaped { // NUL indicates hex-escaped if matches!(lookup_ascii, ascii::Char::Null) { hex_escape(byte) } else { backslash(lookup_ascii) } } else { verbatim(lookup_ascii) } } fn escape_bytes(bytes: &[u8], f: impl Fn(u8) -> ([ascii::Char; 4], Range<u8>)) -> Vec<ascii::Char> { let mut vec = Vec::new(); for b in bytes { let (buf, range) = f(*b); vec.extend_from_slice(&buf[range.start as usize..range.end as usize]); } vec } pub fn criterion_benchmark(c: &mut Criterion) { let mut group = c.benchmark_group("escape_ascii"); group.sample_size(1000); let rand_200k = &mut [0; 200 * 1024]; thread_rng().fill(&mut rand_200k[..]); let cat = include_bytes!("/bin/cat"); let cargo_toml = include_bytes!("/home/ltdk/rustsrc/Cargo.toml"); group.bench_function("old_rand", |b| { b.iter(|| escape_bytes(rand_200k, escape_ascii_old)); }); group.bench_function("new_rand", |b| { b.iter(|| escape_bytes(rand_200k, escape_ascii_new)); }); group.bench_function("old_bin", |b| { b.iter(|| escape_bytes(cat, escape_ascii_old)); }); group.bench_function("new_bin", |b| { b.iter(|| escape_bytes(cat, escape_ascii_new)); }); group.bench_function("old_cargo_toml", |b| { b.iter(|| escape_bytes(cargo_toml, escape_ascii_old)); }); group.bench_function("new_cargo_toml", |b| { b.iter(|| escape_bytes(cargo_toml, escape_ascii_new)); }); group.finish(); } criterion_group!(benches, criterion_benchmark); criterion_main!(benches); ``` </details> My benchmark results: ``` escape_ascii/old_rand time: [1.6965 ms 1.7006 ms 1.7053 ms] Found 22 outliers among 1000 measurements (2.20%) 4 (0.40%) high mild 18 (1.80%) high severe escape_ascii/new_rand time: [1.6749 ms 1.6953 ms 1.7158 ms] Found 38 outliers among 1000 measurements (3.80%) 38 (3.80%) high mild escape_ascii/old_bin time: [224.59 µs 225.40 µs 226.33 µs] Found 39 outliers among 1000 measurements (3.90%) 17 (1.70%) high mild 22 (2.20%) high severe escape_ascii/new_bin time: [164.86 µs 165.63 µs 166.58 µs] Found 107 outliers among 1000 measurements (10.70%) 43 (4.30%) high mild 64 (6.40%) high severe escape_ascii/old_cargo_toml time: [23.397 µs 23.699 µs 24.014 µs] Found 204 outliers among 1000 measurements (20.40%) 21 (2.10%) high mild 183 (18.30%) high severe escape_ascii/new_cargo_toml time: [16.404 µs 16.438 µs 16.483 µs] Found 88 outliers among 1000 measurements (8.80%) 56 (5.60%) high mild 32 (3.20%) high severe ``` Random: 1.7006ms => 1.6953ms (<1% speedup) Binary: 225.40µs => 165.63µs (26% speedup) Text: 23.699µs => 16.438µs (30% speedup)
Optimize `escape_ascii` using a lookup table Based upon my suggestion here: rust-lang/rust#125340 (comment) Effectively, we can take advantage of the fact that ASCII only needs 7 bits to make the eighth bit store whether the value should be escaped or not. This adds a 256-byte lookup table, but 256 bytes *should* be small enough that very few people will mind, according to my probably not incontrovertible opinion. The generated assembly isn't clearly better (although has fewer branches), so, I decided to benchmark on three inputs: first on a random 200KiB, then on `/bin/cat`, then on `Cargo.toml` for this repo. In all cases, the generated code ran faster on my machine. (an old i7-8700) But, if you want to try my benchmarking code for yourself: <details><summary>Criterion code below. Replace <code>/home/ltdk/rustsrc</code> with the appropriate directory.</summary> ```rust #![feature(ascii_char)] #![feature(ascii_char_variants)] #![feature(const_option)] #![feature(let_chains)] use core::ascii; use core::ops::Range; use criterion::{criterion_group, criterion_main, Criterion}; use rand::{thread_rng, Rng}; const HEX_DIGITS: [ascii::Char; 16] = *b"0123456789abcdef".as_ascii().unwrap(); #[inline] const fn backslash<const N: usize>(a: ascii::Char) -> ([ascii::Char; N], Range<u8>) { const { assert!(N >= 2) }; let mut output = [ascii::Char::Null; N]; output[0] = ascii::Char::ReverseSolidus; output[1] = a; (output, 0..2) } #[inline] const fn hex_escape<const N: usize>(byte: u8) -> ([ascii::Char; N], Range<u8>) { const { assert!(N >= 4) }; let mut output = [ascii::Char::Null; N]; let hi = HEX_DIGITS[(byte >> 4) as usize]; let lo = HEX_DIGITS[(byte & 0xf) as usize]; output[0] = ascii::Char::ReverseSolidus; output[1] = ascii::Char::SmallX; output[2] = hi; output[3] = lo; (output, 0..4) } #[inline] const fn verbatim<const N: usize>(a: ascii::Char) -> ([ascii::Char; N], Range<u8>) { const { assert!(N >= 1) }; let mut output = [ascii::Char::Null; N]; output[0] = a; (output, 0..1) } /// Escapes an ASCII character. /// /// Returns a buffer and the length of the escaped representation. const fn escape_ascii_old<const N: usize>(byte: u8) -> ([ascii::Char; N], Range<u8>) { const { assert!(N >= 4) }; match byte { b'\t' => backslash(ascii::Char::SmallT), b'\r' => backslash(ascii::Char::SmallR), b'\n' => backslash(ascii::Char::SmallN), b'\\' => backslash(ascii::Char::ReverseSolidus), b'\'' => backslash(ascii::Char::Apostrophe), b'\"' => backslash(ascii::Char::QuotationMark), 0x00..=0x1F => hex_escape(byte), _ => match ascii::Char::from_u8(byte) { Some(a) => verbatim(a), None => hex_escape(byte), }, } } /// Escapes an ASCII character. /// /// Returns a buffer and the length of the escaped representation. const fn escape_ascii_new<const N: usize>(byte: u8) -> ([ascii::Char; N], Range<u8>) { /// Lookup table helps us determine how to display character. /// /// Since ASCII characters will always be 7 bits, we can exploit this to store the 8th bit to /// indicate whether the result is escaped or unescaped. /// /// We additionally use 0x80 (escaped NUL character) to indicate hex-escaped bytes, since /// escaped NUL will not occur. const LOOKUP: [u8; 256] = { let mut arr = [0; 256]; let mut idx = 0; loop { arr[idx as usize] = match idx { // use 8th bit to indicate escaped b'\t' => 0x80 | b't', b'\r' => 0x80 | b'r', b'\n' => 0x80 | b'n', b'\\' => 0x80 | b'\\', b'\'' => 0x80 | b'\'', b'"' => 0x80 | b'"', // use NUL to indicate hex-escaped 0x00..=0x1F | 0x7F..=0xFF => 0x80 | b'\0', _ => idx, }; if idx == 255 { break; } idx += 1; } arr }; let lookup = LOOKUP[byte as usize]; // 8th bit indicates escape let lookup_escaped = lookup & 0x80 != 0; // SAFETY: We explicitly mask out the eighth bit to get a 7-bit ASCII character. let lookup_ascii = unsafe { ascii::Char::from_u8_unchecked(lookup & 0x7F) }; if lookup_escaped { // NUL indicates hex-escaped if matches!(lookup_ascii, ascii::Char::Null) { hex_escape(byte) } else { backslash(lookup_ascii) } } else { verbatim(lookup_ascii) } } fn escape_bytes(bytes: &[u8], f: impl Fn(u8) -> ([ascii::Char; 4], Range<u8>)) -> Vec<ascii::Char> { let mut vec = Vec::new(); for b in bytes { let (buf, range) = f(*b); vec.extend_from_slice(&buf[range.start as usize..range.end as usize]); } vec } pub fn criterion_benchmark(c: &mut Criterion) { let mut group = c.benchmark_group("escape_ascii"); group.sample_size(1000); let rand_200k = &mut [0; 200 * 1024]; thread_rng().fill(&mut rand_200k[..]); let cat = include_bytes!("/bin/cat"); let cargo_toml = include_bytes!("/home/ltdk/rustsrc/Cargo.toml"); group.bench_function("old_rand", |b| { b.iter(|| escape_bytes(rand_200k, escape_ascii_old)); }); group.bench_function("new_rand", |b| { b.iter(|| escape_bytes(rand_200k, escape_ascii_new)); }); group.bench_function("old_bin", |b| { b.iter(|| escape_bytes(cat, escape_ascii_old)); }); group.bench_function("new_bin", |b| { b.iter(|| escape_bytes(cat, escape_ascii_new)); }); group.bench_function("old_cargo_toml", |b| { b.iter(|| escape_bytes(cargo_toml, escape_ascii_old)); }); group.bench_function("new_cargo_toml", |b| { b.iter(|| escape_bytes(cargo_toml, escape_ascii_new)); }); group.finish(); } criterion_group!(benches, criterion_benchmark); criterion_main!(benches); ``` </details> My benchmark results: ``` escape_ascii/old_rand time: [1.6965 ms 1.7006 ms 1.7053 ms] Found 22 outliers among 1000 measurements (2.20%) 4 (0.40%) high mild 18 (1.80%) high severe escape_ascii/new_rand time: [1.6749 ms 1.6953 ms 1.7158 ms] Found 38 outliers among 1000 measurements (3.80%) 38 (3.80%) high mild escape_ascii/old_bin time: [224.59 µs 225.40 µs 226.33 µs] Found 39 outliers among 1000 measurements (3.90%) 17 (1.70%) high mild 22 (2.20%) high severe escape_ascii/new_bin time: [164.86 µs 165.63 µs 166.58 µs] Found 107 outliers among 1000 measurements (10.70%) 43 (4.30%) high mild 64 (6.40%) high severe escape_ascii/old_cargo_toml time: [23.397 µs 23.699 µs 24.014 µs] Found 204 outliers among 1000 measurements (20.40%) 21 (2.10%) high mild 183 (18.30%) high severe escape_ascii/new_cargo_toml time: [16.404 µs 16.438 µs 16.483 µs] Found 88 outliers among 1000 measurements (8.80%) 56 (5.60%) high mild 32 (3.20%) high severe ``` Random: 1.7006ms => 1.6953ms (<1% speedup) Binary: 225.40µs => 165.63µs (26% speedup) Text: 23.699µs => 16.438µs (30% speedup)
Follow-up to #124307. CC @joboet
Alternative/addition to #125317.
Based on #124307 (comment), it doesn't look like this function is the cause for the regression, but this change produces even fewer instructions (https://rust.godbolt.org/z/nebzqoveG).