Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

perf: fix SIMD-inlining #131

Merged
merged 2 commits into from
Apr 18, 2023
Merged

Conversation

AaronO
Copy link
Contributor

@AaronO AaronO commented Apr 13, 2023

Drastically improving throughput on larger inputs (3x+ for large URIs or header-values)

There are 2 optimizations in this PR:

  1. Removing two unnecessary instructions when computing trailizing_zeros / bytes-validated.
    We don't need to or the upper half of the register with xFF we can instead compute trailing-zeros on the meaningful bits by using eax (u32) instead of rax (u64) and ax (u16) instead of eax (u32) for AVX2 and SSE4.2 respectively.
  2. Correctly scoping target_feature pragmas to allow SIMD validators to be inlined, so when looped we benefit from greater register reuse etc... See:

Benchmarks

Summary table

(Disclaimer: aggregated by ChatGPT, which "computed" the ratio rows which aren't exactly correct but close enough)

Test 128b 256b 512b 1024b 2048b 4096b
Before
Header 38 66 123 263 484 946
URI 19 44 116 237 465 937
After
Header 30 39 55 88 193 300
URI 12 20 35 65 127 270
Improvement
Header Ratio ~1.5x ~1.5x ~2.0x ~3.0x ~2.5x ~3.0x
URI Ratio ~1.5x ~2.0x ~3.5x ~3.5x ~3.5x ~3.5x

Raw benches

before:
test header/value_128b ... bench:           38 ns/iter (+/- 3)
test header/value_256b ... bench:           66 ns/iter (+/- 0)
test header/value_512b ... bench:           123 ns/iter (+/- 2)
test header/value_1024b ... bench:          263 ns/iter (+/- 13)
test header/value_2048b ... bench:          484 ns/iter (+/- 19)
test header/value_4096b ... bench:          946 ns/iter (+/- 7)

test uri/uri_128b ... bench:          19 ns/iter (+/- 3)
test uri/uri_256b ... bench:          44 ns/iter (+/- 1)
test uri/uri_512b ... bench:         116 ns/iter (+/- 1)
test uri/uri_1024b ... bench:         237 ns/iter (+/- 3)
test uri/uri_2048b ... bench:         465 ns/iter (+/- 3)
test uri/uri_4096b ... bench:         937 ns/iter (+/- 58)

after:
test header/value_128b ... bench:           30 ns/iter (+/- 1)
test header/value_256b ... bench:           39 ns/iter (+/- 1)
test header/value_512b ... bench:           55 ns/iter (+/- 2)
test header/value_1024b ... bench:          88 ns/iter (+/- 4)
test header/value_2048b ... bench:          193 ns/iter (+/- 49)
test header/value_4096b ... bench:          300 ns/iter (+/- 4)

test uri/uri_128b ... bench:          12 ns/iter (+/- 3)
test uri/uri_256b ... bench:          20 ns/iter (+/- 0)
test uri/uri_512b ... bench:          35 ns/iter (+/- 1)
test uri/uri_1024b ... bench:          65 ns/iter (+/- 4)
test uri/uri_2048b ... bench:         127 ns/iter (+/- 2)
test uri/uri_4096b ... bench:         270 ns/iter (+/- 36)

Drastically improving throughput on larger inputs
(3x+ for large URIs or header-values)
AaronO added a commit to AaronO/httparse that referenced this pull request Apr 14, 2023
@seanmonstar seanmonstar merged commit d745bd2 into seanmonstar:master Apr 18, 2023
AaronO added a commit to AaronO/httparse that referenced this pull request Apr 18, 2023
@AaronO AaronO deleted the perf/simd-inlining branch April 24, 2023 19:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants