Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

optimize inflate::core::init_tree by precomputing reversed bits #132

Merged
merged 3 commits into from
Jan 10, 2023

Conversation

connorskees
Copy link
Contributor

@connorskees connorskees commented Jan 10, 2023

Somewhat inspired by #82. For small inputs, this function can take up more than 50% of the runtime. As noted in the linked issue, in absolute terms this function is not notable, accounting for at most a couple hundred microseconds, but it is still useful to optimize.

There is only one relevant benchmark in the existing suite:

// before
test oxide::decompress_short_lvl_1      ... bench:       2,966 ns/iter (+/- 67)
// after
test oxide::decompress_short_lvl_1      ... bench:       2,377 ns/iter (+/- 88)

A roughly 25% improvement (though in absolute terms only 500ns). On the image linked in #82, the total runtime improves by about 13%. This benchmark includes the total time to decode the PNG. In absolute terms the performance increase is about 0.4ms on the server I use for benchmarking.

I'm a bit skeptical about the precomputed reversed bits table, and have left a GitHub comment below discussing this.

This PR in its current form should close #82.

Potential future work

There likely isn't much benefit to spending more time on this function. If one really wanted to squeeze more performance out of this, doubling the size of the precomputed reversed bits lookup table should have some impact, though I haven't measured the performance of making such a change. The while rev_code < FAST_LOOKUP_SIZE loop is extremely tight, but I haven't looked at any potential savings there. The for i in 1..16 loop looks extremely similar to prefix-sum which can be optimized with SIMD, but such an optimization would likely require unsafe use of intrinsics.

bug appeared when running `cargo t --all`. the solution to use `.sum()` on the iterator makes it slightly slower, so just completely reverting
@connorskees connorskees changed the title optimize inflate::core::init_tree optimize inflate::core::init_tree by precomputing reversed bits Jan 10, 2023
@oyvindln
Copy link
Collaborator

@notgull does it look fine to you as well now?

@notgull
Copy link

notgull commented Jan 10, 2023

Yeah, it looks good to me. This is basically how I'd do it.

@oyvindln
Copy link
Collaborator

aight. let me know when you want a new release. Can wait a bit if you have more work on the way.

@oyvindln oyvindln merged commit bf66097 into Frommi:master Jan 10, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

inflate::core::init_tree() is slow
3 participants