optimize inflate::core::init_tree
by precomputing reversed bits
#132
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Somewhat inspired by #82. For small inputs, this function can take up more than 50% of the runtime. As noted in the linked issue, in absolute terms this function is not notable, accounting for at most a couple hundred microseconds, but it is still useful to optimize.
There is only one relevant benchmark in the existing suite:
A roughly 25% improvement (though in absolute terms only 500ns). On the image linked in #82, the total runtime improves by about 13%. This benchmark includes the total time to decode the PNG. In absolute terms the performance increase is about 0.4ms on the server I use for benchmarking.
I'm a bit skeptical about the precomputed reversed bits table, and have left a GitHub comment below discussing this.
This PR in its current form should close #82.
Potential future work
There likely isn't much benefit to spending more time on this function. If one really wanted to squeeze more performance out of this, doubling the size of the precomputed reversed bits lookup table should have some impact, though I haven't measured the performance of making such a change. The
while rev_code < FAST_LOOKUP_SIZE
loop is extremely tight, but I haven't looked at any potential savings there. Thefor i in 1..16
loop looks extremely similar to prefix-sum which can be optimized with SIMD, but such an optimization would likely require unsafe use of intrinsics.