Granular benchmarks of palette transformations. #460

anforowicz · 2024-01-22T16:45:35Z

No description provided.

This commit introduces a new `transform.rs` module and moves row transformation functions into the new module: * From `util.rs`: - `unpack_bits` (no longer needs to be `pub`) - `expand_trns_line` - `expand_trns_line16` - `expand_trns_and_strip_line16` * From `mod.rs`: - `expand_paletted` - `expand_gray_u8` This commit also renames `util.rs` into `adam7.rs`, because after the refactoring above this module contains only Adam7-related functionality: - `struct Adam7Iterator` - `fn expand_pass` which operates on already-transformed, but still-interlaced rows) This commit is intended to be just pure refactoring (i.e. no changes in behavior or performance are expected).

This ensures that all the public functions in the `transform.rs` module are infallible.

This commit tweaks `transform.rs` so that all the functions take the same parameters: `input: &[u8], output: &mut [u8], info: &Info`. This is achieved by: 1. Taking `info: &Info` instead of `trns: Option<&[u8]>, channels: usize` for `expand_trns_line`, `expand_trns_line16`, `expand_trns_and_strip_line16`. 2. Removing `trns: Option<Option<&[u8]>>` parameter from `expand_paletted` and `expand_gray_u8` by splitting these functions into two separate flavors: ones that emit an alpha channel and ones that don't.

Instead of deciding which function to use for every row, memoize and reuse the first decision. One desirable outcome of this commit is making the public API of the `transform.rs` module quite thin (just the `TransformFn` type alias and the `create_transform_fn` function) - this makes this functionality easier to test and benchmark. Another desirable outcome is a small runtime improvement in most benchmarks (compared to the baseline just before the commit that introduces `transform.rs`): decode/paletted-zune.png: [-8.7989% -7.4940% -6.1466%] (p = 0.00 < 0.05) decode/kodim02.png: [-4.4824% -4.0883% -3.6232%] (p = 0.00 < 0.05) decode/Transparency.png: [-4.5886% -3.5213% -2.2121%] (p = 0.00 < 0.05) decode/kodim17.png: [-2.4406% -2.0663% -1.7093%] (p = 0.00 < 0.05) decode/kodim07.png: [-3.4461% -2.8264% -2.2676%] (p = 0.00 < 0.05) decode/kodim23.png: [-1.7490% -1.3101% -0.7639%] (p = 0.00 < 0.05) decode/Lohengrin: [-2.9387% -2.3664% -1.7545%] (p = 0.00 < 0.05) generated-noncompressed-4k-idat/8x8.png: [-4.0353% -3.5931% -3.1529%] (p = 0.00 < 0.05) generated-noncompressed-4k-idat/128x128.png: [-5.2607% -4.6452% -4.0279%] (p = 0.00 < 0.05) generated-noncompressed-4k-idat/2048x2048.png: [-3.0347% -1.7376% -0.4028%] (p = 0.03 < 0.05) generated-noncompressed-64k-idat/128x128.png: [-2.3769% -1.7924% -1.2211%] (p = 0.00 < 0.05) generated-noncompressed-64k-idat/2048x2048.png: [-12.113% -9.8099% -7.2633%] (p = 0.00 < 0.05) generated-noncompressed-64k-idat/12288x12288.png: [-5.0077% -1.4750% +1.4708%] (p = 0.43 > 0.05) generated-noncompressed-2g-idat/12288x12288.png: [-9.1860% -8.2857% -7.3934%] (p = 0.00 < 0.05) Some regressions were observed in 2 benchmarks: generated-noncompressed-4k-idat/12288x12288.png: [+2.5010% +3.1616% +3.8445%] (p = 0.00 < 0.05) [+3.6046% +4.6592% +5.8580%] (p = 0.00 < 0.05) [+4.6484% +5.4718% +6.4193%] (p = 0.00 < 0.05) generated-noncompressed-2g-idat/2048x2048.png: [-0.6455% +1.9676% +3.9191%] (p = 0.13 > 0.05) [+6.7491% +8.4227% +10.791%] (p = 0.00 < 0.05) [+5.9926% +7.2249% +8.5428%] (p = 0.00 < 0.05)

anforowicz · 2024-01-22T16:57:43Z

@fintelia, can you PTAL?

I think it may be best to avoid squeezing the commits when merging - I think it is desirable to preserve individual commits
Memoization in the-4th-commit-in-the-PR-under-review seems to result in a positive performance change in general except 2 benchmarks that operate on generated inputs. I think this shouldn’t block merging of this PR: 1) the regressions are minor (same or smaller in size than improvements in other benchmarks), 2) improvements on real-world inputs should probably outweigh the regressions on (some) generated inputs, 3) I don’t understand why memoization would help most benchmarks except these two and therefore I am sheepishly leaning toward dismissing them as something outlandish like memory layout.
Once/if this PR gets merged, I want to try working on the following series of commits (not sure how to split them across PRs + if all performance improvement ideas will bear fruit):
- Next PR: Initial memoization for palette transformation
  - Commit A: separate transform/palette.rs module (also move unit tests)
  - Commit B: fn => Box<dyn Fn> (end-to-end perf impact via extra allocation + dyn; other commits will only impact paletted images)
  - Commit C: simple memoization
    - Introduce create_rgba_palette(plte, trns) -> [[u8;4]; 256] - single loop over plte [with trns lookup with fallback to 0xFF] - this should be easier to SIMD-ify later (I think…)
    - Add pallette.rs/mod-level comment: 1) link to slidedeck by ARM engineers + permalink to code in libpng + link to top500 and QOI corpus where there are ~35% of paletted images, 2) “Break-even point” section with TODOs to measure the cut-off of when the old is faster than new (and outline of how this can be measured / calculate)
    - The prod code (when compiled in test [and/or debug?] config) should compare that all versions of functions return the same result: expansion: 1) old (non-memoized) VS new (memoized) VS (future new/SIMD), 2) create_rgba_palette: regular VS (future SIMD)
- Other PRs:
  - Commit D: create_rgba_palette - read 4-bytes-at-a-time from plte
  - Commit E: std::simd - create_rgba_palette?
  - Commit F: std::simd - expand_palette? Only work with 8-bit input (otherwise fall back to non-SIMD). Maybe expand_rgba is already auto-vectorized, but probably expand_rgb will get better with SIMD?
  - Commit G: Fall back to the old, non-memoized code for small inputs?

/cc @Shnatsel

anforowicz added 5 commits January 22, 2024 16:21

Hoist error handling from expand_palette into mod.rs.

4443875

This ensures that all the public functions in the `transform.rs` module are infallible.

Add unit tests and benchmarks for expand_paletted.

31d161a

anforowicz force-pushed the palette-benchmarks-func2 branch from eb8b004 to 31d161a Compare January 22, 2024 16:52

fintelia merged commit 6cad99d into image-rs:master Jan 27, 2024
19 checks passed

anforowicz deleted the palette-benchmarks-func2 branch January 29, 2024 16:41

anforowicz mentioned this pull request Feb 1, 2024

Memoization of RGBA palette when expanding palette indices into RGB8 or RGBA8 #462

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Granular benchmarks of palette transformations. #460

Granular benchmarks of palette transformations. #460

anforowicz commented Jan 22, 2024

anforowicz commented Jan 22, 2024

Granular benchmarks of palette transformations. #460

Granular benchmarks of palette transformations. #460

Conversation

anforowicz commented Jan 22, 2024

anforowicz commented Jan 22, 2024