Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Granular benchmarks of palette transformations. #460

Merged
merged 5 commits into from
Jan 27, 2024

Conversation

anforowicz
Copy link
Contributor

No description provided.

This commit introduces a new `transform.rs` module and moves
row transformation functions into the new module:

* From `util.rs`:
    - `unpack_bits` (no longer needs to be `pub`)
    - `expand_trns_line`
    - `expand_trns_line16`
    - `expand_trns_and_strip_line16`
* From `mod.rs`:
    - `expand_paletted`
    - `expand_gray_u8`

This commit also renames `util.rs` into `adam7.rs`, because after the
refactoring above this module contains only Adam7-related functionality:

- `struct Adam7Iterator`
- `fn expand_pass` which operates on already-transformed, but
  still-interlaced rows)

This commit is intended to be just pure refactoring (i.e. no changes
in behavior or performance are expected).
This ensures that all the public functions in the `transform.rs` module
are infallible.
This commit tweaks `transform.rs` so that all the functions take the
same parameters: `input: &[u8], output: &mut [u8], info: &Info`.

This is achieved by:

1. Taking `info: &Info` instead of
   `trns: Option<&[u8]>, channels: usize` for `expand_trns_line`,
   `expand_trns_line16`, `expand_trns_and_strip_line16`.
2. Removing `trns: Option<Option<&[u8]>>` parameter from
   `expand_paletted` and `expand_gray_u8` by splitting these functions
   into two separate flavors: ones that emit an alpha channel and ones
   that don't.
Instead of deciding which function to use for every row, memoize and
reuse the first decision.

One desirable outcome of this commit is making the public API of the
`transform.rs` module quite thin (just the `TransformFn` type alias and
the `create_transform_fn` function) - this makes this functionality
easier to test and benchmark.

Another desirable outcome is a small runtime improvement in most
benchmarks (compared to the baseline just before the commit that
introduces `transform.rs`):

decode/paletted-zune.png: [-8.7989% -7.4940% -6.1466%] (p = 0.00 < 0.05)
decode/kodim02.png: [-4.4824% -4.0883% -3.6232%] (p = 0.00 < 0.05)
decode/Transparency.png: [-4.5886% -3.5213% -2.2121%] (p = 0.00 < 0.05)
decode/kodim17.png: [-2.4406% -2.0663% -1.7093%] (p = 0.00 < 0.05)
decode/kodim07.png: [-3.4461% -2.8264% -2.2676%] (p = 0.00 < 0.05)
decode/kodim23.png: [-1.7490% -1.3101% -0.7639%] (p = 0.00 < 0.05)
decode/Lohengrin: [-2.9387% -2.3664% -1.7545%] (p = 0.00 < 0.05)
generated-noncompressed-4k-idat/8x8.png: [-4.0353% -3.5931% -3.1529%] (p = 0.00 < 0.05)
generated-noncompressed-4k-idat/128x128.png: [-5.2607% -4.6452% -4.0279%] (p = 0.00 < 0.05)
generated-noncompressed-4k-idat/2048x2048.png: [-3.0347% -1.7376% -0.4028%] (p = 0.03 < 0.05)
generated-noncompressed-64k-idat/128x128.png: [-2.3769% -1.7924% -1.2211%] (p = 0.00 < 0.05)
generated-noncompressed-64k-idat/2048x2048.png: [-12.113% -9.8099% -7.2633%] (p = 0.00 < 0.05)
generated-noncompressed-64k-idat/12288x12288.png: [-5.0077% -1.4750% +1.4708%] (p = 0.43 > 0.05)
generated-noncompressed-2g-idat/12288x12288.png: [-9.1860% -8.2857% -7.3934%] (p = 0.00 < 0.05)

Some regressions were observed in 2 benchmarks:

generated-noncompressed-4k-idat/12288x12288.png:
[+2.5010% +3.1616% +3.8445%] (p = 0.00 < 0.05)
[+3.6046% +4.6592% +5.8580%] (p = 0.00 < 0.05)
[+4.6484% +5.4718% +6.4193%] (p = 0.00 < 0.05)

generated-noncompressed-2g-idat/2048x2048.png:
[-0.6455% +1.9676% +3.9191%] (p = 0.13 > 0.05)
[+6.7491% +8.4227% +10.791%] (p = 0.00 < 0.05)
[+5.9926% +7.2249% +8.5428%] (p = 0.00 < 0.05)
@anforowicz
Copy link
Contributor Author

@fintelia, can you PTAL?

  • I think it may be best to avoid squeezing the commits when merging - I think it is desirable to preserve individual commits

  • Memoization in the-4th-commit-in-the-PR-under-review seems to result in a positive performance change in general except 2 benchmarks that operate on generated inputs. I think this shouldn’t block merging of this PR: 1) the regressions are minor (same or smaller in size than improvements in other benchmarks), 2) improvements on real-world inputs should probably outweigh the regressions on (some) generated inputs, 3) I don’t understand why memoization would help most benchmarks except these two and therefore I am sheepishly leaning toward dismissing them as something outlandish like memory layout.

  • Once/if this PR gets merged, I want to try working on the following series of commits (not sure how to split them across PRs + if all performance improvement ideas will bear fruit):

    • Next PR: Initial memoization for palette transformation

      • Commit A: separate transform/palette.rs module (also move unit tests)
      • Commit B: fn => Box<dyn Fn> (end-to-end perf impact via extra allocation + dyn; other commits will only impact paletted images)
      • Commit C: simple memoization
        • Introduce create_rgba_palette(plte, trns) -> [[u8;4]; 256] - single loop over plte [with trns lookup with fallback to 0xFF] - this should be easier to SIMD-ify later (I think…)
        • Add pallette.rs/mod-level comment: 1) link to slidedeck by ARM engineers + permalink to code in libpng + link to top500 and QOI corpus where there are ~35% of paletted images, 2) “Break-even point” section with TODOs to measure the cut-off of when the old is faster than new (and outline of how this can be measured / calculate)
        • The prod code (when compiled in test [and/or debug?] config) should compare that all versions of functions return the same result: expansion: 1) old (non-memoized) VS new (memoized) VS (future new/SIMD), 2) create_rgba_palette: regular VS (future SIMD)
    • Other PRs:

      • Commit D: create_rgba_palette - read 4-bytes-at-a-time from plte
      • Commit E: std::simd - create_rgba_palette?
      • Commit F: std::simd - expand_palette? Only work with 8-bit input (otherwise fall back to non-SIMD). Maybe expand_rgba is already auto-vectorized, but probably expand_rgb will get better with SIMD?
      • Commit G: Fall back to the old, non-memoized code for small inputs?

/cc @Shnatsel

@fintelia fintelia merged commit 6cad99d into image-rs:master Jan 27, 2024
19 checks passed
@anforowicz anforowicz deleted the palette-benchmarks-func2 branch January 29, 2024 16:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants