Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Memoization of RGBA palette when expanding palette indices into RGB8 or RGBA8 #462

Merged
merged 6 commits into from
Feb 2, 2024

Conversation

anforowicz
Copy link
Contributor

No description provided.

This commit moves `expand_paletted_into_rgb8` and
`expand_paletted_into_rgba8` (and their unit tests) into a separate
`transform/palette.rs` module.  This prepares room for encapsulating
extra complexity in this module in follow-up commits, where we will
start to precompute and memoize some data when creating a `TransformFn`.

This commit just moves the code around - it should have no impact on
correctness or performance.
The `PLTE` chunk's size should be a multiple of 3 (since it contains RGB
entries - 3 bytes per entry).

Additionally, taking 10000 samples in the `bench_create_fn` benchmarks
is a bit excessive after memoization.
This commit changes the `TransformFn` type alias from `fn(...)` into
`Box<dyn Fn(...)>`.  This allows the `TransformFn` to have store some
precomputer, memoized state that we plan to add in follow-up commits.

In theory this commit may have negative performance impact, but in the
grand scheme of things it disappears into the measurement noise.  In
particular, when there is no state, then `Box` shouldn't allocate.
Before this commit `expand_paletted_into_rgba8` would:

* Perform 2 lookups - `palette.get(i)` and `trns.get(i)`
* Check via `unwrap_or` if `i` was within the bounds of `palette`/`trns`

This commit introduces `create_rgba_palette` which combines `palette`
and `trns` into a fixed-size `[[u8;4]; 256]` look-up table (called
`rgba_palette` in the code).  After this commit
`expand_paletted_into_rgba8` only needs to perform a single look-up and
doesn't need to check the bounds.  This helps to improve the expansion
time by 60+%:

- expand_paletted(exec)/trns=yes/src_bits=4/src_size=5461:
  [-60.208% -60.057% -59.899%] (p = 0.00 < 0.05)
- expand_paletted(exec)/trns=yes/src_bits=8/src_size=5461:
  [-77.520% -77.407% -77.301%] (p = 0.00 < 0.05)

`expand_paletted_into_rgb8` performs only a single lookup before and
after this commit, but avoiding bounds checks still helps to improve the
expansion time by ~12%:

- expand_paletted(exec)/trns=no/src_bits=4/src_size=5461:
  [-12.357% -12.005% -11.664%] (p = 0.00 < 0.05)
- expand_paletted(exec)/trns=no/src_bits=8/src_size=5461:
  [-13.135% -12.584% -12.092%] (p = 0.00 < 0.05)

Understandably, this commit regresses the time of `create_transform_fn`.
Future commits will reduce this regression 2-4 times:

- expand_paletted(ctor)/plte=256/trns=256:
  [+3757.2% +3763.8% +3770.5%] (p = 0.00 < 0.05)
- expand_paletted(ctor)/plte=224/trns=32:
  [+3807.3% +3816.2% +3824.6%] (p = 0.00 < 0.05)
- expand_paletted(ctor)/plte=16/trns=1:
  [+1672.0% +1675.0% +1678.1%] (p = 0.00 < 0.05)
Before this commit `expand_into_rgb8` would copy 3 bytes at a time into
the output.  After this commit it copies 4 bytes at a time (possibly
cloberring pixels that will be populated during the next iteration -
this is ok).  This improved the performance as follows:

expand_paletted(exec)/trns=no/src_bits=8/src_size=5461
time:   [-23.852% -23.593% -23.319%] (p = 0.00 < 0.05)
This improves the performance as follows:

- expand_paletted(ctor)/plte=256/trns=256
  [-40.581% -40.396% -40.211%] (p = 0.00 < 0.05)
- expand_paletted(ctor)/plte=224/trns=32
  [-24.070% -23.840% -23.592%] (p = 0.00 < 0.05)

Small palettes are mostly unaffected:

- expand_paletted(ctor)/plte=16/trns=1
  [-0.2525% +0.0338% +0.3239%] (p = 0.81 > 0.05)
@anforowicz
Copy link
Contributor Author

@fintelia, can you PTAL?

/cc @Shnatsel (thanks again for pointing out my earlier benchmarking mistake; I read your article on eliminating bounds checks, but wasn’t able to successfully apply it - more details below)

  • As before, I think it may be best to avoid squeezing the commits when merging - I think it is desirable to preserve individual commits
  • The end-to-end speed-up looks like this: decode/paletted-zune.png: time: [-12.282% -11.986% -11.625%] (p = 0.00 < 0.05). (This is the only image in tests/benches for which identify -verbose reports png:IHDR.color_type: 3 (Indexed).)
  • When working through my old list of planned changes I made some adjustments:
    • I wasn’t able to achieve speed-ups via std::simd (see below for more details)
    • I decided that falling back to non-memoized processing isn’t worth it (see below for more details)
    • I decided to compare memoized-vs-non-memoized results only in palette-specific tests (rather than in #[cfg(test)] parts of the product code itself)
    • I’ve also realized that some palette benchmarks were using a PLTE chunk where the length wasn’t a multiple of 3… ooops… this is fixed in the very first commit of this PR
    • And I’ve also realized that in my previous PR I missed an opportunity to make most of the transform.rs functions private to that module

SIMD

Memoization

My attempt to SIMD-ify create_rgba_palette can be found in c71f944. It seems to actually make execution slower by ~5%.

Short link to the non-SIMD-ified code can be found here: https://godbolt.org/z/KeKrfWcs5. There is no auto-vectorization AFAICT.

Execution

My attempt to SIMD-ify expand_8bit_into_rgb8 can be found in 63d4595. It seems to actually make execution slower by ~60%. Using an unsafe gather_select_unchecked doesn’t make a difference AFAICT.

The non-SIMD-ified code can be found here: https://godbolt.org/z/T9zv891dc. There is no auto-vectorization AFAICT - this makes me a bit surprised that this is faster than the SIMD-ified version.

The SIMD-ified version can be found here: https://godbolt.org/z/P19TPP33G. AFAICT it does in fact compile into SIMD instructions.

There are some inefficiencies that (I think) may be addressed by (somehow) hoisting bounds checks outside of the loop (both in the non-SIMD-ified and the SIMD-ified version):

  • each loop not only maintains pointers into input and output, but also keeps updating the slice length
  • each loop iteration does some bounds checks (e.g. jumping to core::slice::index::slice_end_index_len_fail in the SIMD-ified version or to core::panicking::panic_bounds_check in the non-SIMD-ified version).

I don’t know how to address the inefficiencies / how to do the hoisting. (I’ve tried some assert!s about the relative output.len() and input.len() but it didn’t help.)

Break-even point

Memoization is a trade-off:

  • On one hand, memoization requires spending X ns before starting to call
    expand_paletted_... functions.
  • On the other hand, memoization improves the throughput of the
    expand_paletted_... functions - they take Y ns less to process each byte

If we process B bytes, then:

  • With memoization expansion costs X + BBaselineThroughput - BY
  • Without memoization expansion costs B*BaselineThroughput

The breakeven point can be calculated with B=X/Y.

Note that X and Y are typically measure via microbenchmarks which are somewhat unrealistic
(i.e. the benchmarks don't compete with other code for the L1 cache). Therefore the breakeven
point calculated this way may be a bit too optimistic - for example it won't take into account
the negative impact of additional ~1kB of cache/memory pressure from the memoized
rgba_palette.

Measurements

Without memoization we get:

  • Summary:
    • Construction average: 21.029 ns
    • Expansion average: 0.4643 ns per output byte
    • RGB8 expansion: 0.4181 ns per output byte
    • RGBA8 expansion: 0.5245 ns per output byte
  • Details:
    • expand_paletted(exec)/trns=no/src_bits=4/src_size=5461: 1.6935 GiB/s
    • expand_paletted(exec)/trns=no/src_bits=8/src_size=5461: 2.7610 GiB/s
    • expand_paletted(exec)/trns=yes/src_bits=4/src_size=5461: 1.4933 GiB/s
    • expand_paletted(exec)/trns=yes/src_bits=8/src_size=5461: 2.0579 GiB/s
    • expand_paletted(ctor)/plte=256/trns=256: 21.012 ns
    • expand_paletted(ctor)/plte=224/trns=32: 20.990 ns
    • expand_paletted(ctor)/plte=16/trns=1: 21.088 ns

With memoization we get:

  • Summary:
    • Construction average: 177.065 ns (+156.036 ns)
    • Expansion average: 0.1981 ns (-0.2661 ns) per output byte
    • RGB8 expansion: 0.1845 ns (-0.3259 ns) per output byte
    • RGBA8 expansion: 0.1534 ns (-0.3711 ns) per output byte
  • Details:
    • expand_paletted(exec)/trns=no/src_bits=4/src_size=5461: 2.2342 GiB/s: +32.042%
    • expand_paletted(exec)/trns=no/src_bits=8/src_size=5461: 4.4242 GiB/s: +60.373%
    • expand_paletted(exec)/trns=yes/src_bits=4/src_size=5461: 3.8193 GiB/s: +152.83%
    • expand_paletted(exec)/trns=yes/src_bits=8/src_size=5461: 8.3201 GiB/s: +300.30%
    • expand_paletted(ctor)/plte=256/trns=256: 231.59 ns: +996.76%
    • expand_paletted(ctor)/plte=224/trns=32: 222.06 ns: +945.54%
    • expand_paletted(ctor)/plte=16/trns=1: 77.547 ns: +266.00%

Results

The measurements above give the following thresholds:

  • Overall: 586 output bytes
  • RGB8: 478 output bytes = 159 pixels = ~13x13 pixels
  • RGBA8: 420 output bytes = 105 pixels = ~10x10 pixels

Such small images seem relatively rare. IMHO it is unjustified (in terms of extra code complexity, binary size, etc.) to implement falling back to non-memoized implementation for these kinds of images.

@Shnatsel
Copy link
Contributor

Shnatsel commented Feb 1, 2024

I am surprised the crate isn't doing this already! Turning the chunks into a lookup table seems like a no-brainer.

I'll take a look at the bounds checks.

@Shnatsel
Copy link
Contributor

Shnatsel commented Feb 1, 2024

What machine did you run those benchmarks on? I am getting dramatically higher numbers on a Zen 4 CPU.

Benchmark results on Zen 4 ``` expand_paletted(exec)/trns=no/src_bits=4/src_size=5461 time: [7.0350 µs 7.0351 µs 7.0352 µs] thrpt: [4.3376 GiB/s 4.3376 GiB/s 4.3377 GiB/s] Found 73 outliers among 500 measurements (14.60%) 7 (1.40%) low severe 35 (7.00%) low mild 13 (2.60%) high mild 18 (3.60%) high severe

expand_paletted(exec)/trns=no/src_bits=8/src_size=5461
time: [1.3668 µs 1.3669 µs 1.3670 µs]
thrpt: [11.161 GiB/s 11.162 GiB/s 11.163 GiB/s]
Found 53 outliers among 500 measurements (10.60%)
14 (2.80%) low severe
12 (2.40%) low mild
18 (3.60%) high mild
9 (1.80%) high severe

expand_paletted(exec)/trns=yes/src_bits=4/src_size=5461
time: [5.7850 µs 5.7859 µs 5.7867 µs]
thrpt: [7.0312 GiB/s 7.0322 GiB/s 7.0333 GiB/s]
Found 92 outliers among 500 measurements (18.40%)
3 (0.60%) low severe
36 (7.20%) low mild
32 (6.40%) high mild
21 (4.20%) high severe

expand_paletted(exec)/trns=yes/src_bits=8/src_size=5461
time: [1.3811 µs 1.3811 µs 1.3812 µs]
thrpt: [14.729 GiB/s 14.730 GiB/s 14.731 GiB/s]
Found 12 outliers among 500 measurements (2.40%)
1 (0.20%) low severe
1 (0.20%) low mild
6 (1.20%) high mild
4 (0.80%) high severe

expand_paletted(ctor)/plte=256/trns=256
time: [139.95 ns 140.01 ns 140.08 ns]
Found 111 outliers among 1000 measurements (11.10%)
80 (8.00%) low mild
21 (2.10%) high mild
10 (1.00%) high severe

expand_paletted(ctor)/plte=224/trns=32
time: [138.72 ns 138.87 ns 139.02 ns]
Found 9 outliers among 1000 measurements (0.90%)
9 (0.90%) high mild

expand_paletted(ctor)/plte=16/trns=1
time: [49.913 ns 49.963 ns 50.013 ns]
Found 22 outliers among 1000 measurements (2.20%)
1 (0.10%) low severe
11 (1.10%) low mild
10 (1.00%) high mild

</details>

@Shnatsel
Copy link
Contributor

Shnatsel commented Feb 1, 2024

Just so you know, with -C target-cpu=znver4 your expand_8bit_into_rgb8 from this PR gets autovectorized into AVX-512 instructions: https://godbolt.org/z/3bTG8oYz3
And it absolutely tanks performance on the micro-benchmarks, with some of them taking more than 2x more time. It also regresses the end-to-end decoding benchmark for the zune-paletted file by 23%.

I suspect the expansion function ends up getting called a lot on very small chunks, so the vector code is wasted and only gets in the way. I'll try to confirm this hunch.

@Shnatsel
Copy link
Contributor

Shnatsel commented Feb 1, 2024

Hmm, that function seems to be called for every row. I am surprised that autovectorization actually hurts it even on rather large images.

@fintelia
Copy link
Contributor

fintelia commented Feb 1, 2024

Zen4 has a somewhat weird AVX-512 implementation, so I wonder if LLVM just isn't realizing that the autovectorized version will be slower.

As another datapoint, I tested on a Zen3 CPU and don't see any huge performance differences compiling with -C target-cpu=znver3

@Shnatsel
Copy link
Contributor

Shnatsel commented Feb 1, 2024

I couldn't convince the compiler to remove the bounds checks from the loop. But it seems to speculate right past them without too much trouble.

A naive iterator version that removes bounds checks but copies 3 bytes instead of 4 is 25% slower than this code.

{
let mut palette_iter = palette;
let mut rgba_iter = &mut rgba_palette[..];
while palette_iter.len() >= 4 {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
while palette_iter.len() >= 4 {
while palette_iter.len() >= 4 && !rgba_iter.is_empty() {

Modifying https://godbolt.org/z/KeKrfWcs5, I haven't benchmarked or tested it however.
Removes the panic but doesn't unroll the loop like the original https://rust.godbolt.org/z/53MaeEano

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This results in a 25% regression.

The compiler is aware that the branches leading to the panic are cold, and so decides to unroll the loop. The CPU should speculate right past the branch leading to a panic. So it is best to keep it as a clearly-cold panic rather than a regular branch.

Comment on lines +99 to +100
input = &input[1..];
output = &mut output[3..];
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Replacing these with get_unchecked() showed exactly zero performance difference on the benchmarks. It seems eliminating bounds checks here is not worthwhile.

@anforowicz
Copy link
Contributor Author

What machine did you run those benchmarks on?

AMD EPYC 7B12.

`lscpu` output

$ lscpu Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Address sizes: 48 bits physical, 48 bits virtual Byte Order: Little Endian CPU(s): 128 On-line CPU(s) list: 0-127 Vendor ID: AuthenticAMD Model name: AMD EPYC 7B12 CPU family: 23 Model: 49 Thread(s) per core: 2 Core(s) per socket: 32 Socket(s): 2 Stepping: 0 BogoMIPS: 4499.99 Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc c puid extd_apicid tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3d nowprefetch osvw topoext ssbd ibrs ibpb stibp vmmcall fsgsbase tsc_adjust bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr arat npt nr ip_save umip rdpid Virtualization features: Hypervisor vendor: KVM Virtualization type: full Caches (sum of all): L1d: 2 MiB (64 instances) L1i: 2 MiB (64 instances) L2: 32 MiB (64 instances) L3: 256 MiB (16 instances) NUMA: NUMA node(s): 2 NUMA node0 CPU(s): 0-31,64-95 NUMA node1 CPU(s): 32-63,96-127 Vulnerabilities: Gather data sampling: Not affected Itlb multihit: Not affected L1tf: Not affected Mds: Not affected Meltdown: Not affected Mmio stale data: Not affected Retbleed: Mitigation; untrained return thunk; SMT enabled with STIBP protection Spec rstack overflow: Vulnerable: Safe RET, no microcode Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization Spectre v2: Mitigation; Retpolines, IBPB conditional, STIBP always-on, RSB filling, PBRSB-eIBRS Not affected Srbds: Not affected Tsx async abort: Not affected

FWIW, I also didn't use -C target-cpu=native or something like that, because AFAIK Chromium builds and ships a single binary for a given target platform/architecture. AFAIU this still in general allows for some (lowest-common denominator) auto-vectorization (e.g. see #414 (comment)).

@Shnatsel
Copy link
Contributor

Shnatsel commented Feb 1, 2024

Ah, that runs at half the frequency of Zen4 desktop parts, so those numbers make sense. I did not use -C target-cpu=native either.

I suppose we could use multiversioning if wider SIMD helped, but at least in palette expansion it doesn't.

@fintelia fintelia merged commit b13388f into image-rs:master Feb 2, 2024
19 checks passed
@anforowicz anforowicz deleted the plte--memoization branch February 2, 2024 16:38
hubot pushed a commit to google/skia that referenced this pull request Oct 24, 2024
Before this CL, `png::Transformations::EXPAND` was used unconditionally,
which meant that the `png` crate was responsible for 1) expanding to at
least 8 bits, 2) expanding `ColorType::Rgb` to `Rgba` (and `Grayscale`
to `GrayscaleAlpha`) based on `tRNS` chunk, and 3) expanding
`ColorType::Indexed` based on `PLTE` and `tRNS` chunks.

After this CL, expanding `ColorType::Indexed` images is done on Skia
side instead.  This has some performance benefits:

* Performance measurements show that the total runtime of decoding
  PNG images gathered from top 500 websites goes down - the runtime
  after this CL is ~94% of the runtime before this CL.  For comparison
  with other potential interventions, please see the document here:
  https://docs.google.com/document/d/16P0I_4AglenbkU1IBM1Qfe1zAQrMXa5NoEazZqOTHZE/edit?usp=sharing
* Arm-chair analysis:
    - There are less transformation "hops" after this CL:
         - Before this CL: indexed data ==png-crate==> rgb or rgba
           ==SkSwizzler-and/or-skcms_Transform==> final data
         - After this CL: indexed data ==SkSwizzler==> final data
         - The indexed=>rgb/rgba transformation is still happening,
           but it is happening within SkSwizzler, so there is (maybe?)
           less loop overhead and less memory pressure (although one
           image row is not _that_ much)
         - The skcms_Transform may still be needed, but it can be
           applied to the palette (at most 256 rgba pixels) rather than
           to all the image pixels.  (This may also apply for some
           stages handles by SkSwizzler like alpha-premultiplication.)
    - I note that the indexed=>rgb/rgba transformation depends on
      auto-vectorization both before and after this CL.  So far the
      `png` crate didn't even see improvements from using `std::simd`
      (see the notes in
      image-rs/image-png#462 (comment)).
      Similarily, `SkSwizzler` doesn't provide SIMD-based routines for
      this transformation (see `swizzle_small_index_to_n32` and
      `swizzle_index_to_n32`).  There may be additional performance
      improvement opportunities here, because `libpng` expansion does
      use SIMD intrinsics - see the notes in https://crbug.com/40512957.

Fixed: chromium:356882657
Change-Id: I4bf63bd126140401ae1f38f82e613ffe7d9d13e8
Reviewed-on: https://skia-review.googlesource.com/c/skia/+/911438
Reviewed-by: Daniel Dilan <[email protected]>
Commit-Queue: Łukasz Anforowicz <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants