Very slow decoding of paletted PNG images #393

Shnatsel · 2023-04-01T15:12:42Z

zune-png benchmarks show that the png crate is much slower than other decoders on indexed images - a whopping 3x slower than zune-png.

CPU profile shows that 71% of the time is spent in png::utils::unpack_bits, specifically this hot loop.

The code is full of indexing and doesn't look amenable to autovectorization. I think the entire function will have to be rewritten; it's probably a good idea to copy zune-png here.

The text was updated successfully, but these errors were encountered:

Shnatsel · 2023-04-01T15:13:36Z

Measured on v0.17.8-rc with this image: speed_bench_palette

Shnatsel · 2023-04-01T22:31:24Z

zune-png uses vectorization-friendly implementation of expansion, which explains the difference in performance. It can be found here: https://github.com/etemesi254/zune-image/blob/dev/zune-png/src/utils.rs

fintelia · 2023-04-02T03:43:23Z

The version in this crate handles paletted images with 1/2/4-bit indices, whereas zune-png seems not to. However, probably the overwhelming majority of images use 8-bit indices... so it is probably worth special casing it.

Shnatsel · 2023-04-02T13:28:49Z

Based on this comment I understand zune-png expands the bits to bytes first, and then resolves those to the correct palette values in another pass.

So it does handle lower bit depth indices correctly, but they're expanded to 8 bits before being used for palette lookups.

okaneco · 2023-04-04T20:31:24Z

I thought maybe the step_by code wasn't optimizing well but it doesn't seem like much can currently be improved.

My guess is that the bottleneck is the access pattern of using the same buffer for holding the indices and expanding the colors in back-to-front order. If a separate buffer with the indices were passed in to the unpack_bits function, simpler iterators could be used since you wouldn't have to hold indices at both ends of the buffer simultaneously. I'm not sure if there's an available vector that could be used on the current decoder struct for this purpose.

pub fn unpack_bits<F>(buf: &mut [u8], palette_indices: &[u8], channels: usize, bit_depth: u8, func: F)

The possibility of vectorization or at least less bounds checks seems a likely benefit. L36 can be simplified to remove a shift but it won't make a difference.

image-png/src/utils.rs

Lines 27 to 38 in 2f53fc4

    
           let i = (0..entries) 
        
               .rev() // reverse iterator 
        
               .flat_map(|idx| 
        
                   // this has to be reversed too 
        
                   (0..8).step_by(bit_depth.into()) 
        
                   .zip(repeat(idx))) 
        
               .skip(skip); 
        
           let j = (0..=buf.len() - channels).rev().step_by(channels); 
        
           for ((shift, i), j) in i.zip(j) { 
        
               let pixel = (buf[i] & (mask << shift)) >> shift; 
        
               func(pixel, &mut buf[j..(j + channels)]) 
        
           }

The hot loop is calling the closure here. The indices in the closures can be replaced with TryInto to remove some more indexing but also not a big difference that I could see.

image-png/src/decoder/mod.rs

Lines 856 to 863 in 2f53fc4

    
           utils::unpack_bits(buffer, 3, info.bit_depth as u8, |i, chunk| { 
        
               let rgb = palette 
        
                   .get(3 * i as usize..3 * i as usize + 3) 
        
                   .unwrap_or(&black); 
        
               chunk[0] = rgb[0]; 
        
               chunk[1] = rgb[1]; 
        
               chunk[2] = rgb[2]; 
        
           })

I also tried unpacking the iterators to understand the logic and maybe get better performance.

It's definitely clumsy trying to convert that iterator chain to some loop format. I converted to this to see if there was a better way of special-casing the different depths but 8-bit will always only have a 0 for the shift and shouldn't ever have a skip.

let entry_indices = (0..entries).rev().skip(usize::from(skip != 0));
let mut j = buf.len() - channels;
let mut curr = buf[entries - 1];

if skip != 0 {
    for shift in (0..8).step_by(bit_depth as usize).skip(skip) {
        let pixel = (curr >> shift) & mask;
        func(pixel, &mut buf[j..j + channels]);
        if j < channels {
            return;
        }
        j -= channels;
    }
}

for idx in entry_indices {
    curr = buf[idx];
    for shift in (0..8).step_by(bit_depth as usize) {
        let pixel = (curr >> shift) & mask;
        func(pixel, &mut buf[j..j + channels]);
        if j < channels {
            return;
        }
        j -= channels;
    }
}

fintelia · 2023-04-04T22:01:44Z

This reminded me that I started working on more refactoring which included changing the transformations to not operate in place. Just created #396 with the current state

anforowicz · 2024-01-07T20:44:43Z

Looking at top-500 website, paletted images account for 35% of PNG images. Source:

Discussion: PNG decoding performance improvement opportunities #416 (comment)
https://docs.google.com/spreadsheets/d/1CEgr4ttd2agOZi-MFOjqJGqfDVFUKDp-zKbrF-8Xwyk/edit?usp=sharing (see the blue part of the "image properties" sheet)

OTOH so far I only had negative results when trying to improve paletted images: #416 (comment)

Shnatsel · 2024-01-07T20:50:28Z

It's the bit unpacking hot loop that's slowing things down. zune-png has already figured out a much more performant way to do it that gets auto-vectorized: https://github.com/etemesi254/zune-image/blob/54cc956ccc01ea942456c0dcebf8d97bda614666/crates/zune-png/src/utils.rs#L217-L317

It should be fairly straightforward to integrate into the png crate.

okaneco · 2024-01-07T21:25:27Z

It's the bit unpacking hot loop that's slowing things down.

That code was changed in #405 and looks like this now.

The distinction I'd make is that it's not the bit-unpacking as much as retrieving colors from the lookup table and filling those into the buffer. Trying to do it in 2 passes ended up slower than the current behavior of unpacking and then doing the lookup in 1 pass. The code didn't seem to get autovectorized even using chunks_exact at the time.

I'm not sure that it's straightforward or easy to get similar results without larger architectural changes as mentioned in #416 (comment) where different several methods were tried unsuccessfully.

Shnatsel · 2024-01-07T21:41:19Z

It is clearly possible to do better here still, as evidenced by zune-png. Even with the latest code it is 50% faster on a big paletted image. Perhaps its code is worth a closer look.

anforowicz · 2024-01-11T18:29:03Z

It is clearly possible to do better here still, as evidenced by zune-png. Even with the latest code it is 50% faster on a big paletted image. Perhaps its code is worth a closer look.

FWIW, I've tried running perf record / perf report on the linked image and it seems that most time is spent below the png crate (i.e. not in fn expand_paletted in png/src/decoder/mod.rs which is implemented in the png crate). I understand that inlining and other optimizations may make the perf report somewhat inaccurate/crude, but AFAIU things like fdeflate::decompress::Decompressor::read should only accidentally include things inlined from dependencies of fdeflate (like the simd-adler32 crate), but not from clients of fdeflate (like the png crate). I guess LTO can potentially deduplicate code across fdeflate and png and continue attributing profiling samples to fdeflate, but this seems unlikely for fn expand_paletted.

Maybe this means that the observed runtime delta (compared to zune-png) comes from other areas?

For the record, here are the perf report results I got for that paletted image:

  55.34%  decoder-1b5b1e4  decoder-1b5b1e46cb85776a  [.] fdeflate::decompress::Decompressor::read                                                                                                                                            
  17.43%  decoder-1b5b1e4  decoder-1b5b1e46cb85776a  [.] <alloc::vec::Vec<T> as alloc::vec::spec_from_iter::SpecFromIter<T,I>>::from_iter                                                                                                    
  12.61%  decoder-1b5b1e4  libc-2.31.so              [.] 0x00000000001602f2                                                                                                                                                                  
   3.75%  decoder-1b5b1e4  [kernel.kallsyms]         [k] clear_page_erms                                                                                                                                                                     
   2.56%  decoder-1b5b1e4  [kernel.kallsyms]         [k] read_hpet                                                                                                                                                                           
   2.33%  decoder-1b5b1e4  decoder-1b5b1e46cb85776a  [.] crc32fast::specialized::pclmulqdq::calculate                                                                                                                                        
   1.29%  decoder-1b5b1e4  decoder-1b5b1e46cb85776a  [.] fdeflate::decompress::Decompressor::build_tables                                                                                                                                    
   0.70%  decoder-1b5b1e4  decoder-1b5b1e46cb85776a  [.] <alloc::vec::Vec<T> as alloc::vec::spec_from_iter::SpecFromIter<T,I>>::from_iter                                                                                                    
   0.49%  decoder-1b5b1e4  decoder-1b5b1e46cb85776a  [.] simd_adler32::imp::avx2::imp::update_imp                                                       
...

Shnatsel · 2024-01-11T19:03:00Z

debug=true in Cargo.toml combined with perf --call-graph=dwarf should attribute inlined code correctly. I'll profile the decoding of this image later and post the results.

Shnatsel · 2024-01-11T20:10:33Z

You need to call decoder.set_transformations(png::Transformations::EXPAND) to activate conversion from the palette indices to RGBA. If this is not set, the conversion from palette indices to RGB is never performed, and will not show up on the profile at all.

If you set that decoder option, then 40% of the total decoding time is spent in expand_paletted - here's the profile: https://share.firefox.dev/422gfkL

It is measured on this image, meant to be a a more realistic use case for indexed PNG: Stadt_Onex_2021_indexed_oxipng, originally sourced from wikimedia and converted to paletted image by myself.

It is possible that Chromium prefers to resolve the indices themselves using the palette chunk instead of getting RGB or RGBA output from the decoder, in which case this is bottleneck is not relevant to Chromium.

This hot loop is where 40% of the decoding time for this image is spent:

image-png/src/decoder/mod.rs

Lines 872 to 879 in b00fb53

    
           utils::unpack_bits(row, buffer, 3, info.bit_depth as u8, |i, chunk| { 
        
               let rgb = palette 
        
                   .get(3 * i as usize..3 * i as usize + 3) 
        
                   .unwrap_or(&black); 
        
               chunk[0] = rgb[0]; 
        
               chunk[1] = rgb[1]; 
        
               chunk[2] = rgb[2]; 
        
           })

Code used for measurement: https://gist.github.com/Shnatsel/ec75b05205ef55c96b9d256fdc36c2ec

anforowicz · 2024-01-11T21:11:27Z

You need to call decoder.set_transformations(png::Transformations::EXPAND) to activate conversion from the palette indices to RGBA. If this is not set, the conversion from palette indices to RGB is never performed, and will not show up on the profile at all.

Ooops. You're right, my bad. I knew about that but forgot :-(.

If you set that decoder option, then 40% of the total decoding time is spent in expand_paletted - here's the profile: https://share.firefox.dev/422gfkL

Thanks for sharing.

It is measured on this image, meant to be a a more realistic use case for indexed PNG: Stadt_Onex_2021_indexed_oxipng, originally sourced from wikimedia and converted to paletted image by myself.

Ack.

It is possible that Chromium prefers to resolve the indices themselves using the palette chunk instead of getting RGB or RGBA output from the decoder, in which case this is bottleneck is not relevant to Chromium.

No, Chromium also wants to expand into RGBA (at a high-level, there are some other unimportant details like BGRA and/or alpha-premul).

This hot loop is where 40% of the decoding time for this image is spent:

image-png/src/decoder/mod.rs

Lines 872 to 879 in b00fb53

utils::unpack_bits(row, buffer, 3, info.bit_depth as u8, |i, chunk| {

let rgb = palette

.get(3 * i as usize..3 * i as usize + 3)

.unwrap_or(&black);

chunk[0] = rgb[0];

chunk[1] = rgb[1];

chunk[2] = rgb[2];

})

Code used for measurement: https://gist.github.com/Shnatsel/ec75b05205ef55c96b9d256fdc36c2ec

I've tried to replicate your results with the code at https://github.com/anforowicz/image-png/tree/palette-measurements, but I got different results. I am very curious what may be causing the difference.

Repro steps:

Run the following

# booted with isolcpus=nohz,domain,4-5
# (I forgot to disable TurboBoost and move kernel work off of cores 4-5,
# but hopefully this shouldn't matter for comparing how much a function takes in a profile)
$ git reset --hard
$ git checkout 1f57fd0cd050d1fa8ed43c3d94c781e6057aa483 # this is the `palette-measurements` branch
$ rustup run stable rustc --version
rustc 1.75.0 (82e1608df 2023-12-21)
$ rustup run stable cargo build --bench=decoder --release   
...
$ taskset --cpu-list 4-5 nice -n -19 sudo perf record --call-graph=dwarf target/release/deps/decoder-01bdb75be1fb05ce --bench --profile-time 10 big-palletted
$ sudo perf script -F +pid > /tmp/test.perf
...

Upload the profile to https://profiler.firefox.com/ (this is great - thanks for teaching me about this tool)
Share the profile

In the https://share.firefox.dev/3tOf0ZM that got, I can see the png::decoder::expand_paletted::_$u7b$$u7b$closure$u7d$$u7d$::hfd10b92927fc4892 closure in the call tree when using inverted call stack, but it only takes 0.1% of the whole profile

Shnatsel · 2024-01-11T21:21:53Z

If this is a benchmark, you need to set debug = true for the bench profile and not just the release one. Cargo uses the bench profile when compiling benchmarks:

Cargo has 4 built-in profiles: dev, release, test, and bench.

(from Cargo docs)

Speaking of tools, samply is amazing. Try this:

cargo install samply
echo '0' | sudo tee /proc/sys/kernel/perf_event_paranoid  # to be able to profile as non-root
samply record /path/to/binary arg1 arg2

Its processed results are more accurate than perf, it produces them much faster than perf script, it opens Firefox Profiler automatically, and you can double-click flame graph bars to open the code view while Samply is running to see the samples attributed to both lines of code and assembly instructions.

Shnatsel · 2024-01-11T21:45:14Z

Ah, I've found your problem: there are two different Decoder instances in the benchmarks and you applied the transform to one but not the other. The patched one:
https://github.com/anforowicz/image-png/blob/1f57fd0cd050d1fa8ed43c3d94c781e6057aa483/benches/decoder.rs#L60-L61
is not performing the actual decoding, this one is:
https://github.com/anforowicz/image-png/blob/1f57fd0cd050d1fa8ed43c3d94c781e6057aa483/benches/decoder.rs#L69
And you forgot to set the EXPAND transformation on it as well.

anforowicz · 2024-01-12T00:42:25Z

Ah, I've found your problem: there are two different Decoder instances in the benchmarks and you applied the transform to one but not the other.

Ugh. You're right. Thank you for pointing this out. That's rather embarassing (since I've abandoned my earlier attempts at improving expand_paletted because of this incorrect approach to benchmarking).

For now I've opened #453 to cover the expand_paletted in the end-to-end benchmarks (adding one of the images you've pointed out above).

I'll try to find some time next week to also work on:

Function-level benchmarks of expand_paletted (https://github.com/anforowicz/image-png/tree/palette-benchmarks-func). The first step will be figuring out why the end-to-end benchmarks move, even though this was supposed to be a performance-neutral refactoring that just hides expand_paletted in a module that exposes a simple, testable, benchable API.
Actual improvements. I see around -13% improvement for the new end-to-end benchmark image with https://github.com/anforowicz/image-png/tree/palette-combined-rgba, but currently Transparency sees a 2-3% regression.
- I've tried looking at https://github.com/etemesi254/zune-image/blob/cd696f8f23dddee14063e66c6a728163ad0e6e2b/crates/zune-bmp/src/decoder.rs#L807. It seems quite similar to what I came up with in the commit above.
- There are probably opportunities to read/write 4 bytes at a time (rather than just 3). This may require side-stepping utils::unpack_bits.
- And there may be opportunities for using std::simd maybe?

Shnatsel · 2024-01-15T22:36:19Z

The first step will be figuring out why the end-to-end benchmarks move, even though this was supposed to be a performance-neutral refactoring that just hides expand_paletted in a module that exposes a simple, testable, benchable API.

Probably because of inlining. More on inlining in Rust

You can use https://github.com/pacak/cargo-show-asm to inspect the generated assembly, or just objdump -d to keep it old school. You're mostly interested in whether a given function is present as a separate function or not (inlined ones disappear from the listing).

Shnatsel · 2024-02-18T20:03:34Z

Fixed by #462

anforowicz mentioned this issue Jan 12, 2024

End-to-end decoding benchmarks of paletted PNG images. #453

Merged

anforowicz mentioned this issue Feb 1, 2024

Memoization of RGBA palette when expanding palette indices into RGB8 or RGBA8 #462

Merged

Shnatsel closed this as completed Feb 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Very slow decoding of paletted PNG images #393

Very slow decoding of paletted PNG images #393

Shnatsel commented Apr 1, 2023

Shnatsel commented Apr 1, 2023

Shnatsel commented Apr 1, 2023 •

edited

Loading

fintelia commented Apr 2, 2023

Shnatsel commented Apr 2, 2023 •

edited

Loading

okaneco commented Apr 4, 2023

fintelia commented Apr 4, 2023

anforowicz commented Jan 7, 2024

Shnatsel commented Jan 7, 2024 •

edited

Loading

okaneco commented Jan 7, 2024

Shnatsel commented Jan 7, 2024

anforowicz commented Jan 11, 2024

Shnatsel commented Jan 11, 2024

Shnatsel commented Jan 11, 2024

anforowicz commented Jan 11, 2024

Shnatsel commented Jan 11, 2024

Shnatsel commented Jan 11, 2024

anforowicz commented Jan 12, 2024

Shnatsel commented Jan 15, 2024

Shnatsel commented Feb 18, 2024

Very slow decoding of paletted PNG images #393

Very slow decoding of paletted PNG images #393

Comments

Shnatsel commented Apr 1, 2023

Shnatsel commented Apr 1, 2023

Shnatsel commented Apr 1, 2023 • edited Loading

fintelia commented Apr 2, 2023

Shnatsel commented Apr 2, 2023 • edited Loading

okaneco commented Apr 4, 2023

fintelia commented Apr 4, 2023

anforowicz commented Jan 7, 2024

Shnatsel commented Jan 7, 2024 • edited Loading

okaneco commented Jan 7, 2024

Shnatsel commented Jan 7, 2024

anforowicz commented Jan 11, 2024

Shnatsel commented Jan 11, 2024

Shnatsel commented Jan 11, 2024

anforowicz commented Jan 11, 2024

Shnatsel commented Jan 11, 2024

Shnatsel commented Jan 11, 2024

anforowicz commented Jan 12, 2024

Shnatsel commented Jan 15, 2024

Shnatsel commented Feb 18, 2024

Shnatsel commented Apr 1, 2023 •

edited

Loading

Shnatsel commented Apr 2, 2023 •

edited

Loading

Shnatsel commented Jan 7, 2024 •

edited

Loading