Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Very slow decoding of paletted PNG images #393

Closed
Shnatsel opened this issue Apr 1, 2023 · 19 comments
Closed

Very slow decoding of paletted PNG images #393

Shnatsel opened this issue Apr 1, 2023 · 19 comments

Comments

@Shnatsel
Copy link
Contributor

Shnatsel commented Apr 1, 2023

zune-png benchmarks show that the png crate is much slower than other decoders on indexed images - a whopping 3x slower than zune-png.

CPU profile shows that 71% of the time is spent in png::utils::unpack_bits, specifically this hot loop.

The code is full of indexing and doesn't look amenable to autovectorization. I think the entire function will have to be rewritten; it's probably a good idea to copy zune-png here.

@Shnatsel
Copy link
Contributor Author

Shnatsel commented Apr 1, 2023

Measured on v0.17.8-rc with this image: speed_bench_palette

@Shnatsel
Copy link
Contributor Author

Shnatsel commented Apr 1, 2023

zune-png uses vectorization-friendly implementation of expansion, which explains the difference in performance. It can be found here: https://github.com/etemesi254/zune-image/blob/dev/zune-png/src/utils.rs

@fintelia
Copy link
Contributor

fintelia commented Apr 2, 2023

The version in this crate handles paletted images with 1/2/4-bit indices, whereas zune-png seems not to. However, probably the overwhelming majority of images use 8-bit indices... so it is probably worth special casing it.

@Shnatsel
Copy link
Contributor Author

Shnatsel commented Apr 2, 2023

Based on this comment I understand zune-png expands the bits to bytes first, and then resolves those to the correct palette values in another pass.

So it does handle lower bit depth indices correctly, but they're expanded to 8 bits before being used for palette lookups.

@okaneco
Copy link
Contributor

okaneco commented Apr 4, 2023

I thought maybe the step_by code wasn't optimizing well but it doesn't seem like much can currently be improved.

My guess is that the bottleneck is the access pattern of using the same buffer for holding the indices and expanding the colors in back-to-front order. If a separate buffer with the indices were passed in to the unpack_bits function, simpler iterators could be used since you wouldn't have to hold indices at both ends of the buffer simultaneously. I'm not sure if there's an available vector that could be used on the current decoder struct for this purpose.

pub fn unpack_bits<F>(buf: &mut [u8], palette_indices: &[u8], channels: usize, bit_depth: u8, func: F)

The possibility of vectorization or at least less bounds checks seems a likely benefit. L36 can be simplified to remove a shift but it won't make a difference.

image-png/src/utils.rs

Lines 27 to 38 in 2f53fc4

let i = (0..entries)
.rev() // reverse iterator
.flat_map(|idx|
// this has to be reversed too
(0..8).step_by(bit_depth.into())
.zip(repeat(idx)))
.skip(skip);
let j = (0..=buf.len() - channels).rev().step_by(channels);
for ((shift, i), j) in i.zip(j) {
let pixel = (buf[i] & (mask << shift)) >> shift;
func(pixel, &mut buf[j..(j + channels)])
}

The hot loop is calling the closure here. The indices in the closures can be replaced with TryInto to remove some more indexing but also not a big difference that I could see.

utils::unpack_bits(buffer, 3, info.bit_depth as u8, |i, chunk| {
let rgb = palette
.get(3 * i as usize..3 * i as usize + 3)
.unwrap_or(&black);
chunk[0] = rgb[0];
chunk[1] = rgb[1];
chunk[2] = rgb[2];
})

I also tried unpacking the iterators to understand the logic and maybe get better performance.

It's definitely clumsy trying to convert that iterator chain to some loop format. I converted to this to see if there was a better way of special-casing the different depths but 8-bit will always only have a 0 for the shift and shouldn't ever have a skip.

let entry_indices = (0..entries).rev().skip(usize::from(skip != 0));
let mut j = buf.len() - channels;
let mut curr = buf[entries - 1];

if skip != 0 {
    for shift in (0..8).step_by(bit_depth as usize).skip(skip) {
        let pixel = (curr >> shift) & mask;
        func(pixel, &mut buf[j..j + channels]);
        if j < channels {
            return;
        }
        j -= channels;
    }
}

for idx in entry_indices {
    curr = buf[idx];
    for shift in (0..8).step_by(bit_depth as usize) {
        let pixel = (curr >> shift) & mask;
        func(pixel, &mut buf[j..j + channels]);
        if j < channels {
            return;
        }
        j -= channels;
    }
}

@fintelia
Copy link
Contributor

fintelia commented Apr 4, 2023

This reminded me that I started working on more refactoring which included changing the transformations to not operate in place. Just created #396 with the current state

@anforowicz
Copy link
Contributor

Looking at top-500 website, paletted images account for 35% of PNG images. Source:

OTOH so far I only had negative results when trying to improve paletted images: #416 (comment)

@Shnatsel
Copy link
Contributor Author

Shnatsel commented Jan 7, 2024

It's the bit unpacking hot loop that's slowing things down. zune-png has already figured out a much more performant way to do it that gets auto-vectorized: https://github.com/etemesi254/zune-image/blob/54cc956ccc01ea942456c0dcebf8d97bda614666/crates/zune-png/src/utils.rs#L217-L317

It should be fairly straightforward to integrate into the png crate.

@okaneco
Copy link
Contributor

okaneco commented Jan 7, 2024

It's the bit unpacking hot loop that's slowing things down.

That code was changed in #405 and looks like this now.

The distinction I'd make is that it's not the bit-unpacking as much as retrieving colors from the lookup table and filling those into the buffer. Trying to do it in 2 passes ended up slower than the current behavior of unpacking and then doing the lookup in 1 pass. The code didn't seem to get autovectorized even using chunks_exact at the time.

I'm not sure that it's straightforward or easy to get similar results without larger architectural changes as mentioned in #416 (comment) where different several methods were tried unsuccessfully.

@Shnatsel
Copy link
Contributor Author

Shnatsel commented Jan 7, 2024

It is clearly possible to do better here still, as evidenced by zune-png. Even with the latest code it is 50% faster on a big paletted image. Perhaps its code is worth a closer look.

@anforowicz
Copy link
Contributor

It is clearly possible to do better here still, as evidenced by zune-png. Even with the latest code it is 50% faster on a big paletted image. Perhaps its code is worth a closer look.

FWIW, I've tried running perf record / perf report on the linked image and it seems that most time is spent below the png crate (i.e. not in fn expand_paletted in png/src/decoder/mod.rs which is implemented in the png crate). I understand that inlining and other optimizations may make the perf report somewhat inaccurate/crude, but AFAIU things like fdeflate::decompress::Decompressor::read should only accidentally include things inlined from dependencies of fdeflate (like the simd-adler32 crate), but not from clients of fdeflate (like the png crate). I guess LTO can potentially deduplicate code across fdeflate and png and continue attributing profiling samples to fdeflate, but this seems unlikely for fn expand_paletted.

Maybe this means that the observed runtime delta (compared to zune-png) comes from other areas?

For the record, here are the perf report results I got for that paletted image:

  55.34%  decoder-1b5b1e4  decoder-1b5b1e46cb85776a  [.] fdeflate::decompress::Decompressor::read                                                                                                                                            
  17.43%  decoder-1b5b1e4  decoder-1b5b1e46cb85776a  [.] <alloc::vec::Vec<T> as alloc::vec::spec_from_iter::SpecFromIter<T,I>>::from_iter                                                                                                    
  12.61%  decoder-1b5b1e4  libc-2.31.so              [.] 0x00000000001602f2                                                                                                                                                                  
   3.75%  decoder-1b5b1e4  [kernel.kallsyms]         [k] clear_page_erms                                                                                                                                                                     
   2.56%  decoder-1b5b1e4  [kernel.kallsyms]         [k] read_hpet                                                                                                                                                                           
   2.33%  decoder-1b5b1e4  decoder-1b5b1e46cb85776a  [.] crc32fast::specialized::pclmulqdq::calculate                                                                                                                                        
   1.29%  decoder-1b5b1e4  decoder-1b5b1e46cb85776a  [.] fdeflate::decompress::Decompressor::build_tables                                                                                                                                    
   0.70%  decoder-1b5b1e4  decoder-1b5b1e46cb85776a  [.] <alloc::vec::Vec<T> as alloc::vec::spec_from_iter::SpecFromIter<T,I>>::from_iter                                                                                                    
   0.49%  decoder-1b5b1e4  decoder-1b5b1e46cb85776a  [.] simd_adler32::imp::avx2::imp::update_imp                                                       
...

@Shnatsel
Copy link
Contributor Author

debug=true in Cargo.toml combined with perf --call-graph=dwarf should attribute inlined code correctly. I'll profile the decoding of this image later and post the results.

@Shnatsel
Copy link
Contributor Author

You need to call decoder.set_transformations(png::Transformations::EXPAND) to activate conversion from the palette indices to RGBA. If this is not set, the conversion from palette indices to RGB is never performed, and will not show up on the profile at all.

If you set that decoder option, then 40% of the total decoding time is spent in expand_paletted - here's the profile: https://share.firefox.dev/422gfkL

It is measured on this image, meant to be a a more realistic use case for indexed PNG: Stadt_Onex_2021_indexed_oxipng, originally sourced from wikimedia and converted to paletted image by myself.

It is possible that Chromium prefers to resolve the indices themselves using the palette chunk instead of getting RGB or RGBA output from the decoder, in which case this is bottleneck is not relevant to Chromium.

This hot loop is where 40% of the decoding time for this image is spent:

utils::unpack_bits(row, buffer, 3, info.bit_depth as u8, |i, chunk| {
let rgb = palette
.get(3 * i as usize..3 * i as usize + 3)
.unwrap_or(&black);
chunk[0] = rgb[0];
chunk[1] = rgb[1];
chunk[2] = rgb[2];
})

Code used for measurement: https://gist.github.com/Shnatsel/ec75b05205ef55c96b9d256fdc36c2ec

@anforowicz
Copy link
Contributor

You need to call decoder.set_transformations(png::Transformations::EXPAND) to activate conversion from the palette indices to RGBA. If this is not set, the conversion from palette indices to RGB is never performed, and will not show up on the profile at all.

Ooops. You're right, my bad. I knew about that but forgot :-(.

If you set that decoder option, then 40% of the total decoding time is spent in expand_paletted - here's the profile: https://share.firefox.dev/422gfkL

Thanks for sharing.

It is measured on this image, meant to be a a more realistic use case for indexed PNG: Stadt_Onex_2021_indexed_oxipng, originally sourced from wikimedia and converted to paletted image by myself.

Ack.

It is possible that Chromium prefers to resolve the indices themselves using the palette chunk instead of getting RGB or RGBA output from the decoder, in which case this is bottleneck is not relevant to Chromium.

No, Chromium also wants to expand into RGBA (at a high-level, there are some other unimportant details like BGRA and/or alpha-premul).

This hot loop is where 40% of the decoding time for this image is spent:

utils::unpack_bits(row, buffer, 3, info.bit_depth as u8, |i, chunk| {
let rgb = palette
.get(3 * i as usize..3 * i as usize + 3)
.unwrap_or(&black);
chunk[0] = rgb[0];
chunk[1] = rgb[1];
chunk[2] = rgb[2];
})

Code used for measurement: https://gist.github.com/Shnatsel/ec75b05205ef55c96b9d256fdc36c2ec

I've tried to replicate your results with the code at https://github.com/anforowicz/image-png/tree/palette-measurements, but I got different results. I am very curious what may be causing the difference.

Repro steps:

  1. Run the following

    # booted with isolcpus=nohz,domain,4-5
    # (I forgot to disable TurboBoost and move kernel work off of cores 4-5,
    # but hopefully this shouldn't matter for comparing how much a function takes in a profile)
    $ git reset --hard
    $ git checkout 1f57fd0cd050d1fa8ed43c3d94c781e6057aa483 # this is the `palette-measurements` branch
    $ rustup run stable rustc --version
    rustc 1.75.0 (82e1608df 2023-12-21)
    $ rustup run stable cargo build --bench=decoder --release   
    ...
    $ taskset --cpu-list 4-5 nice -n -19 sudo perf record --call-graph=dwarf target/release/deps/decoder-01bdb75be1fb05ce --bench --profile-time 10 big-palletted
    $ sudo perf script -F +pid > /tmp/test.perf
    ...
    
  2. Upload the profile to https://profiler.firefox.com/ (this is great - thanks for teaching me about this tool)

  3. Share the profile

In the https://share.firefox.dev/3tOf0ZM that got, I can see the png::decoder::expand_paletted::_$u7b$$u7b$closure$u7d$$u7d$::hfd10b92927fc4892 closure in the call tree when using inverted call stack, but it only takes 0.1% of the whole profile

@Shnatsel
Copy link
Contributor Author

If this is a benchmark, you need to set debug = true for the bench profile and not just the release one. Cargo uses the bench profile when compiling benchmarks:

Cargo has 4 built-in profiles: dev, release, test, and bench.

(from Cargo docs)


Speaking of tools, samply is amazing. Try this:

cargo install samply
echo '0' | sudo tee /proc/sys/kernel/perf_event_paranoid  # to be able to profile as non-root
samply record /path/to/binary arg1 arg2

Its processed results are more accurate than perf, it produces them much faster than perf script, it opens Firefox Profiler automatically, and you can double-click flame graph bars to open the code view while Samply is running to see the samples attributed to both lines of code and assembly instructions.

@Shnatsel
Copy link
Contributor Author

Ah, I've found your problem: there are two different Decoder instances in the benchmarks and you applied the transform to one but not the other. The patched one:
https://github.com/anforowicz/image-png/blob/1f57fd0cd050d1fa8ed43c3d94c781e6057aa483/benches/decoder.rs#L60-L61
is not performing the actual decoding, this one is:
https://github.com/anforowicz/image-png/blob/1f57fd0cd050d1fa8ed43c3d94c781e6057aa483/benches/decoder.rs#L69
And you forgot to set the EXPAND transformation on it as well.

@anforowicz
Copy link
Contributor

Ah, I've found your problem: there are two different Decoder instances in the benchmarks and you applied the transform to one but not the other.

Ugh. You're right. Thank you for pointing this out. That's rather embarassing (since I've abandoned my earlier attempts at improving expand_paletted because of this incorrect approach to benchmarking).

For now I've opened #453 to cover the expand_paletted in the end-to-end benchmarks (adding one of the images you've pointed out above).

I'll try to find some time next week to also work on:

@Shnatsel
Copy link
Contributor Author

The first step will be figuring out why the end-to-end benchmarks move, even though this was supposed to be a performance-neutral refactoring that just hides expand_paletted in a module that exposes a simple, testable, benchable API.

Probably because of inlining. More on inlining in Rust

You can use https://github.com/pacak/cargo-show-asm to inspect the generated assembly, or just objdump -d to keep it old school. You're mostly interested in whether a given function is present as a separate function or not (inlined ones disappear from the listing).

@Shnatsel
Copy link
Contributor Author

Fixed by #462

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants