DDS: optimize loading of compressed images (3-5x) #3583

Comparing the time it takes to load DXT/BCn compressed DDS images via OIIO, I noticed that it's several times slower than in another codebase (Blender). So I started profiling, and fixing some low hanging fruit. - After decoding a BCn block (always 4x4 pixels), it was written into destination image using quite many branches in the innermost loop. Now it still takes care of not writing outside of a destination image that might not be multiple-of-4 in size, just with way fewer branches. - BCn decompression is multi-threaded via parallel_for_chunked, using at least 8 block rows per job (i.e. 32 pixel rows). This is similar to internal multi-threading done by EXR or TIF readers. - Avoid one case of std::vector resize (which clears memory) followed by immediate overwriting of that buffer with bytes read from the file. Switched to using unique_ptr for that array, similar to how other input plugins do it. Performance data: I'm timing how long it takes to do "iinfo --hash" on 24 DDS files, each 4096x4096 in size. Quite a lot of time there is taken by just the SHA hash calculation which is not related to loading, so also have timing info for just DDSInput::read_native_scanline where all the actual DDS loading/decoding happens. Timings on PC (Windows 10, VS2022 Release build), AMD Ryzen 5950X. - BC1: total 3.68s -> 2.76s, read_native_scanline 1.54s -> 0.59s - BC3: total 3.74s -> 2.84s, read_native_scanline 1.83s -> 0.66s - BC4: total 1.26s -> 0.82s, read_native_scanline 0.74s -> 0.24s - BC5: total 2.29s -> 1.57s, read_native_scanline 1.31s -> 0.46s - BC6: total 7.50s -> 4.30s, read_native_scanline 4.67s -> 1.60s - BC7: total 6.18s -> 3.09s, read_native_scanline 4.34s -> 0.84s Possible future optimization: implement DDSInput::read_native_scanlines, that could avoid one extra temporary memory buffer copy. This extra buffer clear & eventual copy from it into final user destination pixels is non-trivial cost from profiling. But it's also quite involved to handle all the possible edge cases (e.g. user requests scanlines not on 4-pixel row boundaries); the code would have to do similar juggling as EXR reader does to handle "not on chunk boundaries" reads. It could potentially also interplay with DDS cubemap/volume texture reading ("tiled") code paths, for which there is no test coverage at all right now. Maybe for some other day.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DDS: optimize loading of compressed images (3-5x) #3583

DDS: optimize loading of compressed images (3-5x) #3583

Commits on Oct 5, 2022