Comparing the time it takes to load DXT/BCn compressed DDS images via
OIIO, I noticed that it's several times slower than in another codebase
(Blender). So I started profiling, and fixing some low hanging fruit.
- After decoding a BCn block (always 4x4 pixels), it was written into
destination image using quite many branches in the innermost loop.
Now it still takes care of not writing outside of a destination image
that might not be multiple-of-4 in size, just with way fewer branches.
- BCn decompression is multi-threaded via parallel_for_chunked, using
at least 8 block rows per job (i.e. 32 pixel rows). This is
similar to internal multi-threading done by EXR or TIF readers.
- Avoid one case of std::vector resize (which clears memory) followed
by immediate overwriting of that buffer with bytes read from the file.
Switched to using unique_ptr for that array, similar to how other
input plugins do it.
Performance data: I'm timing how long it takes to do "iinfo --hash" on
24 DDS files, each 4096x4096 in size. Quite a lot of time there is taken
by just the SHA hash calculation which is not related to loading, so
also have timing info for just DDSInput::read_native_scanline where
all the actual DDS loading/decoding happens.
Timings on PC (Windows 10, VS2022 Release build), AMD Ryzen 5950X.
- BC1: total 3.68s -> 2.76s, read_native_scanline 1.54s -> 0.59s
- BC3: total 3.74s -> 2.84s, read_native_scanline 1.83s -> 0.66s
- BC4: total 1.26s -> 0.82s, read_native_scanline 0.74s -> 0.24s
- BC5: total 2.29s -> 1.57s, read_native_scanline 1.31s -> 0.46s
- BC6: total 7.50s -> 4.30s, read_native_scanline 4.67s -> 1.60s
- BC7: total 6.18s -> 3.09s, read_native_scanline 4.34s -> 0.84s
Possible future optimization: implement DDSInput::read_native_scanlines,
that could avoid one extra temporary memory buffer copy. This extra
buffer clear & eventual copy from it into final user destination pixels
is non-trivial cost from profiling. But it's also quite involved
to handle all the possible edge cases (e.g. user requests scanlines
not on 4-pixel row boundaries); the code would have to do similar
juggling as EXR reader does to handle "not on chunk boundaries" reads.
It could potentially also interplay with DDS cubemap/volume texture
reading ("tiled") code paths, for which there is no test coverage at all
right now. Maybe for some other day.