Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DDS: optimize loading of compressed images (3-5x) #3583

Merged
merged 1 commit into from
Oct 6, 2022
Merged

DDS: optimize loading of compressed images (3-5x) #3583

merged 1 commit into from
Oct 6, 2022

Commits on Oct 5, 2022

  1. DDS: optimize loading of compressed images (3-5x)

    Comparing the time it takes to load DXT/BCn compressed DDS images via
    OIIO, I noticed that it's several times slower than in another codebase
    (Blender). So I started profiling, and fixing some low hanging fruit.
    
    - After decoding a BCn block (always 4x4 pixels), it was written into
      destination image using quite many branches in the innermost loop.
      Now it still takes care of not writing outside of a destination image
      that might not be multiple-of-4 in size, just with way fewer branches.
    - BCn decompression is multi-threaded via parallel_for_chunked, using
      at least 8 block rows per job (i.e. 32 pixel rows). This is
      similar to internal multi-threading done by EXR or TIF readers.
    - Avoid one case of std::vector resize (which clears memory) followed
      by immediate overwriting of that buffer with bytes read from the file.
      Switched to using unique_ptr for that array, similar to how other
      input plugins do it.
    
    Performance data: I'm timing how long it takes to do "iinfo --hash" on
    24 DDS files, each 4096x4096 in size. Quite a lot of time there is taken
    by just the SHA hash calculation which is not related to loading, so
    also have timing info for just DDSInput::read_native_scanline where
    all the actual DDS loading/decoding happens.
    
    Timings on PC (Windows 10, VS2022 Release build), AMD Ryzen 5950X.
    - BC1: total 3.68s -> 2.76s, read_native_scanline 1.54s -> 0.59s
    - BC3: total 3.74s -> 2.84s, read_native_scanline 1.83s -> 0.66s
    - BC4: total 1.26s -> 0.82s, read_native_scanline 0.74s -> 0.24s
    - BC5: total 2.29s -> 1.57s, read_native_scanline 1.31s -> 0.46s
    - BC6: total 7.50s -> 4.30s, read_native_scanline 4.67s -> 1.60s
    - BC7: total 6.18s -> 3.09s, read_native_scanline 4.34s -> 0.84s
    
    Possible future optimization: implement DDSInput::read_native_scanlines,
    that could avoid one extra temporary memory buffer copy. This extra
    buffer clear & eventual copy from it into final user destination pixels
    is non-trivial cost from profiling. But it's also quite involved
    to handle all the possible edge cases (e.g. user requests scanlines
    not on 4-pixel row boundaries); the code would have to do similar
    juggling as EXR reader does to handle "not on chunk boundaries" reads.
    It could potentially also interplay with DDS cubemap/volume texture
    reading ("tiled") code paths, for which there is no test coverage at all
    right now. Maybe for some other day.
    aras-p committed Oct 5, 2022
    Configuration menu
    Copy the full SHA
    ae919bf View commit details
    Browse the repository at this point in the history