-
Notifications
You must be signed in to change notification settings - Fork 598
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DDS: optimize loading of compressed images (3-5x) #3583
Conversation
Comparing the time it takes to load DXT/BCn compressed DDS images via OIIO, I noticed that it's several times slower than in another codebase (Blender). So I started profiling, and fixing some low hanging fruit. - After decoding a BCn block (always 4x4 pixels), it was written into destination image using quite many branches in the innermost loop. Now it still takes care of not writing outside of a destination image that might not be multiple-of-4 in size, just with way fewer branches. - BCn decompression is multi-threaded via parallel_for_chunked, using at least 8 block rows per job (i.e. 32 pixel rows). This is similar to internal multi-threading done by EXR or TIF readers. - Avoid one case of std::vector resize (which clears memory) followed by immediate overwriting of that buffer with bytes read from the file. Switched to using unique_ptr for that array, similar to how other input plugins do it. Performance data: I'm timing how long it takes to do "iinfo --hash" on 24 DDS files, each 4096x4096 in size. Quite a lot of time there is taken by just the SHA hash calculation which is not related to loading, so also have timing info for just DDSInput::read_native_scanline where all the actual DDS loading/decoding happens. Timings on PC (Windows 10, VS2022 Release build), AMD Ryzen 5950X. - BC1: total 3.68s -> 2.76s, read_native_scanline 1.54s -> 0.59s - BC3: total 3.74s -> 2.84s, read_native_scanline 1.83s -> 0.66s - BC4: total 1.26s -> 0.82s, read_native_scanline 0.74s -> 0.24s - BC5: total 2.29s -> 1.57s, read_native_scanline 1.31s -> 0.46s - BC6: total 7.50s -> 4.30s, read_native_scanline 4.67s -> 1.60s - BC7: total 6.18s -> 3.09s, read_native_scanline 4.34s -> 0.84s Possible future optimization: implement DDSInput::read_native_scanlines, that could avoid one extra temporary memory buffer copy. This extra buffer clear & eventual copy from it into final user destination pixels is non-trivial cost from profiling. But it's also quite involved to handle all the possible edge cases (e.g. user requests scanlines not on 4-pixel row boundaries); the code would have to do similar juggling as EXR reader does to handle "not on chunk boundaries" reads. It could potentially also interplay with DDS cubemap/volume texture reading ("tiled") code paths, for which there is no test coverage at all right now. Maybe for some other day.
Huh, the two CI failures are the most curious. I don't yet quite see how they would be related to the single .cpp source file change here. The
The failure on
|
Don't sweat the failure on "unit_timer" -- every once in a while we get a slow / overloaded GHA runner and it throws this test off enough to fail its sanity check. It is not important. Not sure what happened in the "bleeding edge" test so I'm triggering it to re-run. Sometimes there are spurious failures because of failure for the runners to configure properly or they have a bad package installed or fail to install a dependency. The environments on GHA aren't super locked down and the fact that some deps are installed on the fly means all sorts of rare failures can happen. I just rerun and hope for the best. |
Ah looks like the timer test has passed on a rerun. But the bleeding edge linker failure has not; however I see it's also happening on CI runs on other branches here in OIIO. |
The bleeding edge test error appears related to libtiff -- which we build from source at the tip of its master for this test, as we do for several other dependencies. That's why it's bleeding edge. Stuff breaks there any time somebody checks a bug into one of those packages. Sometimes we're even the ones to alert them to their mistake! But anyway, it seems unrelated to your change. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome, LGTM!
const int widthInBlocks = (width + kBlockSize - 1) / kBlockSize; | ||
const int heightInBlocks = (height + kBlockSize - 1) / kBlockSize; | ||
paropt opt = paropt(0, paropt::SplitDir::Y, 8); | ||
parallel_for_chunked( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice use of parallel_for_chunked!
…ndation#3583) Comparing the time it takes to load DXT/BCn compressed DDS images via OIIO, I noticed that it's several times slower than in another codebase (Blender). So I started profiling, and fixing some low hanging fruit. - After decoding a BCn block (always 4x4 pixels), it was written into destination image using quite many branches in the innermost loop. Now it still takes care of not writing outside of a destination image that might not be multiple-of-4 in size, just with way fewer branches. - BCn decompression is multi-threaded via parallel_for_chunked, using at least 8 block rows per job (i.e. 32 pixel rows). This is similar to internal multi-threading done by EXR or TIF readers. - Avoid one case of std::vector resize (which clears memory) followed by immediate overwriting of that buffer with bytes read from the file. Switched to using unique_ptr for that array, similar to how other input plugins do it. Performance data: I'm timing how long it takes to do "iinfo --hash" on 24 DDS files, each 4096x4096 in size. Quite a lot of time there is taken by just the SHA hash calculation which is not related to loading, so also have timing info for just DDSInput::read_native_scanline where all the actual DDS loading/decoding happens. Timings on PC (Windows 10, VS2022 Release build), AMD Ryzen 5950X. - BC1: total 3.68s -> 2.76s, read_native_scanline 1.54s -> 0.59s - BC3: total 3.74s -> 2.84s, read_native_scanline 1.83s -> 0.66s - BC4: total 1.26s -> 0.82s, read_native_scanline 0.74s -> 0.24s - BC5: total 2.29s -> 1.57s, read_native_scanline 1.31s -> 0.46s - BC6: total 7.50s -> 4.30s, read_native_scanline 4.67s -> 1.60s - BC7: total 6.18s -> 3.09s, read_native_scanline 4.34s -> 0.84s Possible future optimization: implement DDSInput::read_native_scanlines, that could avoid one extra temporary memory buffer copy. This extra buffer clear & eventual copy from it into final user destination pixels is non-trivial cost from profiling. But it's also quite involved to handle all the possible edge cases (e.g. user requests scanlines not on 4-pixel row boundaries); the code would have to do similar juggling as EXR reader does to handle "not on chunk boundaries" reads. It could potentially also interplay with DDS cubemap/volume texture reading ("tiled") code paths, for which there is no test coverage at all right now. Maybe for some other day.
Aside for @aras-p to know next time this comes up: a good way to time how long it takes to do a full read with minimal conversions and no SHA-1 overhead is
The |
Description
Comparing the time it takes to load DXT/BCn compressed DDS images via OIIO, I noticed that it's several times slower than in another codebase (Blender). So I started profiling, and fixing some low hanging fruit.
parallel_for_chunked
, using at least 8 block rows per job (i.e. 32 pixel rows). This is similar to internal multi-threading done by EXR or TIF readers.Performance data: I'm timing how long it takes to do
iinfo --hash
on 24 DDS files, each 4096x4096 in size. Quite a lot of time there is taken by just the SHA hash calculation which is not related to loading, so also have timing info for just DDSInput::read_native_scanline where all the actual DDS loading/decoding happens.Timings on PC (Windows 10, VS2022 Release build), AMD Ryzen 5950X.
Possible future optimization: implement
DDSInput::read_native_scanlines
, that could avoid one extra temporary memory buffer copy. This extra buffer clear & eventual copy from it into final user destination pixels is non-trivial cost from profiling. But it's also quite involved to handle all the possible edge cases (e.g. user requests scanlines not on 4-pixel row boundaries); the code would have to do similar juggling as EXR reader does to handle "not on chunk boundaries" reads. It could potentially also interplay with DDS cubemap/volume texture reading ("tiled") code paths, for which there is no test coverage at all right now. Maybe for some other day.Tests
Locally for me passes the extended DDS test suite from #3581. The code changes are purely an optimization, no behavior changes.
Checklist:
have previously submitted a Contributor License Agreement
(individual, and if there is any way my
employers might think my programming belongs to them, then also
corporate).
(adding new test cases if necessary).