DDS: optimize loading of compressed images (3-5x) #3583

aras-p · 2022-10-05T17:14:24Z

Description

Comparing the time it takes to load DXT/BCn compressed DDS images via OIIO, I noticed that it's several times slower than in another codebase (Blender). So I started profiling, and fixing some low hanging fruit.

After decoding a BCn block (always 4x4 pixels), it was written into destination image using quite many branches in the innermost loop. Now it still takes care of not writing outside of a destination image that might not be multiple-of-4 in size, just with way fewer branches.
BCn decompression is multi-threaded via parallel_for_chunked, using at least 8 block rows per job (i.e. 32 pixel rows). This is similar to internal multi-threading done by EXR or TIF readers.
Avoid one case of std::vector resize (which clears memory) followed by immediate overwriting of that buffer with bytes read from the file. Switched to using unique_ptr for that array, similar to how other input plugins do it.

Performance data: I'm timing how long it takes to do iinfo --hash on 24 DDS files, each 4096x4096 in size. Quite a lot of time there is taken by just the SHA hash calculation which is not related to loading, so also have timing info for just DDSInput::read_native_scanline where all the actual DDS loading/decoding happens.

Timings on PC (Windows 10, VS2022 Release build), AMD Ryzen 5950X.

BC1: total 3.68s -> 2.76s, read_native_scanline 1.54s -> 0.59s
BC3: total 3.74s -> 2.84s, read_native_scanline 1.83s -> 0.66s
BC4: total 1.26s -> 0.82s, read_native_scanline 0.74s -> 0.24s
BC5: total 2.29s -> 1.57s, read_native_scanline 1.31s -> 0.46s
BC6: total 7.50s -> 4.30s, read_native_scanline 4.67s -> 1.60s
BC7: total 6.18s -> 3.09s, read_native_scanline 4.34s -> 0.84s

Possible future optimization: implement DDSInput::read_native_scanlines, that could avoid one extra temporary memory buffer copy. This extra buffer clear & eventual copy from it into final user destination pixels is non-trivial cost from profiling. But it's also quite involved to handle all the possible edge cases (e.g. user requests scanlines not on 4-pixel row boundaries); the code would have to do similar juggling as EXR reader does to handle "not on chunk boundaries" reads. It could potentially also interplay with DDS cubemap/volume texture reading ("tiled") code paths, for which there is no test coverage at all right now. Maybe for some other day.

Tests

Locally for me passes the extended DDS test suite from #3581. The code changes are purely an optimization, no behavior changes.

Checklist:

I have read the contribution guidelines.
If this is more extensive than a small change to existing code, I
have previously submitted a Contributor License Agreement
(individual, and if there is any way my
employers might think my programming belongs to them, then also
corporate).
I have updated the documentation, if applicable.
I have ensured that the change is tested somewhere in the testsuite
(adding new test cases if necessary).
My code follows the prevailing code style of this project.

Comparing the time it takes to load DXT/BCn compressed DDS images via OIIO, I noticed that it's several times slower than in another codebase (Blender). So I started profiling, and fixing some low hanging fruit. - After decoding a BCn block (always 4x4 pixels), it was written into destination image using quite many branches in the innermost loop. Now it still takes care of not writing outside of a destination image that might not be multiple-of-4 in size, just with way fewer branches. - BCn decompression is multi-threaded via parallel_for_chunked, using at least 8 block rows per job (i.e. 32 pixel rows). This is similar to internal multi-threading done by EXR or TIF readers. - Avoid one case of std::vector resize (which clears memory) followed by immediate overwriting of that buffer with bytes read from the file. Switched to using unique_ptr for that array, similar to how other input plugins do it. Performance data: I'm timing how long it takes to do "iinfo --hash" on 24 DDS files, each 4096x4096 in size. Quite a lot of time there is taken by just the SHA hash calculation which is not related to loading, so also have timing info for just DDSInput::read_native_scanline where all the actual DDS loading/decoding happens. Timings on PC (Windows 10, VS2022 Release build), AMD Ryzen 5950X. - BC1: total 3.68s -> 2.76s, read_native_scanline 1.54s -> 0.59s - BC3: total 3.74s -> 2.84s, read_native_scanline 1.83s -> 0.66s - BC4: total 1.26s -> 0.82s, read_native_scanline 0.74s -> 0.24s - BC5: total 2.29s -> 1.57s, read_native_scanline 1.31s -> 0.46s - BC6: total 7.50s -> 4.30s, read_native_scanline 4.67s -> 1.60s - BC7: total 6.18s -> 3.09s, read_native_scanline 4.34s -> 0.84s Possible future optimization: implement DDSInput::read_native_scanlines, that could avoid one extra temporary memory buffer copy. This extra buffer clear & eventual copy from it into final user destination pixels is non-trivial cost from profiling. But it's also quite involved to handle all the possible edge cases (e.g. user requests scanlines not on 4-pixel row boundaries); the code would have to do similar juggling as EXR reader does to handle "not on chunk boundaries" reads. It could potentially also interplay with DDS cubemap/volume texture reading ("tiled") code paths, for which there is no test coverage at all right now. Maybe for some other day.

aras-p · 2022-10-05T18:47:34Z

Huh, the two CI failures are the most curious. I don't yet quite see how they would be related to the single .cpp source file change here.

The Ubuntu bleeding edge gcc12 C++20 py3.10 OCIO/libtiff/exr-master boost1.74 avx2 one fails linking lib/libOpenImageIO.so.2.5.0 with an error coming from libtiff (?):

/usr/bin/ld: /home/runner/work/oiio/oiio/ext/dist/lib/libtiff.a(tif_unix.c.o): warning: relocation against `stderr@@GLIBC_2.2.5' in read-only section `.text'
/usr/bin/ld: /home/runner/work/oiio/oiio/ext/dist/lib/libtiff.a(tif_unix.c.o): relocation R_X86_64_PC32 against symbol `stderr@@GLIBC_2.2.5' can not be used when making a shared object; recompile with -fPIC

The failure on windows-2019 VS2019 is where unit_timer tests fail:

141/157 Test #144: unit_timer ..............................***Failed    2.90 sec
Timer begin/end cost is 2.13647e+07 /sec
Out of 10000000 queries, 7785692 had no time
Longest was 0.0007168 s.
Checkpoint: All 0.438635 selective 0.150176
D:/a/oiio/oiio/src/libutil/timer_test.cpp:105:
FAILED: all() == 0.5
	values were '0.753666' and '0.5', diff was 0.253656
D:/a/oiio/oiio/src/libutil/timer_test.cpp:112:
FAILED: all() == 0.5
	values were '0.753693' and '0.5', diff was 0.253692
D:/a/oiio/oiio/src/libutil/timer_test.cpp:116:
FAILED: all() == 0.6
	values were '0.897358' and '0.6', diff was 0.297346
Checkpoint2: All 0.897369 selective 0.301812
string ctr      :    0.0 ns (+/- 0.0ns), unreliable
usleep(1000)    :   40.4 ms (+/-28.2ms),    0.0 k/s
add             :    1.6 ns (+/- 0.0ns),  624.6 M/s
cos             :    5.6 ns (+/- 0.0ns),  178.1 M/s
acos            :    8.5 ns (+/- 0.2ns),  117.1 M/s
fast_acos       :    3.2 ns (+/- 0.0ns),  311.9 M/s
sqrt            :    1.6 ns (+/- 0.0ns),  623.7 M/s
simd sqrt       :    1.2 ns (+/- 0.0ns), 3325.0 Mvals/s, 831.2 Mcalls/s
ERRORS!

lgritz · 2022-10-05T23:15:24Z

Don't sweat the failure on "unit_timer" -- every once in a while we get a slow / overloaded GHA runner and it throws this test off enough to fail its sanity check. It is not important.

Not sure what happened in the "bleeding edge" test so I'm triggering it to re-run. Sometimes there are spurious failures because of failure for the runners to configure properly or they have a bad package installed or fail to install a dependency. The environments on GHA aren't super locked down and the fact that some deps are installed on the fly means all sorts of rare failures can happen. I just rerun and hope for the best.

aras-p · 2022-10-06T03:51:18Z

Ah looks like the timer test has passed on a rerun. But the bleeding edge linker failure has not; however I see it's also happening on CI runs on other branches here in OIIO.

lgritz · 2022-10-06T03:57:23Z

The bleeding edge test error appears related to libtiff -- which we build from source at the tip of its master for this test, as we do for several other dependencies. That's why it's bleeding edge. Stuff breaks there any time somebody checks a bug into one of those packages. Sometimes we're even the ones to alert them to their mistake! But anyway, it seems unrelated to your change.

lgritz

Awesome, LGTM!

lgritz · 2022-10-06T03:58:58Z

src/dds.imageio/ddsinput.cpp

+    const int widthInBlocks  = (width + kBlockSize - 1) / kBlockSize;
+    const int heightInBlocks = (height + kBlockSize - 1) / kBlockSize;
+    paropt opt               = paropt(0, paropt::SplitDir::Y, 8);
+    parallel_for_chunked(


Nice use of parallel_for_chunked!

src/dds.imageio/ddsinput.cpp

…ndation#3583) Comparing the time it takes to load DXT/BCn compressed DDS images via OIIO, I noticed that it's several times slower than in another codebase (Blender). So I started profiling, and fixing some low hanging fruit. - After decoding a BCn block (always 4x4 pixels), it was written into destination image using quite many branches in the innermost loop. Now it still takes care of not writing outside of a destination image that might not be multiple-of-4 in size, just with way fewer branches. - BCn decompression is multi-threaded via parallel_for_chunked, using at least 8 block rows per job (i.e. 32 pixel rows). This is similar to internal multi-threading done by EXR or TIF readers. - Avoid one case of std::vector resize (which clears memory) followed by immediate overwriting of that buffer with bytes read from the file. Switched to using unique_ptr for that array, similar to how other input plugins do it. Performance data: I'm timing how long it takes to do "iinfo --hash" on 24 DDS files, each 4096x4096 in size. Quite a lot of time there is taken by just the SHA hash calculation which is not related to loading, so also have timing info for just DDSInput::read_native_scanline where all the actual DDS loading/decoding happens. Timings on PC (Windows 10, VS2022 Release build), AMD Ryzen 5950X. - BC1: total 3.68s -> 2.76s, read_native_scanline 1.54s -> 0.59s - BC3: total 3.74s -> 2.84s, read_native_scanline 1.83s -> 0.66s - BC4: total 1.26s -> 0.82s, read_native_scanline 0.74s -> 0.24s - BC5: total 2.29s -> 1.57s, read_native_scanline 1.31s -> 0.46s - BC6: total 7.50s -> 4.30s, read_native_scanline 4.67s -> 1.60s - BC7: total 6.18s -> 3.09s, read_native_scanline 4.34s -> 0.84s Possible future optimization: implement DDSInput::read_native_scanlines, that could avoid one extra temporary memory buffer copy. This extra buffer clear & eventual copy from it into final user destination pixels is non-trivial cost from profiling. But it's also quite involved to handle all the possible edge cases (e.g. user requests scanlines not on 4-pixel row boundaries); the code would have to do similar juggling as EXR reader does to handle "not on chunk boundaries" reads. It could potentially also interplay with DDS cubemap/volume texture reading ("tiled") code paths, for which there is no test coverage at all right now. Maybe for some other day.

lgritz · 2022-10-06T18:58:19Z

Aside for @aras-p to know next time this comes up: a good way to time how long it takes to do a full read with minimal conversions and no SHA-1 overhead is

time oiiotool --native -i:now=1 file.exr

The now=1 modifier causes the input to be read in full rather than going through the ImageCache, and all at once. And the --native keeps it in the data type of the file itself if at all possible (versus the oiiotool default of tending to convert everything to float upon input so that image processing operations are done at full speed and precision).

aras-p marked this pull request as ready for review October 5, 2022 17:15

lgritz approved these changes Oct 6, 2022

View reviewed changes

lgritz merged commit 6080074 into AcademySoftwareFoundation:master Oct 6, 2022

lgritz reviewed Oct 6, 2022

View reviewed changes

src/dds.imageio/ddsinput.cpp Show resolved Hide resolved

aras-p deleted the dds-opt branch October 7, 2022 19:56

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DDS: optimize loading of compressed images (3-5x) #3583

DDS: optimize loading of compressed images (3-5x) #3583

aras-p commented Oct 5, 2022

aras-p commented Oct 5, 2022

lgritz commented Oct 5, 2022

aras-p commented Oct 6, 2022

lgritz commented Oct 6, 2022

lgritz left a comment

lgritz Oct 6, 2022

lgritz commented Oct 6, 2022

DDS: optimize loading of compressed images (3-5x) #3583

DDS: optimize loading of compressed images (3-5x) #3583

Conversation

aras-p commented Oct 5, 2022

Description

Tests

Checklist:

aras-p commented Oct 5, 2022

lgritz commented Oct 5, 2022

aras-p commented Oct 6, 2022

lgritz commented Oct 6, 2022

lgritz left a comment

Choose a reason for hiding this comment

lgritz Oct 6, 2022

Choose a reason for hiding this comment

lgritz commented Oct 6, 2022