Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DDS: optimize loading of compressed images (3-5x) #3583

Merged
merged 1 commit into from
Oct 6, 2022
Merged

DDS: optimize loading of compressed images (3-5x) #3583

merged 1 commit into from
Oct 6, 2022

Conversation

aras-p
Copy link
Contributor

@aras-p aras-p commented Oct 5, 2022

Description

Comparing the time it takes to load DXT/BCn compressed DDS images via OIIO, I noticed that it's several times slower than in another codebase (Blender). So I started profiling, and fixing some low hanging fruit.

  • After decoding a BCn block (always 4x4 pixels), it was written into destination image using quite many branches in the innermost loop. Now it still takes care of not writing outside of a destination image that might not be multiple-of-4 in size, just with way fewer branches.
  • BCn decompression is multi-threaded via parallel_for_chunked, using at least 8 block rows per job (i.e. 32 pixel rows). This is similar to internal multi-threading done by EXR or TIF readers.
  • Avoid one case of std::vector resize (which clears memory) followed by immediate overwriting of that buffer with bytes read from the file. Switched to using unique_ptr for that array, similar to how other input plugins do it.

Performance data: I'm timing how long it takes to do iinfo --hash on 24 DDS files, each 4096x4096 in size. Quite a lot of time there is taken by just the SHA hash calculation which is not related to loading, so also have timing info for just DDSInput::read_native_scanline where all the actual DDS loading/decoding happens.

Timings on PC (Windows 10, VS2022 Release build), AMD Ryzen 5950X.

  • BC1: total 3.68s -> 2.76s, read_native_scanline 1.54s -> 0.59s
  • BC3: total 3.74s -> 2.84s, read_native_scanline 1.83s -> 0.66s
  • BC4: total 1.26s -> 0.82s, read_native_scanline 0.74s -> 0.24s
  • BC5: total 2.29s -> 1.57s, read_native_scanline 1.31s -> 0.46s
  • BC6: total 7.50s -> 4.30s, read_native_scanline 4.67s -> 1.60s
  • BC7: total 6.18s -> 3.09s, read_native_scanline 4.34s -> 0.84s

Possible future optimization: implement DDSInput::read_native_scanlines, that could avoid one extra temporary memory buffer copy. This extra buffer clear & eventual copy from it into final user destination pixels is non-trivial cost from profiling. But it's also quite involved to handle all the possible edge cases (e.g. user requests scanlines not on 4-pixel row boundaries); the code would have to do similar juggling as EXR reader does to handle "not on chunk boundaries" reads. It could potentially also interplay with DDS cubemap/volume texture reading ("tiled") code paths, for which there is no test coverage at all right now. Maybe for some other day.

Tests

Locally for me passes the extended DDS test suite from #3581. The code changes are purely an optimization, no behavior changes.

Checklist:

  • I have read the contribution guidelines.
  • If this is more extensive than a small change to existing code, I
    have previously submitted a Contributor License Agreement
    (individual, and if there is any way my
    employers might think my programming belongs to them, then also
    corporate).
  • I have updated the documentation, if applicable.
  • I have ensured that the change is tested somewhere in the testsuite
    (adding new test cases if necessary).
  • My code follows the prevailing code style of this project.

Comparing the time it takes to load DXT/BCn compressed DDS images via
OIIO, I noticed that it's several times slower than in another codebase
(Blender). So I started profiling, and fixing some low hanging fruit.

- After decoding a BCn block (always 4x4 pixels), it was written into
  destination image using quite many branches in the innermost loop.
  Now it still takes care of not writing outside of a destination image
  that might not be multiple-of-4 in size, just with way fewer branches.
- BCn decompression is multi-threaded via parallel_for_chunked, using
  at least 8 block rows per job (i.e. 32 pixel rows). This is
  similar to internal multi-threading done by EXR or TIF readers.
- Avoid one case of std::vector resize (which clears memory) followed
  by immediate overwriting of that buffer with bytes read from the file.
  Switched to using unique_ptr for that array, similar to how other
  input plugins do it.

Performance data: I'm timing how long it takes to do "iinfo --hash" on
24 DDS files, each 4096x4096 in size. Quite a lot of time there is taken
by just the SHA hash calculation which is not related to loading, so
also have timing info for just DDSInput::read_native_scanline where
all the actual DDS loading/decoding happens.

Timings on PC (Windows 10, VS2022 Release build), AMD Ryzen 5950X.
- BC1: total 3.68s -> 2.76s, read_native_scanline 1.54s -> 0.59s
- BC3: total 3.74s -> 2.84s, read_native_scanline 1.83s -> 0.66s
- BC4: total 1.26s -> 0.82s, read_native_scanline 0.74s -> 0.24s
- BC5: total 2.29s -> 1.57s, read_native_scanline 1.31s -> 0.46s
- BC6: total 7.50s -> 4.30s, read_native_scanline 4.67s -> 1.60s
- BC7: total 6.18s -> 3.09s, read_native_scanline 4.34s -> 0.84s

Possible future optimization: implement DDSInput::read_native_scanlines,
that could avoid one extra temporary memory buffer copy. This extra
buffer clear & eventual copy from it into final user destination pixels
is non-trivial cost from profiling. But it's also quite involved
to handle all the possible edge cases (e.g. user requests scanlines
not on 4-pixel row boundaries); the code would have to do similar
juggling as EXR reader does to handle "not on chunk boundaries" reads.
It could potentially also interplay with DDS cubemap/volume texture
reading ("tiled") code paths, for which there is no test coverage at all
right now. Maybe for some other day.
@aras-p aras-p marked this pull request as ready for review October 5, 2022 17:15
@aras-p
Copy link
Contributor Author

aras-p commented Oct 5, 2022

Huh, the two CI failures are the most curious. I don't yet quite see how they would be related to the single .cpp source file change here.

The Ubuntu bleeding edge gcc12 C++20 py3.10 OCIO/libtiff/exr-master boost1.74 avx2 one fails linking lib/libOpenImageIO.so.2.5.0 with an error coming from libtiff (?):

/usr/bin/ld: /home/runner/work/oiio/oiio/ext/dist/lib/libtiff.a(tif_unix.c.o): warning: relocation against `stderr@@GLIBC_2.2.5' in read-only section `.text'
/usr/bin/ld: /home/runner/work/oiio/oiio/ext/dist/lib/libtiff.a(tif_unix.c.o): relocation R_X86_64_PC32 against symbol `stderr@@GLIBC_2.2.5' can not be used when making a shared object; recompile with -fPIC

The failure on windows-2019 VS2019 is where unit_timer tests fail:

141/157 Test #144: unit_timer ..............................***Failed    2.90 sec
Timer begin/end cost is 2.13647e+07 /sec
Out of 10000000 queries, 7785692 had no time
Longest was 0.0007168 s.
Checkpoint: All 0.438635 selective 0.150176
D:/a/oiio/oiio/src/libutil/timer_test.cpp:105:
FAILED: all() == 0.5
	values were '0.753666' and '0.5', diff was 0.253656
D:/a/oiio/oiio/src/libutil/timer_test.cpp:112:
FAILED: all() == 0.5
	values were '0.753693' and '0.5', diff was 0.253692
D:/a/oiio/oiio/src/libutil/timer_test.cpp:116:
FAILED: all() == 0.6
	values were '0.897358' and '0.6', diff was 0.297346
Checkpoint2: All 0.897369 selective 0.301812
string ctr      :    0.0 ns (+/- 0.0ns), unreliable
usleep(1000)    :   40.4 ms (+/-28.2ms),    0.0 k/s
add             :    1.6 ns (+/- 0.0ns),  624.6 M/s
cos             :    5.6 ns (+/- 0.0ns),  178.1 M/s
acos            :    8.5 ns (+/- 0.2ns),  117.1 M/s
fast_acos       :    3.2 ns (+/- 0.0ns),  311.9 M/s
sqrt            :    1.6 ns (+/- 0.0ns),  623.7 M/s
simd sqrt       :    1.2 ns (+/- 0.0ns), 3325.0 Mvals/s, 831.2 Mcalls/s
ERRORS!

@lgritz
Copy link
Collaborator

lgritz commented Oct 5, 2022

Don't sweat the failure on "unit_timer" -- every once in a while we get a slow / overloaded GHA runner and it throws this test off enough to fail its sanity check. It is not important.

Not sure what happened in the "bleeding edge" test so I'm triggering it to re-run. Sometimes there are spurious failures because of failure for the runners to configure properly or they have a bad package installed or fail to install a dependency. The environments on GHA aren't super locked down and the fact that some deps are installed on the fly means all sorts of rare failures can happen. I just rerun and hope for the best.

@aras-p
Copy link
Contributor Author

aras-p commented Oct 6, 2022

Ah looks like the timer test has passed on a rerun. But the bleeding edge linker failure has not; however I see it's also happening on CI runs on other branches here in OIIO.

@lgritz
Copy link
Collaborator

lgritz commented Oct 6, 2022

The bleeding edge test error appears related to libtiff -- which we build from source at the tip of its master for this test, as we do for several other dependencies. That's why it's bleeding edge. Stuff breaks there any time somebody checks a bug into one of those packages. Sometimes we're even the ones to alert them to their mistake! But anyway, it seems unrelated to your change.

Copy link
Collaborator

@lgritz lgritz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome, LGTM!

const int widthInBlocks = (width + kBlockSize - 1) / kBlockSize;
const int heightInBlocks = (height + kBlockSize - 1) / kBlockSize;
paropt opt = paropt(0, paropt::SplitDir::Y, 8);
parallel_for_chunked(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice use of parallel_for_chunked!

@lgritz lgritz merged commit 6080074 into AcademySoftwareFoundation:master Oct 6, 2022
lgritz pushed a commit to lgritz/OpenImageIO that referenced this pull request Oct 6, 2022
…ndation#3583)

Comparing the time it takes to load DXT/BCn compressed DDS images via
OIIO, I noticed that it's several times slower than in another codebase
(Blender). So I started profiling, and fixing some low hanging fruit.

- After decoding a BCn block (always 4x4 pixels), it was written into
  destination image using quite many branches in the innermost loop.
  Now it still takes care of not writing outside of a destination image
  that might not be multiple-of-4 in size, just with way fewer branches.
- BCn decompression is multi-threaded via parallel_for_chunked, using
  at least 8 block rows per job (i.e. 32 pixel rows). This is
  similar to internal multi-threading done by EXR or TIF readers.
- Avoid one case of std::vector resize (which clears memory) followed
  by immediate overwriting of that buffer with bytes read from the file.
  Switched to using unique_ptr for that array, similar to how other
  input plugins do it.

Performance data: I'm timing how long it takes to do "iinfo --hash" on
24 DDS files, each 4096x4096 in size. Quite a lot of time there is taken
by just the SHA hash calculation which is not related to loading, so
also have timing info for just DDSInput::read_native_scanline where
all the actual DDS loading/decoding happens.

Timings on PC (Windows 10, VS2022 Release build), AMD Ryzen 5950X.
- BC1: total 3.68s -> 2.76s, read_native_scanline 1.54s -> 0.59s
- BC3: total 3.74s -> 2.84s, read_native_scanline 1.83s -> 0.66s
- BC4: total 1.26s -> 0.82s, read_native_scanline 0.74s -> 0.24s
- BC5: total 2.29s -> 1.57s, read_native_scanline 1.31s -> 0.46s
- BC6: total 7.50s -> 4.30s, read_native_scanline 4.67s -> 1.60s
- BC7: total 6.18s -> 3.09s, read_native_scanline 4.34s -> 0.84s

Possible future optimization: implement DDSInput::read_native_scanlines,
that could avoid one extra temporary memory buffer copy. This extra
buffer clear & eventual copy from it into final user destination pixels
is non-trivial cost from profiling. But it's also quite involved
to handle all the possible edge cases (e.g. user requests scanlines
not on 4-pixel row boundaries); the code would have to do similar
juggling as EXR reader does to handle "not on chunk boundaries" reads.
It could potentially also interplay with DDS cubemap/volume texture
reading ("tiled") code paths, for which there is no test coverage at all
right now. Maybe for some other day.
@lgritz
Copy link
Collaborator

lgritz commented Oct 6, 2022

Aside for @aras-p to know next time this comes up: a good way to time how long it takes to do a full read with minimal conversions and no SHA-1 overhead is

time oiiotool --native -i:now=1 file.exr

The now=1 modifier causes the input to be read in full rather than going through the ImageCache, and all at once. And the --native keeps it in the data type of the file itself if at all possible (versus the oiiotool default of tending to convert everything to float upon input so that image processing operations are done at full speed and precision).

@aras-p aras-p deleted the dds-opt branch October 7, 2022 19:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants