Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(oiiotool): oiiotool --parallel-frames #3849

Merged
merged 1 commit into from
Jun 6, 2023

Conversation

lgritz
Copy link
Collaborator

@lgritz lgritz commented May 24, 2023

Background info: oiiotool multithreads within any individual image operation (such as --over), but a series of operations comprising the oiiotool command are done serially -- each command finishes before the next starts. And similarly, for multi-frame operations using frame ranges, frame number wildcards, or view wildcards, the set of commands for each frame runs to completion before the next begins. For complex oiiotool commands, especially when running over frame ranges, this is not a particularly efficient way of parallelizing, since any serialized operation (which often includes the expensive writing of the result image from each frame to disk) can substantially serialize the entire run.

This PR adds a new oiiotool --parallel-frames option that assumes all frames in the range can be computed independently and in any order, and can therefore be run in parallel. As long as there are a similar number (or more) of frames versus cores, and you have enough RAM to handle that many image pipelines at once, this is a more efficient way of parallelizing the operations than multithreading individual image processing operations.

Benchmarking: On my 32-core workstation, I tried the following oiiotool command, contrived but representative of a reasonable oiiotool command complexity that might be a production task:

oiiotool  --runstats -v --frames 1-100 tmp/noise.exr -cut 2880x1620+960+540 -resize 1920x1080 -colorconvert lin_srgb srgb -d uint8 -o "tmp/test.#.tif"

The noise.exr file was a previously created 4k x 3channel 'half' exr image. The command loads the file, cuts out a 3k section of it, resizes that to HD res, does a color space transformation, then outputs as an 8 bit TIFF file, for each of 100 sequential frames.

threads   real time    user  sys      peak memory
1         4:44         4:23  0:21     122 MB
2         2:58         4:24  0:29     122 MB
4         1:40         4:36  0:29     122 MB
8         1:10         5:00  0:30     127 MB
16        0:56         5:28  0:37     131 MB
all(32)   0:55         8:12  0:54     127 MB

parallel  0:25         8:13  1:04     3.4 GB

Here we can see the diminishing returns of using multiple threads within each frame, where for low thread counts we are improving steadily, but we pretty much peak at 16 threads, seeing virtually no further gain at 32, and maxing out at about 5x the single threaded performance. This is due to serialization of the I/O and increasingly bumping into various locking at higher thread levels.

However, using --parallel-frames gives us another doubling of total throughput (giving us a total of 11.3x speedup versus single threaded), though at the expense of 28x the memory use! That memory explosion is acceptable in this case, even 3.4 GB is really nothing on a modern machine, but we can imagine that other possible oiiotool command lines might be more memory intensive and having dozens of frame pipelines computing simultaneously would have problematic memory use. So use this feature carefully and with purpose, always checking that your own workflows experience sufficient speedup and don't use more memory than you have available.

Background info: oiiotool multithreads *within* any individual image
operation (such as `--over`), but a series of operations comprising
the oiiotool command are done serially -- each command finishes before
the next starts. And similarly, for multi-frame operations using frame
ranges, frame number wildcards, or view wildcards, the set of commands
for each frame runs to completion before the next begins. For complex
oiiotool commands, especially when running over frame ranges, this is
not a particularly effient way of parallelizing, since any serialized
operation (which often includes the expensive writing of the result
image from each frame to disk) can substantially serialize the entire
run.

This PR adds a new oiiotool --parallel-frames option that assumes all
frames in the range can be computed independently and in any order,
and can therefore be run in parallel. As long as there are a similar
number (or more) of frames versus cores, and you have enough RAM to
handle that many image pipelines at once, this is a more efficient way
of parallelizing the operations than multithreading individual image
processing operations.

Benchmarking: On my 32-core workstation, I tried the following
oiiotool command, contrived but representative of a reasonable
oiiotool command complexity that might be a production task:

    oiiotool  --runstats -v --frames 1-100 tmp/noise.exr -cut 2880x1620+960+540 -resize 1920x1080 -colorconvert lin_srgb srgb -d uint8 -o "tmp/test.#.tif"

The noise.exr file was a previously created 4k x 3channel 'half' exr
image. The command loads he file, cuts a 3k section of it, resizes
that to HD res, does a color spaace transformation, then outputs as an
8 bit TIFF file, for each of 100 sequential frames.

    threads   real time    user  sys      peak memory
    1         4:44         4:23  0:21     122 MB
    2         2:58         4:24  0:29     122 MB
    4         1:40         4:36  0:29     122 MB
    8         1:10         5:00  0:30     127 MB
    16        0:56         5:28  0:37     131 MB
    all(32)   0:55         8:12  0:54     127 MB

    parallel  0:25         8:13  1:04     3.4 GB

Here we can see the diminishing returns of using multiple threads
within each frame, where for low thread counts we are improving
steadily, but we pretty much peak at 16 threads, seeing virtually no
further gain at 32, and maxing out at about 5x the single threaded
performance. This is due to serialization of the I/O and increasingly
bumping into various locking at higher thread levels.

However, using `--parallel-frames` gives us another doubling of total
throughput (giving us a total of 11.3x speedup versus single
threaded), though at the expense of 28x the memory use! That memory
explosion is acceptable in this case, even 3.4 GB is really nothing on
a modern machine, but we can imagine that other possible oiiotool
command lines might be more memory intensive and having dozens of
frame pipelines computing simultaneously would have problematic memory
use. So use this feature carefully and with purpose, always checking
that your own workflows experience sufficient speedup and don't use
more memory than you have available.
@lgritz lgritz merged commit f40f980 into AcademySoftwareFoundation:master Jun 6, 2023
lgritz added a commit to lgritz/OpenImageIO that referenced this pull request Jun 7, 2023
…#3849)

Background info: oiiotool multithreads *within* any individual image
operation (such as `--over`), but a series of operations comprising the
oiiotool command are done serially -- each command finishes before the
next starts. And similarly, for multi-frame operations using frame
ranges, frame number wildcards, or view wildcards, the set of commands
for each frame runs to completion before the next begins. For complex
oiiotool commands, especially when running over frame ranges, this is
not a particularly efficient way of parallelizing, since any serialized
operation (which often includes the expensive writing of the result
image from each frame to disk) can substantially serialize the entire
run.

This PR adds a new oiiotool --parallel-frames option that assumes all
frames in the range can be computed independently and in any order, and
can therefore be run in parallel. As long as there are a similar number
(or more) of frames versus cores, and you have enough RAM to handle that
many image pipelines at once, this is a more efficient way of
parallelizing the operations than multithreading individual image
processing operations.

Benchmarking: On my 32-core workstation, I tried the following oiiotool
command, contrived but representative of a reasonable oiiotool command
complexity that might be a production task:

oiiotool --runstats -v --frames 1-100 tmp/noise.exr -cut
2880x1620+960+540 -resize 1920x1080 -colorconvert lin_srgb srgb -d uint8
-o "tmp/test.#.tif"

The noise.exr file was a previously created 4k x 3channel 'half' exr
image. The command loads the file, cuts out a 3k section of it, resizes
that to HD res, does a color space transformation, then outputs as an 8
bit TIFF file, for each of 100 sequential frames.

    threads   real time    user  sys      peak memory
    1         4:44         4:23  0:21     122 MB
    2         2:58         4:24  0:29     122 MB
    4         1:40         4:36  0:29     122 MB
    8         1:10         5:00  0:30     127 MB
    16        0:56         5:28  0:37     131 MB
    all(32)   0:55         8:12  0:54     127 MB

    parallel  0:25         8:13  1:04     3.4 GB

Here we can see the diminishing returns of using multiple threads within
each frame, where for low thread counts we are improving steadily, but
we pretty much peak at 16 threads, seeing virtually no further gain at
32, and maxing out at about 5x the single threaded performance. This is
due to serialization of the I/O and increasingly bumping into various
locking at higher thread levels.

However, using `--parallel-frames` gives us another doubling of total
throughput (giving us a total of 11.3x speedup versus single threaded),
though at the expense of 28x the memory use! That memory explosion is
acceptable in this case, even 3.4 GB is really nothing on a modern
machine, but we can imagine that other possible oiiotool command lines
might be more memory intensive and having dozens of frame pipelines
computing simultaneously would have problematic memory use. So use this
feature carefully and with purpose, always checking that your own
workflows experience sufficient speedup and don't use more memory than
you have available.
lgritz added a commit to imageworks/OpenImageIO that referenced this pull request Jun 7, 2023
…#3849)

Background info: oiiotool multithreads *within* any individual image
operation (such as `--over`), but a series of operations comprising the
oiiotool command are done serially -- each command finishes before the
next starts. And similarly, for multi-frame operations using frame
ranges, frame number wildcards, or view wildcards, the set of commands
for each frame runs to completion before the next begins. For complex
oiiotool commands, especially when running over frame ranges, this is
not a particularly efficient way of parallelizing, since any serialized
operation (which often includes the expensive writing of the result
image from each frame to disk) can substantially serialize the entire
run.

This PR adds a new oiiotool --parallel-frames option that assumes all
frames in the range can be computed independently and in any order, and
can therefore be run in parallel. As long as there are a similar number
(or more) of frames versus cores, and you have enough RAM to handle that
many image pipelines at once, this is a more efficient way of
parallelizing the operations than multithreading individual image
processing operations.

Benchmarking: On my 32-core workstation, I tried the following oiiotool
command, contrived but representative of a reasonable oiiotool command
complexity that might be a production task:

oiiotool --runstats -v --frames 1-100 tmp/noise.exr -cut
2880x1620+960+540 -resize 1920x1080 -colorconvert lin_srgb srgb -d uint8
-o "tmp/test.#.tif"

The noise.exr file was a previously created 4k x 3channel 'half' exr
image. The command loads the file, cuts out a 3k section of it, resizes
that to HD res, does a color space transformation, then outputs as an 8
bit TIFF file, for each of 100 sequential frames.

    threads   real time    user  sys      peak memory
    1         4:44         4:23  0:21     122 MB
    2         2:58         4:24  0:29     122 MB
    4         1:40         4:36  0:29     122 MB
    8         1:10         5:00  0:30     127 MB
    16        0:56         5:28  0:37     131 MB
    all(32)   0:55         8:12  0:54     127 MB

    parallel  0:25         8:13  1:04     3.4 GB

Here we can see the diminishing returns of using multiple threads within
each frame, where for low thread counts we are improving steadily, but
we pretty much peak at 16 threads, seeing virtually no further gain at
32, and maxing out at about 5x the single threaded performance. This is
due to serialization of the I/O and increasingly bumping into various
locking at higher thread levels.

However, using `--parallel-frames` gives us another doubling of total
throughput (giving us a total of 11.3x speedup versus single threaded),
though at the expense of 28x the memory use! That memory explosion is
acceptable in this case, even 3.4 GB is really nothing on a modern
machine, but we can imagine that other possible oiiotool command lines
might be more memory intensive and having dozens of frame pipelines
computing simultaneously would have problematic memory use. So use this
feature carefully and with purpose, always checking that your own
workflows experience sufficient speedup and don't use more memory than
you have available.
@lgritz lgritz deleted the lg-parallel2 branch June 14, 2023 20:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant