Don't read the mdat box from ISOBMFF files #1961

dhoulder · 2021-10-14T07:50:13Z

BmffImage::boxHandler() reads every box it encounters, whether it's interested in it or not. This includes mdat, which can be huge. Here's a quick benchmark comparing a CR3 file from a Canon R5 that has a 40Mb mdat box with a another file with the mdat box chopped off:

david@blackbox:build$ time bash -c 'for i in $(seq 1 100); do ./bin/exiv2 -pp ~/Downloads/Canon-R5.CR3 >/dev/null; done'

real	0m2.937s
user	0m0.782s
sys	0m2.159s
david@blackbox:build$ time bash -c 'for i in $(seq 1 100); do ./bin/exiv2 -pp ../test/data/Canon-R6-pruned.CR3 >/dev/null; done'

real	0m0.558s
user	0m0.377s
sys	0m0.189s
david@blackbox:build$

This could be a significant issue for darktable and rawtherapee which will want to scan a bunch of files when importing. The memory consumption also spikes, but that's transient and probably less of a problem.

The fix should be pretty easy:

if (box_type == TAG_mdat ) {
   // skip it or return appropriate offset or something
  ...
}

before io_->read(data.data(), data.size());

There may be other boxes that can be skipped, but mdat is typically the big one where the main image data lives.

The text was updated successfully, but these errors were encountered:

clanmills · 2021-10-14T08:42:04Z

Very well explained, @dhoulder. We currently control/direct the boxHandler() with a couple of predicates. superBox() says: recursively read subboxes, fullBox() read flags defined in the spec. It sounds like we should add skipBox() to say don't allocate storage or read the contents. This would be a small code change with substantial performance implications for large files. We're currently only discussing .CR3 files, however it's probable that there are boxes in other BMFF files which can take advantage of skipBox().

1div0 · 2021-10-16T18:29:32Z

@dhoulder Thanks for the good hint for code improvement.

dhoulder · 2021-10-16T22:30:44Z

While it's probably not a big problem, I should also mention that the current handling of recursive boxes involves some unnecessary reads too: the outer box gets read in its entirety, then each box inside it gets re-read, and so on all the way down.

I suspect this might not be worth changing, but I haven't actually done any benchmarking. All the data will be in page cache anyway after the first read, and unlike the mdat case, we're typically not talking about large amounts in the first place.

clanmills · 2021-10-16T22:44:49Z

These are excellent observations, David @dhoulder and deserve attention. It would be great if you and @1div0 (and other volunteers) polish the bmff implementation.

dhoulder · 2021-10-19T06:52:29Z

Working on this now

clanmills · 2021-10-19T07:33:12Z

I suspect (without reading the code) that skipBox() is actually the default (don't allocate, don't read, just seek past the box). So, we don't need a skipBox(), however we need knownBox() function to deal with ILOC etc. So the code will be something like:

address = 0 ;
  while ( something ) {
    seek(address);
    readBoxHeader;
  if ( superBox(....) ) {
    boxHandler();
  } else if ( knownBox(...) ) {
     allocate; read;
     perform magic;
  }
address=address+box.length+8;

I'm unclear about how to interact with @paolobenve and the darktable guys. Do you know Matt McGuire in Sydney? He's one of the darktable folks. I'd like to get 0.27.50 ready (0.27.5 Preview) with your code changes and ask darktable to build with that code send it to Paolo for testing.

It's all kind of vague. Something will work out.

dhoulder · 2021-10-19T11:59:45Z

@clanmills Yes, I think something like that would be the ideal approach: read the header and then only read the box body if required, and in the superBox() case, just the stuff before the inner boxes. I think that might involve a substantial rewrite, so I was thinking of something less intrusive that would achieve most of the gains of that approach.

See dhoulder@4e587fa

Quick benchmark against that R5 file I used in #1961 (comment)

david@blackbox:build$ time bash -c 'for i in $(seq 1 100); do ./bin/exiv2 -pp ~/Downloads/Canon-R5.CR3 >/dev/null; done'

real	0m0.534s
user	0m0.327s
sys	0m0.214s

0.534s vs 2.937s. Much better! All tests pass too.

Comments welcome. If you know of other box types that are likely to be big (trak maybe?) I can skip those too, but I think mdat is the low hanging fruit.

clanmills · 2021-10-19T12:43:09Z

I think the "big win" is to not read mdat. I'm not sure this requires a "substantial rewrite". We're only talking about the control of the boxHandler() which is mostly concerned with parsing boxes and should be unchanged. If this requires lots of changes, we should not risk putting this in v0.27.5.

Can we take a two stage approach:

For v0.27.5, we add two features to the code:
a) The PRVW/THMB code
b) don't alloc/read mdat
More extensive changes to follow in future after v0.27.5 has shipped.

dhoulder · 2021-10-19T20:37:01Z

@clanmills : Yep, that two stage approach sounds good to me.

I will prepare a pull request and use that mergifyio trick to create a backport to 0.27-maintenance

dhoulder · 2021-10-20T06:43:56Z

Looks like the github CI stuff has got into another tangle - see #1974

clanmills · 2021-10-20T07:01:06Z

Look's like the the Windows-10/CI has gone bonkers. Try your git "empty" trick to trigger another build.

Current runner version: '2.283.3'
 2 Operating System
 6 Virtual Environment
11 Virtual Environment Provisioner
13 GITHUB_TOKEN Permissions
27 Prepare workflow directory
28 Prepare all required actions
29 Getting action download info
30 Error: Internal Server Error. Internal Server Error. Internal Server Error

We've had a genuine issue on macOS/CI. That was fixed yesterday with a change in cmake/compilerFlags.cmake.

dhoulder mentioned this issue Oct 19, 2021

no preview for cr3 image #1893

Closed

dhoulder mentioned this issue Oct 19, 2021

BMFF Performance boost: don't read boxes we're not interested in #1974

Merged

kevinbackhouse closed this as completed in #1974 Oct 20, 2021

nehaljwani mentioned this issue Apr 30, 2023

Is a new release happening soon? #2406

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Don't read the mdat box from ISOBMFF files #1961

Don't read the mdat box from ISOBMFF files #1961

dhoulder commented Oct 14, 2021

clanmills commented Oct 14, 2021 •

edited

Loading

1div0 commented Oct 16, 2021

dhoulder commented Oct 16, 2021 •

edited

Loading

clanmills commented Oct 16, 2021

dhoulder commented Oct 19, 2021

clanmills commented Oct 19, 2021

dhoulder commented Oct 19, 2021

clanmills commented Oct 19, 2021

dhoulder commented Oct 19, 2021

dhoulder commented Oct 20, 2021

clanmills commented Oct 20, 2021

Don't read the mdat box from ISOBMFF files #1961

Don't read the mdat box from ISOBMFF files #1961

Comments

dhoulder commented Oct 14, 2021

clanmills commented Oct 14, 2021 • edited Loading

1div0 commented Oct 16, 2021

dhoulder commented Oct 16, 2021 • edited Loading

clanmills commented Oct 16, 2021

dhoulder commented Oct 19, 2021

clanmills commented Oct 19, 2021

dhoulder commented Oct 19, 2021

clanmills commented Oct 19, 2021

dhoulder commented Oct 19, 2021

dhoulder commented Oct 20, 2021

clanmills commented Oct 20, 2021

clanmills commented Oct 14, 2021 •

edited

Loading

dhoulder commented Oct 16, 2021 •

edited

Loading