Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Don't read the mdat box from ISOBMFF files #1961

Closed
dhoulder opened this issue Oct 14, 2021 · 11 comments · Fixed by #1974
Closed

Don't read the mdat box from ISOBMFF files #1961

dhoulder opened this issue Oct 14, 2021 · 11 comments · Fixed by #1974

Comments

@dhoulder
Copy link
Contributor

BmffImage::boxHandler() reads every box it encounters, whether it's interested in it or not. This includes mdat, which can be huge. Here's a quick benchmark comparing a CR3 file from a Canon R5 that has a 40Mb mdat box with a another file with the mdat box chopped off:

david@blackbox:build$ time bash -c 'for i in $(seq 1 100); do ./bin/exiv2 -pp ~/Downloads/Canon-R5.CR3 >/dev/null; done'

real	0m2.937s
user	0m0.782s
sys	0m2.159s
david@blackbox:build$ time bash -c 'for i in $(seq 1 100); do ./bin/exiv2 -pp ../test/data/Canon-R6-pruned.CR3 >/dev/null; done'

real	0m0.558s
user	0m0.377s
sys	0m0.189s
david@blackbox:build$ 

This could be a significant issue for darktable and rawtherapee which will want to scan a bunch of files when importing. The memory consumption also spikes, but that's transient and probably less of a problem.

The fix should be pretty easy:

if (box_type == TAG_mdat ) {
   // skip it or return appropriate offset or something
  ...
}

before io_->read(data.data(), data.size());

There may be other boxes that can be skipped, but mdat is typically the big one where the main image data lives.

@clanmills
Copy link
Collaborator

clanmills commented Oct 14, 2021

Very well explained, @dhoulder. We currently control/direct the boxHandler() with a couple of predicates. superBox() says: recursively read subboxes, fullBox() read flags defined in the spec. It sounds like we should add skipBox() to say don't allocate storage or read the contents. This would be a small code change with substantial performance implications for large files. We're currently only discussing .CR3 files, however it's probable that there are boxes in other BMFF files which can take advantage of skipBox().

@1div0
Copy link
Collaborator

1div0 commented Oct 16, 2021

@dhoulder Thanks for the good hint for code improvement.

@dhoulder
Copy link
Contributor Author

dhoulder commented Oct 16, 2021

While it's probably not a big problem, I should also mention that the current handling of recursive boxes involves some unnecessary reads too: the outer box gets read in its entirety, then each box inside it gets re-read, and so on all the way down.

I suspect this might not be worth changing, but I haven't actually done any benchmarking. All the data will be in page cache anyway after the first read, and unlike the mdat case, we're typically not talking about large amounts in the first place.

@clanmills
Copy link
Collaborator

These are excellent observations, David @dhoulder and deserve attention. It would be great if you and @1div0 (and other volunteers) polish the bmff implementation.

@dhoulder
Copy link
Contributor Author

Working on this now

@clanmills
Copy link
Collaborator

I suspect (without reading the code) that skipBox() is actually the default (don't allocate, don't read, just seek past the box). So, we don't need a skipBox(), however we need knownBox() function to deal with ILOC etc. So the code will be something like:

address = 0 ;
  while ( something ) {
    seek(address);
    readBoxHeader;
  if ( superBox(....) ) {
    boxHandler();
  } else if ( knownBox(...) ) {
     allocate; read;
     perform magic;
  }
address=address+box.length+8;

I'm unclear about how to interact with @paolobenve and the darktable guys. Do you know Matt McGuire in Sydney? He's one of the darktable folks. I'd like to get 0.27.50 ready (0.27.5 Preview) with your code changes and ask darktable to build with that code send it to Paolo for testing.

It's all kind of vague. Something will work out.

@dhoulder
Copy link
Contributor Author

@clanmills Yes, I think something like that would be the ideal approach: read the header and then only read the box body if required, and in the superBox() case, just the stuff before the inner boxes. I think that might involve a substantial rewrite, so I was thinking of something less intrusive that would achieve most of the gains of that approach.

See dhoulder@4e587fa

Quick benchmark against that R5 file I used in #1961 (comment)

david@blackbox:build$ time bash -c 'for i in $(seq 1 100); do ./bin/exiv2 -pp ~/Downloads/Canon-R5.CR3 >/dev/null; done'

real	0m0.534s
user	0m0.327s
sys	0m0.214s

0.534s vs 2.937s. Much better! All tests pass too.

Comments welcome. If you know of other box types that are likely to be big (trak maybe?) I can skip those too, but I think mdat is the low hanging fruit.

@clanmills
Copy link
Collaborator

I think the "big win" is to not read mdat. I'm not sure this requires a "substantial rewrite". We're only talking about the control of the boxHandler() which is mostly concerned with parsing boxes and should be unchanged. If this requires lots of changes, we should not risk putting this in v0.27.5.

Can we take a two stage approach:

  1. For v0.27.5, we add two features to the code:
    a) The PRVW/THMB code
    b) don't alloc/read mdat
  2. More extensive changes to follow in future after v0.27.5 has shipped.

@dhoulder
Copy link
Contributor Author

@clanmills : Yep, that two stage approach sounds good to me.

I will prepare a pull request and use that mergifyio trick to create a backport to 0.27-maintenance

@dhoulder
Copy link
Contributor Author

Looks like the github CI stuff has got into another tangle - see #1974

@clanmills
Copy link
Collaborator

Look's like the the Windows-10/CI has gone bonkers. Try your git "empty" trick to trigger another build.

Current runner version: '2.283.3'
 2 Operating System
 6 Virtual Environment
11 Virtual Environment Provisioner
13 GITHUB_TOKEN Permissions
27 Prepare workflow directory
28 Prepare all required actions
29 Getting action download info
30 Error: Internal Server Error. Internal Server Error. Internal Server Error

We've had a genuine issue on macOS/CI. That was fixed yesterday with a change in cmake/compilerFlags.cmake.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants