Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dangerzone is massively inflating file sizes by default - am I missing something? #970

Closed
TechReverie opened this issue Oct 24, 2024 · 3 comments
Labels
bug Something isn't working

Comments

@TechReverie
Copy link

What happened?

When I run a pdf file through Dangerzone the output file is huge compared to the original - for example 4MB --> 20MB, 50MB --> 282MB, 85MB --> 287MB.

I was under the impression that as part of the conversion the files were compressed. Did I get that wrong?

Linux distribution

Linuxmint 21.3 or Fedora 40 - both result in the same inflated results, and both inflate the files to the same file size

Dangerzone version

0.7.1

Podman info

No response

Document conversion logs

No response

Additional info

No response

@TechReverie TechReverie added the bug Something isn't working label Oct 24, 2024
@almet
Copy link
Contributor

almet commented Oct 25, 2024

Hey, thanks for opening a bug report, that certainly seem suspicious. We've already seen this, but I believe not the extent you're reporting now, and have an issue for this here: #239

If that's possible for you (if some of the PDFs leading to these changes are shareable), would it be possible to send them to us at [email protected]? (or attach it here if you feel like it)?

@TechReverie
Copy link
Author

Hi, thanks for getting back to me so quickly.

I'm not sure due to copyright that I can share the exact files I tried this with, but they can be downloaded straight from the publisher here:- https://magpi.raspberrypi.com/issues, if that helps.

Attached image is a directory listing showing the before/after of converting issues 136,137, and 146.
comparing sizes of dangerzone converted files

I've had a further play with some of my old archived instruction manuals which show differing results, so I wonder if some sort of particular PDF format that may be bugging the software? Beyond my skills to know the difference between these so I've attached the originals for your perusal if that helps.

Note from the directory image that the freenas guide massively inflated, however the HD20, and P9657AA manuals reduced as expected/hoped.

freenas9.2.1_guide.pdf

P9657AA-Manual-EN-v1.0-090406.pdf
HD20-M-en-GB.pdf

I cannot upload the 'safe' version of the converted freenas guide as it's over the upload file size limit.

If you need the converted versions of the other two I can upload those if you require, or if I can assist further do let me know.

Thank you.

@apyrgio
Copy link
Contributor

apyrgio commented Oct 29, 2024

Thanks for the link to the documents! I did a quick check and I can reproduce the size inflation you're noticing. However, I'm afraid it's kind of an expected side-effect of the way Dangerzone converts documents. The original file size does not affect the final file size, but the number of pages do.

You see, Dangerzone first renders each document page to pixels (RGB at 150 DPI), and then it reconstructs the document from said pixels. We did some measurements in #526, and for typical A4 documents, each page should take about 6.22 MiB at 150 DPI. Let's see how this applies to your documents:

Document Pages Expected size (MiB) Final size (MiB)
freenas9.2.1_guide.pdf 280 1,741.6 89
MagPI 146 133 827.26 128

And here's where the compression comes into play. The table above tells us the following:

  1. The final file size is much less than the expected one. Compression is doing a good job there!
  2. The amount of graphics in a page affect the compression efficiency. You can see that the final MagPI document takes much more space than the FreeNAS guide, even though the MagPI document has half the pages! That's because the MagPI document has lots of pictures, graphics, whereas FreeNAS is more lean.

All in all, I think that Dangerzone can't do much better here, given the constraint that it has to convert pages to pixels. If your archiving method is doing something similar though, and you get better results, we'd like to know more.

In the meantime, I'll close this issue, but feel free to drop a comment.

@apyrgio apyrgio closed this as not planned Won't fix, can't repro, duplicate, stale Oct 29, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants