Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

(client) Handle RGB pages not fitting in temporary directories #526

Closed
apyrgio opened this issue Aug 21, 2023 · 4 comments
Closed

(client) Handle RGB pages not fitting in temporary directories #526

apyrgio opened this issue Aug 21, 2023 · 4 comments

Comments

@apyrgio
Copy link
Contributor

apyrgio commented Aug 21, 2023

(this issue is a follow up of #518, best done after #443)

The size of a single A4 page in pixels is:

  • 1 A4 page at 72 DPI = 595 x 842 pixels
  • 1 A4 page at 150 DPI = 1240 x 1754 pixels

We also need to account for 3 color channels too (RGB), meaning that the final size in bytes is:

  • 1 A4 page at 72 DPI = 3 x 595 x 842 pixels = 1.43 MiB
  • 1 A4 page at 150 DPI = 3 x 1240 x 1754 pixels = 6.22 MiB

If we have 1 GiB of RAM available, we need 716 pages (72 DPI) or 165 pages (150 DPI) to fill it up. It seems that pdftoppm does use 150 DPI by default for the conversion to PPM, meaning that users with limited RAM (e.g., 1 GiB) will not be able to convert PDFs with more than 165 pages. Note that this is the case because we store the RGB files in a temporary directory as a result of the conversion to pixels.

This is a limitation that does not affect all users or files, but we need to find a solution for it.

@apyrgio
Copy link
Contributor Author

apyrgio commented Aug 21, 2023

Possible solutions

  1. Can we pipe the pages from container 1 to container 2, and call the program that "unites" them with the stdin as an argument?

    This would essentially remove any intermediate step and greatly reduce the bytes we'd need to store on a temp dir. However, this requires two things that we don't have right now:

    • An architecture where we spawn 2 containers that speak to each other.
    • A program that reads pages from stdin, instead of a filesystem.
  2. Can we compress each page that we receive from the first container?

    Yes, we saw that compressing an RGB page into PNG leads to 30x - 40x size reduction for typical document types (letters on white background), and only 2x size reduction for photos. This is probably fine, as we expect most multi-page documents to not be dominated by photos.

    Note that this means that the 1st container must not save pages in a temporary filesystem, but stream them instead to the host, and the host must immediately convert them to PNG, e.g., using python-pil. We already have an open issue for this: Containers: have progress streamed instead of via mounted volumes (and deprecate doc_to_pixels_qubes_wrapper.py) #443

  3. Can we store the pages in a data dir?

    Previously, the way we stored intermediate pages was in the config dir of the user. This brought some issues of its own (see Ensure Intermediate Directories are cleaned in the case of exceptions #317), but also undermined the confidentiality of these documents, as traces of them could remain in the user's computer. Consider the case where the original files are in an encrypted device or tmp dir. Therefore, this is something that we can't do.

From the above solutions, it seems that (2) is the one we should go with.

Workarounds

If you are a user that has this problem, you can consider the following workarounds, if you are on Linux:

  1. You can increase the size of /tmp through /etc/fstab. See https://www.looklinux.com/how-to-resize-tmpfs-on-linux/
  2. You can specify a different temp dir using the TEMP environment variable (e.g., TEMP=/home/tmp dangerzone)

@apyrgio apyrgio changed the title Handle RGB pages not fitting in memory Handle RGB pages not fitting in temporary directories Aug 21, 2023
@apyrgio apyrgio added this to the 0.5.0 milestone Aug 22, 2023
@apyrgio
Copy link
Contributor Author

apyrgio commented Aug 22, 2023

We also need to find out if this affects Windows / MacOS platforms, i.e., if tmpfs is used there.

@deeplow
Copy link
Contributor

deeplow commented Oct 19, 2023

From the above solutions, it seems that (2) is the one we should go with.

I agree with this. And we can use the pillow python module to convert from rgb to png (or if needed even PDF).

@deeplow deeplow changed the title Handle RGB pages not fitting in temporary directories (client) Handle RGB pages not fitting in temporary directories Oct 24, 2023
deeplow added a commit that referenced this issue Nov 2, 2023
Storing all RGB files in the host were leading to a fast-filling `/tmp`.
This solution essentially converts all the RGB files to PNGs (which are
compressed) saving valuable space in the process. This conversion is
made with the Pillow (PIL) module, without the need for any external
dependencies.

Fixes #526
deeplow added a commit that referenced this issue Nov 2, 2023
Storing all RGB files in the host were leading to a fast-filling `/tmp`.
This solution essentially converts all the RGB files to PNGs (which are
compressed) saving valuable space in the process. This conversion is
made with the Pillow (PIL) module, without the need for any external
dependencies.

Fixes #526
@apyrgio apyrgio modified the milestones: 0.6.0, Bookmarks Nov 22, 2023
@apyrgio apyrgio removed this from the 0.6.0 milestone Feb 12, 2024
@apyrgio
Copy link
Contributor Author

apyrgio commented Oct 29, 2024

As part of #625, we no longer store RGB files in temporary directories. The safe PDF is now created on the fly, and each page is compressed immediately, once it's received from the conversion sandbox. This means that there will be at most 1 uncompressed RGB page in-flight at all times. The rest of the pages will be compressed, which really improves the situation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants