Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Memory management for large PDFs #19

Open
plaisted opened this issue May 22, 2018 · 4 comments
Open

Memory management for large PDFs #19

plaisted opened this issue May 22, 2018 · 4 comments

Comments

@plaisted
Copy link

Currently it isn't possible to parse/edit/write PDFs if their size approaches the available memory on the computer. Unidoc parses the object tree and stores them in memory. Based on the architecture of a PDF file we should be able to parse (and for example extract text) from arbitrarily large PDFs that greatly exceed the memory capacity of the server. #128 could resolve a lot of this but would require pages/objects to be able to be freed from memory when no longer needed.

Additionally when writing large PDFs there would need to be something implemented to either cache objects to disk before writing the completed PDF or stream the PDF writing while each page is completed and releasing the memory back. This would get tricky with shared objects.

@gunnsth
Copy link
Contributor

gunnsth commented May 23, 2018

I think unidoc/#128 (lazy loading) would help with this. Also could put a size limit on the object cache, probably freeing the objects with the oldest load timestamp first. Then just load them again when needed.

Note: Lazy loading is supported in v3 with NewPdfReaderLazy().

@gunnsth
Copy link
Contributor

gunnsth commented Apr 14, 2019

Presumably most of the memory consumed is from large stream objects (would be good to verify based on a large corpus). A potential fix for this would be to change PdfObjectStream and unexport the Stream []byte and instead provide a higher level StreamAccessor / StreamCache which can provide the []byte directly if loaded in memory, or load from the file when needed.

The tricky part might be knowing whether the data is encoded or not and applying an encoding. If the stream is changed (e.g. has been encoded) then need to keep it in memory. (Or stored into some temporary storage where it could be re-accessed without staying in memory the entire time, such as boltdb or something similar).

@gunnsth gunnsth transferred this issue from unidoc/unidoc May 23, 2019
@seankungrubrik
Copy link

Is there a solution to writing large PDF whose content doesn't fit in memory?

@ipod4g
Copy link

ipod4g commented Aug 13, 2024

@seankungrubrik

We're currently working on implementing a solution for handling very large PDFs, including features like lazy writing.
In the meantime, you can use methods such as NewPdfReaderLazy() or lazy mode for large images with SetLazy() to help manage large PDFs more efficiently.

If you have any questions or need further assistance, please let us know

Thanks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants