Memory management for large PDFs #19

plaisted · 2018-05-22T16:43:47Z

Currently it isn't possible to parse/edit/write PDFs if their size approaches the available memory on the computer. Unidoc parses the object tree and stores them in memory. Based on the architecture of a PDF file we should be able to parse (and for example extract text) from arbitrarily large PDFs that greatly exceed the memory capacity of the server. #128 could resolve a lot of this but would require pages/objects to be able to be freed from memory when no longer needed.

Additionally when writing large PDFs there would need to be something implemented to either cache objects to disk before writing the completed PDF or stream the PDF writing while each page is completed and releasing the memory back. This would get tricky with shared objects.

gunnsth · 2018-05-23T23:33:53Z

I think unidoc/#128 (lazy loading) would help with this. Also could put a size limit on the object cache, probably freeing the objects with the oldest load timestamp first. Then just load them again when needed.

Note: Lazy loading is supported in v3 with NewPdfReaderLazy().

gunnsth · 2019-04-14T23:17:33Z

Presumably most of the memory consumed is from large stream objects (would be good to verify based on a large corpus). A potential fix for this would be to change PdfObjectStream and unexport the Stream []byte and instead provide a higher level StreamAccessor / StreamCache which can provide the []byte directly if loaded in memory, or load from the file when needed.

The tricky part might be knowing whether the data is encoded or not and applying an encoding. If the stream is changed (e.g. has been encoded) then need to keep it in memory. (Or stored into some temporary storage where it could be re-accessed without staying in memory the entire time, such as boltdb or something similar).

seankungrubrik · 2024-05-29T20:11:14Z

Is there a solution to writing large PDF whose content doesn't fit in memory?

ipod4g · 2024-08-13T11:43:39Z

@seankungrubrik

We're currently working on implementing a solution for handling very large PDFs, including features like lazy writing.
In the meantime, you can use methods such as NewPdfReaderLazy() or lazy mode for large images with SetLazy() to help manage large PDFs more efficiently.

If you have any questions or need further assistance, please let us know

Thanks

gunnsth transferred this issue from unidoc/unidoc May 23, 2019

gunnsth mentioned this issue May 3, 2020

UniPDF v4 - Proposals #337

Open

gunnsth added the performance label Jun 2, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Memory management for large PDFs #19

Memory management for large PDFs #19

plaisted commented May 22, 2018

gunnsth commented May 23, 2018 •

edited

Loading

gunnsth commented Apr 14, 2019 •

edited

Loading

seankungrubrik commented May 29, 2024

ipod4g commented Aug 13, 2024

Memory management for large PDFs #19

Memory management for large PDFs #19

Comments

plaisted commented May 22, 2018

gunnsth commented May 23, 2018 • edited Loading

gunnsth commented Apr 14, 2019 • edited Loading

seankungrubrik commented May 29, 2024

ipod4g commented Aug 13, 2024

gunnsth commented May 23, 2018 •

edited

Loading

gunnsth commented Apr 14, 2019 •

edited

Loading