TikaService - scan max pages - for PDF documetns #159

rsoika · 2021-12-02T20:24:08Z

Provide a new option to scan only a maximum number of pages of a PDF document.
This is to avoid that very large documents will not block the tika service because of to CPU intensive ocr scanning.

The method 'doORCProcessing' should accept an new optional parameter MaxPages and cut the PDF document if the number of pages exceeds this param.

We can use PDF Box to implement this feature

Issue #159

rsoika added enhancement feature labels Dec 2, 2021

rsoika added this to the 2.2.14 milestone Dec 2, 2021

rsoika changed the title ~~TikaService - scan max pages~~ TikaService - scan max pages - for PDF documetns Dec 2, 2021

rsoika added a commit that referenced this issue Dec 2, 2021

implementation, docu

91b24ec

Issue #159

rsoika added the testing label Dec 2, 2021

rsoika closed this as completed Oct 26, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TikaService - scan max pages - for PDF documetns #159

TikaService - scan max pages - for PDF documetns #159

rsoika commented Dec 2, 2021 •

edited

Loading

TikaService - scan max pages - for PDF documetns #159

TikaService - scan max pages - for PDF documetns #159

Comments

rsoika commented Dec 2, 2021 • edited Loading

rsoika commented Dec 2, 2021 •

edited

Loading