Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TikaService - scan max pages - for PDF documetns #159

Closed
rsoika opened this issue Dec 2, 2021 · 0 comments
Closed

TikaService - scan max pages - for PDF documetns #159

rsoika opened this issue Dec 2, 2021 · 0 comments

Comments

@rsoika
Copy link
Member

rsoika commented Dec 2, 2021

Provide a new option to scan only a maximum number of pages of a PDF document.
This is to avoid that very large documents will not block the tika service because of to CPU intensive ocr scanning.

The method 'doORCProcessing' should accept an new optional parameter MaxPages and cut the PDF document if the number of pages exceeds this param.

We can use PDF Box to implement this feature

@rsoika rsoika added this to the 2.2.14 milestone Dec 2, 2021
@rsoika rsoika changed the title TikaService - scan max pages TikaService - scan max pages - for PDF documetns Dec 2, 2021
rsoika added a commit that referenced this issue Dec 2, 2021
@rsoika rsoika added the testing label Dec 2, 2021
@rsoika rsoika closed this as completed Oct 26, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant