-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for --output-type pdf
#222
Comments
Well since the app itself doesn't create the new PDF content I would assume there is a problem with ocrMyPdf itself. If possible please post the problematic file here or try what happens if you invoke ocrMyPdf directly from the CLI with the problematic PDF as input. |
I did this tests:
Here, the file is not touched nor modified. After running it with: the file gets "corrupted". |
Can you repeat your test with the There may be a problem with the metadata of the PDF, as observed in this issue. Metadata can be preserved if the output file is not an archived PDF file but a regular PDF file created by the additional command line option |
Thanks for checking this @bahnwaerter ! If the |
@bahnwaerter Sure. Here is the output:
|
Thanks for the verbose output @SKB-CGN. Now we can see that there are two problems in the input PDF file:
The first problem indeed concerns the PDF metadata. Here, the tool that generated the PDF, embedded characters in the metadata with an encoding that is not permitted in the PDF/A standard. The second problem can be understood more as a warning. Apparently an interpolated image should be embedded here, which is not allowed in the PDF/A standard either. @SKB-CGN: What tool was used to create the PDF? |
Sure we could add this flag to the command line call performed by this app. However, this should never be done by default, but only used when necessary (especially if PDF files were not generated in accordance with the PDF/A standard). Can we create an optional configuration in the workflow settings for this? |
@bahnwaerter the tool is 'WISO Vermieter'. |
Sure, sounds like the best and most flexible solution 👍
@SKB-CGN maybe that's one for the Buhl Data support team. I'd suggest they should produce PDF/A compliant documents 😄 |
@R0Wi Perhaps they should. But you know - big company with their own rules 😁 |
So to summarize: the problem mentioned here is mainly related to some PDF/A compliant issues which cannot be handled by The meaning for this app would be to release a new feature:
|
--output-type pdf
Thanks @SKB-CGN for sharing the tool's name. It is most likely the case that this tool does not create PDF/A compatible documents. However, this tool may also implement the latest version of the PDF/A standard, which Ghostscript may not currently support. Feel free to checkout the PDF/A version of your document. If the version number is supported by Ghostscript then the tool is faulty. In this case we would appreciate if you contact the Buhl Data support team and report the error. @R0Wi: Your suggestion for adding a workflow setting sounds good. Possibly we can implement all flags for the command line option |
@bahnwaerter In general I agree. But still I'm not sure if we really need a dropdown for this (I think we would need both a switch and a dropdown then...) since according to the docs:
So my understanding is that omitting the |
Yes, the text in the documentation clearly states that the default configuration of OCRmyPDF is the explicitly set If at some point the two-valued configuration logic is no longer sufficient, we can always introduce a dropdown list of flags. |
HI,
which now produces a lot of warnings/errors inside Nextcloud. Warnung | workflow_ocr | OCRmyPDF succeeded with warning(s): PriorOcrFoundError: page already has text! - aborting (use --force-ocr to force OCR; see also help for the arguments --skip-text and --redo-ocr, | 2023-09-21T17:00:12+0200 Should this normally be suppressed as the application should not produce any output? |
This is a known problem which we're currently working on (see #232) |
@SKB-CGN FYI: when implementing #233, I reviewed your problem as well but I came to the conclusion that logging a warning is mandatory if As a workaround please increase your loglevel so that for example only errors are logged. You can also use logrotation to control the size of your logs. |
With #272, you can now add custom CLI arguments for |
Hi,
i am not sure, how this error is connected to an upgrade of Nextcloud from 25 to 26.
I have uploaded a new pdf, which was processed, but displays the wrong char-set.
This is the original text:
This is the text of the converted one:
But, when selecting the text with the mouse and copying it, it displays the correct text. Which is:
anbei erhaltenSiedieBetriebskostenabrech nungfürdasJah r2022. Bei derBerech nungwirdIhrNutzungszeitraum vom 01.03.2022-31.12.2022berücksic
Would be great, if you know, what kind of issue this could be.
Thank you!
The text was updated successfully, but these errors were encountered: