Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for --output-type pdf #222

Closed
SKB-CGN opened this issue Aug 20, 2023 · 18 comments
Closed

Add support for --output-type pdf #222

SKB-CGN opened this issue Aug 20, 2023 · 18 comments
Assignees
Labels
enhancement New feature or request

Comments

@SKB-CGN
Copy link

SKB-CGN commented Aug 20, 2023

Hi,
i am not sure, how this error is connected to an upgrade of Nextcloud from 25 to 26.

I have uploaded a new pdf, which was processed, but displays the wrong char-set.

This is the original text:
f1

This is the text of the converted one:
f2

But, when selecting the text with the mouse and copying it, it displays the correct text. Which is:
anbei erhaltenSiedieBetriebskostenabrech nungfürdasJah r2022. Bei derBerech nungwirdIhrNutzungszeitraum vom 01.03.2022-31.12.2022berücksic

Would be great, if you know, what kind of issue this could be.

Thank you!

@SKB-CGN SKB-CGN added the bug Something isn't working label Aug 20, 2023
@R0Wi
Copy link
Contributor

R0Wi commented Aug 20, 2023

Well since the app itself doesn't create the new PDF content I would assume there is a problem with ocrMyPdf itself. If possible please post the problematic file here or try what happens if you invoke ocrMyPdf directly from the CLI with the problematic PDF as input.

@SKB-CGN
Copy link
Author

SKB-CGN commented Aug 20, 2023

I did this tests:

root@webserver:/home/webserveradmin# ocrmypdf Abrechnung.pdf test_out.pdf Using Tesseract OpenMP thread limit 2 Start processing 4 pages concurrently OCR: 0%| | 0.0/4.0 [00:00<?, ?page/s] PriorOcrFoundError: page already has text! - aborting (use --force-ocr to force OCR; see also help for the arguments --skip-text and --redo-ocr

Here, the file is not touched nor modified.

After running it with:
root@webserver:/home/webserveradmin# ocrmypdf Abrechnung.pdf test_out.pdf --skip-text [00:00<00:00, 55.94page/s] Using Tesseract OpenMP thread limit 2 Start processing 4 pages concurrently 2 skipping all processing on this page 3 skipping all processing on this page 1 skipping all processing on this page 4 skipping all processing on this page Some input metadata could not be copied because it is not permitted in PDF/A. You may wish to examine the output PDF's XMP metadata. JPEGs: 0image [00:00, ?image/s] JBIG2: 0item [00:00, ?item/s] Optimize ratio: 1.00 savings: 0.1% Output file is a PDF/A-2B (as expected)

the file gets "corrupted".

@bahnwaerter
Copy link
Collaborator

Can you repeat your test with the -v command line option to get more verbose output?
Maybe this will reveal more about the problem.

There may be a problem with the metadata of the PDF, as observed in this issue. Metadata can be preserved if the output file is not an archived PDF file but a regular PDF file created by the additional command line option --output-type pdf.

@R0Wi
Copy link
Contributor

R0Wi commented Aug 21, 2023

Thanks for checking this @bahnwaerter ! If the --output-type pdf flag fixes this issue, we should think about adding it to the command queried by this app, what do you think? Is there any particular reason why this is not the default? 😄

@SKB-CGN
Copy link
Author

SKB-CGN commented Aug 21, 2023

@bahnwaerter Sure. Here is the output:

root@webserver:/home/webserveradmin# ocrmypdf Abrechnung.pdf test_out.pdf --skip-text -v                        ocrmypdf 10.3.1+dfsg
Running: ['tesseract', '--list-langs']
No language specified; assuming --language eng
Running: ['tesseract', '--version']
Found tesseract 4.1.1
Running: ['tesseract', '-l', 'eng', '--print-parameters', 'pdf']
Running: ['gs', '--version']
Found gs 9.53.3
pikepdf mmap enabled
os.symlink(Abrechnung.pdf, /tmp/com.github.ocrmypdf.u18nzsj8/origin)
os.symlink(/tmp/com.github.ocrmypdf.u18nzsj8/origin, /tmp/com.github.ocrmypdf.u18nzsj8/origin.pdf)
pikepdf mmap enabled
pikepdf mmap enabled
Scanning contents: 100%|████████████████████████████████████████████████████████| 4/4 [00:00<00:00, 83.58page/s]
Using Tesseract OpenMP thread limit 2
Start processing 4 pages concurrently
pikepdf mmap enabled
pikepdf mmap enabled
pikepdf mmap enabled
pikepdf mmap enabled
Rotations for page 0: [text, auto, misalign, content] = 0, 0, 0, 0
    1 skipping all processing on this page
Rotations for page 1: [text, auto, misalign, content] = 0, 0, 0, 0
    2 skipping all processing on this page
Rotations for page 3: [text, auto, misalign, content] = 0, 0, 0, 0
    4 skipping all processing on this page
Rotations for page 2: [text, auto, misalign, content] = 0, 0, 0, 0
    3 skipping all processing on this page
OCR: 100%|█████████████████████████████████████████████████████████████████| 4.0/4.0 [00:00<00:00, 207.91page/s]
Running: ['gs', '-dQUIET', '-dBATCH', '-dNOPAUSE', '-dSAFER', '-dCompatibilityLevel=1.6', '-sDEVICE=pdfwrite', '-dAutoRotatePages=/None', '-sColorConversionStrategy=RGB', '-dAutoFilterColorImages=true', '-dAutoFilterGrayImages=true', '-dJPEGQ=95', '-dPDFA=2', '-dPDFACompatibilityPolicy=1', '-o', '-', '-sstdout=%stderr', '/tmp/com.github.ocrmypdf.u18nzsj8/fix_docinfo.pdf', '/tmp/com.github.ocrmypdf.u18nzsj8/pdfa.ps']
stderr = GPL Ghostscript 9.53.3: UTF16BE text string detected in DOCINFO cannot be represented in XMP for PDF/A1, discarding DOCINFO
GPL Ghostscript 9.53.3: UTF16BE text string detected in DOCINFO cannot be represented in XMP for PDF/A1, discarding DOCINFO
GPL Ghostscript 9.53.3: UTF16BE text string detected in DOCINFO cannot be represented in XMP for PDF/A1, discarding DOCINFO
GPL Ghostscript 9.53.3: UTF16BE text string detected in DOCINFO cannot be represented in XMP for PDF/A1, discarding DOCINFO
GPL Ghostscript 9.53.3: UTF16BE text string detected in DOCINFO cannot be represented in XMP for PDF/A1, discarding DOCINFO
GPL Ghostscript 9.53.3: PDFA doesn't allow images with Interpolate true.

Some input metadata could not be copied because it is not permitted in PDF/A. You may wish to examine the output PDF's XMP metadata.
The following metadata fields were not copied: {'{http://ns.adobe.com/xap/1.0/}MetadataDate'}
XrefExt(xref=23, ext='.png')
Optimizable images: JPEGs: 0 PNGs: 1
JPEGs: 0image [00:00, ?image/s]
Optimizable images: JBIG2 groups: (0,)
JBIG2: 0item [00:00, ?item/s]
Optimize ratio: 1.00 savings: 0.1%
os.symlink(/tmp/com.github.ocrmypdf.u18nzsj8/optimize.opt.pdf, /tmp/com.github.ocrmypdf.u18nzsj8/optimize.pdf)
/tmp/com.github.ocrmypdf.u18nzsj8/optimize.pdf -> test_out.pdf
Output file is a PDF/A-2B (as expected)

@bahnwaerter
Copy link
Collaborator

Thanks for the verbose output @SKB-CGN. Now we can see that there are two problems in the input PDF file:

  • UTF16BE text string detected in DOCINFO cannot be represented in XMP for PDF/A1, discarding DOCINFO
  • PDFA doesn't allow images with Interpolate true

The first problem indeed concerns the PDF metadata. Here, the tool that generated the PDF, embedded characters in the metadata with an encoding that is not permitted in the PDF/A standard. The second problem can be understood more as a warning. Apparently an interpolated image should be embedded here, which is not allowed in the PDF/A standard either.

@SKB-CGN: What tool was used to create the PDF?

@bahnwaerter
Copy link
Collaborator

If the --output-type pdf flag fixes this issue, we should think about adding it to the command queried by this app, what do you think?
Is there any particular reason why this is not the default?

Sure we could add this flag to the command line call performed by this app. However, this should never be done by default, but only used when necessary (especially if PDF files were not generated in accordance with the PDF/A standard).

Can we create an optional configuration in the workflow settings for this?

@SKB-CGN
Copy link
Author

SKB-CGN commented Aug 21, 2023

@bahnwaerter the tool is 'WISO Vermieter'.
German tool from Buhl Data, to create invoices.

@R0Wi
Copy link
Contributor

R0Wi commented Aug 21, 2023

Can we create an optional configuration in the workflow settings for this?

Sure, sounds like the best and most flexible solution 👍

@bahnwaerter the tool is 'WISO Vermieter'.
German tool from Buhl Data, to create invoices.

@SKB-CGN maybe that's one for the Buhl Data support team. I'd suggest they should produce PDF/A compliant documents 😄

@SKB-CGN
Copy link
Author

SKB-CGN commented Aug 21, 2023

@R0Wi Perhaps they should. But you know - big company with their own rules 😁

@R0Wi
Copy link
Contributor

R0Wi commented Aug 21, 2023

So to summarize: the problem mentioned here is mainly related to some PDF/A compliant issues which cannot be handled by ocrmypdf.

The meaning for this app would be to release a new feature:

  • Introduce a new per-workflow settings switch "Output type pdf" which (if set) sets --output-type pdf. If not set, --output-type is omitted.

@R0Wi R0Wi added enhancement New feature or request and removed bug Something isn't working labels Aug 21, 2023
@R0Wi R0Wi changed the title Wrong signs inside pdf after ocr'ing Add support for --output-type pdf Aug 21, 2023
@bahnwaerter
Copy link
Collaborator

Thanks @SKB-CGN for sharing the tool's name.

It is most likely the case that this tool does not create PDF/A compatible documents. However, this tool may also implement the latest version of the PDF/A standard, which Ghostscript may not currently support. Feel free to checkout the PDF/A version of your document. If the version number is supported by Ghostscript then the tool is faulty. In this case we would appreciate if you contact the Buhl Data support team and report the error.

@R0Wi: Your suggestion for adding a workflow setting sounds good. Possibly we can implement all flags for the command line option --output-type, namely pdf and pdfa, as a dropdown selection box in the UI. Supporting all flags for this option allows you to have regular PDF files converted directly into PDF/A documents for archiving if desired.

@R0Wi
Copy link
Contributor

R0Wi commented Aug 21, 2023

@R0Wi: Your suggestion for adding a workflow setting sounds good. Possibly we can implement all flags for the command line option --output-type, namely pdf and pdfa, as a dropdown selection box in the UI. Supporting all flags for this option allows you to have regular PDF files converted directly into PDF/A documents for archiving if desired.

@bahnwaerter In general I agree. But still I'm not sure if we really need a dropdown for this (I think we would need both a switch and a dropdown then...) since according to the docs:

By default, OCRmyPDF produces archival PDFs – PDF/A, which are a stricter subset of PDF features designed for long term archives. If regular PDFs are desired, this can be disabled with --output-type pdf.

So my understanding is that omitting the --output-type flag is basically the same as setting --output-type pdfa?

@bahnwaerter
Copy link
Collaborator

Yes, the text in the documentation clearly states that the default configuration of OCRmyPDF is the explicitly set --output-type pdfa option. Because of this fact we actually don't need a dropdown list of flags. So I totally agree with you.

If at some point the two-valued configuration logic is no longer sufficient, we can always introduce a dropdown list of flags.

@SKB-CGN
Copy link
Author

SKB-CGN commented Sep 22, 2023

HI,
according to the problem, that the pages got the wrong signs after converting, i changed the worklfow to use:

  • skip file completely

which now produces a lot of warnings/errors inside Nextcloud.

Warnung | workflow_ocr | OCRmyPDF succeeded with warning(s): PriorOcrFoundError: page already has text! - aborting (use --force-ocr to force OCR; see also help for the arguments --skip-text and --redo-ocr, | 2023-09-21T17:00:12+0200

Should this normally be suppressed as the application should not produce any output?

@R0Wi
Copy link
Contributor

R0Wi commented Sep 22, 2023

HI, according to the problem, that the pages got the wrong signs after converting, i changed the worklfow to use:

  • skip file completely

which now produces a lot of warnings/errors inside Nextcloud.

Warnung | workflow_ocr | OCRmyPDF succeeded with warning(s): PriorOcrFoundError: page already has text! - aborting (use --force-ocr to force OCR; see also help for the arguments --skip-text and --redo-ocr, | 2023-09-21T17:00:12+0200

Should this normally be suppressed as the application should not produce any output?

This is a known problem which we're currently working on (see #232)

@R0Wi
Copy link
Contributor

R0Wi commented Oct 5, 2023

@SKB-CGN FYI: when implementing #233, I reviewed your problem as well but I came to the conclusion that logging a warning is mandatory if ocrmypdf writes something to the stderr but I removed the notification which was sent in that case. Unfortunately (in my opinion) we cannot reliably tell if we need to ignore a stderr message or not. So for example parsing the message, searching for "PriorOcrFoundError" and not logging an error if the OCR mode is set to "skip file" seems to be quite error prone to me and highly depends on the used ocrmypdf version. We could even miss some other warnings printed by ocrmypdf if we would skip the warning in general. So to me this is just bad design.

As a workaround please increase your loglevel so that for example only errors are logged. You can also use logrotation to control the size of your logs.

@R0Wi
Copy link
Contributor

R0Wi commented Dec 13, 2024

With #272, you can now add custom CLI arguments for ocrMyPdf. This should solve your issue if you just put --output-type pdf there.

@R0Wi R0Wi closed this as completed Dec 13, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants