Add support for `--output-type pdf` #222

SKB-CGN · 2023-08-20T21:00:53Z

Hi,
i am not sure, how this error is connected to an upgrade of Nextcloud from 25 to 26.

I have uploaded a new pdf, which was processed, but displays the wrong char-set.

This is the original text:

This is the text of the converted one:

But, when selecting the text with the mouse and copying it, it displays the correct text. Which is:
anbei erhaltenSiedieBetriebskostenabrech nungfürdasJah r2022. Bei derBerech nungwirdIhrNutzungszeitraum vom 01.03.2022-31.12.2022berücksic

Would be great, if you know, what kind of issue this could be.

Thank you!

The text was updated successfully, but these errors were encountered:

R0Wi · 2023-08-20T21:04:57Z

Well since the app itself doesn't create the new PDF content I would assume there is a problem with ocrMyPdf itself. If possible please post the problematic file here or try what happens if you invoke ocrMyPdf directly from the CLI with the problematic PDF as input.

SKB-CGN · 2023-08-20T21:19:22Z

I did this tests:

root@webserver:/home/webserveradmin# ocrmypdf Abrechnung.pdf test_out.pdf Using Tesseract OpenMP thread limit 2 Start processing 4 pages concurrently OCR: 0%| | 0.0/4.0 [00:00<?, ?page/s] PriorOcrFoundError: page already has text! - aborting (use --force-ocr to force OCR; see also help for the arguments --skip-text and --redo-ocr

Here, the file is not touched nor modified.

After running it with:
root@webserver:/home/webserveradmin# ocrmypdf Abrechnung.pdf test_out.pdf --skip-text [00:00<00:00, 55.94page/s] Using Tesseract OpenMP thread limit 2 Start processing 4 pages concurrently 2 skipping all processing on this page 3 skipping all processing on this page 1 skipping all processing on this page 4 skipping all processing on this page Some input metadata could not be copied because it is not permitted in PDF/A. You may wish to examine the output PDF's XMP metadata. JPEGs: 0image [00:00, ?image/s] JBIG2: 0item [00:00, ?item/s] Optimize ratio: 1.00 savings: 0.1% Output file is a PDF/A-2B (as expected)

the file gets "corrupted".

bahnwaerter · 2023-08-21T05:27:14Z

Can you repeat your test with the -v command line option to get more verbose output?
Maybe this will reveal more about the problem.

There may be a problem with the metadata of the PDF, as observed in this issue. Metadata can be preserved if the output file is not an archived PDF file but a regular PDF file created by the additional command line option --output-type pdf.

R0Wi · 2023-08-21T05:49:23Z

Thanks for checking this @bahnwaerter ! If the --output-type pdf flag fixes this issue, we should think about adding it to the command queried by this app, what do you think? Is there any particular reason why this is not the default? 😄

SKB-CGN · 2023-08-21T06:17:31Z

@bahnwaerter Sure. Here is the output:

root@webserver:/home/webserveradmin# ocrmypdf Abrechnung.pdf test_out.pdf --skip-text -v                        ocrmypdf 10.3.1+dfsg
Running: ['tesseract', '--list-langs']
No language specified; assuming --language eng
Running: ['tesseract', '--version']
Found tesseract 4.1.1
Running: ['tesseract', '-l', 'eng', '--print-parameters', 'pdf']
Running: ['gs', '--version']
Found gs 9.53.3
pikepdf mmap enabled
os.symlink(Abrechnung.pdf, /tmp/com.github.ocrmypdf.u18nzsj8/origin)
os.symlink(/tmp/com.github.ocrmypdf.u18nzsj8/origin, /tmp/com.github.ocrmypdf.u18nzsj8/origin.pdf)
pikepdf mmap enabled
pikepdf mmap enabled
Scanning contents: 100%|████████████████████████████████████████████████████████| 4/4 [00:00<00:00, 83.58page/s]
Using Tesseract OpenMP thread limit 2
Start processing 4 pages concurrently
pikepdf mmap enabled
pikepdf mmap enabled
pikepdf mmap enabled
pikepdf mmap enabled
Rotations for page 0: [text, auto, misalign, content] = 0, 0, 0, 0
    1 skipping all processing on this page
Rotations for page 1: [text, auto, misalign, content] = 0, 0, 0, 0
    2 skipping all processing on this page
Rotations for page 3: [text, auto, misalign, content] = 0, 0, 0, 0
    4 skipping all processing on this page
Rotations for page 2: [text, auto, misalign, content] = 0, 0, 0, 0
    3 skipping all processing on this page
OCR: 100%|█████████████████████████████████████████████████████████████████| 4.0/4.0 [00:00<00:00, 207.91page/s]
Running: ['gs', '-dQUIET', '-dBATCH', '-dNOPAUSE', '-dSAFER', '-dCompatibilityLevel=1.6', '-sDEVICE=pdfwrite', '-dAutoRotatePages=/None', '-sColorConversionStrategy=RGB', '-dAutoFilterColorImages=true', '-dAutoFilterGrayImages=true', '-dJPEGQ=95', '-dPDFA=2', '-dPDFACompatibilityPolicy=1', '-o', '-', '-sstdout=%stderr', '/tmp/com.github.ocrmypdf.u18nzsj8/fix_docinfo.pdf', '/tmp/com.github.ocrmypdf.u18nzsj8/pdfa.ps']
stderr = GPL Ghostscript 9.53.3: UTF16BE text string detected in DOCINFO cannot be represented in XMP for PDF/A1, discarding DOCINFO
GPL Ghostscript 9.53.3: UTF16BE text string detected in DOCINFO cannot be represented in XMP for PDF/A1, discarding DOCINFO
GPL Ghostscript 9.53.3: UTF16BE text string detected in DOCINFO cannot be represented in XMP for PDF/A1, discarding DOCINFO
GPL Ghostscript 9.53.3: UTF16BE text string detected in DOCINFO cannot be represented in XMP for PDF/A1, discarding DOCINFO
GPL Ghostscript 9.53.3: UTF16BE text string detected in DOCINFO cannot be represented in XMP for PDF/A1, discarding DOCINFO
GPL Ghostscript 9.53.3: PDFA doesn't allow images with Interpolate true.

Some input metadata could not be copied because it is not permitted in PDF/A. You may wish to examine the output PDF's XMP metadata.
The following metadata fields were not copied: {'{http://ns.adobe.com/xap/1.0/}MetadataDate'}
XrefExt(xref=23, ext='.png')
Optimizable images: JPEGs: 0 PNGs: 1
JPEGs: 0image [00:00, ?image/s]
Optimizable images: JBIG2 groups: (0,)
JBIG2: 0item [00:00, ?item/s]
Optimize ratio: 1.00 savings: 0.1%
os.symlink(/tmp/com.github.ocrmypdf.u18nzsj8/optimize.opt.pdf, /tmp/com.github.ocrmypdf.u18nzsj8/optimize.pdf)
/tmp/com.github.ocrmypdf.u18nzsj8/optimize.pdf -> test_out.pdf
Output file is a PDF/A-2B (as expected)

bahnwaerter · 2023-08-21T06:57:19Z

Thanks for the verbose output @SKB-CGN. Now we can see that there are two problems in the input PDF file:

UTF16BE text string detected in DOCINFO cannot be represented in XMP for PDF/A1, discarding DOCINFO
PDFA doesn't allow images with Interpolate true

The first problem indeed concerns the PDF metadata. Here, the tool that generated the PDF, embedded characters in the metadata with an encoding that is not permitted in the PDF/A standard. The second problem can be understood more as a warning. Apparently an interpolated image should be embedded here, which is not allowed in the PDF/A standard either.

@SKB-CGN: What tool was used to create the PDF?

bahnwaerter · 2023-08-21T07:03:16Z

If the --output-type pdf flag fixes this issue, we should think about adding it to the command queried by this app, what do you think?
Is there any particular reason why this is not the default?

Sure we could add this flag to the command line call performed by this app. However, this should never be done by default, but only used when necessary (especially if PDF files were not generated in accordance with the PDF/A standard).

Can we create an optional configuration in the workflow settings for this?

SKB-CGN · 2023-08-21T07:31:46Z

@bahnwaerter the tool is 'WISO Vermieter'.
German tool from Buhl Data, to create invoices.

R0Wi · 2023-08-21T08:51:25Z

Can we create an optional configuration in the workflow settings for this?

Sure, sounds like the best and most flexible solution 👍

@bahnwaerter the tool is 'WISO Vermieter'.
German tool from Buhl Data, to create invoices.

@SKB-CGN maybe that's one for the Buhl Data support team. I'd suggest they should produce PDF/A compliant documents 😄

SKB-CGN · 2023-08-21T08:59:50Z

@R0Wi Perhaps they should. But you know - big company with their own rules 😁

R0Wi · 2023-08-21T09:13:38Z

So to summarize: the problem mentioned here is mainly related to some PDF/A compliant issues which cannot be handled by ocrmypdf.

The meaning for this app would be to release a new feature:

Introduce a new per-workflow settings switch "Output type pdf" which (if set) sets --output-type pdf. If not set, --output-type is omitted.

bahnwaerter · 2023-08-21T09:27:32Z

Thanks @SKB-CGN for sharing the tool's name.

It is most likely the case that this tool does not create PDF/A compatible documents. However, this tool may also implement the latest version of the PDF/A standard, which Ghostscript may not currently support. Feel free to checkout the PDF/A version of your document. If the version number is supported by Ghostscript then the tool is faulty. In this case we would appreciate if you contact the Buhl Data support team and report the error.

@R0Wi: Your suggestion for adding a workflow setting sounds good. Possibly we can implement all flags for the command line option --output-type, namely pdf and pdfa, as a dropdown selection box in the UI. Supporting all flags for this option allows you to have regular PDF files converted directly into PDF/A documents for archiving if desired.

R0Wi · 2023-08-21T09:33:02Z

@R0Wi: Your suggestion for adding a workflow setting sounds good. Possibly we can implement all flags for the command line option --output-type, namely pdf and pdfa, as a dropdown selection box in the UI. Supporting all flags for this option allows you to have regular PDF files converted directly into PDF/A documents for archiving if desired.

@bahnwaerter In general I agree. But still I'm not sure if we really need a dropdown for this (I think we would need both a switch and a dropdown then...) since according to the docs:

By default, OCRmyPDF produces archival PDFs – PDF/A, which are a stricter subset of PDF features designed for long term archives. If regular PDFs are desired, this can be disabled with --output-type pdf.

So my understanding is that omitting the --output-type flag is basically the same as setting --output-type pdfa?

bahnwaerter · 2023-08-21T17:04:38Z

Yes, the text in the documentation clearly states that the default configuration of OCRmyPDF is the explicitly set --output-type pdfa option. Because of this fact we actually don't need a dropdown list of flags. So I totally agree with you.

If at some point the two-valued configuration logic is no longer sufficient, we can always introduce a dropdown list of flags.

SKB-CGN · 2023-09-22T06:24:44Z

HI,
according to the problem, that the pages got the wrong signs after converting, i changed the worklfow to use:

skip file completely

which now produces a lot of warnings/errors inside Nextcloud.

Warnung | workflow_ocr | OCRmyPDF succeeded with warning(s): PriorOcrFoundError: page already has text! - aborting (use --force-ocr to force OCR; see also help for the arguments --skip-text and --redo-ocr, | 2023-09-21T17:00:12+0200

Should this normally be suppressed as the application should not produce any output?

R0Wi · 2023-09-22T06:28:37Z

HI, according to the problem, that the pages got the wrong signs after converting, i changed the worklfow to use:

skip file completely

which now produces a lot of warnings/errors inside Nextcloud.

Warnung | workflow_ocr | OCRmyPDF succeeded with warning(s): PriorOcrFoundError: page already has text! - aborting (use --force-ocr to force OCR; see also help for the arguments --skip-text and --redo-ocr, | 2023-09-21T17:00:12+0200

Should this normally be suppressed as the application should not produce any output?

This is a known problem which we're currently working on (see #232)

R0Wi · 2023-10-05T07:34:20Z

@SKB-CGN FYI: when implementing #233, I reviewed your problem as well but I came to the conclusion that logging a warning is mandatory if ocrmypdf writes something to the stderr but I removed the notification which was sent in that case. Unfortunately (in my opinion) we cannot reliably tell if we need to ignore a stderr message or not. So for example parsing the message, searching for "PriorOcrFoundError" and not logging an error if the OCR mode is set to "skip file" seems to be quite error prone to me and highly depends on the used ocrmypdf version. We could even miss some other warnings printed by ocrmypdf if we would skip the warning in general. So to me this is just bad design.

As a workaround please increase your loglevel so that for example only errors are logged. You can also use logrotation to control the size of your logs.

R0Wi · 2024-12-13T21:48:47Z

With #272, you can now add custom CLI arguments for ocrMyPdf. This should solve your issue if you just put --output-type pdf there.

SKB-CGN added the bug Something isn't working label Aug 20, 2023

SKB-CGN assigned R0Wi Aug 20, 2023

R0Wi added enhancement New feature or request and removed bug Something isn't working labels Aug 21, 2023

R0Wi changed the title ~~Wrong signs inside pdf after ocr'ing~~ Add support for --output-type pdf Aug 21, 2023

R0Wi closed this as completed Dec 13, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support for `--output-type pdf` #222

Add support for `--output-type pdf` #222

SKB-CGN commented Aug 20, 2023

R0Wi commented Aug 20, 2023

SKB-CGN commented Aug 20, 2023 •

edited

Loading

bahnwaerter commented Aug 21, 2023

R0Wi commented Aug 21, 2023

SKB-CGN commented Aug 21, 2023

bahnwaerter commented Aug 21, 2023

bahnwaerter commented Aug 21, 2023

SKB-CGN commented Aug 21, 2023

R0Wi commented Aug 21, 2023

SKB-CGN commented Aug 21, 2023

R0Wi commented Aug 21, 2023

bahnwaerter commented Aug 21, 2023

R0Wi commented Aug 21, 2023 •

edited

Loading

bahnwaerter commented Aug 21, 2023

SKB-CGN commented Sep 22, 2023

R0Wi commented Sep 22, 2023

R0Wi commented Oct 5, 2023 •

edited

Loading

R0Wi commented Dec 13, 2024

Add support for --output-type pdf #222

Add support for --output-type pdf #222

Comments

SKB-CGN commented Aug 20, 2023

R0Wi commented Aug 20, 2023

SKB-CGN commented Aug 20, 2023 • edited Loading

bahnwaerter commented Aug 21, 2023

R0Wi commented Aug 21, 2023

SKB-CGN commented Aug 21, 2023

bahnwaerter commented Aug 21, 2023

bahnwaerter commented Aug 21, 2023

SKB-CGN commented Aug 21, 2023

R0Wi commented Aug 21, 2023

SKB-CGN commented Aug 21, 2023

R0Wi commented Aug 21, 2023

bahnwaerter commented Aug 21, 2023

R0Wi commented Aug 21, 2023 • edited Loading

bahnwaerter commented Aug 21, 2023

SKB-CGN commented Sep 22, 2023

R0Wi commented Sep 22, 2023

R0Wi commented Oct 5, 2023 • edited Loading

R0Wi commented Dec 13, 2024

Add support for `--output-type pdf` #222

Add support for `--output-type pdf` #222

SKB-CGN commented Aug 20, 2023 •

edited

Loading

R0Wi commented Aug 21, 2023 •

edited

Loading

R0Wi commented Oct 5, 2023 •

edited

Loading