Generating PDF/A conforming PDFs #630

sheppie123 · 2018-05-15T15:02:39Z

Is it possible to generate PDFs that conform to PDF/A using Weasyprint?
From wikipedia:

Other key elements to PDF/A compatibility include:

Audio and video content are forbidden.

JavaScript and executable file launches are forbidden.

All fonts must be embedded and also must be legally embeddable for
unlimited, universal rendering. This also applies to the so-called
PostScript standard fonts such as Times or Helvetica.

Colorspaces specified in a device-independent manner.

Encryption is disallowed.

Use of standards-based metadata is mandated.

Many Thanks

LukasKlement · 2018-06-19T10:02:31Z

I opened a ticket on PDF X/3 compliance: #640

Perhaps to start the discussion on what direction WeasyPrint should take, it may be worthwhile to collect the purpose of the different standards:

PDF A -> a standard used predominantly for document archiving
PDF X -> a standard used predominantly for professional print (e.g. offset print)

For detailed differences on the two standards, see page 17 of this document: https://www.impressed.de/DOWNLOADS/pdfToolbox_Server/callas_pdfEngine_Reference.pdf

I attach great importance to PDF X, as I believe achieving full print-compliance is an absolute necessity for a mature PDF creation/conversion tool.

liZe · 2018-08-07T17:05:08Z

I've tried to give Acrobat various PDF files generated by WeasyPrint… It's awful, there are many, many, many things to fix before reaching PDF/A or PDF/X conformance.

I attach great importance to PDF X, as I believe achieving full print-compliance is an absolute necessity for a mature PDF creation/conversion tool.

I agree, but there's a long way waiting for us.

hejsan · 2020-04-13T11:35:50Z

Hi - opening this can of worms - can we list the things needed to conform to PDF/A?
@liZe You mention giving Acrobat various PDF files generated by WeasyPrint, how does it tell you what's amiss?
I thought it was mostly about not referencing any outside files and embedding all fonts - is weasyprint not complying with that already?

liZe · 2020-04-13T11:51:31Z

opening this can of worms

🐛🐛🐛🐛🐛🐛🐛🐛

can we list the things needed to conform to PDF/A?

That would be really useful.

@liZe You mention giving Acrobat various PDF files generated by WeasyPrint, how does it tell you what's amiss?

I don’t really remember, but I think that there’s a PDF validator in Acrobat (not in Reader, it’s not free 😢).

Does anyone know an open source (or at least free) tool to check PDF/A and PDF/X conformance?

I thought it was mostly about not referencing any outside files and embedding all fonts - is weasyprint not complying with that already?

As far as I can remember, there were lots of errors, and most of them were just impossible to fix with Cairo. I think that we need a dedicated PDF generator for that (see #841).

hejsan · 2020-04-13T13:38:35Z

I seem to recall Apache PDFBox having some features, I'll have to check better though.

I think that we need a dedicated PDF generator for that

Maybe this is another use for a post-processor that would parse through the pdf and do what is needed. Seems like a massive undertaking though if it is supposed to support changing everything to be pdf/a compliant. Might be smart to start by being able to convert simple pdf's that don't include edge cases like embedded video etc.

edit: I was looking through #841 and must say I somewhat disagree about getting rid of external dependencies unless they're proving to be severe limiting factors (Maybe Cairo is?). They're literally what make big opensource projects viable and not just a massive liability to the developers.

liZe · 2020-04-13T15:16:30Z

Might be smart to start by being able to convert simple pdf's that don't include edge cases like embedded video etc.

The current post-processor only knows how to parse PDF files generated by Cairo. It removes a lot of edge cases.

edit: I was looking through #841 and must say I somewhat disagree about getting rid of external dependencies unless they're proving to be severe limiting factors (Maybe Cairo is?). They're literally what make big opensource projects viable and not just a massive liability to the developers.

Of course, removing all external dependencies is not a goal per se. But there are some reasons why it would be interesting to consider getting rid of some of them:

Having non-Python dependencies is the source of many, many, many installation problems, at least on Windows and macOS.
We’ve had many problems with Cairo. More than 20% of the reported issues have the "Cairo" word in their comments.
Cairo releases are … sometimes late. SVG getting mangled when I export to pdf #278 is a good example of why it’s been really frustrating to work with its dev team.
Cairo does a lot of things WeasyPrint’s not interested in. Generating PNG is useful for WeasyPrint, but it could be done with a PDF-to-PNG converter. Cairo is complex, it will probably never get new PDF-only features soon (the latest stable version is the first one providing metadata and links, for example).
Pango should be useless for us. We use it to break lines, but HTML has requirements that are really different from "normal" use cases. That’s why we have a lot of workarounds for texts. We should use Harfbuzz instead, and break lines using a custom algorithm, just as other browsers do. See Rewrite the line breaking algorithm #301, for example.

So. Here’s what I think.

Using a "real" PDF generator would be hard but not impossible. I don’t really like ReportLab for many reasons, but something like that would be really useful.
Having a real line-breaking algorithm would make Pango useless.
FontConfig is really convenient for Pango, but it should be used only on Linux where it’s the standard library. We could probably rely on macOS and Windows APIs to find fonts (what do other browsers do?).
We have to keep HarfBuzz.

hejsan · 2020-04-13T17:47:10Z

Ok, I understand and agree with your points.

I don’t really like ReportLab for many reasons

I agree to steer away from any freemium solutions as they tend to become a liability down the road when they refuse to push features to their "community" versions.

Do you see this new PDF generator as a separate project or would it be part of WeasyPrint?

liZe · 2020-04-13T19:02:33Z

I agree to steer away from any freemium solutions as they tend to become a liability down the road when they refuse to push features to their "community" versions.

👍

Do you see this new PDF generator as a separate project or would it be part of WeasyPrint?

It can be a separate project, with a quite low-level API. The hard part is probably to handle fonts, by creating a PangoCairo equivalent.

(If anyone knows how to convert PDF to PNG in pure Python, that would be useful too 😒.)

hejsan · 2020-04-14T11:23:39Z

I found an opensource PDF/A conformance checker that is pretty cool: https://verapdf.org/
(Download here: https://verapdf.org/software/)
There's both a simple gui for checking individual files and also a commandline that can be used for automatic testing.
It gives you a simple breakdown report that links to details about each error (they're all hosted in this list: https://github.com/veraPDF/veraPDF-validation-profiles/wiki/PDFA-Part-1-rules

liZe · 2020-04-15T13:20:13Z

I found an opensource PDF/A conformance checker that is pretty cool: https://verapdf.org/

That’s really cool, thanks!

It gives you a simple breakdown report that links to details about each error (they're all hosted in this list: https://github.com/veraPDF/veraPDF-validation-profiles/wiki/PDFA-Part-1-rules

That’s really impressive.

Having PDF/A conformance is probably one of the best features we can get once we have a new PDF generator. I’m currently working on that 😉. (That = the generator, not the PDF/A conformance yet)

hejsan · 2020-04-15T15:58:05Z

I’m currently working on that

Cool, do you have an open repo for it yet? I had been pondering the same.
Thinking out loud the PDF/A conformance has to be an option as it would impact speed and available features?

malnajdi · 2020-04-15T16:12:20Z

@liZe is teasing a lot about this new generator. If you need help let me know 😄

oleg-medovikov · 2021-01-19T12:15:05Z

How is it going?

liZe · 2021-01-19T14:43:54Z

How is it going?

Pretty well! The new PDF generator (called pydyf) is now used in master, almost everything works fine (missing SVG images support is the major point we have to fix).

An online PDF validator thinks that many PDF files we generate are already PDF/A compliant, but I suppose that we still have a lot of work (for tags at least, I think). We have to check with veraPDF too.

As explained in #1232, the next step is to have a master branch with the same features as the current stable versions. When it’s done, we’ll have to implement features that people want, but we can’t do everything at the same time 😉. And before that, we’re currently implementing sponsored features (#247, #1057) that may also be useful for you!

guidocioni · 2021-04-19T08:17:51Z

How is it going?

Pretty well! The new PDF generator (called pydyf) is now used in master, almost everything works fine (missing SVG images support is the major point we have to fix).

An online PDF validator thinks that many PDF files we generate are already PDF/A compliant, but I suppose that we still have a lot of work (for tags at least, I think). We have to check with veraPDF too.

As explained in #1232, the next step is to have a master branch with the same features as the current stable versions. When it’s done, we’ll have to implement features that people want, but we can’t do everything at the same time 😉. And before that, we’re currently implementing sponsored features (#247, #1057) that may also be useful for you!

If I get the latest version from Conda is this already inside? Because I've been trying to produce quite simple (no images or weird components) PDF/A compliant files and from the file info I can see that the version is only 1.5 and they're not PDF/A compliant. :( So maybe the version that I'm using (52.4) still does not include pydyf support?

grewn0uille · 2021-04-19T09:49:31Z

Hello @guidocioni!

The latest version on Conda (52.5) doesn’t include pydyf. All 52.x versions are using (and will use) Cairo.

Currently there is no release working with pydyf, but the current master branch uses it so you can give it a try if you want 😀

guidocioni · 2021-04-19T10:39:38Z

Hello @guidocioni!

The latest version on Conda (52.5) doesn’t include pydyf. All 52.x versions are using (and will use) Cairo.

Currently there is no release working with pydyf, but the current master branch uses it so you can give it a try if you want 😀

Would be good, the problem is that where I'm deploying this I can only use conda to install anything :D Is there a way to install the master with conda? As you can imagine also converting a PDF to PDF/A using solely conda/python installation is kind of a nightmare :D

grewn0uille · 2021-04-19T13:41:49Z

I don’t think there is an easy way to install the master branch directly with Conda, but you can use pip in a Conda environment and so install the master branch with pip.

guidocioni · 2021-04-19T13:44:52Z

I don’t think there is an easy way to install the master branch directly with Conda, but you can use pip in a Conda environment and so install the master branch with pip.

eh eh I wish it would be so easy. Unfortunately I can only give a list of dependencies to install through conda forge and access a Python environment running with Spark. No access to pip or the underlying unix system. Thanks for the help anyway! I hope someday this will make its way in the stable release

guidocioni · 2021-04-26T11:59:09Z

@grewn0uille I managed to install the latest 53.0b1 version (which uses pydyf) in our system and produce a PDF. When looking in the file info I can see it was generated according to the 1.7 standard but when checking in the online validator unfortunately I get these errors:

The value of the key Flags is 8 but must be either symbolic or non-symbolic.
The value of the key Flags is 10 but must be either symbolic or non-symbolic.
The value of the key Flags is 8 but must be either symbolic or non-symbolic.
The document does not conform to the requested standard.
The document contains fonts without embedded font programs or encoding information (CMAPs).
The document doesnot conform to the PDF 1.7 standard.

any idea where are those coming from?

Related to #630.

liZe · 2021-04-26T13:08:56Z

any idea where are those coming from?

They come from a bug that’s just been fixed by f804d59. Thanks a lot for the report!

guidocioni · 2021-04-26T13:29:14Z

any idea where are those coming from?

They come from a bug that’s just been fixed by f804d59. Thanks a lot for the report!

Nice, when implementing f804d59 the PDF is now 1.7 conform but still not PDF/A compliant :(

liZe · 2021-04-26T13:31:04Z

Nice, when implementing f804d59 the PDF is now 1.7 conform but still not PDF/A compliant :(

And there’s a long way ahead… But at least now we can generate the PDF we want.

grewn0uille · 2021-05-05T15:47:24Z

Hello!

(The survey is now closed. Thanks for all your answers! We’ll share the results soon 😉)

If you’re interested in PDF/A compliance, we created a short survey where you can give a boost to this feature and help us to improve WeasyPrint 😉

Vote for it!

guidocioni · 2021-05-11T13:25:20Z

Nice, when implementing f804d59 the PDF is now 1.7 conform but still not PDF/A compliant :(

And there’s a long way ahead… But at least now we can generate the PDF we want.

So is there no way to force pydyf to produce a PDF/A compliant file? Like changing default fonts or other things...
Or is it just something that can be controlled right now.

liZe · 2021-05-12T06:29:04Z

So is there no way to force pydyf to produce a PDF/A compliant file? Like changing default fonts or other things...
Or is it just something that can be controlled right now.

It can’t be controlled right now, at least without code being added to WeasyPrint. pydyf is theoretically able to generate PDF/A-compliant PDFs, but the current code of WeasyPrint doesn’t follow the PDF/A rules. pydyf is only the first step.

guidocioni · 2021-05-12T15:47:08Z

So is there no way to force pydyf to produce a PDF/A compliant file? Like changing default fonts or other things...
Or is it just something that can be controlled right now.

It can’t be controlled right now, at least without code being added to WeasyPrint. pydyf is theoretically able to generate PDF/A-compliant PDFs, but the current code of WeasyPrint doesn’t follow the PDF/A rules. pydyf is only the first step.

Ok thanks. For the moment I'm using ghostscript piping input and output, where the input is a temporary file where weasyprint writes and the output is in the filesystem, to directly convert what's coming out of weasyprint to PDF/A but of course it would be amazing to have such a feature built-in the tool. Anyway keep up the good work!

Fix #630.

winklemint · 2024-01-31T12:01:10Z

So is there no way to force pydyf to produce a PDF/A compliant file? Like changing default fonts or other things...
Or is it just something that can be controlled right now.

It can’t be controlled right now, at least without code being added to WeasyPrint. pydyf is theoretically able to generate PDF/A-compliant PDFs, but the current code of WeasyPrint doesn’t follow the PDF/A rules. pydyf is only the first step.

Ok thanks. For the moment I'm using ghostscript piping input and output, where the input is a temporary file where weasyprint writes and the output is in the filesystem, to directly convert what's coming out of weasyprint to PDF/A but of course it would be amazing to have such a feature built-in the tool. Anyway keep up the good work!

Hi can you share this code like how are you converting an existing pdf to PDF/A using ghost script as i am trying it is not working for me

guidocioni · 2024-02-01T08:20:54Z

So is there no way to force pydyf to produce a PDF/A compliant file? Like changing default fonts or other things...
Or is it just something that can be controlled right now.

It can’t be controlled right now, at least without code being added to WeasyPrint. pydyf is theoretically able to generate PDF/A-compliant PDFs, but the current code of WeasyPrint doesn’t follow the PDF/A rules. pydyf is only the first step.

Ok thanks. For the moment I'm using ghostscript piping input and output, where the input is a temporary file where weasyprint writes and the output is in the filesystem, to directly convert what's coming out of weasyprint to PDF/A but of course it would be amazing to have such a feature built-in the tool. Anyway keep up the good work!

Hi can you share this code like how are you converting an existing pdf to PDF/A using ghost script as i am trying it is not working for me

This is something I used in the past but I'm not sure it is still working now

import subprocess
import os


def convert_to_pdfa(sourceFile, targetFile):
    ghostScriptExec = ['gs', '-dPDFA', '-dBATCH', '-dNOPAUSE', 
                      '-sColorConversionStrategy=UseDeviceIndependentColor',
                      '-sDEVICE=pdfwrite', '-dPDFACompatibilityPolicy=2']
    # because of a ghostscript bug, which does not allow parameters that are longer than 255 characters
    # we need to perform a directory changes, before we can actually return from the method
    cwd = os.getcwd()
    os.chdir(os.path.dirname(targetFile))
    try:
        subprocess.check_output(ghostScriptExec +
                                ['-sOutputFile=' + os.path.basename(targetFile), sourceFile])
    except subprocess.CalledProcessError as e:
        raise RuntimeError("command '{}' return with error (code {}): {}".format(
            e.cmd, e.returncode, e.output))
    os.chdir(cwd)

winklemint · 2024-02-02T19:43:26Z

So is there no way to force pydyf to produce a PDF/A compliant file? Like changing default fonts or other things...
Or is it just something that can be controlled right now.

It can’t be controlled right now, at least without code being added to WeasyPrint. pydyf is theoretically able to generate PDF/A-compliant PDFs, but the current code of WeasyPrint doesn’t follow the PDF/A rules. pydyf is only the first step.

Ok thanks. For the moment I'm using ghostscript piping input and output, where the input is a temporary file where weasyprint writes and the output is in the filesystem, to directly convert what's coming out of weasyprint to PDF/A but of course it would be amazing to have such a feature built-in the tool. Anyway keep up the good work!

Hi can you share this code like how are you converting an existing pdf to PDF/A using ghost script as i am trying it is not working for me

This is something I used in the past but I'm not sure it is still working now
import subprocess
import os


def convert_to_pdfa(sourceFile, targetFile):
    ghostScriptExec = ['gs', '-dPDFA', '-dBATCH', '-dNOPAUSE', 
                      '-sColorConversionStrategy=UseDeviceIndependentColor',
                      '-sDEVICE=pdfwrite', '-dPDFACompatibilityPolicy=2']
    # because of a ghostscript bug, which does not allow parameters that are longer than 255 characters
    # we need to perform a directory changes, before we can actually return from the method
    cwd = os.getcwd()
    os.chdir(os.path.dirname(targetFile))
    try:
        subprocess.check_output(ghostScriptExec +
                                ['-sOutputFile=' + os.path.basename(targetFile), sourceFile])
    except subprocess.CalledProcessError as e:
        raise RuntimeError("command '{}' return with error (code {}): {}".format(
            e.cmd, e.returncode, e.output))
    os.chdir(cwd)

Hi thanks for this solution I tried with different policy and multiple changes to make the file PDF/A-3B compliant and Vera PDF validated it I am trying to look for a way to attach an XML to it like embedd and XML in it to make this with Factur-X standard, Any suggestion or help is highly appreciated. Thanks

FelixSchwarz · 2024-05-12T19:26:38Z

I am trying to look for a way to attach an XML to it like embedd and XML in it to make this with Factur-X standard, Any suggestion or help is highly appreciated.

@winklemint WeasyPrint does not use GitHub discussions but maybe you can open an issue about Factur-X support. My idea is to gather snippets and advice how to generate Factur-X PDFs using WeasyPrint.

liZe added the feature New feature that should be supported label May 15, 2018

liZe added this to the 43 milestone Aug 3, 2018

liZe removed this from the 43 milestone Aug 7, 2018

liZe mentioned this issue Nov 15, 2018

[RFC] using Weasyprint to generate pdf OCA/reporting-engine#254

Closed

LukasKlement mentioned this issue Apr 1, 2019

Support color profiles #844

Open

liZe added a commit that referenced this issue Apr 26, 2021

Fix font flags

f804d59

Related to #630.

liZe added this to the 56.0 milestone May 17, 2022

liZe pinned this issue May 17, 2022

grewn0uille added the sponsored Issues sponsored to be resolved faster label May 17, 2022

liZe added a commit that referenced this issue May 20, 2022

Add options for PDF/A generation

f6a2afb

Fix #630.

liZe closed this as completed in deda575 Jun 13, 2022

grewn0uille unpinned this issue Jul 7, 2022

s-u mentioned this issue Sep 15, 2022

pdf/a compatibility s-u/Cairo#38

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Generating PDF/A conforming PDFs #630

Generating PDF/A conforming PDFs #630

sheppie123 commented May 15, 2018

LukasKlement commented Jun 19, 2018

liZe commented Aug 7, 2018

hejsan commented Apr 13, 2020

liZe commented Apr 13, 2020

hejsan commented Apr 13, 2020 •

edited

Loading

liZe commented Apr 13, 2020

hejsan commented Apr 13, 2020

liZe commented Apr 13, 2020

hejsan commented Apr 14, 2020 •

edited

Loading

liZe commented Apr 15, 2020 •

edited

Loading

hejsan commented Apr 15, 2020 •

edited

Loading

malnajdi commented Apr 15, 2020

oleg-medovikov commented Jan 19, 2021

liZe commented Jan 19, 2021

guidocioni commented Apr 19, 2021

grewn0uille commented Apr 19, 2021

guidocioni commented Apr 19, 2021

grewn0uille commented Apr 19, 2021

guidocioni commented Apr 19, 2021

guidocioni commented Apr 26, 2021

liZe commented Apr 26, 2021

guidocioni commented Apr 26, 2021

liZe commented Apr 26, 2021

grewn0uille commented May 5, 2021 •

edited

Loading

guidocioni commented May 11, 2021

liZe commented May 12, 2021

guidocioni commented May 12, 2021

winklemint commented Jan 31, 2024

guidocioni commented Feb 1, 2024

winklemint commented Feb 2, 2024

FelixSchwarz commented May 12, 2024

Generating PDF/A conforming PDFs #630

Generating PDF/A conforming PDFs #630

Comments

sheppie123 commented May 15, 2018

LukasKlement commented Jun 19, 2018

liZe commented Aug 7, 2018

hejsan commented Apr 13, 2020

liZe commented Apr 13, 2020

hejsan commented Apr 13, 2020 • edited Loading

liZe commented Apr 13, 2020

hejsan commented Apr 13, 2020

liZe commented Apr 13, 2020

hejsan commented Apr 14, 2020 • edited Loading

liZe commented Apr 15, 2020 • edited Loading

hejsan commented Apr 15, 2020 • edited Loading

malnajdi commented Apr 15, 2020

oleg-medovikov commented Jan 19, 2021

liZe commented Jan 19, 2021

guidocioni commented Apr 19, 2021

grewn0uille commented Apr 19, 2021

guidocioni commented Apr 19, 2021

grewn0uille commented Apr 19, 2021

guidocioni commented Apr 19, 2021

guidocioni commented Apr 26, 2021

liZe commented Apr 26, 2021

guidocioni commented Apr 26, 2021

liZe commented Apr 26, 2021

grewn0uille commented May 5, 2021 • edited Loading

guidocioni commented May 11, 2021

liZe commented May 12, 2021

guidocioni commented May 12, 2021

winklemint commented Jan 31, 2024

guidocioni commented Feb 1, 2024

winklemint commented Feb 2, 2024

FelixSchwarz commented May 12, 2024

hejsan commented Apr 13, 2020 •

edited

Loading

hejsan commented Apr 14, 2020 •

edited

Loading

liZe commented Apr 15, 2020 •

edited

Loading

hejsan commented Apr 15, 2020 •

edited

Loading

grewn0uille commented May 5, 2021 •

edited

Loading