Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reproducible PDF generation #1553

Closed
varac opened this issue Jan 26, 2022 · 17 comments
Closed

Reproducible PDF generation #1553

varac opened this issue Jan 26, 2022 · 17 comments
Labels
feature New feature that should be supported
Milestone

Comments

@varac
Copy link

varac commented Jan 26, 2022

Hi, first of all thanks for this awesome project!

I noticed that the PDF generation does create a different PDF from the same html each time.

❯ weasyprint http://txti.es/ /tmp/txti-website.pdf
❯ md5sum /tmp/txti-website.pdf
04e141b9765cb2d4143f968d989ddda1  /tmp/txti-website.pdf

❯ weasyprint http://txti.es/ /tmp/txti-website.pdf
❯ md5sum /tmp/txti-website.pdf                    
e7d3657ccd871115d329b1ebb46facb8  /tmp/txti-website.pdf

Why is that ? I'm using this simple URL as an example because I noticed it first when generating PDFs from code which I check into git, and each time I run weasyprint I end up with a binary git diff for the PDFs. This is very unfortunate, because I want to only have a diff when I actually change sth in the code itself.

Is there any way/option to fix this ?

@varac
Copy link
Author

varac commented Jan 26, 2022

The only reference for reproducible I fond is this

@liZe
Copy link
Member

liZe commented Jan 26, 2022

Hi, first of all thanks for this awesome project!

❤️

I noticed that the PDF generation does create a different PDF from the same html each time.
Why is that ?

The PDF generation is not reproducible because of the way we handle fonts. The main problem is in fonttools (and may need a fix in foontools), but there’s also another "problem" in WeasyPrint (because of 8acedd3).

@liZe liZe added the feature New feature that should be supported label Jan 26, 2022
@liZe liZe added this to the 55.0 milestone Jan 26, 2022
@varac
Copy link
Author

varac commented Jan 27, 2022

I noticed a regression in this regards from v51 to v54.
Here's the diff from calling weasyprint http://txti.es/ /tmp/txti-wp51.pdf; weasyprint http://txti.es/ /tmp/txti2-wp51.pdf with wp v51:

❯ diffoscope /tmp/txti-wp51.pdf /tmp/txti2-wp51.pdf
--- /tmp/txti-wp51.pdf
+++ /tmp/txti2-wp51.pdf
│   --- /tmp/txti-wp51.pdf
├── +++ /tmp/txti2-wp51.pdf
│┄ Document info
│ @@ -1,6 +1,6 @@
│  Author: 'Barry T. Smith'
│ -CreationDate: "D:20220127083332+01'00"
│ +CreationDate: "D:20220127083337+01'00"
│  Keywords: ''
│  Producer: 'cairo 1.16.0 (https://cairographics.org)'
│  Subject: 'Txti is a free service that lets you create the fastest, simplest, most shareable web pages on the internet using any phone, tablet, or computer you have.'
│  Title: 'txti - Fast web pages for everybody'

And here is the diff from weasyprint http://txti.es/ /tmp/txti-wp54.pdf; weasyprint http://txti.es/ /tmp/txti2-wp54.pdf with wp v54. Since it's too long I uploaded it as a file:
diffoscope-wp54.log

@varac
Copy link
Author

varac commented Jan 27, 2022

I also noticed that wp54 doesn't add the CreationDate meta data tag anymore by default, which is nice for reproducibility (however it introduced this huge diff on the other hand).

@varac
Copy link
Author

varac commented Jan 27, 2022

As a workaround for me, I'd pin wp to version 51 until this gets fixed, but is there a way to prevent the CreationDate metadata tag from being added ?

@liZe
Copy link
Member

liZe commented Jan 27, 2022

We actually changed the whole PDF generation library, that’s why you get an impressive diff. Things are thus not as easy as reverting the commits that introduced the regression.

As a workaround for me, I'd pin wp to version 51 until this gets fixed, but is there a way to prevent the CreationDate metadata tag from being added ?

You can include a meta tag in your HTML document to change this value: <meta name="dcterms.created" content="XXX">.

@varac
Copy link
Author

varac commented Jan 27, 2022

Awesome, thanks that worked!

@skruppy
Copy link

skruppy commented Feb 18, 2022

Wouldn't it be sufficient to make the "in PDF font hash" based on the font file file content, e.g. by replacing sha.update(str(font_hash).encode()) with sha.update(file_content)?

So the font_hash (which causes the reproducibility issues, because it's based on the cdata hash) is only used to prevent duplicates.

At least in my small "Hello World"-example it works with SOURCE_DATE_EPOCH

@liZe
Copy link
Member

liZe commented Feb 18, 2022

Wouldn't it be sufficient to make the "in PDF font hash" based on the font file file content, e.g. by replacing sha.update(str(font_hash).encode()) with sha.update(file_content)?

As far as I can remember, it works but it’s noticeably slower when a lot of fonts are included. It should be quite easy to find a solution to fix that though.

But that’s not the only problem: Fonttool’s font subsetter is also not deterministic. We don’t know why yet.

@skruppy
Copy link

skruppy commented Feb 18, 2022

But that’s not the only problem: Fonttool’s font subsetter is also not deterministic. We don’t know why yet.

Do you have a example or link to an issue describing the problem in more detail? Maybe I can have a look into it.

For my minimal example it was sufficient to set the environment variable SOURCE_DATE_EPOCH=0 to get reproducible results (without it, the embedded font has changed). Looking at the change log, the fonttools project seems really keen on producing deterministic output (their unit tests require this property to work). Maybe non deterministic output of fonttools is a problem of the past?

@liZe
Copy link
Member

liZe commented Feb 19, 2022

For my minimal example it was sufficient to set the environment variable SOURCE_DATE_EPOCH=0 to get reproducible results (without it, the embedded font has changed).

You’re right, it’s probably the source of the variation, it looks like it’s reproducible with SOURCE_DATE_EPOCH=0.

We now have two goals:

  1. Find a way to get a reproducible hash for font faces. We’ve tried various things in the past, and surprisingly there are many details we must take care of.

  2. Find a way to generate fonts that don’t depend on the current date.

@liZe
Copy link
Member

liZe commented Mar 6, 2022

2. Find a way to generate fonts that don’t depend on the current date.

I didn’t know that SOURCE_DATE_EPOCH was "specified". We can assume that this point is solved, and that users who want reproducible generation will be OK to set this environment variable.

@liZe liZe closed this as completed in 3b0ae92 Mar 6, 2022
@liZe
Copy link
Member

liZe commented Mar 6, 2022

We should now have reproducible PDF generation when SOURCE_DATE_EPOCH is set, tests are welcome!

@varac
Copy link
Author

varac commented Mar 6, 2022

@liZe Awesome! Happy to test a new release!

@dkg
Copy link

dkg commented Jul 20, 2022

Thank you @liZe for taking this reproducibility concern seriously and fixing it! And thanks @varac for noticing it in the first place and reporting it here.

@castedo
Copy link

castedo commented Oct 23, 2022

SOURCE_DATE_EPOCH does get weasyprint closer to reproducible PDF generation. I see it having an effect in various debug cases. Thank you @liZe for getting it closer!

However, it does not seem weasyprint is quite there yet. Although the test_reproducible unit test passes, the binary output is only constant within a single process. If we insert the following code into the test_reproducible case, we see that the binary data changes from process to process:

    import hashlib
    assert hashlib.md5(stdout1).hexdigest() == "nothing constant across runs"

I believe when this feature is truly implemented, the following should output the same digest for each execution:

echo '<!doctype html><title>foo</title><p>bar</p>' | SOURCE_DATE_EPOCH=0 weasyprint - - | md5sum

but right now, each execution will output a different digest.

This is with weasyprint 57.0.

@castedo
Copy link

castedo commented Oct 23, 2022

OK, all is good now. I see that 2b05137 has fixed this issue. So I have pulled the most recent tip of the tree and the above test now works. How crazy that I happen to hit this one day after you fixed it the main branch and I had no idea!

Thanks for fixing!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature New feature that should be supported
Projects
None yet
Development

No branches or pull requests

5 participants