Corruption of PDF #3446

crowbot · 2016-08-08T15:34:03Z

Example at:
https://groups.google.com/a/mysociety.org/d/msg/alaveteli/FDpr12IPCd4/Rb9GZI_ZAwAJ

lizconlan · 2016-09-02T11:43:18Z

Using a Vagrant box, managed to extract usable images from both example files using pdftohtml 0.24.5 and pdftk 2.01 on trusty (pretty slow though). Rebuilding the box on a jessie image to see what happens...

lizconlan · 2016-09-02T14:30:58Z

... jessie gets me pdftohtml 0.26.5 and pdftk 2.02 which also worked on the raw pdf files (which is a shame from the point of view of being able to say "use version x instead"). Something odder happening :(

lizconlan · 2016-09-06T09:19:59Z

ok, haven't managed to replicate the corruption (yet) but have been looking at ghostscript (on jessie, version 9.06) sit for hours doing nothing with both of the unredacted example files so this is starting to look like a ghostscript problem (my working theory at this point is that the corruption occurs when the redacted file is compressed - as yet untested)

lizconlan · 2016-09-15T14:30:32Z

So, this seems to be strongly related to #3447 - in short this looks like a ghostscript problem, to fix either install ghostscript 9.10 or disable it by setting USE_GHOSTSCRIPT_COMPRESSION in your general.yml file to use pdftk for compression instead.

garethrees · 2016-09-15T14:48:32Z

Lets check that this fixes the problem in the NZ env before we consider this closed. Great work on isolating it!

garethrees · 2016-11-07T13:53:40Z

Please reopen if this comes back up

nigeljonez · 2019-07-17T06:31:38Z

@garethrees I don't seem to have the ability to reopen this, but it is still affecting the NZ installation, and I think I may have a root cause.

Taking an affected PDF and running pdftk - output - uncompress < /path/to/in.pdf | pdftk - output - compress > /path/to/out.pdf results in an acceptable PDF without corruption.
(Command based on:

alaveteli/lib/alaveteli_text_masker.rb

Line 49 in c105a06

    
           AlaveteliExternalCommand.run("pdftk", "-", "output", "-", "uncompress", :stdin_string => text)

&

alaveteli/lib/alaveteli_text_masker.rb

Lines 64 to 66 in c105a06

    
             command = ["pdftk", "-", "output", "-", "compress"] 
        
           end 
        
           AlaveteliExternalCommand.run(*(command + [ :stdin_string => text ]))

)

Via replication in rails c for the steps in apply_pdf_masks, I observed that apply_binary_masks was changing byte-wise the uncompressed PDF, stepping through the latter, I have isolated at least (I haven't tried a patched version yet...) the following segment as problematic:

alaveteli/lib/alaveteli_text_masker.rb

Lines 109 to 111 in c105a06

    
           text.gsub!(MySociety::Validate.email_find_regexp) do |email| 
        
             email.gsub(/[^@.]/, 'x') 
        
           end

Looking at our affected PDF, I noticed that this gsub triggers as:

irb(main):073:0> text == uncompressed_text
=> true
irb(main):074:0> text.scan(MySociety::Validate.email_find_regexp)
=> [["[email protected]"]]
irb(main):075:0> text.gsub!(MySociety::Validate.email_find_regexp) do |email|
irb(main):076:1*   email.gsub(/[^@.]/, 'x')
irb(main):077:1> end; nil
=> nil
irb(main):078:0> text == uncompressed_text
=> false

Based on the gsub it therefore appears to be replacing (yet to double check this finding) that portion of the PDF with x's which causes page corruption for non-accessible/typically image PDFs.

In this case, we are lucky in the fact that MySociety::Validate.is_valid_email("[email protected]") returns 0 as we could do:

irb(main):093:0> text = uncompressed_text.dup; nil
=> nil
irb(main):094:0> text.gsub!(MySociety::Validate.email_find_regexp) do |email|
irb(main):095:1*   MySociety::Validate.is_valid_email(email) == 0 ? email : email.gsub(/[^@.]/, 'x')
irb(main):096:1> end; nil
=> nil
irb(main):097:0> text == uncompressed_text
=> true

Which preserves at least this part of the PDF, I'm going to do some more testing as Oliver has a bunch of corrupted PDFs that have built up, this isn't going to fix all corruption but if it can at least fix some, are you open to such a patch?

nigeljonez · 2019-07-17T08:46:09Z

Update, I feel a little foolish as I've just realised that MySociety::Validate.is_valid_email(email) returns 0 if valid, and nil if invalid.

I managed to get my hands on a second attachment that has had corruption issues, and again noticed that:

irb(main):005:0> uncompressed_text.scan(MySociety::Validate.email_find_regexp)
=> [["[email protected]"], ["[email protected]"], ["[email protected]"], ["[email protected]"], ["[email protected]"], ["[email protected]"], ["[email protected]"]]

(example = verisign, don't see a point in their CA addresses getting more spam than they already do).

Interestingly, this seems to be from a security certificate from a signed PDF...

Ironically with this PDF, no other changes are made, which makes me question the effectiveness of the rules as they stand - especially since it doesn't detect an email address on the second page of the PDF. For comparison, the gem 'pdf-reader' can with:

reader = PDF::Reader.new("/tmp/B.pdf")
reader.page(2).text.scan(/(\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Za-z]{2,4}\b)/)
 => [["[email protected]"]]

hexapdf may be an interesting option to modify PDFs as an 'in-ruby' solution, see gettalong/hexapdf#31, there are limitations but for PDFs it may be better than binary mangling.

garethrees · 2019-07-17T09:10:47Z

Thanks for investigating this.

We'd definitely be open to a patch that improves this, but I'm not sure we have the headspace for swapping out pdftk at the moment. It might be something we want to consider though, as it got dropped (#5025) and replaced with pdftk-java, so IDK how well supported that will be over time.

with this PDF, no other changes are made, which makes me question the effectiveness of the rules as they stand

It seems like we could do something more intelligent with how we do the replacements for sure.

nigeljonez · 2019-07-18T00:48:22Z

I agree, I spent last night playing around with hexapdf and it's fair to say it's not a trivial thing to implement (mainly due to the way content streams are encoded and the fact most/all libraries don't support their modification).

I've uploaded some examples to https://files.jnet.net.nz/alaveteli/3446/ (B.pdf is missing as it wasn't that interesting for this particular case).

A.pdf is a real response from https://fyi.org.nz/request/10545-transgender-prisoners#incoming-37155 (orig/A.pdf is the attachment extracted directly from the raw email, post/A.pdf is a copy produced via the Download link from my 0.34 dev instance (basically identical to what our live site produces)). Corruption can be spotted on Page 7.

C,D,E all show interesting effects of the current binary replacement mechanism, particularly D.pdf appears untouched, while C&E which have Annotations (links) have the URL link replaced but the text component is untouched, a product it seems of the way the PDFs are encoded.

One suggestion from Oliver is (paraphrased) "I wish there was a way for the original requestor to get an untouched copy", another suggestion may be (and potentially at the same time as working on prominence for attachments (#1005)) is for a checkbox admin side that instructs Alaveteli to by-pass text masking on a particular attachment, i.e. in the case of the request where A.pdf originated from, as we know the process is going to corrupt the file, we could tick the box, and any user could get the original copy.

(My mindset is really now towards minimising the impact than solving it)

garethrees · 2019-07-22T08:35:29Z

I wish there was a way for the original requestor to get an untouched copy

Looks like #3033 and #10 are issues covering this.

nigeljonez · 2019-07-22T08:41:43Z

Yeah, we are likely going to go with something like https://gist.github.com/nigeljonez/4204a8f20e0ebaaa103197c8a0a69322 within the theme for attachment related issues as a temporary workaround which has tested okay so far, but I'll keep an eye on those two issues as well as it does relate to another discussion we've had recently.

HelenWDTK · 2024-11-19T10:13:41Z

This issue has been automatically closed due to a lack of discussion or resolution for over 12 months.
Should we decide to revisit this issue in the future, it can be reopened.

crowbot added bug Breaks expected functionality x:new-zealand 1 - new labels Aug 8, 2016

crowbot changed the title ~~Corruption of PDF~~ Corruption of PDF Aug 15, 2016

crowbot added 3 - current sprint and removed 1 - new labels Aug 15, 2016

crowbot self-assigned this Aug 15, 2016

garethrees assigned lizconlan and unassigned crowbot Aug 30, 2016

lizconlan added 4 - now and removed 3 - current sprint labels Sep 8, 2016

lizconlan closed this as completed Sep 15, 2016

lizconlan removed the 3 - current sprint label Sep 15, 2016

garethrees reopened this Sep 15, 2016

garethrees added the 3 - current sprint label Sep 15, 2016

garethrees added awaiting-review and removed 3 - current sprint labels Sep 26, 2016

garethrees added awaiting-deploy and removed awaiting-review labels Oct 24, 2016

garethrees closed this as completed Nov 7, 2016

garethrees removed the awaiting-deploy label Nov 7, 2016

garethrees reopened this Jul 17, 2019

garethrees added the f:request-analysis label Jul 17, 2019

garethrees unassigned lizconlan Jul 12, 2023

HelenWDTK added the stale Issues with no activity for 12 months label Nov 19, 2024

HelenWDTK closed this as not planned Won't fix, can't repro, duplicate, stale Nov 19, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Corruption of PDF #3446

Corruption of PDF #3446

crowbot commented Aug 8, 2016

lizconlan commented Sep 2, 2016

lizconlan commented Sep 2, 2016

lizconlan commented Sep 6, 2016 •

edited

Loading

lizconlan commented Sep 15, 2016 •

edited

Loading

garethrees commented Sep 15, 2016

garethrees commented Nov 7, 2016

nigeljonez commented Jul 17, 2019

nigeljonez commented Jul 17, 2019

garethrees commented Jul 17, 2019

nigeljonez commented Jul 18, 2019

garethrees commented Jul 22, 2019

nigeljonez commented Jul 22, 2019

HelenWDTK commented Nov 19, 2024

Corruption of PDF #3446

Corruption of PDF #3446

Comments

crowbot commented Aug 8, 2016

lizconlan commented Sep 2, 2016

lizconlan commented Sep 2, 2016

lizconlan commented Sep 6, 2016 • edited Loading

lizconlan commented Sep 15, 2016 • edited Loading

garethrees commented Sep 15, 2016

garethrees commented Nov 7, 2016

nigeljonez commented Jul 17, 2019

nigeljonez commented Jul 17, 2019

garethrees commented Jul 17, 2019

nigeljonez commented Jul 18, 2019

garethrees commented Jul 22, 2019

nigeljonez commented Jul 22, 2019

HelenWDTK commented Nov 19, 2024

lizconlan commented Sep 6, 2016 •

edited

Loading

lizconlan commented Sep 15, 2016 •

edited

Loading