Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Corruption of PDF #3446

Closed
crowbot opened this issue Aug 8, 2016 · 13 comments
Closed

Corruption of PDF #3446

crowbot opened this issue Aug 8, 2016 · 13 comments
Labels
bug Breaks expected functionality f:request-analysis stale Issues with no activity for 12 months x:new-zealand

Comments

@crowbot
Copy link
Member

crowbot commented Aug 8, 2016

Example at:
https://groups.google.com/a/mysociety.org/d/msg/alaveteli/FDpr12IPCd4/Rb9GZI_ZAwAJ

@crowbot crowbot added bug Breaks expected functionality x:new-zealand 1 - new labels Aug 8, 2016
@crowbot crowbot changed the title Corruption of PDF Corruption of PDF Aug 15, 2016
@crowbot crowbot self-assigned this Aug 15, 2016
@garethrees garethrees assigned lizconlan and unassigned crowbot Aug 30, 2016
@lizconlan
Copy link

Using a Vagrant box, managed to extract usable images from both example files using pdftohtml 0.24.5 and pdftk 2.01 on trusty (pretty slow though). Rebuilding the box on a jessie image to see what happens...

@lizconlan
Copy link

... jessie gets me pdftohtml 0.26.5 and pdftk 2.02 which also worked on the raw pdf files (which is a shame from the point of view of being able to say "use version x instead"). Something odder happening :(

@lizconlan
Copy link

lizconlan commented Sep 6, 2016

ok, haven't managed to replicate the corruption (yet) but have been looking at ghostscript (on jessie, version 9.06) sit for hours doing nothing with both of the unredacted example files so this is starting to look like a ghostscript problem (my working theory at this point is that the corruption occurs when the redacted file is compressed - as yet untested)

@lizconlan
Copy link

lizconlan commented Sep 15, 2016

So, this seems to be strongly related to #3447 - in short this looks like a ghostscript problem, to fix either install ghostscript 9.10 or disable it by setting USE_GHOSTSCRIPT_COMPRESSION in your general.yml file to use pdftk for compression instead.

@garethrees
Copy link
Member

Lets check that this fixes the problem in the NZ env before we consider this closed. Great work on isolating it!

@garethrees
Copy link
Member

Please reopen if this comes back up

@nigeljonez
Copy link
Contributor

@garethrees I don't seem to have the ability to reopen this, but it is still affecting the NZ installation, and I think I may have a root cause.

Taking an affected PDF and running pdftk - output - uncompress < /path/to/in.pdf | pdftk - output - compress > /path/to/out.pdf results in an acceptable PDF without corruption.
(Command based on:

AlaveteliExternalCommand.run("pdftk", "-", "output", "-", "uncompress", :stdin_string => text)

&
command = ["pdftk", "-", "output", "-", "compress"]
end
AlaveteliExternalCommand.run(*(command + [ :stdin_string => text ]))
)

Via replication in rails c for the steps in apply_pdf_masks, I observed that apply_binary_masks was changing byte-wise the uncompressed PDF, stepping through the latter, I have isolated at least (I haven't tried a patched version yet...) the following segment as problematic:

text.gsub!(MySociety::Validate.email_find_regexp) do |email|
email.gsub(/[^@.]/, 'x')
end

Looking at our affected PDF, I noticed that this gsub triggers as:

irb(main):073:0> text == uncompressed_text
=> true
irb(main):074:0> text.scan(MySociety::Validate.email_find_regexp)
=> [["[email protected]"]]
irb(main):075:0> text.gsub!(MySociety::Validate.email_find_regexp) do |email|
irb(main):076:1*   email.gsub(/[^@.]/, 'x')
irb(main):077:1> end; nil
=> nil
irb(main):078:0> text == uncompressed_text
=> false

Based on the gsub it therefore appears to be replacing (yet to double check this finding) that portion of the PDF with x's which causes page corruption for non-accessible/typically image PDFs.

In this case, we are lucky in the fact that MySociety::Validate.is_valid_email("[email protected]") returns 0 as we could do:

irb(main):093:0> text = uncompressed_text.dup; nil
=> nil
irb(main):094:0> text.gsub!(MySociety::Validate.email_find_regexp) do |email|
irb(main):095:1*   MySociety::Validate.is_valid_email(email) == 0 ? email : email.gsub(/[^@.]/, 'x')
irb(main):096:1> end; nil
=> nil
irb(main):097:0> text == uncompressed_text
=> true

Which preserves at least this part of the PDF, I'm going to do some more testing as Oliver has a bunch of corrupted PDFs that have built up, this isn't going to fix all corruption but if it can at least fix some, are you open to such a patch?

@nigeljonez
Copy link
Contributor

Update, I feel a little foolish as I've just realised that MySociety::Validate.is_valid_email(email) returns 0 if valid, and nil if invalid.

I managed to get my hands on a second attachment that has had corruption issues, and again noticed that:

irb(main):005:0> uncompressed_text.scan(MySociety::Validate.email_find_regexp)
=> [["[email protected]"], ["[email protected]"], ["[email protected]"], ["[email protected]"], ["[email protected]"], ["[email protected]"], ["[email protected]"]]

(example = verisign, don't see a point in their CA addresses getting more spam than they already do).

Interestingly, this seems to be from a security certificate from a signed PDF...

Ironically with this PDF, no other changes are made, which makes me question the effectiveness of the rules as they stand - especially since it doesn't detect an email address on the second page of the PDF. For comparison, the gem 'pdf-reader' can with:

reader = PDF::Reader.new("/tmp/B.pdf")
reader.page(2).text.scan(/(\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Za-z]{2,4}\b)/)
 => [["[email protected]"]]

hexapdf may be an interesting option to modify PDFs as an 'in-ruby' solution, see gettalong/hexapdf#31, there are limitations but for PDFs it may be better than binary mangling.

@garethrees
Copy link
Member

Thanks for investigating this.

We'd definitely be open to a patch that improves this, but I'm not sure we have the headspace for swapping out pdftk at the moment. It might be something we want to consider though, as it got dropped (#5025) and replaced with pdftk-java, so IDK how well supported that will be over time.

with this PDF, no other changes are made, which makes me question the effectiveness of the rules as they stand

It seems like we could do something more intelligent with how we do the replacements for sure.

@nigeljonez
Copy link
Contributor

I agree, I spent last night playing around with hexapdf and it's fair to say it's not a trivial thing to implement (mainly due to the way content streams are encoded and the fact most/all libraries don't support their modification).

I've uploaded some examples to https://files.jnet.net.nz/alaveteli/3446/ (B.pdf is missing as it wasn't that interesting for this particular case).

A.pdf is a real response from https://fyi.org.nz/request/10545-transgender-prisoners#incoming-37155 (orig/A.pdf is the attachment extracted directly from the raw email, post/A.pdf is a copy produced via the Download link from my 0.34 dev instance (basically identical to what our live site produces)). Corruption can be spotted on Page 7.

C,D,E all show interesting effects of the current binary replacement mechanism, particularly D.pdf appears untouched, while C&E which have Annotations (links) have the URL link replaced but the text component is untouched, a product it seems of the way the PDFs are encoded.

One suggestion from Oliver is (paraphrased) "I wish there was a way for the original requestor to get an untouched copy", another suggestion may be (and potentially at the same time as working on prominence for attachments (#1005)) is for a checkbox admin side that instructs Alaveteli to by-pass text masking on a particular attachment, i.e. in the case of the request where A.pdf originated from, as we know the process is going to corrupt the file, we could tick the box, and any user could get the original copy.

(My mindset is really now towards minimising the impact than solving it)

@garethrees
Copy link
Member

I wish there was a way for the original requestor to get an untouched copy

Looks like #3033 and #10 are issues covering this.

@nigeljonez
Copy link
Contributor

Yeah, we are likely going to go with something like https://gist.github.com/nigeljonez/4204a8f20e0ebaaa103197c8a0a69322 within the theme for attachment related issues as a temporary workaround which has tested okay so far, but I'll keep an eye on those two issues as well as it does relate to another discussion we've had recently.

@HelenWDTK HelenWDTK added the stale Issues with no activity for 12 months label Nov 19, 2024
@HelenWDTK
Copy link
Contributor

This issue has been automatically closed due to a lack of discussion or resolution for over 12 months.
Should we decide to revisit this issue in the future, it can be reopened.

@HelenWDTK HelenWDTK closed this as not planned Won't fix, can't repro, duplicate, stale Nov 19, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Breaks expected functionality f:request-analysis stale Issues with no activity for 12 months x:new-zealand
Projects
None yet
Development

No branches or pull requests

5 participants