-
-
Notifications
You must be signed in to change notification settings - Fork 196
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Corruption of PDF #3446
Comments
Using a Vagrant box, managed to extract usable images from both example files using pdftohtml 0.24.5 and pdftk 2.01 on trusty (pretty slow though). Rebuilding the box on a jessie image to see what happens... |
... jessie gets me pdftohtml 0.26.5 and pdftk 2.02 which also worked on the raw pdf files (which is a shame from the point of view of being able to say "use version x instead"). Something odder happening :( |
ok, haven't managed to replicate the corruption (yet) but have been looking at ghostscript (on jessie, version 9.06) sit for hours doing nothing with both of the unredacted example files so this is starting to look like a ghostscript problem (my working theory at this point is that the corruption occurs when the redacted file is compressed - as yet untested) |
So, this seems to be strongly related to #3447 - in short this looks like a ghostscript problem, to fix either install ghostscript 9.10 or disable it by setting |
Lets check that this fixes the problem in the NZ env before we consider this closed. Great work on isolating it! |
Please reopen if this comes back up |
@garethrees I don't seem to have the ability to reopen this, but it is still affecting the NZ installation, and I think I may have a root cause. Taking an affected PDF and running alaveteli/lib/alaveteli_text_masker.rb Line 49 in c105a06
& alaveteli/lib/alaveteli_text_masker.rb Lines 64 to 66 in c105a06
Via replication in alaveteli/lib/alaveteli_text_masker.rb Lines 109 to 111 in c105a06
Looking at our affected PDF, I noticed that this gsub triggers as:
Based on the gsub it therefore appears to be replacing (yet to double check this finding) that portion of the PDF with x's which causes page corruption for non-accessible/typically image PDFs. In this case, we are lucky in the fact that
Which preserves at least this part of the PDF, I'm going to do some more testing as Oliver has a bunch of corrupted PDFs that have built up, this isn't going to fix all corruption but if it can at least fix some, are you open to such a patch? |
Update, I feel a little foolish as I've just realised that I managed to get my hands on a second attachment that has had corruption issues, and again noticed that:
(example = verisign, don't see a point in their CA addresses getting more spam than they already do). Interestingly, this seems to be from a security certificate from a signed PDF... Ironically with this PDF, no other changes are made, which makes me question the effectiveness of the rules as they stand - especially since it doesn't detect an email address on the second page of the PDF. For comparison, the gem 'pdf-reader' can with:
hexapdf may be an interesting option to modify PDFs as an 'in-ruby' solution, see gettalong/hexapdf#31, there are limitations but for PDFs it may be better than binary mangling. |
Thanks for investigating this. We'd definitely be open to a patch that improves this, but I'm not sure we have the headspace for swapping out pdftk at the moment. It might be something we want to consider though, as it got dropped (#5025) and replaced with
It seems like we could do something more intelligent with how we do the replacements for sure. |
I agree, I spent last night playing around with hexapdf and it's fair to say it's not a trivial thing to implement (mainly due to the way content streams are encoded and the fact most/all libraries don't support their modification). I've uploaded some examples to https://files.jnet.net.nz/alaveteli/3446/ (B.pdf is missing as it wasn't that interesting for this particular case). A.pdf is a real response from https://fyi.org.nz/request/10545-transgender-prisoners#incoming-37155 (orig/A.pdf is the attachment extracted directly from the raw email, post/A.pdf is a copy produced via the Download link from my 0.34 dev instance (basically identical to what our live site produces)). Corruption can be spotted on Page 7. C,D,E all show interesting effects of the current binary replacement mechanism, particularly D.pdf appears untouched, while C&E which have Annotations (links) have the URL link replaced but the text component is untouched, a product it seems of the way the PDFs are encoded. One suggestion from Oliver is (paraphrased) "I wish there was a way for the original requestor to get an untouched copy", another suggestion may be (and potentially at the same time as working on prominence for attachments (#1005)) is for a checkbox admin side that instructs Alaveteli to by-pass text masking on a particular attachment, i.e. in the case of the request where A.pdf originated from, as we know the process is going to corrupt the file, we could tick the box, and any user could get the original copy. (My mindset is really now towards minimising the impact than solving it) |
Yeah, we are likely going to go with something like https://gist.github.com/nigeljonez/4204a8f20e0ebaaa103197c8a0a69322 within the theme for attachment related issues as a temporary workaround which has tested okay so far, but I'll keep an eye on those two issues as well as it does relate to another discussion we've had recently. |
This issue has been automatically closed due to a lack of discussion or resolution for over 12 months. |
Example at:
https://groups.google.com/a/mysociety.org/d/msg/alaveteli/FDpr12IPCd4/Rb9GZI_ZAwAJ
The text was updated successfully, but these errors were encountered: