Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Evaluate Dangerzone's Potential as a Redaction Tool (and add redaction capabilities) #763

Open
deeplow opened this issue Apr 2, 2024 · 3 comments

Comments

@deeplow
Copy link
Contributor

deeplow commented Apr 2, 2024

Dangerzone's goal is protecting the user against malware. However, thought the way it works, it also removes metadata. So it can also help with publication security.

The problem

Typical PDFs manipulation tools have poorly implemented redaction methods that can be reversed. Because Dangerzone already rasterizes documents, it has nothing to loose. When a black box is applied and then rasterized, there is no more information in the final output.

This is best put in the paper Story Beyond the Eye: Glyph Positions Break PDF Text Redaction (emphasis added):

Rasterization appears to be an effective defense against deredaction. In many cases this defense is infeasible be-
cause it removes searchable text data from the document, however, performing OCR on the document post-redaction can act as a stop-gap for this issue. Rasterization algorithms may also modify or ignore certain glyph shifts,17 requiring the analyst to perform more reverse engineering to identify the specific rasterization tool used.

We're working on turning Dangerzone into a file view and that could be the perfect change to add redaction tools.

User Story

As a journalist, I'd like to have use dangerzone to help redact documents, ensuring that redactions cannot be reversed.

How could this work?

User journey:

  1. In the view mode user draws black squares over blacked out area
  2. After all redactions are done, the user saves the final document

Technical explanation: the host receives all the rasterized images. As the user adds a black box to the image, with the help of an image manipulation module (like Pillow) it adds those black boxes to the final image. If we want extra rasterization assurances, we can convert final PDF though dangerzone one more time to ensure proper rasterization.

Implementation Risks and Unmitigated Risks

We should keep in mind that redaction alone may not be to eliminate all unredaction risks. The best advice is never to publish source documents and if needed, to retype them. I can think of several other ways that redaction could still be bypassed:

  • invisible watermarks: if the purpose is to identify the leaker, then printer dots, space-width variations, etc. could all be used. No redaction can save this form of identification. Only document retyping can potentially help there.
  • character width can be used to reverse redactions (related paper)
  • compression artifacts can leave traces of what was hidden. In pre-compressed artifacts like images we cannot help much, as the whole element has to be redacted. However, dangerzone also compresses documents. We could make sure to only do this in the final rasterization (i.e. the one with the redaction boxes).
@deeplow
Copy link
Contributor Author

deeplow commented Apr 2, 2024

If the previewer ends up using PDFs rather ran images, we can apparently use fitz for that (linked issue would not affect us if the doc was already rasterized once).

@DeltaEpsilon19498
Copy link

Could dangerzone convert the text in the pdf to a .txt file which the journalist could redact manually? Things like black boxes still give away the length of the word being redacted. Then, could a tool be used to convert the redacted text into a pdf document with a template that could be standardized across the industry as a "redacted anti-watermark whistleblowing" template? That way, all watermarks could be removed, except if the corporation or government modifies the text itself a little bit depending on which authorized user is reading it.

With corporations already putting invisible watermarks or whatever into their emails, the above idea could help protect sources. One issue is that with the document modified so much, the corporation or government could deny that it is a legitimate document and claim that it is faked. A second issue is, as mentioned, that they could adapt by modifying the words used in the document depending on the authorized viewer. Third, the leaked material might have important images or diagrams that need to be part of the document but which contain undetectable watermarks. And a fourth issue is that readers / viewers of the general public may be too ignorant to understand why these sorts of measures are necessary, causing them to doubt the authenticity of the document or be manipulated by propaganda. So idk. A tool like the one I presented in the first paragraph might still be useful though.

@apyrgio
Copy link
Contributor

apyrgio commented Apr 15, 2024

To the above reservations, I'd add the fact that some documents may have two columns of text, pictures, or formatting elements like tables. If it's a solution that works for 90% of the documents, then we will add some extra mental load to a journalist that is already pressed (given that they are handling a very sensitive document).

Still, allowing users to get back just the text of the document, and then post-process it in anyway they like, could be a nice fit for a Dangerzone plugin system. I think we had an issue for this, but I can't find it right now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants