Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Censor rules are sometimes ineffective #1156

Closed
RichardTaylor opened this issue Oct 24, 2013 · 22 comments
Closed

Censor rules are sometimes ineffective #1156

RichardTaylor opened this issue Oct 24, 2013 · 22 comments
Labels
f:redaction improvement Improves existing functionality (UI tweaks, refactoring, performance, etc) stale Issues with no activity for 12 months transparent-administration

Comments

@RichardTaylor
Copy link

This was raised, and discussed, at the now closed #33

There is still an issue with long censor rules not always working. I suspect it's an issue related to line breaks or odd characters.

A recent example is at:

https://www.whatdotheyknow.com/admin/requests/173631

I resorted to editing the outgoing message there as censor rule attempts were ineffective. I have left the ineffective censor rules in place, in part as a record of what has been removed.

Simpler rules which don't work can be seen in censor rules I've applied to a test request at:

https://www.whatdotheyknow.com/admin/requests/10375

One of those rules has a "this works" comment the others don't work even though I've made them by copying and pasting either from the raw message in the admin interface or from the public thread.

@RichardTaylor
Copy link
Author

I've struggled with censor rules at:

https://www.whatdotheyknow.com/admin/requests/33740

Ideally rule 1724 would have worked; it didn't.

I needed different censor rules for the quoted response to the original.

In relation to part of that I used the censor rule to note the change and edited the outgoing message.

@RichardTaylor
Copy link
Author

Is it easy to determine if a rule has had any effect? (Presumably if anything was matched by what the rule was to apply to it would have an effect?)

Could we give an "are you sure" warning when an attempt is made to create a rule which has no effect?

@RichardTaylor
Copy link
Author

This is an issue which I'm encountering every couple of days.

Today the first censor rule I applied at:

https://www.whatdotheyknow.com/admin/requests/109902

covered a paragraph and did work to remove it from the original outgoing message however it didn't work for a copy in a further outgoing message or when it was quoted in replies. I understand the line breaking is different in those latter occurrences.

I couldn't manage to create rules which would apply to those latter cases. In one case I resorted to editing the raw outgoing message and the other I applied line by line censor rules.

@RichardTaylor
Copy link
Author

Censor rule 1761 at https://www.whatdotheyknow.com/admin/requests/155205 didn't work.

I removed the text from the outgoing message instead (leaving the rule in place as a record)

To remove the material where it was quoted in a reply I tried copy and pasting text as it appeared in that reply (rule 1762) that didn't work either. I created line by line censor rules which did work.

@hsenag
Copy link
Collaborator

hsenag commented Dec 13, 2013

Another request where a multi-line censor rule didn't work and I spent a while doing it line by line: https://www.whatdotheyknow.com/request/disabled_and_elderly_respite_car

I also had problems with censoring punctuation like "-" but I may just not have tried hard enough.

@RichardTaylor
Copy link
Author

More examples where multi-line censor rules would have been useful can be seen at:

https://www.whatdotheyknow.com/admin/requests/158698

(again there I edited the raw outgoing messages after using the [ineffective] censor rules to note the original content).

Maybe we need another, additional, way of thinking about this censoring eg. removing from point x to point y in the relevant message, as an option as well as trying to match and replace specific text?

@RichardTaylor
Copy link
Author

Another example where I resorted to line by line censor rules for the incoming message and editing the outgoing

https://www.whatdotheyknow.com/admin/requests/144185

@RichardTaylor RichardTaylor changed the title Large censor rules are often ineffective Censor rules are somtimes ineffective Nov 5, 2015
@RichardTaylor RichardTaylor changed the title Censor rules are somtimes ineffective Censor rules are sometimes ineffective Nov 5, 2015
@RichardTaylor
Copy link
Author

I've edited the title of this issue to make it more general.

At
https://www.whatdotheyknow.com/admin/requests/296386
Material we wanted to censor was within a docx - the censor rule didn't appear effective against plain text within the docx

Often removal of material from PDF documents is hard see for example:

https://www.whatdotheyknow.com/admin/requests/228030

See also the EditingRequests page under WhatDoTheyKnow on the mySociety old internal wiki.

@RichardTaylor
Copy link
Author

At

https://www.whatdotheyknow.com/admin/requests/272538

We want to redact a number from an Excel .xlsx document. The censor rules aren't working on it for a reason presumably related to the structure of the file. The number is there in plain text is there once the .xlsx file is opened/extracted to reveal the plain text files within.

@RichardTaylor
Copy link
Author

I used lots of censor rules at:
https://www.whatdotheyknow.com/admin/requests/26588

an attempt at a multi-line rule failed; I didn't attempt things like replacing line breaks with spaces or other modifications to try and make it work.

@RichardTaylor
Copy link
Author

Another example where lots of line by line censor rules were (I think) required

https://www.whatdotheyknow.com/admin/requests/12701

@RichardTaylor
Copy link
Author

Discussion of better documenting the regex option already present in Alaveteli:

#2056 (comment)

@RichardTaylor
Copy link
Author

Just adding a +1 following a case on WhatDoTheyKnow where having censor rules work on a .xlsx document would have been useful.

See previous comment on this thread: #1156 (comment)

@garethrees garethrees added f:redaction improvement Improves existing functionality (UI tweaks, refactoring, performance, etc) labels May 29, 2018
@RichardTaylor
Copy link
Author

Adding a note as today I wanted to invoke a censor rule on a docx document at

https://www.whatdotheyknow.com/admin/requests/489665

the text to redact is present in plain text in the docx document, but it's within directories in the docx structure.

Currently censor rules don't work on such documents.

@garethrees
Copy link
Member

garethrees commented Mar 27, 2019

In some cases censor rules applied to PDFs corrupt the output PDF.

In a case I was investigating recently, it looks like re-compressing the PDF results in it being created with a different PDF version. Presumably something else in the PDF is incompatible with the new version, so breaks in some viewers:

$ qpdf --stream-data=uncompress no-censor-rules.pdf no-censor-rules.txt
$ qpdf --stream-data=uncompress full-line-rule.pdf full-line-rule.txt

$ diff no-censor-rules.txt full-line-rule.txt | head -n 4
1c1
< %PDF-1.4
---
> %PDF-1.7

Even adding a censor rule that does nothing results in different (corrupt) output:

# Text: `4(n N)`
# Replacement: `4(n N)`

$ sha256sum no-censor-rules.pdf
7c0d4067b0789d00bc53ea75c8c0782c9f6ab7a07fede1c7b9235373c705e71e  no-censor-rules.pdf
$ sha256sum short-no-change-rule.pdf
02a1ccb8a01a0d8118414e44bb4caa4c0f2f02d72af4bfa31ade6e5e6e8f027b  short-no-change-rule.pdf

@RichardTaylor
Copy link
Author

Noting a discussion group thread on which it is stated

(\s+|\n)

works to match newlines in Alaveteli

https://groups.google.com/g/alaveteli-users/c/o20BudmKZeo

@RichardTaylor
Copy link
Author

Surprisingly an attempted censor rule on WhatDoTheyKnow

admin/censor_rules/6529/edit

on

admin/requests/831728

didn't work apparently as it because it went across two lines of text as wrapped for display on the public page. This is confusing as there were no line breaks shown when viewing

/admin/outgoing_messages/1258933/edit

Censor rules for just one line of text do work.

This one might be worth looking at to seek to understand the issues with line breaks and censor rules.

@RichardTaylor
Copy link
Author

@RichardTaylor
Copy link
Author

I had challenges today with plain text censor rules on plain text emails.

I was focused on getting the job done rather than identifying the issues but I think issues were caused by:

  • Whitespace. Even if I copied and pasted the text to be redacted things other than a normal space between words appeared to cause an issue.
  • Characters eg. / and '
  • I had to go line by line as the censor rules didn't work over line breaks.
  • Sometimes I'd make a censor rule and it would undo censoring I'd already done (related: Facility to change order of censor rules #2761)
  • I had the issues described at Make creating multiple censor rules easier #4626 in relation to creating lots of censor rules

@RichardTaylor
Copy link
Author

Related - being able to stop displaying "show quoted sections" would help with censor rule creation significantly, it reduces the amount of material one needs to censor.

#7081

@RichardTaylor
Copy link
Author

I had a case where I wanted to redact

-0.2 (\() -0.2 (\))

to

-0.2 (x) -0.2 (x)

to remove material from a PDF

Similar rules with different characters in the brackets eg.

-0.2 ($) -0.2 ($)

to

-0.2 (x) -0.2 (x)

did work

@HelenWDTK
Copy link
Contributor

This issue has been automatically closed due to a lack of discussion or resolution for over 12 months.
Should we decide to revisit this issue in the future, it can be reopened.

@HelenWDTK HelenWDTK closed this as not planned Won't fix, can't repro, duplicate, stale Nov 19, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
f:redaction improvement Improves existing functionality (UI tweaks, refactoring, performance, etc) stale Issues with no activity for 12 months transparent-administration
Projects
None yet
Development

No branches or pull requests

5 participants