Skip to content
This repository has been archived by the owner on Feb 15, 2024. It is now read-only.

Commit

Permalink
Add hashed email and phone fields to Data Labs output
Browse files Browse the repository at this point in the history
We want an anonymised means to identify when the same email and phone
have been used to submit multiple questions. This is done by applying
a SHA256 hash function to the email and phone number fields. When doing
analysis on the data these fields can be compared to identify where a
single person may be skewing the data.

In putting this in it is understood that it is valid for users to submit
multiple questions and that that this mechanism to tag an individual
question submitter is not useful if someone enters different
emails/phone numbers each time. More this is put in as an additional
tool to help make sense of the data submitted to help trends.

To keep a reasonable degree of anonymity on the hashes they are hashed
with a secret key. This is so that only owners of the secret key can
identify that a particular email address has asked a question. This does
unfortunately mean that we need yet another environment variable. I set
this as a rather generic name, SECRET_KEY, so that it could be re-used
if we need to do anything else with a secret key.
  • Loading branch information
kevindew committed May 21, 2020
1 parent 27f0ee6 commit 227d4ee
Show file tree
Hide file tree
Showing 5 changed files with 38 additions and 10 deletions.
15 changes: 12 additions & 3 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -59,6 +59,8 @@ The following environment variables should be configured:
storing CSV exports for the third party
- `THIRD_PARTY_EMAIL_RECIPIENTS` - a comma separated list of email addresses
of third party colleagues who will be emailed upon a successful export
- `SECRET_KEY` - a key that is used as salt to hashing functions to anonymise
personally identifiable information
- `SINCE_TIME` (optional) - defaults to "10:00", can be changed to alter the time
exports include data from. When this a relative time (for example "10:00") it
will be for the previous day, otherwise an absolute time can be set (for example
Expand All @@ -80,7 +82,7 @@ found in the Account > API Keys section.
You can test a draft export with:

```
SMART_SURVEY_API_TOKEN=<api-token> SMART_SURVEY_API_TOKEN_SECRET=<api-token-secret> bundle exec rake file_export
SECRET_KEY=$(openssl rand -hex 64) SMART_SURVEY_API_TOKEN=<api-token> SMART_SURVEY_API_TOKEN_SECRET=<api-token-secret> bundle exec rake file_export
```

There should now be files created in the `output` directory of this project
Expand All @@ -93,19 +95,26 @@ use ISO 8601 formatting such as "2020-05-01 10:00" or "10:00"
For example:

```
SINCE_TIME=09:00 UNTIL_TIME=11:00 SMART_SURVEY_API_TOKEN=<api-token> SMART_SURVEY_API_TOKEN_SECRET=<api-token-secret> bundle exec rake file_export
SINCE_TIME=09:00 UNTIL_TIME=11:00 SECRET_KEY=$(openssl rand -hex 64) SMART_SURVEY_API_TOKEN=<api-token> SMART_SURVEY_API_TOKEN_SECRET=<api-token-secret> bundle exec rake file_export
```

To perform the live export you need `SMART_SURVEY_LIVE` to equal `"true"`, for
example:

```
SMART_SURVEY_LIVE=true SINCE_TIME=09:00 UNTIL_TIME=11:00 SMART_SURVEY_API_TOKEN=<api-token> SMART_SURVEY_API_TOKEN_SECRET=<api-token-secret> bundle exec rake file_export
SMART_SURVEY_LIVE=true SINCE_TIME=09:00 UNTIL_TIME=11:00 SECRET_KEY=$(openssl rand -hex 64) SMART_SURVEY_API_TOKEN=<api-token> SMART_SURVEY_API_TOKEN_SECRET=<api-token-secret> bundle exec rake file_export
```

This should now have added files to the `output` directory (it will have
overwritten any draft files).

These examples all use a randomly generated secret key to hash user identifiable
information for Data Labs. Using this randomly generated key means these hashes
will change each run and not allowing comparing hashes between runs. To maintain
the same hashes as the daily export run you will want to retrieve the secret
key used in the daily export. This can be retrieved from the concourse pipeline
with `gds cd secrets get cd-govuk-tools govuk-ask-export/secret-key`.

## Licence

[MIT License](LICENCE)
15 changes: 13 additions & 2 deletions lib/ask_export/csv_builder.rb
Original file line number Diff line number Diff line change
Expand Up @@ -18,11 +18,18 @@ def cabinet_office
end

def data_labs
build_csv(report.completed_responses,
responses = report.completed_responses.map do |response|
response.merge(hashed_email: hash(response[:email]),
hashed_phone: hash(response[:phone]))
end

build_csv(responses,
:submission_time,
:region,
:question,
:question_format)
:question_format,
:hashed_email,
:hashed_phone)
end

def performance_analyst
Expand Down Expand Up @@ -54,5 +61,9 @@ def build_csv(responses, *fields)
responses.each { |row| csv << row.slice(*fields).values }
end
end

def hash(field)
Digest::SHA256.hexdigest(field + ENV.fetch("SECRET_KEY"))
end
end
end
16 changes: 11 additions & 5 deletions spec/ask_export/csv_builder_spec.rb
Original file line number Diff line number Diff line change
Expand Up @@ -18,11 +18,17 @@
end

describe "#data_labs" do
it "returns a csv of completed records formatted for data labs" do
expect(builder.data_labs).to eq(
"submission_time,region,question,question_format\n" \
"01/05/2020 09:00:00,Scotland,A question?,\"In writing, to be read out at the press conference\"\n",
)
let(:secret_key) { SecureRandom.uuid }
let(:hashed_email) { Digest::SHA256.hexdigest("[email protected]" + secret_key) }
let(:hashed_phone) { Digest::SHA256.hexdigest("+447123456789" + secret_key) }

it "returns a csv of completed records with hashed emails and phone numbers for Data Labs" do
ClimateControl.modify(SECRET_KEY: secret_key) do
expect(builder.data_labs).to eq(
"submission_time,region,question,question_format,hashed_email,hashed_phone\n" \
"01/05/2020 09:00:00,Scotland,A question?,\"In writing, to be read out at the press conference\",#{hashed_email},#{hashed_phone}\n",
)
end
end
end

Expand Down
1 change: 1 addition & 0 deletions spec/integration/drive_export_spec.rb
Original file line number Diff line number Diff line change
Expand Up @@ -19,6 +19,7 @@
THIRD_PARTY_DRIVE_FOLDER: "third-party-folder-id",
THIRD_PARTY_RECIPIENTS: "[email protected]",
OUTPUT_DIR: tmpdir,
SECRET_KEY: SecureRandom.uuid,
SINCE_TIME: "2020-05-06 20:00",
UNTIL_TIME: "2020-05-07 11:00") { example.run }
end
Expand Down
1 change: 1 addition & 0 deletions spec/integration/file_export_spec.rb
Original file line number Diff line number Diff line change
Expand Up @@ -12,6 +12,7 @@
ClimateControl.modify(SMART_SURVEY_API_TOKEN: "token",
SMART_SURVEY_API_TOKEN_SECRET: "token",
OUTPUT_DIR: tmpdir,
SECRET_KEY: SecureRandom.uuid,
SINCE_TIME: "2020-05-06 20:00",
UNTIL_TIME: "2020-05-07 11:00") do
Rake::Task["file_export"].invoke
Expand Down

0 comments on commit 227d4ee

Please sign in to comment.