Before you can start to use BigQuery and send events to it with dfe-analytics
you'll need to setup your project in the Google Cloud Platform (GCP).
These steps need to be performed only once when you setup your Google Cloud project.
Ask in Slack on the #twd_data_insights
channel for someone to help you create
your project in the digital.education.gov.uk
Google Cloud Organisation.
Each team is responsible for managing their project in Google Cloud. Ensure
you've added users with the Owner
role through the IAM section of Google
Cloud.
You also need to set up your GCP organisation instance with paid billing. This
is because dfe-analytics
uses streaming, and streaming to BigQuery isn't
allowed in the free tier:
accessDenied: Access Denied: BigQuery BigQuery: Streaming insert is not allowed
in the free tier
The following steps can be accomplished without having billing setup, however there are certain restrictions.
- Streaming data to BigQuery isn't allowed, so you won't be able to use
dfe_analytics
. - Tables are limited to 60 days retention.
We use customised roles to give permissions to users who need to use the BigQuery.
Instructions are provided below and must be followed to create each role. There are two approaches available to create custom roles, one is using the Google Cloud shell CLI, which is appropriate for advanced users comfortable with command-line interfaces. The other is through the Google Cloud IAM web UI and requires more manual work especially when it comes to adding permissions.
Instructions for GCloud CLI
NB: These instructions are appropriate for people who are comfortable running shell commands.
Instructions for GCloud IAM Web UI
NB: Adding permissions to a role is a manual process that requires using the permission browser to add permissions one at a time.
- Go to the IAM section of the Google Console for your project.
- Go to Roles section using the sidebar on the left.
- Click on "+ Create role" near the top.
- Fill in the details from the info below.
This role is used for analysts or other users who don't need to write to or modify data in BigQuery.
Using the GCloud CLI
gcloud iam roles create bigquery_basic_custom --title="BigQuery Basic Custom" --description="Assigned to accounts used by analysts." --permissions=analyticshub.dataExchanges.get,analyticshub.dataExchanges.getIamPolicy,analyticshub.dataExchanges.list,analyticshub.dataExchanges.subscribe,analyticshub.listings.get,analyticshub.listings.getIamPolicy,analyticshub.listings.list,analyticshub.listings.subscribe,bigquery.connections.get,bigquery.dataPolicies.maskedGet,bigquery.datasets.get,bigquery.datasets.getIamPolicy,bigquery.datasets.updateTag,bigquery.jobs.create,bigquery.jobs.get,bigquery.jobs.list,bigquery.jobs.listAll,bigquery.models.export,bigquery.models.getData,bigquery.models.getMetadata,bigquery.models.list,bigquery.readsessions.create,bigquery.readsessions.getData,bigquery.readsessions.update,bigquery.routines.get,bigquery.routines.list,bigquery.savedqueries.create,bigquery.savedqueries.delete,bigquery.savedqueries.get,bigquery.savedqueries.list,bigquery.savedqueries.update,bigquery.tables.createSnapshot,bigquery.tables.export,bigquery.tables.get,bigquery.tables.getData,bigquery.tables.getIamPolicy,bigquery.tables.list,bigquery.tables.restoreSnapshot,datacatalog.entries.get,datacatalog.entries.list,datacatalog.entryGroups.get,datacatalog.entryGroups.list,datacatalog.tagTemplates.get,datacatalog.tagTemplates.getTag,datacatalog.taxonomies.get,datacatalog.taxonomies.list,datalineage.events.get,datalineage.events.list,datalineage.locations.searchLinks,datalineage.processes.get,datalineage.processes.list,datalineage.runs.get,datalineage.runs.list,iam.serviceAccounts.actAs,iam.serviceAccounts.get,iam.serviceAccounts.list,pubsub.topics.get,resourcemanager.projects.get --project=YOUR_PROJECT_ID
Using the GCloud IAM Web UI
Field | Value |
---|---|
Title | BigQuery Basic Custom |
Description | Assigned to accounts used by analysts or other users who don't need to write to or modify data in BigQuery. |
ID | bigquery_basic_custom |
Role launch stage | General Availability |
+ Add permissions | See below |
analyticshub.dataExchanges.get
analyticshub.dataExchanges.getIamPolicy
analyticshub.dataExchanges.list
analyticshub.dataExchanges.subscribe
analyticshub.listings.get
analyticshub.listings.getIamPolicy
analyticshub.listings.list
analyticshub.listings.subscribe
bigquery.connections.get
bigquery.dataPolicies.maskedGet
bigquery.datasets.get
bigquery.datasets.getIamPolicy
bigquery.datasets.updateTag
bigquery.jobs.create
bigquery.jobs.get
bigquery.jobs.list
bigquery.jobs.listAll
bigquery.models.export
bigquery.models.getData
bigquery.models.getMetadata
bigquery.models.list
bigquery.readsessions.create
bigquery.readsessions.getData
bigquery.readsessions.update
bigquery.routines.get
bigquery.routines.list
bigquery.savedqueries.create
bigquery.savedqueries.delete
bigquery.savedqueries.get
bigquery.savedqueries.list
bigquery.savedqueries.update
bigquery.tables.createSnapshot
bigquery.tables.export
bigquery.tables.get
bigquery.tables.getData
bigquery.tables.getIamPolicy
bigquery.tables.list
bigquery.tables.restoreSnapshot
datacatalog.entries.get
datacatalog.entries.list
datacatalog.entryGroups.get
datacatalog.entryGroups.list
datacatalog.tagTemplates.get
datacatalog.tagTemplates.getTag
datacatalog.taxonomies.get
datacatalog.taxonomies.list
datalineage.events.get
datalineage.events.list
datalineage.locations.searchLinks
datalineage.processes.get
datalineage.processes.list
datalineage.runs.get
datalineage.runs.list
iam.serviceAccounts.actAs
iam.serviceAccounts.get
iam.serviceAccounts.list
pubsub.topics.get
resourcemanager.projects.get
This role is used for Dataform SQL developers or other users who need to be able to write to or modify data in BigQuery.
Using the GCloud CLI
gcloud iam roles create bigquery_advanced_custom --title="BigQuery Advanced Custom" --description="Assigned to accounts used by Dataform SQL developers who need to be able to write to or modify data in BigQuery." --permissions=analyticshub.dataExchanges.get,analyticshub.dataExchanges.getIamPolicy,analyticshub.dataExchanges.list,analyticshub.dataExchanges.subscribe,analyticshub.listings.get,analyticshub.listings.getIamPolicy,analyticshub.listings.list,analyticshub.listings.subscribe,aiplatform.notebookRuntimeTemplates.apply,aiplatform.notebookRuntimeTemplates.get,aiplatform.notebookRuntimeTemplates.getIamPolicy,aiplatform.notebookRuntimeTemplates.list,aiplatform.notebookRuntimes.assign,aiplatform.notebookRuntimes.get,aiplatform.notebookRuntimes.list,aiplatform.operations.list,bigquery.config.get,bigquery.connections.create,bigquery.connections.delete,bigquery.connections.get,bigquery.connections.getIamPolicy,bigquery.connections.list,bigquery.connections.update,bigquery.connections.updateTag,bigquery.connections.use,bigquery.datasets.create,bigquery.datasets.delete,bigquery.datasets.get,bigquery.datasets.getIamPolicy,bigquery.datasets.update,bigquery.datasets.updateTag,bigquery.jobs.create,bigquery.jobs.delete,bigquery.jobs.get,bigquery.jobs.list,bigquery.jobs.listAll,bigquery.jobs.update,bigquery.models.create,bigquery.models.delete,bigquery.models.export,bigquery.models.getData,bigquery.models.getMetadata,bigquery.models.list,bigquery.models.updateData,bigquery.models.updateMetadata,bigquery.models.updateTag,bigquery.readsessions.create,bigquery.readsessions.getData,bigquery.readsessions.update,bigquery.routines.create,bigquery.routines.delete,bigquery.routines.get,bigquery.routines.list,bigquery.routines.update,bigquery.routines.updateTag,bigquery.savedqueries.create,bigquery.savedqueries.delete,bigquery.savedqueries.get,bigquery.savedqueries.list,bigquery.savedqueries.update,bigquery.tables.create,bigquery.tables.createSnapshot,bigquery.tables.delete,bigquery.tables.deleteSnapshot,bigquery.tables.export,bigquery.tables.get,bigquery.tables.getData,bigquery.tables.getIamPolicy,bigquery.tables.list,bigquery.tables.restoreSnapshot,bigquery.tables.setCategory,bigquery.tables.update,bigquery.tables.updateData,bigquery.tables.updateTag,datacatalog.categories.fineGrainedGet,datacatalog.entries.get,datacatalog.entries.list,datacatalog.entryGroups.get,datacatalog.entryGroups.list,datacatalog.tagTemplates.get,datacatalog.tagTemplates.getTag,datacatalog.taxonomies.get,datacatalog.taxonomies.list,dataform.compilationResults.create,dataform.compilationResults.get,dataform.compilationResults.list,dataform.compilationResults.query,dataform.locations.get,dataform.locations.list,dataform.releaseConfigs.create,dataform.releaseConfigs.delete,dataform.releaseConfigs.get,dataform.releaseConfigs.list,dataform.releaseConfigs.update,dataform.repositories.commit,dataform.repositories.computeAccessTokenStatus,dataform.repositories.create,dataform.repositories.delete,dataform.repositories.fetchHistory,dataform.repositories.fetchRemoteBranches,dataform.repositories.get,dataform.repositories.getIamPolicy,dataform.repositories.list,dataform.repositories.queryDirectoryContents,dataform.repositories.readFile,dataform.repositories.setIamPolicy,dataform.repositories.update,dataform.workflowConfigs.create,dataform.workflowConfigs.delete,dataform.workflowConfigs.get,dataform.workflowConfigs.list,dataform.workflowConfigs.update,dataform.workflowInvocations.cancel,dataform.workflowInvocations.create,dataform.workflowInvocations.delete,dataform.workflowInvocations.get,dataform.workflowInvocations.list,dataform.workflowInvocations.query,dataform.workspaces.commit,dataform.workspaces.create,dataform.workspaces.delete,dataform.workspaces.fetchFileDiff,dataform.workspaces.fetchFileGitStatuses,dataform.workspaces.fetchGitAheadBehind,dataform.workspaces.get,dataform.workspaces.getIamPolicy,dataform.workspaces.installNpmPackages,dataform.workspaces.list,dataform.workspaces.makeDirectory,dataform.workspaces.moveDirectory,dataform.workspaces.moveFile,dataform.workspaces.pull,dataform.workspaces.push,dataform.workspaces.queryDirectoryContents,dataform.workspaces.readFile,dataform.workspaces.removeDirectory,dataform.workspaces.removeFile,dataform.workspaces.reset,dataform.workspaces.searchFiles,dataform.workspaces.setIamPolicy,dataform.workspaces.writeFile,datalineage.events.get,datalineage.events.list,datalineage.locations.searchLinks,datalineage.processes.get,datalineage.processes.list,datalineage.runs.get,datalineage.runs.list,iam.serviceAccounts.actAs,iam.serviceAccounts.get,iam.serviceAccounts.list,logging.buckets.get,logging.buckets.list,logging.exclusions.get,logging.exclusions.list,logging.links.get,logging.links.list,logging.locations.get,logging.locations.list,logging.logEntries.list,logging.logMetrics.get,logging.logMetrics.list,logging.logServiceIndexes.list,logging.logServices.list,logging.logs.list,logging.operations.get,logging.operations.list,logging.queries.create,logging.queries.delete,logging.queries.get,logging.queries.list,logging.queries.listShared,logging.queries.update,logging.sinks.get,logging.sinks.list,logging.usage.get,logging.views.get,logging.views.list,pubsub.topics.get,resourcemanager.projects.get --project=YOUR_PROJECT_ID
Using the GCloud IAM Web UI
Field | Value |
---|---|
Title | BigQuery Advanced Custom |
Description | Assigned to accounts used by Dataform SQL developers who need to be able to write to or modify data in BigQuery. |
ID | bigquery_advanced_custom |
Role launch stage | General Availability |
+ Add permissions | See below |
analyticshub.dataExchanges.get
analyticshub.dataExchanges.getIamPolicy
analyticshub.dataExchanges.list
analyticshub.dataExchanges.subscribe
analyticshub.listings.get
analyticshub.listings.getIamPolicy
analyticshub.listings.list
analyticshub.listings.subscribe
aiplatform.notebookRuntimeTemplates.apply
aiplatform.notebookRuntimeTemplates.get
aiplatform.notebookRuntimeTemplates.getIamPolicy
aiplatform.notebookRuntimeTemplates.list
aiplatform.notebookRuntimes.assign
aiplatform.notebookRuntimes.get
aiplatform.notebookRuntimes.list
aiplatform.operations.list
bigquery.config.get
bigquery.connections.create
bigquery.connections.delete
bigquery.connections.get
bigquery.connections.getIamPolicy
bigquery.connections.list
bigquery.connections.update
bigquery.connections.updateTag
bigquery.connections.use
bigquery.datasets.create
bigquery.datasets.delete
bigquery.datasets.get
bigquery.datasets.getIamPolicy
bigquery.datasets.update
bigquery.datasets.updateTag
bigquery.jobs.create
bigquery.jobs.delete
bigquery.jobs.get
bigquery.jobs.list
bigquery.jobs.listAll
bigquery.jobs.update
bigquery.models.create
bigquery.models.delete
bigquery.models.export
bigquery.models.getData
bigquery.models.getMetadata
bigquery.models.list
bigquery.models.updateData
bigquery.models.updateMetadata
bigquery.models.updateTag
bigquery.readsessions.create
bigquery.readsessions.getData
bigquery.readsessions.update
bigquery.routines.create
bigquery.routines.delete
bigquery.routines.get
bigquery.routines.list
bigquery.routines.update
bigquery.routines.updateTag
bigquery.savedqueries.create
bigquery.savedqueries.delete
bigquery.savedqueries.get
bigquery.savedqueries.list
bigquery.savedqueries.update
bigquery.tables.create
bigquery.tables.createSnapshot
bigquery.tables.delete
bigquery.tables.deleteSnapshot
bigquery.tables.export
bigquery.tables.get
bigquery.tables.getData
bigquery.tables.getIamPolicy
bigquery.tables.list
bigquery.tables.restoreSnapshot
bigquery.tables.setCategory
bigquery.tables.update
bigquery.tables.updateData
bigquery.tables.updateTag
datacatalog.categories.fineGrainedGet
datacatalog.entries.get
datacatalog.entries.list
datacatalog.entryGroups.get
datacatalog.entryGroups.list
datacatalog.tagTemplates.get
datacatalog.tagTemplates.getTag
datacatalog.taxonomies.get
datacatalog.taxonomies.list
dataform.compilationResults.create
dataform.compilationResults.get
dataform.compilationResults.list
dataform.compilationResults.query
dataform.locations.get
dataform.locations.list
dataform.releaseConfigs.create
dataform.releaseConfigs.delete
dataform.releaseConfigs.get
dataform.releaseConfigs.list
dataform.releaseConfigs.update
dataform.repositories.commit
dataform.repositories.computeAccessTokenStatus
dataform.repositories.create
dataform.repositories.delete
dataform.repositories.fetchHistory
dataform.repositories.fetchRemoteBranches
dataform.repositories.get
dataform.repositories.getIamPolicy
dataform.repositories.list
dataform.repositories.queryDirectoryContents
dataform.repositories.readFile
dataform.repositories.setIamPolicy
dataform.repositories.update
dataform.workflowConfigs.create
dataform.workflowConfigs.delete
dataform.workflowConfigs.get
dataform.workflowConfigs.list
dataform.workflowConfigs.update
dataform.workflowInvocations.cancel
dataform.workflowInvocations.create
dataform.workflowInvocations.delete
dataform.workflowInvocations.get
dataform.workflowInvocations.list
dataform.workflowInvocations.query
dataform.workspaces.commit
dataform.workspaces.create
dataform.workspaces.delete
dataform.workspaces.fetchFileDiff
dataform.workspaces.fetchFileGitStatuses
dataform.workspaces.fetchGitAheadBehind
dataform.workspaces.get
dataform.workspaces.getIamPolicy
dataform.workspaces.installNpmPackages
dataform.workspaces.list
dataform.workspaces.makeDirectory
dataform.workspaces.moveDirectory
dataform.workspaces.moveFile
dataform.workspaces.pull
dataform.workspaces.push
dataform.workspaces.queryDirectoryContents
dataform.workspaces.readFile
dataform.workspaces.removeDirectory
dataform.workspaces.removeFile
dataform.workspaces.reset
dataform.workspaces.searchFiles
dataform.workspaces.setIamPolicy
dataform.workspaces.writeFile
datalineage.events.get
datalineage.events.list
datalineage.locations.searchLinks
datalineage.processes.get
datalineage.processes.list
datalineage.runs.get
datalineage.runs.list
iam.serviceAccounts.actAs
iam.serviceAccounts.get
iam.serviceAccounts.list
logging.buckets.get
logging.buckets.list
logging.exclusions.get
logging.exclusions.list
logging.links.get
logging.links.list
logging.locations.get
logging.locations.list
logging.logEntries.list
logging.logMetrics.get
logging.logMetrics.list
logging.logServiceIndexes.list
logging.logServices.list
logging.logs.list
logging.operations.get
logging.operations.list
logging.queries.create
logging.queries.delete
logging.queries.get
logging.queries.list
logging.queries.listShared
logging.queries.update
logging.sinks.get
logging.sinks.list
logging.usage.get
logging.views.get
logging.views.list
pubsub.topics.get
resourcemanager.projects.get
This role is assigned to the service account used by the application connecting
to Google Cloud to append data to the events
tables.
Using the GCloud CLI
gcloud iam roles create bigquery_appender_custom --title="BigQuery Appender Custom" --description="Assigned to service accounts used to append data to events tables." --permissions=bigquery.datasets.get,bigquery.tables.get,bigquery.tables.updateData --project=YOUR_PROJECT_ID
Using the GCloud IAM Web UI
Field | Value |
---|---|
Title | BigQuery Appender Custom |
Description | Assigned to service accounts used to append data to events tables. |
ID | bigquery_appender_custom |
Role launch stage | General Availability |
+ Add permissions | See below |
bigquery.datasets.get
bigquery.tables.get
bigquery.tables.updateData
We use a BigQuery 'policy tag' to label some fields in some tables in BigQuery as 'hidden', restrict access to these fields and mask data in these fields to users without access. Policy tag(s) exist within a group known as a 'taxonomy'.
To create the 'hidden' policy tag required by dfe-analytics:
- Enable the "BigQuery Data Policy API": search for this from the 'Enable APIs and services' screen, accessible from the 'Enabled APIs and services' screen within the 'APIs and services' section of GCP, and click 'Enable'.
- Open BigQuery, open the 'Policy tags' screen and click 'Create taxonomy'.
- Use this screen to create a policy tag named ‘hidden’ within a taxonomy named something like ‘project-restricted-access' (replacing ‘project’ with something meaningful to your GCP project). Ensure the taxonomy is within the europe-west2 (London) region.
- Click the 'Manage data policies' button to open the Masking rules screen. Under 'Data policy name 1' type 'hidden' and under 'Masking rule 1' select 'Hash (SHA256)'. Click Submit.
dfe-analytics
inserts events into a table in BigQuery with a pre-defined
schema. Access is given using a service account that has access to append data
to the given events table. The recommended setup is to have a separate dataset
and service account for each application / environment combination in your
project.
For example let's say you have the applications publish
and find
in your
project, and use development
, qa
, staging
and production
environments.
You should create a separate dataset for each combination of the above, as well
as a separate service account that has access to append data to events in only
one dataset. The following table illustrates how this might look for this
example:
Application | Environment | BigQuery Dataset | Service Account |
---|---|---|---|
publish | development | publish_events_development | [email protected] |
publish | qa | publish_events_qa | [email protected] |
publish | staging | publish_events_staging | [email protected] |
publish | production | publish_events_production | [email protected] |
find | development | find_events_development | [email protected] |
find | qa | find_events_qa | [email protected] |
find | staging | find_events_staging | [email protected] |
find | production | find_events_production | [email protected] |
This approach helps prevent the possibility of sending events to the wrong dataset, and reduce the risk should a secret key for one of these accounts be leaked.
NB: It may be easier to perform these instructions with two browser tabs open, one for BigQuery and the other for IAM
Start by creating a dataset.
- Open your project's BigQuery instance and go to the SQL Workspace section.
- Tap on the 3 dots next to the project name then "Create dataset".
- Name it something like
APPLICATIONNAME_events_ENVIRONMENT
, as per above examples, e.g.publish_events_development
, and set the location toeurope-west2 (London)
.
Once the dataset is ready you need to create the events
table in it:
- Select your new dataset and click the to open a new query execution tab.
- Copy the contents of create-events-table.sql into the query editor.
- Edit your project and dataset names in the query editor.
- Run the query to create a blank events table.
- Label the hidden_DATA field with the 'hidden' policy tag to restrict access to it: Navigate to the newly created table in BigQuery using the left hand sidebar. Click 'Edit Schema'. Expand the 'hidden_DATA' field and select the checkbox next to the 'value' element within it. Click 'Add policy tag' and select the 'hidden' policy tag in the taxonomy for your project. Click Save.
BigQuery allows you to copy a table to a new dataset, so now is a good time to
create all the datasets you need and copy the blank events
table to each of
them.
Create a service account that will be given permission to append data to tables in the new dataset.
- Go to IAM and Admin settings > Create service account
- Name it like "Appender NAME_OF_SERVICE ENVIRONMENT" e.g. "Appender ApplyForQTS Development".
- Add a description, like "Used for appending data from development environments."
- Copy the email address using the button next to it. You'll need this in the next step to give this account access to your dataset.
- Click the "CREATE AND CONTINUE" button.
- Click "DONE", skipping the steps to grant roles and user access to this account. Access will be given to the specific dataset in the next step.
Ensure you have the email address of the service account handy for this.
- Go to the events table created in step 2 above and click "SHARING" > "Permissions" near the top right.
- Click "ADD PRINCIPAL".
- Paste in the email address of the service account you created into the "New principals" box.
- Select the "BigQuery Appender Custom" role you created previously.
- Click "SAVE" to finish.
External applications connecting to Google Cloud tend to use service account keys to access Google Cloud resources. However, service account keys are powerful credentials, and can present a security risk if they are not managed correctly. Workload Identity Federation eliminates the security risk associated with service account keys.
With Workload Identity Federation (WIF), you can use Identity and Access Management (IAM) to grant to external identities IAM roles, direct access on Google Cloud resources. You can also grant access through service account impersonation.
With dfe-analytics our strong preference is to use WIF where possible. Where WIF is not possible to use then OAuth should be considered. The use of service account API Keys is discouraged.
The diagram below demonstrates our use of WIF within dfe-analytics connecting from an Azure client to BigQuery.
![[azure-gcp-wif.svg]]
The steps below outline how to setup WIF for service accounts using either gcloud shell scripts or gcloud console.
The client process connecting to GCP should have WIF enabled. TS DevOps provide this through terraform configuration.
If the client process is enabled for WIF, then it will have the following properties per environment (namespace):
The following environment variables will be set:
AZURE_CLIENT_ID
AZURE_FEDERATED_TOKEN_FILE
Within Azure a managed identity will also exist for each namespace. The managed identity will have the text gcp-wif
within its name.
Please take note of the Managed Identity Object ID for each namespace (environment). This is a uuid that will be required in later steps below.
If WIF is not enabled, then contact TS DevOps in the #teacher_services_infra on getting this enabled.
For each project a workload identity pool with the name azure-cip-identity-pool
should exist.
If this does not exist then one can be created with either the create gcp workload identity pool gcloud script or from the IAM gcloud console using the attributes specified in the gcloud script.
For each project a workload identity pool provider with the name azure-cip-oidc-provider
should exist.
If this does not exist then one can be created with either the create gcp workload identity pool provider gcloud script or from the IAM gcloud console using the attributes specified in the gcloud script.
A service account with the correct permissions on the events table should exist.
If this does not exist then follow the steps above:
The service account defined in step 4 above should be granted access using service account impersonation.
If this does not exist then access can be granted with either the update wif service account permissions gcloud script or from the IAM gcloud console, by navigating to the "GRANT ACCESS" window. Use the attributes specified in the gcloud script. Note that the subject must be set to the Managed Identity Object ID from Azure for each environment (see Step 1 above).
Download the JSON WIF Credentials file and set to following environment variables to the content of this file:
GOOGLE_CLOUD_CREDENTIALS
Download the JSON WIF Credentials file with either the create wif client credentials gcloud script or from the IAM gcloud console, by navigating to the "CONNECTED SERVICE ACCOUNTS" tab. Use the the attributes specified in the gcloud script.