Skip to content

Latest commit

 

History

History
132 lines (107 loc) · 8.52 KB

README.md

File metadata and controls

132 lines (107 loc) · 8.52 KB

sync_surveycto

Overview

Extract and load.

Initial setup

Start setting up the GitHub repository

  1. In GitHub, click "Use this template", then "Create a new repository".
  2. Enter a repository name, such as syncsurveycto-xyz, where xyz is the name of the SurveyCTO server to be synced, and click "Create repository".
  3. Clone the new repo to a directory on your machine of choice.
  4. Rename syncsurveycto-rsurveycto.Rproj to match the new repo's name.
  5. In the first line of README.md, change the two instances of "agency-fund/syncsurveycto-rsurveycto" to match the new repo's organization and name.
  6. In the repo's main directory, create a folder called secrets.
  7. Start an R session in the new repo's main directory. You should see a message that a project was loaded by renv.
  8. In R, run the following commands to upgrade the repo to the latest version of renv, restore the local project library, update the packages, and record the state of the library.
    renv::upgrade()
    renv::restore()
    renv::update()
    renv::snapshot()

Set up SurveyCTO

  1. In the secrets folder, create a file called scto_auth.txt containing the server name on the first line, username on the second, and password on the third. This username must have permission to download data and "allow server API access" must be enabled.

Set up the data warehouse

Google BigQuery

  1. If you haven't already, create a Google Cloud project named something like xyz-raw, where xyz is the name of your organization.
  2. Select the project and enable the BigQuery API.
  3. Create a service account named syncsurveycto-user.
  4. Create a JSON key for the service account, and download the JSON file to the secrets folder.
  5. For the project, give the service account (whose email address will be something like [email protected]) the role BigQuery User.
  6. Create two BigQuery datasets named surveycto and surveycto_dev in the desired region.
  7. For each of the two datasets, click Share, then Manage Permissions, then Add Principal, then add the information below (changing xyz-raw as appropriate), and click Save.
  8. Update the params/warehouse.yaml file, changing xyz-raw as appropriate:
    • auth_file: Name of the JSON file for the service account
    • project (for prod and dev environments): xyz-raw

Postgres

  1. Seriously consider not using Postgres as a data warehouse, and instead using BigQuery.
  2. That is all for now.

Finish setting up the GitHub repository

  1. In the new GitHub repo, click Settings, then "Secrets and variables", then Actions.
  2. Click "New repository secret", enter the information below, then click "Add secret".
    • Name: SCTO_AUTH
    • Secret: Content of scto_auth.txt, with no trailing line break.
  3. Once again, click "New repository secret", enter the information below, then click "Add secret".
    • Name: WH_AUTH
    • Secret: Content of the auth_file specified in params/warehouse.yaml.
  4. In the git repo on your local machine, make a git commit and push the changes to GitHub.
  5. In the GitHub repo, go to Actions and ensure that a GitHub Actions run has started and that it completes without error.
  6. Once the run completes, click on sync_surveycto to see the job details, and ensure that the final line of the "Run script" step says "No ids to sync."

Sync modes

Sync mode Airbyte equivalent Supported for forms Supported for datasets
overwrite Full Refresh Overwrite ✓*
append Full Refresh Append
incremental Incremental Append -
deduped Incremental Append Deduped ✓* -

* Recommended

Add a form or dataset to the pipeline

  1. On your local machine, ensure you are not on the main branch.
  2. Update params/surveycto.yaml by adding lines to the streams list of the dev section, as indicated below. Take care with indentation.
    - id: best_id_ever
      sync_mode: chosen_sync_mode
      # review_status: approved # optional for forms, not allowed for datasets
    
    The default value for review_status for forms is approved. To specify multiple values, use an underscore-separated string, e.g., approved_pending.
  3. In the terminal, run Rscript code/main.R and ensure that the syncs succeed.
  4. Go to the part of your warehouse specified in the dev environment of the params/warehouse.yaml file and ensure that the columns, number of rows, and content of the new table(s) look(s) right.
  5. Carefully copy the new lines from the dev section and paste them into the streams list of the prod section. Remember, this is Sparta.
  6. Make a git commit and push the changes to GitHub.
  7. On GitHub, create a sensibly named pull request and add someone as a reviewer.
  8. Ensure that the syncs initiated by the pull request and run on GitHub Actions succeed.
  9. Ensure that the table(s) in dev still look(s) right.
  10. Wait for the reviewer to approve and merge the pull request.
  11. Once the pull request is merged, ensure that the syncs run on GitHub Actions succeed.
  12. Ensure that the table(s) in prod look(s) right.

Remove a form or dataset from the pipeline

  1. On your local machine, ensure you are not on the main branch.
  2. Update params/surveycto.yaml by deleting or commenting out the relevant lines from the streams lists of the dev and the prod sections.
  3. In the terminal, run Rscript code/main.R and ensure that the syncs succeed.
  4. Make a git commit and push the changes to GitHub.
  5. On GitHub, create a sensibly named pull request and add someone as a reviewer.
  6. Ensure that the syncs initiated by the pull request and run on GitHub Actions succeed.
  7. Wait for the reviewer to approve and merge the pull request.
  8. Once the pull request is merged, ensure that the syncs run on GitHub Actions succeed.
  9. The table(s) for the form or dataset will remain in BigQuery, but will not be updated.

Update packages used by the pipeline

  1. On your local machine, ensure you are not on the main branch.
  2. In R, run renv::update() to update all packages or renv::update('agency-fund/syncsurveycto') to update syncsurveycto.
  3. In R, run renv::snapshot().
  4. If the renv.lock file has changed, continue with the steps below. If not, stop here.
  5. In the terminal, run Rscript code/main.R and ensure that the syncs succeed and that the tables in dev still look right.
  6. Make a git commit and push the changes to GitHub.
  7. On GitHub, create a sensibly named pull request and add someone as a reviewer.
  8. Ensure that the syncs initiated by the pull request and run on GitHub Actions succeed.
  9. Ensure that the tables in dev still look right.
  10. Wait for the reviewer to approve and merge the pull request.
  11. Once the pull request is merged, ensure that the syncs run on GitHub Actions succeed.
  12. Ensure that the tables in prod still look right.

Update the schedule on which the pipeline runs

  1. On your local machine, ensure you are not on the main branch.
  2. In the .github/workflows/sync_surveycto.yaml file, edit the cron item. See details here.
  3. Make a git commit and push changes to GitHub.
  4. On GitHub, create a sensibly named pull request and add someone as a reviewer.
  5. Ensure that the syncs initiated by the pull request and run on GitHub Actions succeed.
  6. Wait for the reviewer to approve and merge the pull request.
  7. Once the pull request is merged, ensure that the syncs run on GitHub Actions succeed and that they run on the intended schedule.