Populate raw schema in development VM #89

redshiftzero · 2016-11-30T21:41:03Z

This PR adds the population of the raw schema to the Ansible playbook that provisions the development VM. This is done such that the feature generation and machine learning codes can be run more easily in the development VM - i.e. without having to run the crawler (can take a while) or connect to the production database (A Bad Idea). The data that is populating the raw schema in the VM is derived from our real data (with some anonymization). I also add the notebook where I construct this dataset for future reference / modification.

Upon request, I have also created a version of the data used to populate each individual table here for people to play with in a single file roles/crawler/files/raw-data/test_data.csv without needing to worry about joins.

This PR also bumps the version of Tor Browser since our download link in the Ansible play was old and the download link was 404ing

coveralls · 2016-11-30T21:55:46Z

Coverage remained the same at 72.727% when pulling 5a73245 on populate-raw-schema-in-vm into b183c0c on master.

conorsch · 2016-11-30T23:24:24Z

roles/crawler/tasks/configure-databases.yml

+    always_run: true
+    changed_when: false
+
+  - debug: var=raw_schema_population_result.results


Do you still want these debug line in here? FYI recent versions of Ansible support the verbosity parameter on debug tasks, so the associated message will only display with e.g. -vvv if verbosity: 3. http://docs.ansible.com/debug_module.html

Hmm okay thanks for pointing that out, it's not necessary so I will remove this real quick

Snipped that line out

coveralls · 2016-11-30T23:35:56Z

Changes Unknown when pulling d8c2c2d on populate-raw-schema-in-vm into ** on master**.

redshiftzero added 3 commits November 29, 2016 16:04

Add CSV files to populate each table in the raw schema

45825e3

Add merged version of the test data for funsies

09ad450

Add notebook showing how the testing dataset was constructed

4731ab5

conorsch reviewed Nov 30, 2016

View reviewed changes

redshiftzero added 2 commits November 30, 2016 15:27

Add tasks to populate raw schema with test data

d18c326

Bump version of TB

d8c2c2d

redshiftzero force-pushed the populate-raw-schema-in-vm branch from 5a73245 to d8c2c2d Compare November 30, 2016 23:28

redshiftzero merged commit 7538b1b into master Nov 30, 2016

conorsch mentioned this pull request Nov 30, 2016

Simplifies group membership update task #90

Merged

psivesely deleted the populate-raw-schema-in-vm branch February 7, 2017 00:27

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Populate raw schema in development VM #89

Populate raw schema in development VM #89

redshiftzero commented Nov 30, 2016

coveralls commented Nov 30, 2016

conorsch Nov 30, 2016

redshiftzero Nov 30, 2016

redshiftzero Nov 30, 2016

coveralls commented Nov 30, 2016

Populate raw schema in development VM #89

Populate raw schema in development VM #89

Conversation

redshiftzero commented Nov 30, 2016

coveralls commented Nov 30, 2016

conorsch Nov 30, 2016

Choose a reason for hiding this comment

redshiftzero Nov 30, 2016

Choose a reason for hiding this comment

redshiftzero Nov 30, 2016

Choose a reason for hiding this comment

coveralls commented Nov 30, 2016