Release 0.5.1
Major changes: perf improvements in BigQuery dataset (incl materialized views, partitioning) , refactored IndexerPipeline for easier understanding
BigQuery:
- added partitioning to raw fact tables document, sentiment, webresource, wrsocialcount and some stats table that have daily snapshots statstoryimpact, stattopic
- reload_metadata_template.sh: added script to populate the metadata "topic" table
- Added topic table to store blocked topics
- Materialized many former views to tables for faster querying: statdomainopinions, statstoryrank, stattopstory7d, stattoptopic7d, stattoptopic7dsentiment
Dataflow pipelines:
- Refactored IndexerPipeline so that code is easier to read
- Added Reshuffle transform: allows to break up fused steps, e.g. when getting OOMs
- Added SplitAB transform: Divide your PCollection in A and B branches by defined ratio
- Added PartitionedTableRef: Helper for writing into partitioned BigQuery tables
- Added write to Bigtable to store the dead letter queue/ bad data
- Added integration with CloudNLP to get their entities and store in BigQuery
- Added tutorial package and OpinionAnalysisPipeline class as the basis for a future tutorial
- StatsCalcPipeline: added calculation statements to calculate the new stats tables backing views
- config.properties: added an override file to control the sirocco-sa config
- custom-idioms-en.csv: added override file for custom dictionaries