Release Release 0.5.1 · GoogleCloudPlatform/dataflow-opinion-analysis

Major changes: perf improvements in BigQuery dataset (incl materialized views, partitioning) , refactored IndexerPipeline for easier understanding

BigQuery:

added partitioning to raw fact tables document, sentiment, webresource, wrsocialcount and some stats table that have daily snapshots statstoryimpact, stattopic
reload_metadata_template.sh: added script to populate the metadata "topic" table
Added topic table to store blocked topics
Materialized many former views to tables for faster querying: statdomainopinions, statstoryrank, stattopstory7d, stattoptopic7d, stattoptopic7dsentiment

Dataflow pipelines:

Refactored IndexerPipeline so that code is easier to read
Added Reshuffle transform: allows to break up fused steps, e.g. when getting OOMs
Added SplitAB transform: Divide your PCollection in A and B branches by defined ratio
Added PartitionedTableRef: Helper for writing into partitioned BigQuery tables
Added write to Bigtable to store the dead letter queue/ bad data
Added integration with CloudNLP to get their entities and store in BigQuery
Added tutorial package and OpinionAnalysisPipeline class as the basis for a future tutorial
StatsCalcPipeline: added calculation statements to calculate the new stats tables backing views
config.properties: added an override file to control the sirocco-sa config
custom-idioms-en.csv: added override file for custom dictionaries

Provide feedback