Skip to content

Release 0.5.1

Compare
Choose a tag to compare
@datancoffee datancoffee released this 05 Nov 21:55
· 21 commits to master since this release

Major changes: perf improvements in BigQuery dataset (incl materialized views, partitioning) , refactored IndexerPipeline for easier understanding

BigQuery:

  • added partitioning to raw fact tables document, sentiment, webresource, wrsocialcount and some stats table that have daily snapshots statstoryimpact, stattopic
  • reload_metadata_template.sh: added script to populate the metadata "topic" table
  • Added topic table to store blocked topics
  • Materialized many former views to tables for faster querying: statdomainopinions, statstoryrank, stattopstory7d, stattoptopic7d, stattoptopic7dsentiment

Dataflow pipelines:

  • Refactored IndexerPipeline so that code is easier to read
  • Added Reshuffle transform: allows to break up fused steps, e.g. when getting OOMs
  • Added SplitAB transform: Divide your PCollection in A and B branches by defined ratio
  • Added PartitionedTableRef: Helper for writing into partitioned BigQuery tables
  • Added write to Bigtable to store the dead letter queue/ bad data
  • Added integration with CloudNLP to get their entities and store in BigQuery
  • Added tutorial package and OpinionAnalysisPipeline class as the basis for a future tutorial
  • StatsCalcPipeline: added calculation statements to calculate the new stats tables backing views
  • config.properties: added an override file to control the sirocco-sa config
  • custom-idioms-en.csv: added override file for custom dictionaries