Snowplow Google Cloud Storage Loader

Introduction

Cloud Storage Loader is a Dataflow job which dumps events from an input PubSub subscription into a Cloud Storage bucket.

Partitioning by schema

At Snowplow we use self-describing-json format to keep a well-defined, type-spec'd data definitions. When used with self-describing JSON, bucket loaders are now able to send each schema-formatted event to applicable schema directory in a tidy directory structure. To partition incoming data (inputSubscription) by schema enable it by setting a target to store partitioned data - partitionedOutputDirectory whereas unpartitioned data will be stored in outputDirectory. All subdirectories in output bucket (partitionedOuptutDirectory) will be stored within date (dateFormat) and schema sorted subdirectories whereas data not partitioned will be stored in outputDirectory under date subdirectories.

Building

Zip archive

To build the zip archive, run:

sbt universal:packageBin

Docker image

To build a Docker image, run:

sbt docker:publishLocal

Running

Through a docker container

You can find the image on Docker hub.

A container can be run as follows:

docker run \
  -v $PWD/config:/snowplow/config \
  -e GOOGLE_APPLICATION_CREDENTIALS=/snowplow/config/credentials.json \ # if running outside GCP
  snowplow/snowplow-google-cloud-storage-loader:0.5.6 \
  --runner=DataFlowRunner \
  --jobName=[JOB-NAME] \
  --project=[PROJECT] \
  --streaming=true \
  --zone=[ZONE] \
  --inputSubscription=projects/[PROJECT]/subscriptions/[SUBSCRIPTION] \
  --outputDirectory=gs://[BUCKET] \
  --outputFilenamePrefix=output \ # optional
  --shardTemplate=-W-P-SSSSS-of-NNNNN \ # optional
  --outputFilenameSuffix=.txt \ # optional
  --windowDuration=5 \ # optional, in minutes
  --compression=none \ # optional, gzip, bz2 or none
  --numShards=1 \ # optional
  --dateFormat=YYYY/MM/dd/HH/ \ # optional
  --labels={\"label\": \"value\"} \ #OPTIONAL
  --partitionedOuptutDirectory=gs://[BUCKET]/[SUBDIR] # optional

To display the help message:

docker run snowplow/snowplow-google-cloud-storage-loader:0.5.6 \
  --help

To display documentation about Cloud Storage Loader-specific options:

docker run snowplow/snowplow-google-cloud-storage-loader:0.5.6 \
  --help=com.snowplowanalytics.storage.googlecloudstorage.loader.Options

Additional information

A full list of all the Beam CLI options can be found at: https://cloud.google.com/dataflow/pipelines/specifying-exec-params#setting-other-cloud-pipeline-options.

Testing

To run the tests:

sbt test

REPL

To experiment with the current codebase in Scio REPL simply run:

sbt repl/run

Find out more

Technical Docs	Setup Guide	Roadmap	Contributing

Technical Docs	Setup Guide	Roadmap	Contributing

Copyright and license

Licensed under the Apache License, Version 2.0 (the "License"); you may not use this software except in compliance with the License.

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.

Name		Name	Last commit message	Last commit date
Latest commit History 84 Commits
.github/workflows		.github/workflows
project		project
src		src
.gitignore		.gitignore
CHANGELOG		CHANGELOG
LICENSE-2.0.txt		LICENSE-2.0.txt
README.md		README.md
build.sbt		build.sbt
scalastyle-config.xml		scalastyle-config.xml
service-account.json.enc		service-account.json.enc

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Snowplow Google Cloud Storage Loader

Introduction

Partitioning by schema

Building

Zip archive

Docker image

Running

Through a docker container

Additional information

Testing

REPL

Find out more

Copyright and license

About

Releases 17

Packages

Contributors 6

Languages

snowplow-incubator/snowplow-google-cloud-storage-loader

Folders and files

Latest commit

History

Repository files navigation

Snowplow Google Cloud Storage Loader

Introduction

Partitioning by schema

Building

Zip archive

Docker image

Running

Through a docker container

Additional information

Testing

REPL

Find out more

Copyright and license

About

Topics

Resources

Stars

Watchers

Forks

Releases 17

Packages 0

Contributors 6

Languages

Packages