Skip to content

Commit

Permalink
Add bigquery example (flyteorg#521)
Browse files Browse the repository at this point in the history
* Add bigquery example

Signed-off-by: Kevin Su <[email protected]>

* Add makefile

Signed-off-by: Kevin Su <[email protected]>

* Updated example

Signed-off-by: Kevin Su <[email protected]>

* Updated comment

Signed-off-by: Kevin Su <[email protected]>

* Updated requirement

Signed-off-by: Kevin Su <[email protected]>

* Updated dependency

Signed-off-by: Kevin Su <[email protected]>

* use annotated

Signed-off-by: Kevin Su <[email protected]>
  • Loading branch information
pingsutw authored Apr 11, 2022
1 parent d0a6ede commit a2b462b
Show file tree
Hide file tree
Showing 10 changed files with 354 additions and 2 deletions.
5 changes: 3 additions & 2 deletions cookbook/docs/conf.py
Original file line number Diff line number Diff line change
Expand Up @@ -129,8 +129,9 @@ class CustomSorter(FileNameSortKey):
## GCP
# TODO
## External Services
"hive.py",
"snowflake.py",
"hive.py"
"snowflake.py"
"bigquery.py"
# Extending Flyte
"backend_plugins.py", # NOTE: for some reason this needs to be listed first here to show up last on the TOC
"run_custom_types.py",
Expand Down
2 changes: 2 additions & 0 deletions cookbook/integrations/gcp/Makefile
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
include ../common/common.mk
include ../common/parent.mk
18 changes: 18 additions & 0 deletions cookbook/integrations/gcp/bigquery/Makefile
Original file line number Diff line number Diff line change
@@ -0,0 +1,18 @@
PREFIX=bigquery
include ../../../common/common.mk
include ../../../common/leaf.mk

.PHONY: docker_build
docker_build: ;

.PHONY: docker_push
docker_push: ;

.PHONY: register
register: ;

.PHONY: serialize
serialize: ;

.PHONY: docker_push
docker_push: ;
23 changes: 23 additions & 0 deletions cookbook/integrations/gcp/bigquery/README.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,23 @@
BigQuery
====

Flyte backend can be connected with BigQuery service. Once enabled it can allow you to query a BigQuery table.
This section will provide how to use the BigQuery Plugin using flytekit python.

Installation
------------

To use the flytekit bigquery plugin simply run the following:

.. prompt:: bash

pip install flytekitplugins-bigquery

No Need of a dockerfile
------------------------
This plugin is purely a spec. Since SQL is completely portable there is no need to build a Docker container.


Configuring the backend to get bigquery working
---------------------------------------------
- BigQuery plugins are `enabled in flytepropeller's config <https://docs.flyte.org/en/latest/deployment/plugin_setup/gcp/bigquery.html#deployment-plugin-setup-gcp-bigquery>`_
Empty file.
69 changes: 69 additions & 0 deletions cookbook/integrations/gcp/bigquery/bigquery.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,69 @@
"""
BigQuery Query
############
This example shows how to use a Flyte BigQueryTask to execute a query.
"""
try:
from typing import Annotated
except ImportError:
from typing_extensions import Annotated

import pandas as pd
from flytekit import StructuredDataset, kwtypes, task, workflow
from flytekitplugins.bigquery import BigQueryConfig, BigQueryTask

# %%
# This is the world's simplest query. Note that in order for registration to work properly, you'll need to give your
# BigQuery task a name that's unique across your project/domain for your Flyte installation.
bigquery_task_no_io = BigQueryTask(
name="sql.bigquery.no_io",
inputs={},
query_template="SELECT 1",
task_config=BigQueryConfig(ProjectID="flyte", Location="us-west1-b"),
)


@workflow
def no_io_wf():
return bigquery_task_no_io()


# %%
# Of course, in real world applications we are usually more interested in using BigQuery to query a dataset.
# In this case we use crypto_dogecoin data which is public dataset in BigQuery.
# `here <https://console.cloud.google.com/bigquery?project=bigquery-public-data&page=table&d=crypto_dogecoin&p=bigquery-public-data&t=transactions>`__
#
# Let's look out how we can parameterize our query to filter results for a specific transaction version, provided as a user input
# specifying a version.

DogeCoinDataset = Annotated[
StructuredDataset, kwtypes(hash=str, size=int, block_number=int)
]

bigquery_task_templatized_query = BigQueryTask(
name="sql.bigquery.w_io",
# Define inputs as well as their types that can be used to customize the query.
inputs=kwtypes(version=int),
output_structured_dataset_type=DogeCoinDataset,
task_config=BigQueryConfig(ProjectID="flyte", Location="us-west1-b"),
query_template="SELECT * FROM `bigquery-public-data.crypto_dogecoin.transactions` WHERE @version = 1 LIMIT 10;",
)


# %%
# StructuredDataset transformer can convert query result to pandas dataframe here.
# We can also change "pandas.dataframe" to "pyarrow.Table", and convert result to Arrow table.
@task
def convert_bq_table_to_pandas_dataframe(sd: DogeCoinDataset) -> pd.DataFrame:
return sd.open(pd.DataFrame).all()


@workflow
def full_bigquery_wf(version: int) -> pd.DataFrame:
sd = bigquery_task_templatized_query(version=version)
return convert_bq_table_to_pandas_dataframe(sd=sd)


# %%
# Check query result on bigquery console: ``https://console.cloud.google.com/bigquery``
2 changes: 2 additions & 0 deletions cookbook/integrations/gcp/bigquery/requirements.in
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
-r ../../../common/requirements-common.in
flytekitplugins-bigquery==0.30.1
217 changes: 217 additions & 0 deletions cookbook/integrations/gcp/bigquery/requirements.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,217 @@
#
# This file is autogenerated by pip-compile with python 3.9
# To update, run:
#
# pip-compile requirements.in
#
arrow==1.2.1
# via jinja2-time
binaryornot==0.4.4
# via cookiecutter
cachetools==5.0.0
# via google-auth
certifi==2021.10.8
# via requests
chardet==4.0.0
# via binaryornot
charset-normalizer==2.0.6
# via requests
checksumdir==1.2.0
# via flytekit
click==7.1.2
# via
# cookiecutter
# flytekit
cloudpickle==2.0.0
# via flytekit
cookiecutter==1.7.3
# via flytekit
croniter==1.0.15
# via flytekit
cycler==0.10.0
# via matplotlib
dataclasses-json==0.5.6
# via flytekit
decorator==5.1.0
# via retry
deprecated==1.2.13
# via flytekit
diskcache==5.2.1
# via flytekit
docker-image-py==0.1.12
# via flytekit
docstring-parser==0.11
# via flytekit
flyteidl==0.22.0
# via flytekit
flytekit==0.30.1
# via
# -r ../../../common/requirements-common.in
# flytekitplugins-bigquery
flytekitplugins-bigquery==0.30.1
# via -r requirements.in
google-api-core[grpc]==2.5.0
# via
# google-cloud-bigquery
# google-cloud-core
google-auth==2.6.0
# via
# google-api-core
# google-cloud-core
google-cloud-bigquery==2.32.0
# via flytekitplugins-bigquery
google-cloud-core==2.2.2
# via google-cloud-bigquery
google-crc32c==1.3.0
# via google-resumable-media
google-resumable-media==2.2.1
# via google-cloud-bigquery
googleapis-common-protos==1.54.0
# via
# google-api-core
# grpcio-status
grpcio==1.43.0
# via
# flytekit
# google-api-core
# google-cloud-bigquery
# grpcio-status
grpcio-status==1.43.0
# via google-api-core
idna==3.2
# via requests
importlib-metadata==4.8.1
# via keyring
jinja2==3.0.3
# via
# cookiecutter
# jinja2-time
jinja2-time==0.2.0
# via cookiecutter
keyring==23.2.1
# via flytekit
kiwisolver==1.3.2
# via matplotlib
markupsafe==2.0.1
# via jinja2
marshmallow==3.13.0
# via
# dataclasses-json
# marshmallow-enum
# marshmallow-jsonschema
marshmallow-enum==1.5.1
# via dataclasses-json
marshmallow-jsonschema==0.12.0
# via flytekit
matplotlib==3.4.3
# via -r ../../../common/requirements-common.in
mypy-extensions==0.4.3
# via typing-inspect
natsort==7.1.1
# via flytekit
numpy==1.21.2
# via
# matplotlib
# pandas
# pyarrow
packaging==21.3
# via google-cloud-bigquery
pandas==1.3.3
# via flytekit
pillow==8.3.2
# via matplotlib
poyo==0.5.0
# via cookiecutter
proto-plus==1.20.0
# via google-cloud-bigquery
protobuf==3.19.4
# via
# flyteidl
# flytekit
# google-api-core
# google-cloud-bigquery
# googleapis-common-protos
# grpcio-status
# proto-plus
py==1.10.0
# via retry
pyarrow==6.0.1
# via flytekit
pyasn1==0.4.8
# via
# pyasn1-modules
# rsa
pyasn1-modules==0.2.8
# via google-auth
pyparsing==2.4.7
# via
# matplotlib
# packaging
python-dateutil==2.8.1
# via
# arrow
# croniter
# flytekit
# google-cloud-bigquery
# matplotlib
# pandas
python-json-logger==2.0.2
# via flytekit
python-slugify==5.0.2
# via cookiecutter
pytimeparse==1.1.8
# via flytekit
pytz==2018.4
# via
# flytekit
# pandas
regex==2021.10.8
# via docker-image-py
requests==2.26.0
# via
# cookiecutter
# flytekit
# google-api-core
# google-cloud-bigquery
# responses
responses==0.14.0
# via flytekit
retry==0.9.2
# via flytekit
rsa==4.8
# via google-auth
six==1.16.0
# via
# cookiecutter
# cycler
# google-auth
# grpcio
# python-dateutil
# responses
sortedcontainers==2.4.0
# via flytekit
statsd==3.3.0
# via flytekit
text-unidecode==1.3
# via python-slugify
typing-extensions==3.10.0.2
# via
# flytekit
# typing-inspect
typing-inspect==0.7.1
# via dataclasses-json
urllib3==1.26.7
# via
# flytekit
# requests
# responses
wheel==0.37.0
# via
# -r ../../../common/requirements-common.in
# flytekit
wrapt==1.13.1
# via
# deprecated
# flytekit
zipp==3.6.0
# via importlib-metadata
4 changes: 4 additions & 0 deletions cookbook/integrations/gcp/bigquery/sandbox.config
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
[sdk]
workflow_packages=bigquery
python_venv=flytekit_venv

16 changes: 16 additions & 0 deletions cookbook/integrations/gcp/in_container.mk
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
SERIALIZED_PB_OUTPUT_DIR := /tmp/output

.PHONY: clean
clean:
rm -rf $(SERIALIZED_PB_OUTPUT_DIR)/*

$(SERIALIZED_PB_OUTPUT_DIR): clean
mkdir -p $(SERIALIZED_PB_OUTPUT_DIR)

.PHONY: serialize
serialize: $(SERIALIZED_PB_OUTPUT_DIR)
pyflyte --config /root/sandbox.config serialize workflows -f $(SERIALIZED_PB_OUTPUT_DIR)

.PHONY: fast_serialize
fast_serialize: $(SERIALIZED_PB_OUTPUT_DIR)
pyflyte --config /root/sandbox.config serialize fast workflows -f $(SERIALIZED_PB_OUTPUT_DIR)

0 comments on commit a2b462b

Please sign in to comment.