Skip to content

Commit

Permalink
✨ add tutorial for sourcing data from sql
Browse files Browse the repository at this point in the history
🚨 remove trailing whitespace

✏️ fix spelling and capitalization

Co-authored-by: Giovanni Santia <[email protected]>

✨ add note about considering not querying sql directly

thanks @nicholasvk for the suggestion

:rotating_light: remove more trailing whitespace
  • Loading branch information
chris-s-friedman committed Dec 8, 2021
1 parent 26bee0c commit 75afaa0
Show file tree
Hide file tree
Showing 2 changed files with 74 additions and 1 deletion.
2 changes: 1 addition & 1 deletion docs/source/tutorial/sourcing_data/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -6,4 +6,4 @@ Sourcing Data
:maxdepth: 2

study_creator.rst

query_sql
73 changes: 73 additions & 0 deletions docs/source/tutorial/sourcing_data/query_sql.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,73 @@
======================================================
Source data stored in a PostgreSQL Database
======================================================

Scenario
========

You would like to extract data directly from a PostgreSQL Database.

.. note::
Before deciding that you'd like to query a SQL database directly, instead,
you may want to consider separately querying your database, saving the
result of that query to a file, and then using that file as source for an
extract config.

By querying a database directly

* It's no longer easy to audit the data used as the source for an
ingest.
* It's much harder, if not impossible, for non-technical users to know
what data is used for an ingest
* Using a database is inherently volatile e.g. rows may be added/
deleted/ updated.

Because of the above points, it's important to consider if your needs
wouldn't be better served by using a static result of your query as
source data for an ingest instead of directly querying the SQL database.


.sql File Solution
======================

A convenient way to load data from a postgres database is by saving your SQL
query in a file, loading that file as source data, then connecting to the
database and running your query inside of source_data_read_function.

``source_data_read_func`` will be used to parse the data from the
``source_data_url``.

As an example, say you save your query in the file at
``my-ingest-package/data/my_query.sql``:

.. code-block:: sql
SELECT *
FROM participant p
WHERE p.study_id = 'SD_ME0WME0W'
Your extract config to pull data from your database could then look like:

.. code-block:: Python
import pandas as pd
import psycopg2
source_data_url = "file://my-ingest-package/data/my_query.sql
db_url = "postgresql://username:[email protected]:5432/postgres"
source_data_read_params = {"connection_url": db_url}
def source_data_read_func(filepath, connection_url, **kwargs):
with open(filepath, "r") as f:
query = f.read().replace("\n", " ")
f.close
conn = psycopg2.connect(connection_url)
df = pd.read_sql(query, conn)
return df
The resulting dataframe would then be the resulting data that comes from your
query. You can then continue on writing your extract config as you normally
would.

0 comments on commit 75afaa0

Please sign in to comment.