Skip to content

Commit

Permalink
Update README.md
Browse files Browse the repository at this point in the history
  • Loading branch information
ryw authored Sep 11, 2024
1 parent 0907b7c commit 515c5e0
Showing 1 changed file with 29 additions and 21 deletions.
50 changes: 29 additions & 21 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,6 @@
# PG_AUTO_DW
<img src="https://github.com/tembo-io/pg_auto_dw/blob/d055fdf57d156edcb374c2803dff69d4b6dcc773/resources/PG_AUTO_DW_LOGO.png" alt="LOGO of image" style="border-radius: 10px; width: 300px; height: auto;">
# pg_auto_dw

<img src="https://tembo.io/_astro/graphs.CNZLRuSs_Z1YDvaO.webp" style="border-radius: 30px; width: 600px; height: auto;">

[Open-source](LICENSE) PostgreSQL Extension for Automated Data Warehouse Creation

Expand All @@ -14,6 +15,7 @@ From [@ryw](https://github.com/ryw) 4-18-24:
> This project attempts to implement an idea I can't shake - an auto-data warehouse extension that uses LLM to inspect operational Postgres schemas, and sets up automation to create a well-formed data warehouse (whether it's Data Vault, Kimball format, etc. I don't care - just something better than a dumb dev like me would build as a DW - a pile of ingested tables, and ad-hoc derivative tables). I don't know if this project will work, but kind of fun to start something without certainty of success. But I have wanted this badly for years as a dev + data engineer.
## Project Vision

To create an open source extension that automates the data warehouse build. We aim to do this within a structured environment that incorporates best practices and harnesses the capabilities of Large Language Models (LLM) technologies.

**Goals:** This extension will enable users to:
Expand All @@ -25,6 +27,7 @@ To create an open source extension that automates the data warehouse build. We
All these capabilities will be delivered through a [small set of intuitive functions](extension/docs/sql_functions/readme.md).

## Principles

* Build in public
* Public repo
* Call attention/scrutiny to the work - release every week or two with blog/tweet calling attention to your work
Expand All @@ -39,12 +42,15 @@ All these capabilities will be delivered through a [small set of intuitive funct
* Ship product + demo video + documentation

## Data Vault

We are starting with automation to facilitate a data vault implementation for our data warehouse. This will be a rudimentary raw vault setup, but we hope it will lead to substantial downstream business models.

## Timeline

We're currently working on a timeline to define points of success and ensure the smooth integration of new contributors to our project. This includes creating milestones, contributor guidelines, and hosting activities such as webinars and meetups. Stay tuned!

## Installation

We are currently developing a new extension, starting with an initial set of defined [functions](extension/docs/sql_functions/readme.md) and implementing a subset of these functions in a mockup extension. This mockup version features skeletal implementations of some functions, designed just to demonstrate our envisioned capabilities as seen in the demo below. Our demo is divided into two parts: Act 1 and Act 2. If you follow along, I hope this will offer a glimpse of what to expect in the weeks ahead.

If you’re interested in exploring this preliminary version, please follow these steps:
Expand All @@ -54,23 +60,31 @@ If you’re interested in exploring this preliminary version, please follow thes
3) Run this Codebase

## Demo: Act 1 - "1-Click Build"

> **Note:** Only use the code presented below. Any deviations may cause errors. This demo is for illustrative purposes only. It is currently tested on PGRX using the default PostgreSQL 13 instance.
We want to make building a data warehouse easy. And, if the source tables are well-structured and appropriately named, constructing a data warehouse can be achieved with a single call to the extension.

1. **Install Extension**

```SQL
/* Installing Extension - Installs and creates sample source tables. */
CREATE EXTENSION pg_auto_dw CASCADE;
```

> **Note:** Installing this extension installs a couple source sample tables in the PUBLIC SCHEMA as well as the PG_CRYPTO extension.
2. **Build Data Warehouse**

```SQL
/* Build me a Data Warehouse for tables that are Ready to Deploy */
SELECT auto_dw.go();
```

> **Note:** This will provide a build ID and some helpful function tips. Do not implement these tips at this time. They are for illustrative purposes of future functionality.
3. **Data Warehouse Built**

```SQL
/* Data Warehouse Built - No More Code Required */
```
Expand All @@ -81,50 +95,53 @@ flowchart LR
ext -- #10711; --> build["Build Data Warehouse\nauto_dw.go()"]
build -- #10711; --> DW[("DW Created")]
DW --> Done(("Done"))
style Start stroke-width:1px,fill:#FFFFFF,stroke:#000000
style ext color:none,fill:#FFFFFF,stroke:#000000
style build fill:#e3fae3,stroke:#000000
style DW fill:#FFFFFF,stroke:#000000
style Done stroke-width:4px,fill:#FFFFFF,stroke:#000000
```

## Demo: Act 2 - “Auto Data Governance”

Sometimes it’s best to get a little push-back when creating a data warehouse, which supports appropriate data governance. In this instance a table was not ready to deploy to the data warehouse as a table column may need to be considered sensitive and handled appropriately. In this sample script, Auto DW’s engine understands the attribute is useful for analysis, but also may need to be considered sensitive. In this script the user will:

1) **Identify a Skipped Table**

```SQL
/* Identify source tables skipped and not integration into the data warehouse. */
SELECT schema, "table", status, status_response
FROM auto_dw.source_table()
WHERE status_code = 'SKIP';
```

> **Note:** Running this code will provide an understanding of which table was skipped along with a high level reason. You should see the following output from the status_response: “Source Table was skipped as column(s) need additional context. Please run the following SQL query for more information: SELECT schema, table, column, status, status_response FROM auto_dw.source_status_detail() WHERE schema = 'public' AND table = 'customers'.”
2) **Identify the Root Cause**

```SQL
/* Identify the source table column that caused the problem, understand the issue, and potential solution. */
SELECT schema, "table", "column", status, confidence_level, status_response
FROM auto_dw.source_column()
WHERE schema = 'PUBLIC' AND "table" = 'CUSTOMER';
```

> **Note:** Running this code will provide an understanding of which table column was skipped along with a reason in the status_response. You should see the following output: “Requires Attention: Column cannot be appropriately categorized as it may contain sensitive data. Specifically, if the zip is an extended zip it may be considered PII.”
3) **Decide to Institute Some Data Governance Best Practices**

```SQL
/* Altering column length restricts the acceptance of extended ZIP codes.*/
ALTER TABLE customer ALTER COLUMN zip TYPE VARCHAR(5);
```

> **Note:** Here the choice was up to the user to make a change that facilitated LLM understanding of data sensitivity. In this case, limiting the type to VARCHAR(5) will allow the LLM to understand that this column will not contain sensitive information in the future.
```mermaid
flowchart LR
Start(("Start")) --> tbl["Identify a Skipped Table\nauto_dw.source_table()"]
tbl --> col["Identify the Root Cause\nauto_dw.source_column()"]
col --> DW[("Institute Data Governance\nBest Practices")]
DW --> Done(("Done"))
style Start stroke-width:1px,fill:#FFFFFF,stroke:#000000
style tbl color:none,fill:#edf5ff,stroke:#000000
style col fill:#edf5ff,stroke:#000000
style DW fill:#FFFFFF,stroke:#000000
style Done stroke-width:4px,fill:#FFFFFF,stroke:#000000
```

**Auto DW Process Flow:** The script highlighted in Act 2 demonstrates that there are several approaches to successfully implementing a data warehouse when using this extension. Below is a BPMN diagram that illustrates these various paths.

```mermaid
flowchart LR
subgraph functions_informative["Informative Functions"]
Expand All @@ -150,13 +167,4 @@ flowchart LR
review --> data_gov --> more_auto{"More\nAutomations?"}
more_auto --> |no| done(("Done"))
more_auto --> |yes| start_again(("Restart"))
classDef standard fill:#FFFFFF,stroke:#000000
classDef informative fill:#edf5ff,stroke:#000000
classDef interactive fill:#e3fae3,stroke:#000000
class start,command,split,join,review standard
class to_gov,gov,more_auto,start_again standard
class health,source_tables,source_column informative
class source_clude,update_context,go interactive
style done stroke-width:4px,fill:#FFFFFF,stroke:#000000
```

0 comments on commit 515c5e0

Please sign in to comment.