Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

update assets_modern_data_stack example readme #11122

Merged
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
128 changes: 117 additions & 11 deletions examples/assets_modern_data_stack/README.md
Original file line number Diff line number Diff line change
@@ -1,33 +1,75 @@
# Assets with Modern Data Stack Example
# Dagster + Modern Data Stack starter kit

This is an example of how to use the Software-Defined Asset APIs alongside Modern Data Stack tools (specifically, Airbyte and dbt).
This starter kit shows how to build the Dagster's [Software-Defined Assets](https://docs.dagster.io/concepts/assets/software-defined-assets) alongside Modern Data Stack tools (specifically, [Airbyte](https://github.com/airbytehq/airbyte) and [dbt](https://github.com/dbt-labs/dbt-core)).

<p align="center">
<img width="500" alt="Screen Shot 2022-11-17 at 11 50 20 PM" src="https://user-images.githubusercontent.com/4531914/202649416-b727405a-f96c-4531-95ff-29b9f9bf53d2.png">
</p>

## Prerequisites

To complete the steps in this guide, you'll need:

- A Postgres database
- An [Airbyte](https://airbyte.com/) connection that's set up from Postgres to Postgres

You can follow the [Set up data and connections](#set-up-data-and-connections) section below to manually seed the source data and set up the connection.

### Using environment variables to handle secrets

Dagster allows using environment variables to handle sensitive information. You can define various configuration options and access environment variables through them. This also allows you to parameterize your pipeline without modifying code.

In this example, we ingest data from Airbyte by reading info from an [Airbyte connection](https://airbytehq.github.io/understanding-airbyte/connections/) where it syncs data from Postgres to Postgres. So, in order to kick off runs successfully, you'll need the following environment variables to configure the connection:
- Airbyte
- `AIRBYTE_CONNECTION_ID`
- `AIRBYTE_HOST`
- `AIRBYTE_PORT`
- Postgres
- `PG_USERNAME`
- `PG_PASSWORD`
- `PG_HOST`
- `PG_PORT`
- `PG_SOURCE_DATABASE`
- `PG_DESTINATION_DATABASE`

You can find all the configurations in [`assets_modern_data_stack/utils/constants.py`](./assets_modern_data_stack/utils/constants.py).

You can declare environment variables in various ways:
- **Local development**: [Using `.env` files to load env vars into local environments](https://docs.dagster.io/guides/dagster/using-environment-variables-and-secrets#declaring-environment-variables)
- **Dagster Cloud**: [Using the Dagster Cloud UI](https://docs.dagster.io/master/dagster-cloud/developing-testing/environment-variables-and-secrets#using-the-dagster-cloud-ui) to manage environment variables
- **Dagster Open Source**: How environment variables are set for Dagster projects deployed on your infrastructure depends on where Dagster is deployed. Read about how to declare environment variables [here](https://docs.dagster.io/master/guides/dagster/using-environment-variables-and-secrets#declaring-environment-variables).

Check out [Using environment variables and secrets guide](https://docs.dagster.io/guides/dagster/using-environment-variables-and-secrets) for more info and examples.

## Getting started

Bootstrap your own Dagster project with this example:
### Option 1: Deploying it on Dagster Cloud

```bash
dagster project from-example --name my-dagster-project --example assets_modern_data_stack
```
The easiest way to spin up your Dagster project is to use [Dagster Serverless](https://docs.dagster.io/dagster-cloud/deployment/serverless). It provides out-of-the-box CI/CD and native branching that make development and deployment easy.

Check out the [Dagster Cloud](https://dagster.io/cloud) to get started.

### Option 2: Running it locally

To install this example and its Python dependencies, run:

```bash
pip install -e ".[dev]"
```

Once you've done this, you can run:
Then, start the Dagit web server:

```
dagit
```

to view this example in Dagster's UI, Dagit.
Open http://localhost:3000 with your browser to see the project.

If you try to kick off a run immediately, it will fail, as there is no source data to ingest/transform, nor is there an active Airbyte connection. To get everything set up properly, read on.

## Set up local Postgres
### Setting up services locally

#### Postgres

To keep things running on a single machine, we'll use a local postgres instance as both the source and the destination for our data. You can imagine the "source" database as some online transactional database, and the "destination" as a data warehouse (something like Snowflake).

Expand All @@ -39,7 +81,7 @@ $ docker run --name mds-demo -p 5432:5432 -e POSTGRES_PASSWORD=password -d postg
$ PGPASSWORD=password psql -h localhost -p 5432 -U postgres -d postgres -c "CREATE DATABASE postgres_replica;"
```

## Set up Airbyte
#### Airbyte

Now, you'll want to get Airbyte running locally. The full instructions can be found [here](https://docs.airbyte.com/deploying-airbyte/local-deployment), but if you just want to run some commands (in a separate terminal):

Expand All @@ -51,7 +93,7 @@ $ docker-compose up

Once you've done this, you should be able to go to http://localhost:8000, and see Airbyte's UI.

## Set up data and connections
#### Set up data and connections

Now, you'll want to seed some data into the empty database you just created, and create an Airbyte connection between the source and destination databases.

Expand All @@ -69,3 +111,67 @@ Created Airbyte Connection: c90cb8a5-c516-4c1a-b243-33dfe2cfb9e8

This connection id is specific to your local setup, so you'll need to update `constants.py` with this
value. Once you've update your `constants.py` file, you're good to go!


## Learning more

### Changing the code locally

When developing pipelines locally, be sure to click the **Reload definition** button in the Dagster UI after you change the code. This ensures that Dagster picks up the latest changes you made.

You can reload the code using the **Deployment** page:
<details><summary>👈 Expand to view the screenshot</summary>

<p align="center">
<img height="500" src="https://raw.githubusercontent.com/dagster-io/dagster/master/docs/next/public/images/quickstarts/basic/more-reload-code.png" />
</p>

</details>

Or from the left nav or on each job page:
<details><summary>👈 Expand to view the screenshot</summary>

<p align="center">
<img height="500" src="https://raw.githubusercontent.com/dagster-io/dagster/master/docs/next/public/images/quickstarts/basic/more-reload-left-nav.png" />
</p>

</details>

### Running daemon locally

If you're running Dagster locally and trying to set up schedules, you will see a warning that your daemon isn’t running.

<details><summary>👈 Expand to learn how to set up a local daemon</summary>

<p align="center">
<img height="500" src="https://raw.githubusercontent.com/dagster-io/dagster/yuhan/11-11-quickstart_1/_add_quickstart_basic_etl_as_the_very_basic_template/docs/next/public/images/quickstarts/basic/step-3-3-daemon-warning.png?raw=true" />
</p>

If you want to enable Dagster [Schedules](https://docs.dagster.io/concepts/partitions-schedules-sensors/schedules) for your jobs, start the [Dagster Daemon](https://docs.dagster.io/deployment/dagster-daemon) process in the same folder as your `workspace.yaml` file, but in a different shell or terminal.

The `$DAGSTER_HOME` environment variable must be set to a directory for the daemon to work. Note: using directories within /tmp may cause issues. See [Dagster Instance default local behavior](https://docs.dagster.io/deployment/dagster-instance#default-local-behavior) for more details.

In this case, go to the project root directory and run:
```bash
dagster-daemon run
```

Once your Dagster Daemon is running, the schedules that are turned on will start running.

<p align="center">
<img height="500" src="https://raw.githubusercontent.com/dagster-io/dagster/master/docs/next/public/images/quickstarts/basic/step-3-4-daemon-on.png?raw=true" />
</p>

</details>

### Adding new Python dependencies

You can specify new Python dependencies in `setup.py`.

### Testing

Tests are in the `assets_modern_data_stack_tests` directory and you can run tests using `pytest`:

```bash
pytest assets_modern_data_stack_tests
```