This starter kit shows how to build the Dagster's Software-Defined Assets alongside Modern Data Stack tools (specifically, Airbyte and dbt).
To complete the steps in this guide, you'll need:
- A Postgres database
- An Airbyte connection that's set up from Postgres to Postgres
You can follow the Set up data and connections section below to manually seed the source data and set up the connection.
Dagster allows using environment variables to handle sensitive information. You can define various configuration options and access environment variables through them. This also allows you to parameterize your pipeline without modifying code.
In this example, we ingest data from Airbyte by reading info from an Airbyte connection where it syncs data from Postgres to Postgres. So, in order to kick off runs successfully, you'll need the following environment variables to configure the connection:
- Airbyte
AIRBYTE_CONNECTION_ID
AIRBYTE_HOST
AIRBYTE_PORT
- Postgres
PG_USERNAME
PG_PASSWORD
PG_HOST
PG_PORT
PG_SOURCE_DATABASE
PG_DESTINATION_DATABASE
You can find all the configurations in assets_modern_data_stack/utils/constants.py
.
You can declare environment variables in various ways:
- Local development: Using
.env
files to load env vars into local environments - Dagster Cloud: Using the Dagster Cloud UI to manage environment variables
- Dagster Open Source: How environment variables are set for Dagster projects deployed on your infrastructure depends on where Dagster is deployed. Read about how to declare environment variables here.
Check out Using environment variables and secrets guide for more info and examples.
The easiest way to spin up your Dagster project is to use Dagster Serverless. It provides out-of-the-box CI/CD and native branching that make development and deployment easy.
Check out the Dagster Cloud to get started.
To install this example and its Python dependencies, run:
pip install -e ".[dev]"
Then, start the Dagit web server:
dagit
Open http://localhost:3000 with your browser to see the project.
If you try to kick off a run immediately, it will fail, as there is no source data to ingest/transform, nor is there an active Airbyte connection. To get everything set up properly, read on.
To keep things running on a single machine, we'll use a local postgres instance as both the source and the destination for our data. You can imagine the "source" database as some online transactional database, and the "destination" as a data warehouse (something like Snowflake).
To get a postgres instance with the required source and destination databases running on your machine, you can run:
$ docker pull postgres
$ docker run --name mds-demo -p 5432:5432 -e POSTGRES_PASSWORD=password -d postgres
$ PGPASSWORD=password psql -h localhost -p 5432 -U postgres -d postgres -c "CREATE DATABASE postgres_replica;"
Now, you'll want to get Airbyte running locally. The full instructions can be found here, but if you just want to run some commands (in a separate terminal):
$ git clone https://github.com/airbytehq/airbyte.git
$ cd airbyte
$ docker-compose up
Once you've done this, you should be able to go to http://localhost:8000, and see Airbyte's UI.
Now, you'll want to seed some data into the empty database you just created, and create an Airbyte connection between the source and destination databases.
There's a script provided that should handle this all for you, which you can run with:
$ python -m assets_modern_data_stack.utils.setup_airbyte
At the end of this output, you should see something like:
Created Airbyte Connection: c90cb8a5-c516-4c1a-b243-33dfe2cfb9e8
This connection id is specific to your local setup, so you'll need to update constants.py
with this
value. Once you've update your constants.py
file, you're good to go!
When developing pipelines locally, be sure to click the Reload definition button in the Dagster UI after you change the code. This ensures that Dagster picks up the latest changes you made.
You can reload the code using the Deployment page:
Or from the left nav or on each job page:
If you're running Dagster locally and trying to set up schedules, you will see a warning that your daemon isn’t running.
👈 Expand to learn how to set up a local daemon
If you want to enable Dagster Schedules for your jobs, start the Dagster Daemon process in the same folder as your workspace.yaml
file, but in a different shell or terminal.
The $DAGSTER_HOME
environment variable must be set to a directory for the daemon to work. Note: using directories within /tmp may cause issues. See Dagster Instance default local behavior for more details.
In this case, go to the project root directory and run:
dagster-daemon run
Once your Dagster Daemon is running, the schedules that are turned on will start running.
You can specify new Python dependencies in setup.py
.
Tests are in the assets_modern_data_stack_tests
directory and you can run tests using pytest
:
pytest assets_modern_data_stack_tests