This is intended as a developer's guide to modifying and using dbt. It is not intended as a guide for end users of dbt (though if it is helpful, that's great!) and assumes a certain level of familiarity with Python concepts such as virtualenvs, pip
, module/filesystem layouts, etc. It also assumes you are using macOS or Linux and are comfortable with the command line.
Before you can develop dbt effectively, you should set up the following:
We strongly recommend setting up pyenv and its pyenv-virtualenv plugin. This setup will make it much easier for you to manage multiple Python projects in the medium to long term.
By default, pyenv
has only one python version installed and it's the system
python - the one that comes with your OS. You don't want that. Instead, use pyenv install 3.6.5
to install a more recent version. dbt supports up to Python 3.6 at the time of writing (and will soon support Python 3.7)
To get a full (very long!) list of versions available, you can do pyenv install -l
and look for the versions defined by numbers alone - the others are variants of Python and outside the scope of this document.
Docker and docker-compose are both used in testing. For macOS, the easiest thing to do is to go here and download it. You'll need to make an account. On Linux, if you can use one of the packages here. We recommend installing from docker.com instead of from your package manager. On Linux you also have to install docker-compose separately, follow these instructions.
You will also need git
in order to get dbt and contribute code. On macOS, the best way to get that is to just install Xcode.
You will need a GitHub account fully configured with SSH to contribute to dbt. GitHub has an excellent guide on how to set up SSH -- we strongly recommend you follow their guide if you are unfamiliar with SSH.
Now clone dbt to wherever you'd like. For example:
mkdir -p ~/git/
cd ~/git
git clone [email protected]:fishtown-analytics/dbt.git
But it really does not matter where you put it as long as you remember it.
Set up a fresh virtualenv with pyenv-virtualenv for dbt:
pyenv virtualenv 3.6.5 dbt36
cd ~/git/dbt
pyenv local dbt36
This makes a new virtualenv based on python 3.6.5 named dbt36
, and tells pyenv that when you're in the dbt
directory it should automatically use that virtualenv.
For testing, and later in the examples in this document, you may want to have psql
available so you can poke around in the database and see what happened. We recommend that you use homebrew for that on macOS, and your package manager on Linux. You can install any version of the postgres client that you'd like. So on macOS, with homebrew setup:
brew install postgresql
Getting the dbt integration tests set up in your local environment will be very helpful as you start to make changes to your local version of dbt. The section that follows outlines some helpful tips for setting up the test environment.
A short list of tools used in dbt testing that will be helpful to your understanding:
- virtualenv to manage dependencies and stuff
- tox to manage virtualenvs across python versions
- nosetests to discover/run tests
- make - but don't worry too much, nobody really understands how make works and our Makefile is super simple
- pep8 for code linting
- CircleCI and Appveyor
If you're unfamiliar with any or all of these, that's fine! You really do not have to have a deep understanding of any of these to get by.
Our test environment goes like this:
- CircleCI and Appveyor run `tox`
- `make test` runs `docker-compose`
- `docker-compose` runs `tox`
- `tox` sets up virtualenvs for each distinct set of tests and runs `nosetests`
- `nosetests` finds all the appropriate tests and runs them
The basics should work with basically no further setup. In the terminal, cd
to the directory where you cloned dbt. So, for example, if you cloned dbt to ~/git/dbt
:
cd ~/git/dbt
Then you'll want to make a test.env file. Fortunately, there's a sample which is fine for our purposes:
cp test.env.sample test.env
If you want to test snowflake/bigquery/redshift locally you'll need to get credentials and add them to this file. But, to start, you can just focus on postgres tests. They have the best coverage, are the fastest, and are the easiest to set up.
To run the unit tests, use make test-unit
- it will run the unit tests on python 2.7 and 3.6, and a pep8 linter.
To run the postgres+python 3.6 integration tests, you'll have to do one extra step of setting up the database:
docker-compose up -d database
PGHOST=localhost PGUSER=root PGPASSWORD=password PGDATABASE=postgres bash test/setup_db.sh
And then to actually run them, you can do make test-quick
.
If you want to see what exactly is getting run on these commands, look at the Makefile
. Note that the commands start with an @
which you can ignore, just makefile magic. If you want to see what the involved tox
commands are using, look at the corresponding tox.ini
section - hopefully it's pretty self-explanatory.
When a contributor to dbt pushes code, GitHub will trigger a series of CI builds on CircleCI and Appveyor (Windows) to test all of dbt's code. The CI builds trigger all the integration tests, not just postgres+python3.6.
The Snowflake tests take a very long time to run (about an hour), so don't just sit around waiting, it'll be a while!
If you open a PR as a non-contributor, these tests won't run automatically. Someone from the dbt team will reach out to you and get them running after reviewing your code.
Sometimes, you're going to have to pretend to be an end user to reproduce bugs and stuff. So that means manually setting up some stuff that the test harness takes care of for you.
First, from the dbt
directory, install dbt in 'editable' mode. There are a couple ways to do it, but I'm in the habit of pip install -e .
, which tells pip to install the package in the current directory in "editable" mode. What's cool about this mode is any changes you make to the current dbt directory will be reflected immediately in your next dbt
run.
Now you'll also need a 'dbt profile' so dbt can tell how to connect to your database. By default, this file belongs at ~/.dbt/profiles.yml
, so mkdir ~/.dbt
and then open your favorite text editor and write out something like this to ~/.dbt/profiles.yml
:
config:
send_anonymous_usage_stats: False
use_colors: True
talk:
outputs:
default:
type: postgres
threads: 4
host: localhost
port: 5432
user: root
pass: password
dbname: postgres
schema: dbt_postgres
target: default
There's a sample you can look at in the dbt
folder (sample.profiles.yml
) but it's got a lot of extra and as a developer, you really probably only want to test against your local postgres container. The basic idea is that there are multiple 'profiles' (talk
, in this case) and within those each profile has one or more 'targets' (default
, in this case), and each profile has a default target. You can specify what profile you want to use with the --profile
flag, and which target with the --target
flag. If you want to be really snazzy, dbt project files actually specify their target, and if you match up your dbt project profile
key with your profiles.yml
profile names you don't have to use --profile
(and if you like your profile's default target, no need for --target
either).
There is a very simple project that is a very nice example of dbt's capabilities, you can get it from github:
cd ~/src/fishtown
git clone [email protected]:fishtown-analytics/talk.git
git checkout use-postgres
The use-postgres
branch configures the project to use your local postgres (instead of the default, Snowflake). You should poke around in this project a bit, particularly the models
directory.
Before doing anything, let's check the database out:
> PGHOST=localhost PGUSER=root PGPASSWORD=password PGDATABASE=postgres psql
psql (10.4)
Type "help" for help.
postgres=# \dn
List of schemas
Name | Owner
--------+----------
public | postgres
(1 row)
postgres=# \q
\dn
lists schemas in postgres. You can see that we just have the default "public" schema, so we haven't done anything yet.
If you compile your model with dbt compile
you should see something like this:
> dbt compile
Found 2 models, 0 tests, 0 archives, 0 analyses, 59 macros, 1 operations, 1 seed files
09:49:57 | Concurrency: 2 threads (target='default')
09:49:57 |
09:49:57 | Done.
So what does that mean? Well:
2 models
refers to the contents of themodels
directory59 macros
are the builtin global macros defind by dbt itself1 operations
is the catalog generation operation that runs by default1 seed files
refers to the seed data indata/moby_dick.csv
It will create two new folders: One named dbt_modules
, which is empty for this case, and one named target
, which has a few things in it:
- A folder named
compiled
, created by dbt looking at your models and your database schema and filling in references (somodels/moby_dick_base.sql
becomestarget/compiled/talk/moby_dick_base.sql
by replacing thefrom {{ ref('moby_dick') }}
withfrom "dbt_postgres".moby_dick
) - A file named
graph.gpickle
, which is your project's dependency/reference graph as understood by thenetworkx
library. - A file named
catalog.json
, which is the data dbt has collected about your project (macros used, models/seeds used, and parent/child reference maps)
Next, load the seed file into the database with dbt seed
, it'll look like this:
> dbt seed
Found 2 models, 0 tests, 0 archives, 0 analyses, 59 macros, 1 operations, 1 seed files
10:40:46 | Concurrency: 2 threads (target='default')
10:40:46 |
10:40:46 | 1 of 1 START seed file dbt_postgres.moby_dick........................ [RUN]
10:40:47 | 1 of 1 OK loaded seed file dbt_postgres.moby_dick.................... [INSERT 17774 in 0.44s]
10:40:47 |
10:40:47 | Finished running 1 seeds in 0.65s.
Completed successfully
If you go into postgres now, you can see that you have a new schema ('dbt_postgres'), a new table in that schema ('moby_dick'), and a bunch of stuff in that table:
> PGHOST=localhost PGUSER=root PGPASSWORD=password PGDATABASE=postgres psql
psql (10.4)
Type "help" for help.
postgres=# \dn
List of schemas
Name | Owner
--------------+----------
dbt_postgres | root
public | postgres
(2 rows)
postgres=# \dt dbt_postgres.*
List of relations
Schema | Name | Type | Owner
--------------+-----------+-------+-------
dbt_postgres | moby_dick | table | root
(1 row)
postgres=# select count(*) from dbt_postgres.moby_dick;
count
-------
17774
(1 row)
postgres=# \q
If you run dbt run
now, you'll see something like this:
> dbt run
Found 2 models, 0 tests, 0 archives, 0 analyses, 59 macros, 1 operations, 1 seed files
10:19:41 | Concurrency: 2 threads (target='default')
10:19:41 |
10:19:41 | 1 of 2 START view model dbt_postgres.moby_dick_base.................. [RUN]
10:19:41 | 1 of 2 OK created view model dbt_postgres.moby_dick_base............. [CREATE VIEW in 0.05s]
10:19:41 | 2 of 2 START table model dbt_postgres.word_count..................... [RUN]
10:19:42 | 2 of 2 OK created table model dbt_postgres.word_count................ [SELECT 27390 in 0.19s]
10:19:42 |
10:19:42 | Finished running 1 view models, 1 table models in 0.53s.
Completed successfully
Done. PASS=2 ERROR=0 SKIP=0 TOTAL=2
So, some of the same information and then you can see that dbt created a view (moby_dick_base
) and a table (word_count
). If you go into postgres, you'll be able to see that!
If you want to inspect the result, you can do so via psql:
> PGHOST=localhost PGUSER=root PGPASSWORD=password PGDATABASE=postgres psql
psql (10.4)
Type "help" for help.
postgres=# \dt dbt_postgres.*
List of relations
Schema | Name | Type | Owner
--------------+------------+-------+-------
dbt_postgres | moby_dick | table | root
dbt_postgres | word_count | table | root
(2 rows)
postgres=# select * from dbt_postgres.word_count order by ct desc limit 10;
word | ct
------+-------
the | 13394
| 12077
of | 6368
and | 5846
to | 4382
a | 4377
in | 3767
that | 2753
his | 2406
I | 1826
(10 rows)
It's pretty much what you'd expect - the most common words are "the", "of", "and", etc. (The empty string probably should not be there, but this is just a toy example!)
So what happened here? First, dbt seed
loaded the data in the csv file into postgres. Then dbt compile
built out a sort of plan for how everything is linked together by looking up references and macros and the current state of the database. And finally, dbt run
ran the compiled SQL to generate the word_count table.