Bootstrap a new dbt project (with Apache Spark) with enterprise grade development configuration.
-
Create Project directory and clone the repo Pre-requisites
- You must have python >= 3.8, say under C:\tools\python38
- You must have installed and configured Apache Spark 3.3.x. See gist for instruction >
md C:\Projects cd C:\Projects git clone https://github.com/skadyan/dbt_201.git cd dbt_201
-
Create Virtual Environment and setup dependencies. Make sure you have run
pre-commit install
C:\tools\python38\python -m venv .venv .venv\Scripts\activate python -m pip install pip --upgrade python -m pip install -r requirements-dev.txt pre-commit install
If you see SSL certificated expired or related issues, add --trusted-host option to bypass the SSL verification
Collecting sasl==0.3.1 WARNING: Retrying (Retry(total=0, connect=None, read=None, redirect=None, status=None)) after connection broken by 'SSLError(SSLCertVerificationError(1, '[SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: certificate has expired (_ssl.c:1131)'))': /pythonlibs/archived/sasl-0.3.1-cp38-cp38-win_amd64.whl ERROR: Could not install packages due to an OSError: HTTPSConnectionPool(host='download.lfd.uci.edu', port=443): Max retries exceeded with url: /pythonlibs/archived/sasl-0.3.1-cp38-cp38-win_amd64.whl (Caused by SSLError(SSLCertVerificationError(1, '[SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: certificate has expired (_ssl.c:1131)')))
python -m pip install -r requirements-dev.txt --trusted-host download.lfd.uci.edu
-
Create dbt profile as following C:\Users\%USERNAME%\.dbt\profiles.yml
dbt_201: outputs: dev: host: localhost method: thrift port: 10000 schema: dbt_201_elt threads: 4 type: spark target: dev
-
Install dbt packages
dbt deps
-
Test Profile and dbt connectivity
dbt debug
It should produce something similar as below.
(.venv) C:\Projects\dbt_201>dbt debug 19:03:20 Running with dbt=1.3.1 dbt version: 1.3.1 python version: 3.8.10 python path: C:\Projects\dbt_201\.venv\Scripts\python.exe os info: Windows-10-10.0.22621-SP0 Using profiles.yml file at C:\Users\skadyan\.dbt\profiles.yml Using dbt_project.yml file at C:\Projects\dbt_201\dbt_project.yml Configuration: profiles.yml file [OK found and valid] dbt_project.yml file [OK found and valid] Required dependencies: - git [OK found] Connection: host: localhost port: 10000 cluster: None endpoint: None schema: dbt_201_elt organization: 0 Connection test: [OK connection ok] All checks passed! (.venv) C:\Projects\dbt_201>
-
Code Quality Checks
sqlfluff lint
You may run
sqlfluff fix
command to apply auto-fix. Modified files must be reviewed manually before committing them.Alternatively, you may run below command as well to produce a summary instead of too verbose output
diff-quality --violations sqlfluff
It produces something similar
------------- Diff Quality Quality Report: sqlfluff Diff: origin/main...HEAD, staged and unstaged changes ------------- models/example/my_first_dbt_model.sql (100%) models/example/my_second_dbt_model.sql (100%) ------------- Total: 16 lines Violations: 0 lines % Quality: 100% -------------
-
Commit will run the hooks automatically. Output will look like.
(.venv) C:\Projects\dbt_201>git commit -am "minor fix" sqlfluff-lint........................................(no files to check)Skipped trim trailing whitespace.................................................Passed fix end of files.........................................................Passed check yaml...............................................................Passed fix requirements.txt.................................(no files to check)Skipped flake8...............................................(no files to check)Skipped [main 1c248e7] minor fix 1 file changed, 4 insertions(+)
Note:- If you prefer to do experiments in faster iterative cycle, you may use duckdb (an in-memory/embedded database). To use this, use below dbt profile C:\Users\%USERNAME%\.dbt\profiles.yml
dbt_201:
outputs:
dev_spark:
host: localhost
method: thrift
port: 10000
schema: dbt_201_elt
threads: 4
type: spark
dev_duckdb:
type: duckdb
path: './target/mydb.duckdb'
#optional fields
schema: dbt_201_elt
target: dev_duckdb