Skip to content

Commit

Permalink
feat: Unit tests for dbt models (openedx#112)
Browse files Browse the repository at this point in the history
* feat: new unit tests for dbt models and macros
  • Loading branch information
saraburns1 authored Aug 7, 2024
1 parent f8ba41c commit e231fab
Show file tree
Hide file tree
Showing 83 changed files with 70,373 additions and 285 deletions.
14 changes: 8 additions & 6 deletions .github/workflows/coverage.yml
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# Check documentation coverage

name: dbt Docs Coverage
name: dbt Tests & Coverage

on:
push:
Expand All @@ -17,7 +17,7 @@ env:

jobs:
build:
name: Build docs and check coverage
name: Check coverage & run tests
runs-on: ubuntu-latest
permissions:
contents: "read"
Expand Down Expand Up @@ -46,10 +46,12 @@ jobs:
tutor local do load-xapi-test-data
- name: Check dbt tests
run: |
dbt run
dbt test
mv unit-test-seeds ci-seeds
dbt seed --full-refresh --selector all_tests
dbt run --full-refresh --selector all_tests
dbt test --selector all_tests
mv ci-seeds unit-test-seeds
- name: Check docs coverage
run: |
dbt run
dbt docs generate
dbt-coverage compute doc --cov-fail-under 1.0
dbt-coverage compute doc --cov-fail-under 1.0 --model-path-filter models/
2 changes: 1 addition & 1 deletion .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -3,4 +3,4 @@ target/
dbt_packages/
logs/
coverage.json

.DS_Store
2 changes: 1 addition & 1 deletion Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -2,4 +2,4 @@ format:
sqlfmt models macros

coverage:
dbt-coverage compute doc --cov-fail-under 0.9
dbt-coverage compute doc --cov-fail-under 1.0
15 changes: 15 additions & 0 deletions README.rst
Original file line number Diff line number Diff line change
Expand Up @@ -33,6 +33,21 @@ Running dbt

``dbt run`` will compile and create the models defined in the "aspects" dbt project. By default, dbt will look in the ``xapi`` schema to find source tables. The ``XAPI_SCHEMA`` environment variable can be used to specify a different schema.

Testing
*******
As of `dbt v1.8 <https://docs.getdbt.com/reference/resource-properties/unit-tests>`, models can now be tested with UNIT tests in addition to the existing DATA tests. Unit tests validate the SQL model logic by building the models using a (known to be good) dataset and comparing the results to a provided 'expected' dataset. This is especially beneficial when updating a model to ensure the output has not changed.

The ``unit_tests.yaml`` file in each model directory contains any tests for the models in that directory.
The ``unit-test-seeds`` directory contains all seed data csv files. There is one file for each base table (event_sink & xapi) and each 'expected' dataset.

``dbt test`` will only run data & generic tests (NOT unit tests). This is the default mode.

``dbt test --selector unit_tests`` will run all unit tests.
These require tables to be seeded first. To do this, add 'unit-test-seeds' to ``seed-paths:`` in ``dbt_project.yml`` and run ``dbt seed --full-refresh && dbt run --full-refresh``.

``dbt test --selector all_tests`` will run all data/generic/unit tests.


More Help
=========

Expand Down
11 changes: 9 additions & 2 deletions dbt_project.yml
Original file line number Diff line number Diff line change
Expand Up @@ -11,10 +11,10 @@ profile: "aspects"
# These configurations specify where dbt should look for different types of files.
# The `model-paths` config, for example, states that models in this project can be
# found in the "models/" directory. You probably won't need to change these!
model-paths: ["models"]
model-paths: ["models",'ci-models']
analysis-paths: ["analyses"]
test-paths: ["tests"]
seed-paths: ["seeds"]
seed-paths: ["seeds","ci-seeds"]
macro-paths: ["macros"]
snapshot-paths: ["snapshots"]

Expand All @@ -34,3 +34,10 @@ models:
# Config indicated by + and applies to all files under models/example/
enrollment:
+materialized: view

# These are for unit test seeds. They will be used when 'unit-test-seeds' is added
# to seed-paths above or when CI tests run
seeds:
aspects:
+column_types:
event_id: UUID
13 changes: 12 additions & 1 deletion macros/get_custom_schema.sql
Original file line number Diff line number Diff line change
@@ -1,3 +1,14 @@
{% macro generate_schema_name(custom_schema_name, node) -%}
{{ generate_schema_name_for_env(custom_schema_name, node) }}

{%- set default_schema = target.schema -%}
{%- if (
target.name == "prod" or node.resource_type == "seed"
) and custom_schema_name is not none -%}

{{ custom_schema_name | trim }}

{%- else -%} {{ default_schema }}

{%- endif -%}

{%- endmacro %}
2 changes: 1 addition & 1 deletion macros/get_problem_id.sql
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,6 @@
-- or is followed by '/answer' or '/hint'
{% macro get_problem_id(object_id) %}
regexpExtract(
object_id, 'xblock/([\w\d-\+:@]*@problem\+block@[\w\d][^_]*)(_\d_\d)?', 1
object_id, 'xblock/([\w\d-\+:@]*@problem\+block@[\w\d][^_\/]*)(_\d_\d)?', 1
)
{% endmacro %}
18 changes: 9 additions & 9 deletions models/base/schema.yml
Original file line number Diff line number Diff line change
Expand Up @@ -5,29 +5,29 @@ models:
description: "A materialized view for xAPI events"
columns:
- name: event_id
data_type: uuid
data_type: UUID
description: "The unique identifier for the event"
- name: verb_id
data_type: string
data_type: String
description: "The xAPI verb identifier"
- name: actor_id
data_type: string
data_type: String
description: "The xAPI actor identifier"
- name: object_id
data_type: string
data_type: String
description: "The xAPI object identifier"
- name: course_id
data_type: string
data_type: String
description: "The fully-qualified course identifier URL"
- name: course_key
data_type: String
description: "The course key for the course"
- name: org
data_type: string
data_type: String
description: "The organization that the course belongs to"
- name: emission_time
data_type: datetime64(6)
data_type: DateTime64(6)
description: "The time the event was emitted"
- name: event
data_type: string
description: "The xAPI event as a string"
data_type: String
description: "The xAPI event as a String"
14 changes: 14 additions & 0 deletions models/base/unit_tests.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,14 @@
unit_tests:
- name: test_xapi_events_all_parsed
model: xapi_events_all_parsed
config:
tags: 'ci'
given:
- input: source("xapi", "xapi_events_all")
format: sql
rows: |
select * from xapi.xapi_events_all
expect:
format: sql
rows: |
select * from xapi_events_all_parsed_expected
37 changes: 3 additions & 34 deletions models/courses/course_block_names.sql
Original file line number Diff line number Diff line change
Expand Up @@ -12,45 +12,14 @@
],
primary_key="location",
layout="COMPLEX_KEY_SPARSE_HASHED()",
lifetime="120",
lifetime=env_var("ASPECTS_BLOCK_NAME_CACHE_LIFETIME", "120"),
source_type="clickhouse",
connection_overrides={
"host": "localhost",
},
)
}}

with
most_recent_course_blocks as (
select
location,
display_name,
toString(section)
|| ':'
|| toString(subsection)
|| ':'
|| toString(unit)
|| ' - '
|| display_name as display_name_with_location,
JSONExtractInt(xblock_data_json, 'section') as section,
JSONExtractInt(xblock_data_json, 'subsection') as subsection,
JSONExtractInt(xblock_data_json, 'unit') as unit,
JSONExtractBool(xblock_data_json, 'graded') as graded,
`order` as course_order,
course_key,
dump_id,
time_last_dumped,
row_number() over (
partition by location order by time_last_dumped desc
) as rn
from {{ source("event_sink", "course_blocks") }}
)
select
location,
display_name as block_name,
course_key,
graded,
course_order,
display_name_with_location
from most_recent_course_blocks
where rn = 1
location, block_name, course_key, graded, course_order, display_name_with_location
from {{ ref("most_recent_course_blocks") }}
8 changes: 6 additions & 2 deletions models/courses/course_names.sql
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,7 @@
],
primary_key="course_key",
layout="COMPLEX_KEY_HASHED()",
lifetime="120",
lifetime=env_var("ASPECTS_COURSE_NAME_CACHE_LIFETIME", "120"),
source_type="clickhouse",
connection_overrides={
"host": "localhost",
Expand All @@ -24,7 +24,11 @@ with
from {{ source("event_sink", "course_overviews") }}
group by org, course_key
)
select course_key, display_name, splitByString('+', course_key)[-1] as course_run, org
select
course_key,
display_name as course_name,
splitByString('+', course_key)[-1] as course_run,
org
from {{ source("event_sink", "course_overviews") }} co
inner join
most_recent_overviews mro
Expand Down
4 changes: 1 addition & 3 deletions models/courses/dim_course_blocks.sql
Original file line number Diff line number Diff line change
Expand Up @@ -22,6 +22,4 @@ select
else regexpExtract(block_id, '@([^+]+)\+block@', 1)
end as block_type
from {{ ref("course_block_names") }} blocks
join
{{ ref("course_names") }} courses on blocks.course_key = courses.course_key
settings join_algorithm = 'direct'
join {{ ref("course_names") }} courses on blocks.course_key = courses.course_key
37 changes: 37 additions & 0 deletions models/courses/most_recent_course_blocks.sql
Original file line number Diff line number Diff line change
@@ -0,0 +1,37 @@
{{
config(
materialized="materialized_view",
schema=env_var("ASPECTS_EVENT_SINK_DATABASE", "event_sink"),
engine=get_engine("ReplacingMergeTree()"),
order_by="(location)",
post_hook="OPTIMIZE TABLE {{ this }} {{ on_cluster() }} FINAL",
)
}}

select
location,
display_name as block_name,
toString(section)
|| ':'
|| toString(subsection)
|| ':'
|| toString(unit)
|| ' - '
|| display_name as display_name_with_location,
JSONExtractInt(xblock_data_json, 'section') as section,
JSONExtractInt(xblock_data_json, 'subsection') as subsection,
JSONExtractInt(xblock_data_json, 'unit') as unit,
JSONExtractBool(xblock_data_json, 'graded') as graded,
order as course_order,
course_key,
dump_id,
time_last_dumped
from {{ source("event_sink", "course_blocks") }} course_blocks
join
(
select location, max(time_last_dumped) as max_time_last_dumped
from {{ source("event_sink", "course_blocks") }}
group by location
) latest_course_blocks
on course_blocks.location = latest_course_blocks.location
and course_blocks.time_last_dumped = latest_course_blocks.max_time_last_dumped
57 changes: 47 additions & 10 deletions models/courses/schema.yml
Original file line number Diff line number Diff line change
Expand Up @@ -23,29 +23,51 @@ models:
data_type: String
description: "The block's name"
- name: section_number
data_type: string
data_type: String
description: "The section this block belongs to, formatted as <section location>:0:0"
- name: subsection_number
data_type: string
data_type: String
description: "The subsection this block belongs to, formatted as <section location>:<subsection location>:0"
- name: hierarchy_location
data_type: string
data_type: String
description: "The full section:subsection:unit hierarchy in which this block belongs"
- name: display_name_with_location
data_type: String
description: "The block's display name with section, subsection, and unit prepended to the name. This provides additional context when looking at block names and can help data consumers understand which block they are analyzing"
- name: course_order
data_type: Int32
description: "The sort order of this block in the course across all course blocks"
- name: graded
data_type: Boolean
description: "Whether the block is graded"
- name: block_type
data_type: String
description: "The type of block. This can be a section, subsection, unit, or the block type"

- name: course_block_names
description: "A table of course blocks with their names"
columns:
- name: location
data_type: String
description: "The location of the block"
- name: block_name
data_type: String
description: "The name of the block"
- name: course_key
data_type: String
description: "The course which the block belongs to"
- name: graded
data_type: Boolean
description: "Whether the block is graded"
- name: display_name_with_location
data_type: String
description: "The block's display name with section, subsection, and unit prepended to the name. This provides additional context when looking at block names and can help data consumers understand which block they are analyzing"
- name: course_order
data_type: Int32
description: "The sort order of this block in the course across all course blocks"

- name: course_block_names
description: "A table of course blocks with their names"
- name: most_recent_course_blocks
description: "A materialized view of course blocks with their display names and additional metadata. Only stores the most recent row per block location."
columns:
- name: location
data_type: String
Expand All @@ -65,6 +87,21 @@ models:
- name: course_order
data_type: Int32
description: "The sort order of this block in the course across all course blocks"
- name: section
data_type: Int32
description: "The section number that this block falls under in the course. Starts at 1."
- name: subsection
data_type: Int32
description: "The subsection number that this block falls under in the section. Starts at 1."
- name: unit
data_type: Int32
description: "The unit number that this block falls under in the subsection. Starts at 1."
- name: dump_id
data_type: UUID
description: "The UUID of the event sink run that published this block to ClickHouse. When a course is published all blocks inside it are sent with the same dump_id."
- name: time_last_dumped
data_type: String
description: "The Datetime of the event sink run that published this block to ClickHouse. When a course is published all blocks inside it are sent with the same time_last_dumped."

- name: course_names
description: "A table of courses with their names"
Expand Down Expand Up @@ -104,13 +141,13 @@ models:
data_type: String
description: "The block's name"
- name: section_number
data_type: string
data_type: String
description: "The section this block belongs to, formatted as <section location>:0:0"
- name: subsection_number
data_type: string
data_type: String
description: "The subsection this block belongs to, formatted as <section location>:<subsection location>:0"
- name: hierarchy_location
data_type: string
data_type: String
description: "The full section:subsection:unit hierarchy in which this block belongs"
- name: display_name_with_location
data_type: String
Expand All @@ -122,10 +159,10 @@ models:
data_type: String
description: "The type of block. This can be a section, subsection, unit, or the block type"
- name: section_with_name
data_type: string
data_type: String
description: "The name of the section this block belongs to, with section_number prepended"
- name: subsection_with_name
data_type: string
data_type: String
description: "The name of the section this subsection belongs to, with subsection_number prepended"
- name: course_order
data_type: Int32
Expand Down
Loading

0 comments on commit e231fab

Please sign in to comment.