Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add deduplication macro #512

Merged
merged 9 commits into from
Mar 22, 2022
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 5 additions & 1 deletion CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -1,3 +1,7 @@
# dbt-utils v0.8.3
## New features
- A macro for deduplicating data ([#335](https://github.com/dbt-labs/dbt-utils/issues/335), [#512](https://github.com/dbt-labs/dbt-utils/pull/512))

# dbt-utils v0.8.2
## Fixes
- Fix union_relations error from [#473](https://github.com/dbt-labs/dbt-utils/pull/473) when no include/exclude parameters are provided ([#505](https://github.com/dbt-labs/dbt-utils/issues/505), [#509](https://github.com/dbt-labs/dbt-utils/pull/509))
Expand Down Expand Up @@ -32,7 +36,7 @@

# dbt-utils v0.8.0
## 🚨 Breaking changes
- dbt ONE POINT OH is here! This version of dbt-utils requires _any_ version (minor and patch) of v1, which means far less need for compatibility releases in the future.
- dbt ONE POINT OH is here! This version of dbt-utils requires _any_ version (minor and patch) of v1, which means far less need for compatibility releases in the future.
- The partition column in the `mutually_exclusive_ranges` test is now always called `partition_by_col`. This enables compatibility with `--store-failures` when multiple columns are concatenated together. If you have models built on top of the failures table, update them to reflect the new column name. ([#423](https://github.com/dbt-labs/dbt-utils/issues/423), [#430](https://github.com/dbt-labs/dbt-utils/pull/430))

## Contributors:
Expand Down
17 changes: 16 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -36,6 +36,7 @@ For compatibility details between versions of dbt-core and dbt-utils, [see this

- [SQL generators](#sql-generators)
- [date_spine](#date_spine-source)
- [dedupe](#dedupe-source)
- [haversine_distance](#haversine_distance-source)
- [group_by](#group_by-source)
- [star](#star-source)
Expand Down Expand Up @@ -706,6 +707,20 @@ This macro returns the sql required to build a date spine. The spine will includ
}}
```

#### deduplicate ([source](macros/sql/deduplicate.sql))
This macro returns the sql required to remove duplicate rows from a model or source.

**Usage:**

```
{{ dbt_utils.deduplicate(
relation=source('my_source', 'my_table'),
group_by="user_id, cast(timestamp as day)",
order_by="timestamp desc"
)
}}
```

#### haversine_distance ([source](macros/sql/haversine_distance.sql))
This macro calculates the [haversine distance](http://daynebatten.com/2015/09/latitude-longitude-distance-sql/) between a pair of x/y coordinates.

Expand Down Expand Up @@ -748,7 +763,7 @@ group by 1,2,3
```

#### star ([source](macros/sql/star.sql))
This macro generates a comma-separated list of all fields that exist in the `from` relation, excluding any fields listed in the `except` argument. The construction is identical to `select * from {{ref('my_model')}}`, replacing star (`*`) with the star macro. This macro also has an optional `relation_alias` argument that will prefix all generated fields with an alias (`relation_alias`.`field_name`).
This macro generates a comma-separated list of all fields that exist in the `from` relation, excluding any fields listed in the `except` argument. The construction is identical to `select * from {{ref('my_model')}}`, replacing star (`*`) with the star macro. This macro also has an optional `relation_alias` argument that will prefix all generated fields with an alias (`relation_alias`.`field_name`).

The macro also has optional `prefix` and `suffix` arguments. When one or both are provided, they will be concatenated onto each field's alias in the output (`prefix` ~ `field_name` ~ `suffix`). NB: This prevents the output from being used in any context other than a select statement.

Expand Down
3 changes: 3 additions & 0 deletions integration_tests/data/sql/data_deduplicate.csv
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
user_id,event,version
1,play,1
1,play,2
2 changes: 2 additions & 0 deletions integration_tests/data/sql/data_deduplicate_expected.csv
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
user_id,event,version
1,play,2
9 changes: 7 additions & 2 deletions integration_tests/models/sql/schema.yml
Original file line number Diff line number Diff line change
Expand Up @@ -85,7 +85,7 @@ models:
tests:
- dbt_utils.equality:
compare_model: ref('data_pivot_expected')

- name: test_pivot_apostrophe
tests:
- dbt_utils.equality:
Expand Down Expand Up @@ -137,8 +137,13 @@ models:
tests:
- dbt_utils.equality:
compare_model: ref('data_union_expected')

- name: test_get_relations_by_pattern
tests:
- dbt_utils.equality:
compare_model: ref('data_union_events_expected')

- name: test_dedupe
tests:
- dbt_utils.equality:
compare_model: ref('data_deduplicate_expected')
7 changes: 7 additions & 0 deletions integration_tests/models/sql/test_deduplicate.sql
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
with deduped as (

{{ dbt_utils.deduplicate(ref('data_deduplicate'), group_by='user_id', order_by='version desc') | indent }}

)

select * from deduped
46 changes: 46 additions & 0 deletions macros/sql/deduplicate.sql
Original file line number Diff line number Diff line change
@@ -0,0 +1,46 @@
{%- macro deduplicate(relation, group_by, order_by=none) -%}
{{ return(adapter.dispatch('deduplicate', 'dbt_utils')(relation, group_by, order_by=order_by)) }}
{% endmacro %}

{%- macro default__deduplicate(relation, group_by, order_by=none) -%}

select
{{ dbt_utils.star(relation, relation_alias='deduped') | indent }}
from (
select
_inner.*,
row_number() over (
partition by {{ group_by }}
{% if order_by is not none -%}
order by {{ order_by }}
{%- endif %}
) as rn
from {{ relation }} as _inner
) as deduped
where deduped.rn = 1

{%- endmacro -%}

{#
-- It is more performant to deduplicate using `array_agg` with a limit
-- clause in BigQuery:
-- https://github.com/dbt-labs/dbt-utils/issues/335#issuecomment-788157572
#}
{%- macro bigquery__deduplicate(relation, group_by, order_by=none) -%}

select
{{ dbt_utils.star(relation, relation_alias='deduped') | indent }}
from (
select
array_agg (
original
{% if order_by is not none -%}
order by {{ order_by }}
{%- endif %}
limit 1
)[offset(0)] as deduped
from {{ relation }} as original
group by {{ group_by }}
)

{%- endmacro -%}