Deduplication macro #335

blake-enyart · 2021-02-12T16:22:15Z

Describe the feature

It would be great to have deduplication macro which utilizes something such as the surrogate_key macro to deduplicate on. The basis for this would be something such as: #29

Describe alternatives you've considered

At this point, the only alternative is writing this in full SQL query and this feels like a pretty common repetitive type of query that is prime for integrating into a macro

Additional context

Is this feature database-specific? Which database(s) is/are relevant? Please include any other relevant context here.
I work mainly in Snowflake, but this could be generalized. Perhaps a first draft in Snowflake and expands outward?

Who will this benefit?

It is very common practice to deduplicate data on ingestion in addition to type casting which will greatly expedite new data source ingestion

Are you interested in contributing this feature?

I would need a hand to get started on this. I can leverage this PR as a reference, #29, but would greatly appreciate any support and perspective provided here.

rwatts3 · 2021-02-19T17:47:19Z

Here is a macro I use to dedupe using BigQuery, not sure if its relevant or could help.

{%- macro deduped_dataset(database, table, key="id") -%}
  select deduped.*
    from (
      select
        _inner.*,
        row_number() over (partition by {{ key }} order by updated_at desc ) as rn
      from {{ source(database, table) }} as _inner
    ) as deduped
  where deduped.rn = 1
{%- endmacro -%}

switzer · 2021-03-01T18:10:33Z

We've had resource issues using row_number() vs. using array_agg. Our dedupe macro is as follows:

{% macro dedupe_source(source, group_by, order_by = None) %}
  select
    event.*
  from (
    select
      array_agg (
        {{ source }}
        {% if order_by != None %}
        order by {{ order_by }}
        {% endif %}
        limit 1
      )[offset(0)] event
    from {{ source }}
    group by {{ group_by }}
  )
{% endmacro %}

It is more performant (can handle more rows, not necessarily faster). You may want to consider going down this path. Adding the LIMIT 1 allows BQ to vastly reduce the dataset it needs to calculate the query.

In addition, rather than passing in a dataset and table, it would be better to pass in a source, so that this can be used as a separate CTE. Of course you could call dedupe( source(db, table) ) if you like, so this would supersede existing functionality.

Also, we use this macro to dedupe based on a configurable column AND order by - so that you can select which row to select.

blake-enyart · 2021-03-03T20:27:05Z

Hi @switzer and @rwatts3 !

Thank you both for sharing out your dedupe methodologies. I'm working on standing up some client projects right now and hoping to utilize these shortly :)

thiklock · 2021-09-23T22:12:28Z

Hi @blake-enyart , @switzer and @rwatts3 !

I was trying to apply this method but to many columns (4). but I got some errors.

I was wondering if you managed to find a solution for deduping based on many columns.

switzer · 2021-09-24T02:44:17Z

There is no limit to the number of columns to deduplicate across - just add a comma list in the string. For example, you can call the macro as follows:

with
source as (
  select * from table
),
optimised_source as (

    {{ dedupe_source('source', 'col1, col2, col3, col4', 'col9, col8, col7 desc' ) }}

)

...

keurcien · 2022-02-16T10:31:40Z

Great idea! Using Airbyte in incremental mode without normalization, I could see a great benefit in having such a macro dedicated to Airbyte's raw data. Thanks for sharing your snippet @switzer

joellabes · 2022-02-17T00:13:08Z

👋 Hey team! I'd welcome a PR that implemented this.

It will need to be cross-database compatible by the time we merge it in - I don't know offhand whether all four of the supported adapters (Snowflake, BQ, Redshift, Postgres) have array_agg or not. If they don't, falling back to the row_number based approach seems reasonable - also remember that there are integration tests that can validate that everything works correctly across all warehouses so you don't have to have access to them all individually.

Happy to help out with any questions along the way 🚀

judahrand · 2022-03-06T18:08:08Z

@joellabes I thought I'd take a crack at this. It turns out that:

Redshift does not have an ARRAY_AGG function.
Snowflake and Postgres do not have a LIMIT clause available in their ARRAY_AGG implementations.

For these reasons I've ended up implementing the ROW_NUMBER approach as the default and the ARRAY_AGG as a BigQuery specific overload.

It was also necessary for the macro to take a relation as an argument because most of the databases do not have a SELECT * EXCEPT (...) clause and we introduce a new column with ROW_NUMBER. Therefore, we have to use dbt_utils.star to get the select statement from the relation.

dbeatty10 · 2022-05-11T20:41:58Z

Fixed by #512

blake-enyart added enhancement New feature or request triage labels Feb 12, 2021

joellabes added good first issue and removed triage labels Feb 17, 2022

judahrand mentioned this issue Mar 6, 2022

Add deduplication macro #512

Merged

15 tasks

dbeatty10 closed this as completed May 11, 2022

dbeatty10 mentioned this issue Jul 27, 2023

Methods to achieve null safety for deduplicate #815

Draft

4 tasks

dbeatty10 mentioned this issue Jul 2, 2024

Update BQ deduplicate macro to support partition pruning downstream #929

Open

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Deduplication macro #335

Deduplication macro #335

blake-enyart commented Feb 12, 2021

rwatts3 commented Feb 19, 2021

switzer commented Mar 1, 2021 •

edited

Loading

blake-enyart commented Mar 3, 2021

thiklock commented Sep 23, 2021

switzer commented Sep 24, 2021

keurcien commented Feb 16, 2022

joellabes commented Feb 17, 2022

judahrand commented Mar 6, 2022 •

edited

Loading

dbeatty10 commented May 11, 2022

Deduplication macro #335

Deduplication macro #335

Comments

blake-enyart commented Feb 12, 2021

Describe the feature

Describe alternatives you've considered

Additional context

Who will this benefit?

Are you interested in contributing this feature?

rwatts3 commented Feb 19, 2021

switzer commented Mar 1, 2021 • edited Loading

blake-enyart commented Mar 3, 2021

thiklock commented Sep 23, 2021

switzer commented Sep 24, 2021

keurcien commented Feb 16, 2022

joellabes commented Feb 17, 2022

judahrand commented Mar 6, 2022 • edited Loading

dbeatty10 commented May 11, 2022

switzer commented Mar 1, 2021 •

edited

Loading

judahrand commented Mar 6, 2022 •

edited

Loading