[CT-3194] [Feature] add another way to calculate source freshness - via a SQL query #8797

graciegoheen · 2023-10-09T20:20:44Z

Is this your first time submitting a feature request?

I have read the expectations for open source contributors
I have searched the existing issues, and I could not find an existing issue for this feature
I am requesting a straightforward extension of existing dbt functionality, rather than a Big Idea better suited to a discussion

Describe the feature

I think for external tables the information_schema.tables.last_altered column is the time of the last refresh (or last ALTER), not the last time a file came in.

So for example if an external table has stale data, an ALTER TABLE ... REFRESH is performed but no new files are added (because the pipeline that is dropping fiels into the S3 bucket is broken for example), I think the last_altered column will have the timestamp of when the ALTER TABLE ... REFRESH that was performed. If this will cover external tables and my understanding of the metadata is correct, then this might lead to the dbt metadata reflecting that the external table is fresher than it actually is thus obfuscating an issue (a pipeline being broken)

The directory (if enabled) of the external stage would have the list of files and when they were added to the stage. So pretty much just wondering if this feature will be a behind the scenes query of the information_schema, or if freshness source can be a user defined query like originally outlined here #7012

Originally posted by @sp-tkerlavage in dbt-labs/dbt-snowflake#785 (comment)

We currently support 2 ways to generate source freshness (via warehouse metadata tables & via a loaded_at_field). We should support a 3rd way to generate source freshness - via a loaded_at_query.

Option 1: freshness config added (get freshness from metadata warehouse tables)

sources:
  - name: my_source
    freshness:
      warn_after:
        count: 1
        period: day
      error_after:
        count: 5
        period: day

Option 2: freshness config added with loaded_at_field (get freshness from select max(loaded_at_field) … from this …)

sources:
  - name: my_source
    freshness:
      warn_after:
        count: 1
        period: day
      error_after:
        count: 5
        period: day
    loaded_at_field:

Option 3: freshness config added with "how does dbt consider a source to be fresh"? (get equivalent of "max(loaded_at_field)" from executing a custom query)

sources:
  - name: my_source
    freshness:
      warn_after:
        count: 1
        period: day
      error_after:
        count: 5
        period: day
      collect_freshness_calc: my_freshness_calc # config name TBD

{% freshness my_freshness_calc(source, table) %}

some sql returns a structure similar to our current query

{% endfreshness %}

The built-in one is collect_freshness: https://github.com/dbt-labs/dbt-adapters/blob/6c41bedf27063eda64375845db6ce5f7535ef6aa/dbt/include/global_project/macros/adapters/freshness.sql#L4-L16

Describe alternatives you've considered

No response

Who will this benefit?

Folks using external tables as sources

Are you interested in contributing this feature?

No response

Anything else?

No response

The text was updated successfully, but these errors were encountered:

graciegoheen · 2023-10-09T20:27:59Z

What happens if you supply both a loaded_at_field and a loaded_at_query
- throw an error?
- use the config "closest" to the source table definition?
We should consider the following case:

version: 2

sources:
  - name: jaffle_shop
    database: raw
    freshness: # default freshness
      warn_after: {count: 12, period: hour}
      error_after: {count: 24, period: hour}
    loaded_at_field: _etl_loaded_at

    tables:
      - name: orders
        loaded_at_query: ...

      - name: customers # this will use the freshness defined above
...

graciegoheen · 2023-10-10T18:37:32Z

From @sp-tkerlavage

In Snowflake an external stage can have a directory which essentially stores a list of files in the external storage (S3 Bucket, etc) along with other useful information like relative_path, url, size, MD5, and last_modified. https://docs.snowflake.com/en/user-guide/data-load-dirtables

This directory table can be automatically refreshed when files are added via SQS https://docs.snowflake.com/en/user-guide/data-load-dirtables-auto

The directory is not a standalone object, its just a layer on top of a stage. It can be queried like this:
SELECT relative_path, 
    last_modified, 
    CONVERT_TIMEZONE('UTC', last_modified) last_modified_utc
FROM DIRECTORY( '@database.schema.stage_name' )
However, you would have to parse the relative path to extract the name of the source table. So if for example you have a bucket that is partitioned like this: <bucket>/source/source_table/yyyy/mm/dd/hh You would have to do something like this
SELECT relative_path, 
    SPLIT_PART(relative_path, '/', 1) source_database,
    SPLIT_PART(relative_path, '/', 2) source_table,
    SPLIT_PART(relative_path, '/', 3) year
    ...
    last_modified, 
    CONVERT_TIMEZONE('UTC', last_modified) last_modified_utc
FROM DIRECTORY( '@prod_cdc.meta.dms' )
I actually have an incremental model for this directory parsing logic in my own project, so that may be a cleaner approach. However, I'm not sure if ref'ing a model in a freshness test is even possible. And even if it is, the model would have to be ran first. But again, I'm not entirely clear on the parsing order so it may not even be possible to ref a model in a freshness test on the face of it.

dataders · 2024-05-28T17:19:59Z

in conversation with @graciegoheen, we decided that we aren't currently aware of enough distinct use cases for custom SQL-defined freshness beyond supporting freshness of external tables in dbt-snowflake.

For now, I'm closing this in favor of dbt-labs/dbt-snowflake#1061. If we come across more use cases in the future, we address them as they come up. Perhaps this framing will useful then

graciegoheen · 2024-11-18T19:47:43Z

I'm going to re-open as I've heard from a couple folks that they use side control tables to track updates and status of their main large tables.

For example:

select max(start_ts)
from audit_tables
where table_name='my_source_table'
and status='completed'

graciegoheen added the enhancement New feature or request label Oct 9, 2023

github-actions bot changed the title ~~[Feature] add another way to calculate source freshness - via a SQL query~~ [CT-3194] [Feature] add another way to calculate source freshness - via a SQL query Oct 9, 2023

This was referenced Oct 9, 2023

[Epic] Applied State (part 1) #8316

Closed

[ADAP-910] Support Metadata Freshness dbt-labs/dbt-snowflake#785

Closed

This was referenced May 23, 2024

[Feature] Allow snowflake__get_relation_last_modified to references models or sources dbt-labs/dbt-snowflake#1054

Closed

[Feature] metadata-based freshness checks for external tables dbt-labs/dbt-snowflake#1061

Closed

dataders closed this as not planned Won't fix, can't repro, duplicate, stale May 28, 2024

graciegoheen reopened this Nov 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[CT-3194] [Feature] add another way to calculate source freshness - via a SQL query #8797

[CT-3194] [Feature] add another way to calculate source freshness - via a SQL query #8797

graciegoheen commented Oct 9, 2023 •

edited

Loading

graciegoheen commented Oct 9, 2023

graciegoheen commented Oct 10, 2023 •

edited

Loading

dataders commented May 28, 2024

graciegoheen commented Nov 18, 2024 •

edited

Loading

[CT-3194] [Feature] add another way to calculate source freshness - via a SQL query #8797

[CT-3194] [Feature] add another way to calculate source freshness - via a SQL query #8797

Comments

graciegoheen commented Oct 9, 2023 • edited Loading

Is this your first time submitting a feature request?

Describe the feature

Describe alternatives you've considered

Who will this benefit?

Are you interested in contributing this feature?

Anything else?

graciegoheen commented Oct 9, 2023

graciegoheen commented Oct 10, 2023 • edited Loading

dataders commented May 28, 2024

graciegoheen commented Nov 18, 2024 • edited Loading

graciegoheen commented Oct 9, 2023 •

edited

Loading

graciegoheen commented Oct 10, 2023 •

edited

Loading

graciegoheen commented Nov 18, 2024 •

edited

Loading