Implementing metadata freshness checks #481

benc-db · 2023-10-13T23:21:48Z

Description

Addresses the 'metadata freshness changes' part of dbt-labs/dbt-core#8307
Doing this in a way that can work for hive and unity was interesting. I'm a little concerned about the accuracy/perf around the hive implementation, but its the best I can do, and I don't have a great way to insert a check for hive metastore earlier in the call chain.

@dataders any concern that if someone has a view as a source (as is the case in my bigger transform jobs), then this mechanism doesn't work? I didn't see if there was any logic that only called down to this path if the relation is a table. Both using information schema in this way and describe history are table-only features. I mean we could look at information_schema.views, but last altered doesn't give the same sort of information for views as it does for tables.

Checklist

I have run this code in development and it appears to resolve the stated issue
This PR includes tests, or tests are not required/relevant for this PR
I have updated the CHANGELOG.md and added information about my change to the "dbt-databricks next" section.

benc-db · 2023-10-13T23:23:19Z

BTW, there may be additional tests required as the guide calls out two more functional tests, but since those tests had a lot of snowflake specific comments in them (and so far are only in the snowflake adapter), I'm waiting until they're part of core before implementing here. I wanted to get this code up for people to look at while I'm away.

benc-db · 2023-10-13T23:24:34Z

dbt/adapters/databricks/relation.py

@@ -115,3 +125,13 @@ def matches(
    @classproperty
    def get_relation_type(cls) -> Type[DatabricksRelationType]:
        return DatabricksRelationType
+
+    def information_schema(self, view_name=None) -> InformationSchema:


This is mostly copied from core, but I had to override so that it would use an information schema that used "`" for quoting, since InformationSchema inherits from BaseRelation.

benc-db · 2023-10-13T23:25:40Z

tests/functional/adapter/test_source_freshness.py

+            project.adapter.drop_schema(relation)
+
+    def test_get_relation_last_modified(self, project, set_env_vars, custom_schema):
+        project.run_sql(


This test is currently hand copied from dbt-snowflake. Nothing I do in the test is specific to databricks, so hopefully it gets pulled into core.

Essentially the test just says, were we able to run the metadata commands and not warn/error. We definitely need another test that asserts that something useful happens. Will add after I'm back.

Updated the test to ensure the freshness tests passed, as opposed to not having a warning or error. @dataders might be worth considering this form for core; to me at least its more obvious what the test is doing (as opposed to the probe function). Also, this file did not match the format I found in the documentation for sources.json, btw.

dataders · 2023-10-17T23:46:39Z

@benc-db you da real MVP for posting this. I'll definitely flag this for the adapters eng team to look at once we're back from Coalesce

susodapop

LGTM with a few comments. Agree we need a test of the true output of the freshness check. But the smoke test is a good start. Nice work

susodapop · 2023-10-23T16:32:18Z

dbt/adapters/databricks/impl.py

+    _capabilities = CapabilityDict(
+        {Capability.TableLastModifiedMetadata: CapabilitySupport(support=Support.Full)}
+    )
+
    @available.parse(lambda *a, **k: 0)


Is the first time our dialect has needed to use _capabilities? Or had we just inherited the _capabilities from dbt-spark? Not a blocker for this PR but I wonder if more capabilities should be reflected explicitly here.

This is new in 1.7

susodapop · 2023-10-23T16:32:56Z

dbt/adapters/databricks/relation.py

+    quote_character: str = "`"
+
+    def is_hive_metastore(self):
+        return self.database is None or self.database == "hive_metastore"


Just confirming: does dbt's database == Databricks' schema?

database == catalog.

susodapop · 2023-10-23T16:36:19Z

dbt/include/databricks/macros/metadata.sql

+        {%- for relation in relations -%}
+            select '{{ relation.schema }}' as schema,
+                    '{{ relation.identifier }}' as identifier,
+                    max(timestamp) as last_modified,


just confirming: the output of a Databricks SQL query that includes a max() aggregation is never more than a single row, yes? That's why there is no GROUP BY present here?

with the absence of group by, it will be a single row.

dbt/include/databricks/macros/metadata.sql

tests/functional/adapter/test_source_freshness.py

Co-authored-by: Jesse <[email protected]>

first blush at implementing this

9712ab2

benc-db requested review from andrefurlan-db, susodapop and rcypher-databricks as code owners October 13, 2023 23:21

benc-db commented Oct 13, 2023

View reviewed changes

pass, rather than not error

5f9d507

benc-db temporarily deployed to azure-prod October 17, 2023 17:10 — with GitHub Actions Inactive

remove unused fixture reference

92373ec

benc-db temporarily deployed to azure-prod October 17, 2023 17:24 — with GitHub Actions Inactive

susodapop previously approved these changes Oct 23, 2023

View reviewed changes

benc-db dismissed susodapop’s stale review via 823397f October 23, 2023 17:11

Update tests/functional/adapter/test_source_freshness.py

823397f

Co-authored-by: Jesse <[email protected]>

benc-db temporarily deployed to azure-prod October 23, 2023 17:11 — with GitHub Actions Inactive

addressing comment, passing lint

8204482

benc-db temporarily deployed to azure-prod October 23, 2023 17:18 — with GitHub Actions Inactive

benc-db merged commit 9127720 into main Oct 23, 2023
21 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implementing metadata freshness checks #481

Implementing metadata freshness checks #481

benc-db commented Oct 13, 2023 •

edited

Loading

benc-db commented Oct 13, 2023 •

edited

Loading

benc-db Oct 13, 2023

benc-db Oct 13, 2023

benc-db Oct 13, 2023

benc-db Oct 17, 2023

dataders commented Oct 17, 2023

susodapop left a comment

susodapop Oct 23, 2023

benc-db Oct 23, 2023

susodapop Oct 23, 2023

benc-db Oct 23, 2023

susodapop Oct 23, 2023

benc-db Oct 23, 2023

Implementing metadata freshness checks #481

Implementing metadata freshness checks #481

Conversation

benc-db commented Oct 13, 2023 • edited Loading

Description

Checklist

benc-db commented Oct 13, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dataders commented Oct 17, 2023

susodapop left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

benc-db commented Oct 13, 2023 •

edited

Loading

benc-db commented Oct 13, 2023 •

edited

Loading