Replies: 8 comments 16 replies
-
I'm interested! It looks like this would initially be rolled out for |
Beta Was this translation helpful? Give feedback.
-
The example at the top has
Potentially this could later extend to support column level grants in supporting data platforms if you want to rigidly enforce the contract e.g. another project cannot access a column with pii, only the project working with HR. It could also allow sharing source with similar oversight in a specific project. |
Beta Was this translation helpful? Give feedback.
-
Two questions B) Is the idea for adding this to all models |
Beta Was this translation helpful? Give feedback.
-
Hey, thanks for all your work regarding contracts! Looking forward to it! Wondering whether some SLA on when data should be refreshed should be included in the contract. For example:
It goes against the nature of a data contract to have an automatic, implicit dependency between the two projects (since that would just result in the same massive DAG as in 1 big project, but then scattered over multiple places). Some thoughts:
|
Beta Was this translation helpful? Give feedback.
-
I'm very new to DBT, but wanted to say this looks like a great step in the right direction. I had a few musings or questions around contracts and perhaps DBT itself. So thanks for reading my naiive question. My understanding is that contracts are essentially between a given model and its underlying output data. Kind of a reverse schema lookup. When I see this, I start to wonder if it can open the door for static type checking in IDEs, or dynamic constraint of SELECTs. Consider the following case that I would love to see DBT address. 1a) Static analysis / downstream column resolution. If I have a model that generates columns A + B...then in a separate model, I select from the former (SELECT *..., or SELECT A,B,C ...), it would be a great dev experience if I could receive an error (even before build) no column C in table. Or the 1b) Related, so you can see why extra columns are an issue. I have a "source" table from an external BigQuery table. While there are dozens of columns, my DBT universe only cares about 5. I actually want to constrict the aperture of that incoming API so that no one can possibly expand our dependency on it. Today, this would have to be done by making an intermediate model, or view, and hoping everyone uses that (not the source itself). But I wonder if defining a contract on the source, and then using |
Beta Was this translation helpful? Give feedback.
-
I am so excited by this discussion & the associated work to come! I have a question around the actual syntax/shape of the YAML file/contract. At my organization, my team is similarly rolling out data contracts using YAML files as an abstraction layer. We are using dbt to define transformations and want to integrate data contracts into our dbt projects. However, we have other data products sitting outside of dbt that rely on data contracts as well. We would like some flexibility (i.e., the agency) to define the shape of our data contracts. Ideally, we'll be using a consistent shape across our various data products. While we don't have anything against the shape you all propose, we would like to be able deviate if needed. Have you all thought about providing more flexibility in how users define these YAML files? For example, as opposed to needing to define a schema using the Interested to hear thoughts on this! Thanks for your thoughtfulness and hard work 😄. |
Beta Was this translation helpful? Give feedback.
-
As a user of and contributor to the dbt_constraints package, I'm excited to see primary_key and foreign_key constraints as a part of dbt-core. However, I'm running into an issue with this new feature. Defining a primary_key constraints requires a contract. Requiring a contract then requires data_type to be set for each column. Requiring data_type to be set for each column makes dbt NOT database agnostic, which is something that I absolutely love about dbt. I have multiple customers using the same dbt project with different data warehouses, and utilize primary keys so the BI tools know how to join. Why does the primary_key constraint require a contract and/or all data_types to then be set? |
Beta Was this translation helpful? Give feedback.
-
Hi! IT would be great to have a possibility to define requirements for one column only; we had a case when a DBT model was referenced in another table, not built in DBT and that table was expecting a column that was dropped in DBT. I wanted to enforce that my DBT model has at least this column (since the DBT run would not detect any error obviously) but then it prompted me to fill contract specifications for all other columns which is something I did not want to do. |
Beta Was this translation helpful? Give feedback.
-
Part of the larger initiative for Multi-project collaboration (#6725)
What is a model "contract"?
A model's "contract" is a way of statically defining the shape of its "API response" when queried. Model developers define the contract as structured data (yaml), and it is enforced while the model is being built. The shape should be strongly typed, leveraging each data platform's data type system and support for constraints. If the contract is not met, the model does not update; the older contract-compliant object will remain in place.
The contract is the set of guarantees about the shape of the returned dataset. That includes, for every column in a model:
Goals
Considerations
It takes real work to define and maintain model contracts, which may not be appropriate for every model. Contracts offer a way to demarcate, from the set of all models, those which are mature, built for reuse, and intended for sharing.
Each contracted model declares exactly one contract. Of course, reusable datasets are likely to be used in different ways, by different consumers. It's tempting to define a separate dedicated contract for each consumer; I believe this would add unnecessary complexity to our first foray. For now, we're going to keep our focus on one set of guarantees, per model, made by the producers of that model.
How does this compare to
dbt test
?Contracts enforce "data shape." That's distinct from
dbt test
, which is still important and not going away. Tests are a highly flexible mechanism to check "data quality," after a model is built; it's capable of catching data quality issues in production, and mistaken logic in in development or CI. Contracts are limited in what they can verify; a "test" can be any SQL you want.By the end of this phase of work on "model contracts," I expect to offer guidance for thinking about testing within dbt:
[Future] Other types of testing
The breakdown above is not the final say on testing in dbt. I've outlined two patterns below that are of ongoing interest. While they are conceptually related, they will be out of scope for this first attempt at "model contracts."
Tests as "pre-flight" checks
Today, all generic and singular tests run after a model has already been built in the database. This gets us a lot, in terms of simplicity and reproducibility for investigation. The
build
command enables users to stop building downstream models when an upstream model's test fails; the--store-failures
option saves test failures for later audibility. But it's understandable when dbt users give the feedback that tests ought to run before a model has actually been updated in the database (#5687).While it's out of scope for this work, we should keep thinking about how to make tests as early and efficient as possible, by combining queries and running them as part of a model's materialization. The options here vary based on how a model is being tested, and how many models. The four built-in generic tests land at different points along this spectrum:
not_null
could be reimplemented as a column constraint, on data platforms that support applying & enforcing this constraint at the same time as model creation. (Depending on your platform, this could be achieved with a model contract.)accepted_values
could be reimplemented as row-level "check" constraints, on data platforms which support them. (Depending on your platform, this could be achieved with a model contract.)unique
requires aggregate queries against an entire column, so it cannot be a row-level check. Analytical data platforms don’t enforceprimary key
constraints, even if they support them being added as metadata (and may even use them in query optimization!). But we could at least imagine testing uniqueness in a transformed table before it replaces its preexisting counterpart, or in a batch of new data before merging/upserting into an incremental model. (What would that require in practice? Save model SQL into a "temporary" view or table, running a query to check for duplicates, and only if none are found, swapping that new table with its preexisting counterpart. This will be slightly slower on data platforms that usecreate or replace
. On data platforms that support transactions, rather thancreate or replace table
, it's closer to the materialization logic already in place.)relationships
depends on multiple models. Our ability to run it as a "pre-flight" check within one model's materialization depends on those models' relationships to one another in the DAG. In cases where an entire model group needs to be tested and deployed together, a "blue/green"-style deployment (build in staging database/schema → test → swap with production) still feels most appropriate.Unit testing
Since the earliest days of dbt, we've looked to software engineering for inspiration about what's missing in data tooling. Software APIs should have a contracted structured and well-tested contents. The latter process often uses fixture data, rather than real data, as its input. (Given a realistic-looking user, customer, account, transaction, etc, validate that the appropriate data is returned from every relevant endpoint.) There's a natural thematic extension from this discussion into one about unit testing (#4455).
Similar to defining model contracts, unit tests would check that a model's resulting dataset matches its expected "shape." Unlike model contracts, unit tests also verify (via fixture inputs & outputs) that the model's logical behavior (transformation logic) matches expectations.
Similar to defining model contracts, unit tests require additional work (creating & maintaining fixture data) that can add friction to a rapid development or iteration process. Unlike defining model contracts, a well-designed unit testing framework can also enable test-driven development, with benefits for speed & quality.
The topic of unit testing remains of great interest to us. While it's out of scope for this initial foray into model contracts, it could be a powerful extension.
Beta Was this translation helpful? Give feedback.
All reactions