Model contracts #6726

jtcohen6 · 2023-01-25T14:24:19Z

jtcohen6
Jan 25, 2023
Maintainer

Part of the larger initiative for Multi-project collaboration (#6725)

What is a model "contract"?

A model's "contract" is a way of statically defining the shape of its "API response" when queried. Model developers define the contract as structured data (yaml), and it is enforced while the model is being built. The shape should be strongly typed, leveraging each data platform's data type system and support for constraints. If the contract is not met, the model does not update; the older contract-compliant object will remain in place.

The contract is the set of guarantees about the shape of the returned dataset. That includes, for every column in a model:

Column name (required)
Column data type (required)
No null values (optional)
Additional constraints or validation checks (optional, depending on data platform support)

models:
  - name: my_stable_model
    config:
      contracted: true
    columns:
      - name: id
        description: "Primary key for this model"
        data_type: integer
        constraints:
          - not null     # enforced by my data platform at table creation time
          - primary key  # my data platform does *not* enforce this, it is here as metadata only
        tests:
          - unique       # actually validate uniqueness for this column
      - name: date_day
        data_type: date
        constraints: [not null]
      - name: category
        data_type: string
      - ...

Goals

Contracts are defined in yaml (structured data).
Because contracts are defined in separate yaml files, they can be defined by the same or different people from those writing the model's SQL. (Some organizations distinguish between "ERD designers" and "SQL developers.")
Contracts are enforced at runtime, before a model is updated (on data platforms where this is possible).
A model's contract is available in dbt metadata, and clearly documented for all consumers of that model (in addition to the metadata recorded in the data platform, for consumption by other data tools).

Considerations

It takes real work to define and maintain model contracts, which may not be appropriate for every model. Contracts offer a way to demarcate, from the set of all models, those which are mature, built for reuse, and intended for sharing.

Each contracted model declares exactly one contract. Of course, reusable datasets are likely to be used in different ways, by different consumers. It's tempting to define a separate dedicated contract for each consumer; I believe this would add unnecessary complexity to our first foray. For now, we're going to keep our focus on one set of guarantees, per model, made by the producers of that model.

How does this compare to `dbt test`?

Contracts enforce "data shape." That's distinct from dbt test, which is still important and not going away. Tests are a highly flexible mechanism to check "data quality," after a model is built; it's capable of catching data quality issues in production, and mistaken logic in in development or CI. Contracts are limited in what they can verify; a "test" can be any SQL you want.

By the end of this phase of work on "model contracts," I expect to offer guidance for thinking about testing within dbt:

	Data shape	Data quality
What	Does this model produce the columns + data types I expect? Can columns be null?	Does this model have fresh data? Does it have a reliable unique key? Does the data in this model satisfy more complex expectations?
How	Model contracts	Generic tests, singular tests, source freshness checks
When	During model materialization, by the data platform. If not met, model will fail to build.	After model materialization. Separate node in DAG. Test warning/failure reported by dbt. Run in dev & CI to catch erroneous logic. Regular cadence in prod to detect data quality issues.
Flexibility & support	Varies. Custom row-level checks are supported by some data platforms. Certain constraints are allowed, but not enforced.	Supported on all data platforms. If you can write the SQL for it, you can test it. Lots of optional configuration (severity, filters, …)

[Future] Other types of testing

The breakdown above is not the final say on testing in dbt. I've outlined two patterns below that are of ongoing interest. While they are conceptually related, they will be out of scope for this first attempt at "model contracts."

Tests as "pre-flight" checks

Today, all generic and singular tests run after a model has already been built in the database. This gets us a lot, in terms of simplicity and reproducibility for investigation. The build command enables users to stop building downstream models when an upstream model's test fails; the --store-failures option saves test failures for later audibility. But it's understandable when dbt users give the feedback that tests ought to run before a model has actually been updated in the database (#5687).

While it's out of scope for this work, we should keep thinking about how to make tests as early and efficient as possible, by combining queries and running them as part of a model's materialization. The options here vary based on how a model is being tested, and how many models. The four built-in generic tests land at different points along this spectrum:

not_null could be reimplemented as a column constraint, on data platforms that support applying & enforcing this constraint at the same time as model creation. (Depending on your platform, this could be achieved with a model contract.)
accepted_values could be reimplemented as row-level "check" constraints, on data platforms which support them. (Depending on your platform, this could be achieved with a model contract.)
unique requires aggregate queries against an entire column, so it cannot be a row-level check. Analytical data platforms don’t enforce primary key constraints, even if they support them being added as metadata (and may even use them in query optimization!). But we could at least imagine testing uniqueness in a transformed table before it replaces its preexisting counterpart, or in a batch of new data before merging/upserting into an incremental model. (What would that require in practice? Save model SQL into a "temporary" view or table, running a query to check for duplicates, and only if none are found, swapping that new table with its preexisting counterpart. This will be slightly slower on data platforms that use create or replace. On data platforms that support transactions, rather than create or replace table, it's closer to the materialization logic already in place.)
relationships depends on multiple models. Our ability to run it as a "pre-flight" check within one model's materialization depends on those models' relationships to one another in the DAG. In cases where an entire model group needs to be tested and deployed together, a "blue/green"-style deployment (build in staging database/schema → test → swap with production) still feels most appropriate.

Unit testing

Since the earliest days of dbt, we've looked to software engineering for inspiration about what's missing in data tooling. Software APIs should have a contracted structured and well-tested contents. The latter process often uses fixture data, rather than real data, as its input. (Given a realistic-looking user, customer, account, transaction, etc, validate that the appropriate data is returned from every relevant endpoint.) There's a natural thematic extension from this discussion into one about unit testing (#4455).

Similar to defining model contracts, unit tests would check that a model's resulting dataset matches its expected "shape." Unlike model contracts, unit tests also verify (via fixture inputs & outputs) that the model's logical behavior (transformation logic) matches expectations.

Similar to defining model contracts, unit tests require additional work (creating & maintaining fixture data) that can add friction to a rapid development or iteration process. Unlike defining model contracts, a well-designed unit testing framework can also enable test-driven development, with benefits for speed & quality.

The topic of unit testing remains of great interest to us. While it's out of scope for this initial foray into model contracts, it could be a powerful extension.

leoebfolsom · 2023-01-31T00:45:43Z

leoebfolsom
Jan 31, 2023

I'm interested!

It looks like this would initially be rolled out for not_null and accepted_values. If you're already doing blue/green testing (because, say, you want to make sure your relationship tests don't fail in production), would model contracts have benefits other than catching not_null and accepted_values violations that somehow slipped by blue/green?

1 reply

jtcohen6 Feb 6, 2023
Maintainer Author

@leoebfolsom glad to hear you're interested!

The plan isn't for dbt to do anything auto-magical if you've defined not_null and accepted_values tests — but that, depending on your data platform*, you could redefine those tests as constraints instead. You get the guarantee that your model won't (re)build if your constraint fails; the trade-off is, it's slightly harder to debug (nothing like dbt test --store-failures, at least not at first).

If you're already following a blue/green deployment pattern, why implement contracts?

A model's contract enforces that all column names & data types match expectations. You'd need another test to do that today, if you wanted to validate it within blue/green deployment.
You could catch unsupported values (nulls, or custom checks where supported) even earlier than dbt test (depending on your data platform*)
By defining a contract, dbt would also raise an exception if you've made a breaking change to the contract, during state:modified comparison, telling you to bump the model version: [CT-2038] Detect breaking changes to column names and data types in state:modified check #6869. (It's not clear to me if we should do the same for generic tests defined on the model. Let's say you removed a relationships test. Is that a breaking change to the model, requiring action from downstream consumers? I don't think so...)

*I do want to clarify that the support here varies by data platform:

Postgres supports enforcing not null and custom (row-level) check constraints while a table is being built/populated
Redshift, Snowflake, BigQuery support enforcing not null constraints while a table is being built/populated, but there isn't any support for custom (row-level) checks
Databricks supports not null and check constraints, but it doesn't support them during table creation, and it doesn't have a transaction model that could support ACID table population (insert) after creation. So, unfortunately, Databricks constraints will be more like tests—validated after a model has already been (re)built, with the bad rows already contained therein—though still happening during the model's materialization (run not test), and still effective for future rows that could be added to an incremental model. (We'll be sure to document this clearly. For more on the implementation decision: dbt Constraints / model contracts dbt-spark#574 (comment))
Other platforms: Support will vary, talk to your local neighborhood adapter maintainer :)

davesgonechina · 2023-01-31T12:17:42Z

davesgonechina
Jan 31, 2023

The example at the top has contracted: true at the model level, but I can imagine needing contracts to be at the column level i.e. I am adding columns to my otherwise stable model (grain, etc remains the same), but I am not adding them to the contract (perhaps releasing them to my team but not others). What about defining a contract separate from defining a model e.g.

contracts:
  - name: my_public_stable_model
    relation: my_stable_model
    columns:
      - name: id
        description: "Primary key for this model"

Potentially this could later extend to support column level grants in supporting data platforms if you want to rigidly enforce the contract e.g. another project cannot access a column with pii, only the project working with HR. It could also allow sharing source with similar oversight in a specific project.

3 replies

jtcohen6 Feb 6, 2023
Maintainer Author

@davesgonechina I love this line of thinking. It's worth drawing a clear distinction between producer contracts and consumer contracts. (Credit & thanks to @aranke for talking through this with me last week.)

The goal of this present initiative is to enable producers to define contracts for themselves, and for other producers (= dbt model developers, today in the same project, eventually in other projects). Each model is only produced once, and it should be produced to a spec. If it doesn't meet the spec, it doesn't build.

Consumer contracts are also important; they should be defined separately, and enforced by different means. We're already thinking about these as a new node type, entities:

dbt should know more semantic information #6644

An entity is a thin wrapper around a model, with all or a subset of its columns as dimensions. To pick up from your example above, could imagine dim_customers being a public model, under contract, that then powers one or more entities—each built for a specific consumption pattern, encoded additionally as metrics:

entities:
  - name: customers_for_marketing
    relation: dim_customers
    dimensions:
      - name: id
        description: "Primary key for this model"
        ...
  - name: customers_for_finance
    relation: dim_customers
    dimensions: ... different subset of columns ...
      - name: id
        description: "Primary key for this model"

The eventual idea is, downstream consumers (whether dbt model developers or semantic layer queriers) would have the ability to reference specific entities at specific versions.

Getting back to the specifics of your comment:

I am adding columns to my otherwise stable model (grain, etc remains the same), but I am not adding them to the contract (perhaps releasing them to my team but not others).

In this case, I think you'd want two models! One that's private, with all the columns for your team's internal use, and another one public, with just the contracted columns for downstream use. The latter could be materialized as a view, and defined as a thin wrapper on the former.

Potentially this could later extend to support column level grants in supporting data platforms if you want to rigidly enforce the contract e.g. another project cannot access a column with pii, only the project working with HR.

You raise another fair point re: advanced access & permissions. We don't have a vision of entities as materialized objects in the data platform. I do think it would be appropriate for dbt developers to configure grants and dynamic policies (row-level access, column-level masking) when defining the properties for a contracted model. This should be easy, but it shouldn't be something that dbt does implicitly or auto-magically. I'm very wary of accidentally building something that looks & feels like an intelligent permission automation system.

JimTSG Mar 22, 2023

Responding to this, as I think there is some potential here depending on your view on contracts:

"I am adding columns to my otherwise stable model (grain, etc remains the same), but I am not adding them to the contract (perhaps releasing them to my team but not others)."

I think I could see a similar use case where only a subset of the data model is 'contracted' with the consumers. Maybe only specific columns as well as the grain of the data need to be contracted because downstream users/applications rely on these fields. If additional columns are added, this doesn't necessarily break the contract. It's possible that I'm overlooking something, but this would avoid the need to rewrite the contract with all related parties when small changes are made to the models.

jtcohen6 Mar 26, 2023
Maintainer Author

@JimTSG You're not the only person who has asked or will ask for this! I definitely appreciate that there is some inflexibility in the first cut of this. Out of a desire to preserve a simple interface and a simple implementation, while the concept is brand-new, I'm going to insist that a model is 1:1 with its contract. You can only define one contract for each model, and that contract will be inclusive of all the columns in the model.

If you want to add some out-of-contract columns to a contracted model, your best bet will be to define a new model that selects from it! Perhaps materialized as a view, and very likely with private (instead of public) access.

(That's not to say "never." We are adjusting the config specification for contract to be a dictionary: #7184 (comment). In part, the goal will be to leave open some design space for other fields/options in the future.)

boje · 2023-02-08T07:06:55Z

boje
Feb 8, 2023

Two questions
A) When The shape should be strongly typed, I assume a conflict in the wrong datatype will reject data outside dbt. An example is below where id is of datatype integer, but external systems send float. How do we get/store these "errors"?
columns id data_type=int

B) Is the idea for adding this to all models source as well? If yes, I assume if data is rejected for being inserted into a table, as in example (A), then an assumption would be that the data provider can resend data. This could be analytic engineering making a mistake and not the data provider. Today I always store all data, for example, as JSON in VARIANT as Source and validate this source before the stage. This way, I, as a consumer of data from an external system, never miss data and can easily change my dbt test/data_type and rerun models without contacting the source system owner.

4 replies

jtcohen6 Feb 8, 2023
Maintainer Author

Thanks for reading @boje!

A) A data type conflict can arise in one of two ways: dbt will run "pre-flight" checks to ensure that the data type you've defined explicitly in the columns structure (yaml) matches up with the data type implicitly returned by the logical query (SQL). And dbt will provide that data type to the underlying data platform, which will verify again, during transformation & materialization, that the actual underlying data matches up against that data type. In the case of a mismatch, the model simply does not build, until you update the model's logic. I'm not envisioning an elaborate system of "quarantining" rows with mismatched data.

B) I'm also not envisioning the application of contracts to data sources. It would be totally possible to implement a contract on a staging model, which sits just above a source, and raises errors during the transformation process if there's a mismatch. It might even be possible, with just a bit of custom code, to extract and share that expected contract (structured data) with upstream data providers. But we hold to our belief that you should never miss data, and should be able to make changes to your dbt modeling logic, or your expectations—even if it means writing more defensive code than we might like, or wish for—without requiring that you contact the source system owner.

boje Feb 10, 2023

Thx for replying. As I understood no changed is need for the RAW-database. All these ideas applied to data that have landed to RAW or later and not focusing on solving the data contract between external data provider (ex SalesForce) and RAW-database

boxysean Mar 16, 2023
Collaborator

I received a question today from a dbt Cloud Enterprise customer about putting the contracts on the source. Indeed, testing data at source is a good safety mechanism to ensure input data shape and expectations. I suspect @boje's question will be a common question from dbt contracts users.

jtcohen6 Mar 26, 2023
Maintainer Author

This question also came up during our community office hours the other week! (You can watch the Loom recording at 48:17.)

If you define your staging models as views, and there's a breaking change to an underlying source table (e.g. the ingestion system removed a column, or changed the type of an existing column), then the view would fail to build ... but the view definition (saved query) would also be broken in production! This is still better than status quo, IMO, because the view's contract failure would be loud & obvious, versus allowing a much subtler change to flow downstream until it causes a harder-to-detect error. But if there's an ingestion system that you really can't trust, with a history of unannounced breaking changes to table structure, you might want to materialize your staging model as a table instead. That way, the contract fails, the model doesn't rebuild, and you have an unbroken (if stale) table still providing a stable interface between that source and the rest of your project.

What about declaring a contract directly on a source? Well...

The underlying mechanism the "preflight" check that dbt will use to validate column names & data types is a standalone macro, assert_columns_equivalent. This macro generates "empty" queries to compare a model's SQL definition, and the columns defined in its yaml spec; then, it matches up the column names returned by both queries, and uses the data platform's type system to ensure that the data types align.

In theory, it could be possible to run this check for sources, where the "SQL definition" is just select * from {{ source(...) }}, and the expected set of columns is defined in that source's yaml spec. Then, invoked via run-operation? Or... a new command (dbt source <validate-contracts>) ??? @MichelleArk had the same thought recently; we can keep chewing.

Junobijlard · 2023-02-13T16:37:12Z

Junobijlard
Feb 13, 2023

Hey, thanks for all your work regarding contracts! Looking forward to it!

Wondering whether some SLA on when data should be refreshed should be included in the contract. For example:

Suppose project A relies on a table that lives in project B...
When we dbt run in project A, before project B has updated the exposed table...
Which means that downstream tables in project A will not have complete data

It goes against the nature of a data contract to have an automatic, implicit dependency between the two projects (since that would just result in the same massive DAG as in 1 big project, but then scattered over multiple places).

Some thoughts:

Should exposed data always be fresh? E.g. through a materialized view?
Should it be up to the developer to decide?
Should we add a data freshness SLA to the contract?

3 replies

boxysean Feb 17, 2023
Collaborator

Hi @Junobijlard, great idea, I came here to suggest the same thing. I do agree with you that data freshness could be another element of a contract beyond the shape of the model.

One thing I'm not sure about is the mechanic. Based on my limited understanding, contracts in this discussion are strict -- runtimes are exited if contracts are broken, data is not transformed. This mechanic isn't possible with enforcing data freshness. Freshness is a quality that can deteriorate from "fresh" to "stale" by the simple passage of time.

So instead, I suggest that someone could measure the compliance of a data freshness contract, which I would call a Service Level Objective (SLO) -- more on that subject here. That someone could be either the data product owner or the data product consumer.

Here are some examples of a data freshness contract that I think are meaningful to data product consumers. It has two parts: (1) an objective, and (2) a target percentage

The sales data model located at fct_sales must not be more than 1 hour late, 99% of the time (7h18m17s monthly downtime allowed)
The customer survey data located at fct_customer_responses, must have yesterday's data loaded by 9 AM eastern time, 95% of the time (1d12h31m27s monthly downtime allowed)

These objectives are best crafted in partnership between the data product producer and the data product consumer. Consumers typically demand higher objectives, while producers should argue for lower objectives -- because a higher availability typically means a more difficult objective and more operational time investment. Publishing this commitment is a great way to provide data freshness expectations, and even better, measuring and demonstrating compliance, can build great trust.

I built a fun little thought experiment dbt project to see what this mechanic would look like... the code is located here, and here is a video walkthrough: https://www.loom.com/share/d275bff96df54caf9f00b3bd520a835b

roygv Mar 1, 2023

IMHO, a data contract YAML should define all the elements of a contract between data producers and data consumers.
E.g:
Schema
Granularity level
Refresh frequency
Freshness
Primary key
Data guarantees: presence of duplicates, nullability, range of values, precision

The code that performs some of these checks can indeed be run at check-in or by the developer.
Other tests (like Granularity level) can only be performed on the data as it comes in.
Others, like freshness, should maybe be validated periodically by an independent process.
I am not sure which ones dbt wants to take on and how but having a central place to define them is a good start

jtcohen6 Mar 26, 2023
Maintainer Author

@Junobijlard @boxysean @roygv Very cool thread! I agree that all of these are important pieces of information, and would be appropriate to surface to a model's downstream consumers.

I want to second @boxysean's clarification:

contracts in this discussion are strict -- runtimes are exited if contracts are broken, data is not transformed

By this strict definition, a model's contract is comprised only of the attributes that can be validated before or during that model's materialization. I'm going to call these attributes "enforceable." The set of enforceable attributes will vary based on the materialization of your model, and the data platform you're running on. For example:

The name and data_type of every column — what Roy calls "schema" — always enforceable
The nullability of certain columns — enforceable on most data platforms, but only for table/incremental models, not for views
The model's primary key / granularity (which column(s) is/are unique) — only enforceable on transactional databases (e.g. Postgres), not enforceable on any columnar data warehouse. As such, while this is a highly relevant piece of information, and very important for dbt developers to test & document, it cannot actually be part of the model's contract, because it is not possible to prevent the model from rebuilding with duplicates. I would be excited about analytical data platforms rolling out support for enforced unique / primary_key constraints.
Refresh frequency, freshness of upstream source data — not (yet) something that we can validate before/during a model's runtime, and therefore also not part of the model's contract. (It's also not obvious to me that you'd want to prevent the model from rebuilding, just because upstream source data is less fresh than you expect it to be—unless those upstream sources haven't been updated at all since the last time you built this model.)

Thinking more about data freshness, and the parallel to software APIs — it's not a breaking change if a web API is returning slightly stale data. It would be a reason to consider the API to have degraded quality. In this sense, an API's "contract" and "uptime" are distinct concepts.

I like Sean's idea of a "model SLO." It would be appropriate for a data team to consider both enforceable (contracted) and unenforceable attributes as part of their maintenance objectives for a model. We should seek to make it easy (via metadata) to share all relevant information with downstream consumers. Data teams will need other mechanisms to measure the unenforceable attributes (e.g. dbt test, dbt source freshness), and thereby measure their success against their SLOs.

comp615 · 2023-03-08T20:35:56Z

comp615
Mar 8, 2023

I'm very new to DBT, but wanted to say this looks like a great step in the right direction. I had a few musings or questions around contracts and perhaps DBT itself. So thanks for reading my naiive question.

My understanding is that contracts are essentially between a given model and its underlying output data. Kind of a reverse schema lookup.

When I see this, I start to wonder if it can open the door for static type checking in IDEs, or dynamic constraint of SELECTs. Consider the following case that I would love to see DBT address.

1a) Static analysis / downstream column resolution. If I have a model that generates columns A + B...then in a separate model, I select from the former (SELECT *..., or SELECT A,B,C ...), it would be a great dev experience if I could receive an error (even before build) no column C in table. Or the dbt_utils.star could generate only select A, B (regardless of actual table contents because I assume the contract does not enforce no additional columns being present? But this would be a nice setting and solve...STRICT contract)

1b) Related, so you can see why extra columns are an issue. I have a "source" table from an external BigQuery table. While there are dozens of columns, my DBT universe only cares about 5. I actually want to constrict the aperture of that incoming API so that no one can possibly expand our dependency on it. Today, this would have to be done by making an intermediate model, or view, and hoping everyone uses that (not the source itself). But I wonder if defining a contract on the source, and then using dbt_utils.star, could somehow prevent anyone from querying outside those columns. Or at least make it much harder

1 reply

jtcohen6 Mar 26, 2023
Maintainer Author

My understanding is that contracts are essentially between a given model and its underlying output data. Kind of a reverse schema lookup.

1a) This is also interesting to me :) I think there may be a couple of gotchas, though. The contract can be assumed to correctly describe ALL the columns in the model. If it's not correct, there will be an error when the model runs. Therefore, we could use the contract of model X to inform the development of downstream model Y ... until/unless you're also editing model X, and there's a temporary state of misalignment between model X's SQL definition and yaml contract.

One quick clarification:

I assume the contract does not enforce no additional columns being present?

The contract will enforce this! For now, each model can have one contract, and that contract is for ALL the columns in the model. Additional "out-of-contract" columns will not be allowed. (We can think about whether, in the future, we should relax this via an optional config additional_columns_allowed: True.) The same answer in another thread above.

1b) I think this is a completely legitimate use case. I'm going to tell you that the right move is still to create an intermediate (we'd call it "staging") model, materialized as a view, that selects only the column you want to expose to your analytical universe. With this feature, you could also define a contract on that view, guaranteeing exactly which columns you're exposing. A well-meaning colleague wouldn't be able to replace select column_a, column_b, ... with select *, unless they're also willing to add the yaml for all those other columns—something that could be flagged in a PR review. Your staging model could use dbt_utils.star, if you so chose, for the convenience of excluding (rather than including) a specific subset of columns. As far as why you still need a staging model, and can't declare the contract directly on the source — I think it is A Good Thing that you're defining this model explicitly, and it's not automagically handled for you by dbt behind the scenes. If there's some frustration around needing to create more files & write more yaml, we can offer more & better codegen to ease that frustration.

anaghshineh · 2023-04-11T18:28:57Z

anaghshineh
Apr 11, 2023

I am so excited by this discussion & the associated work to come! I have a question around the actual syntax/shape of the YAML file/contract.

At my organization, my team is similarly rolling out data contracts using YAML files as an abstraction layer. We are using dbt to define transformations and want to integrate data contracts into our dbt projects. However, we have other data products sitting outside of dbt that rely on data contracts as well. We would like some flexibility (i.e., the agency) to define the shape of our data contracts. Ideally, we'll be using a consistent shape across our various data products. While we don't have anything against the shape you all propose, we would like to be able deviate if needed. Have you all thought about providing more flexibility in how users define these YAML files? For example, as opposed to needing to define a schema using the columns key, maybe we want to use schema instead. If we don't have flexibility, we're either forced to conform to the shape you expect, or we need to include additional complexity to transform our preferred shape into the one you're expecting.

Interested to hear thoughts on this! Thanks for your thoughtfulness and hard work 😄.

1 reply

jtcohen6 Apr 18, 2023
Maintainer Author

@anaghshineh Thanks for the thoughtful comments, and for trying out the new features while they're in beta!

I'd be curious to hear other folks' thoughts on this as well. Unsurprisingly, I do think there's value in a standard :) one that's flexible enough (allows you to do the things you need), but also consistent enough that we can share common vocabulary & documentation.

It would be totally possible to write some custom scripts that translate between our spec and another one, which used slightly different names or structures. That could go in either direction:

Define other spec elsewhere → turn into dbt project code, via codegen
Define in dbt project → dbt metadata (manifest) → parse, rename/reshape, send elsewhere

The latter feels preferable to me personally — though I have the obvious bias of wanting people to define these things in dbt! The total flexibility of the meta dictionary, which can be defined at both the model-level and the column-level, means that you could annotate your model contracts with additional information that will be ignored by dbt, and used by other external systems.

ataft · 2023-06-24T14:41:33Z

ataft
Jun 24, 2023

As a user of and contributor to the dbt_constraints package, I'm excited to see primary_key and foreign_key constraints as a part of dbt-core. However, I'm running into an issue with this new feature. Defining a primary_key constraints requires a contract. Requiring a contract then requires data_type to be set for each column. Requiring data_type to be set for each column makes dbt NOT database agnostic, which is something that I absolutely love about dbt. I have multiple customers using the same dbt project with different data warehouses, and utilize primary keys so the BI tools know how to join.

Why does the primary_key constraint require a contract and/or all data_types to then be set?

2 replies

jtcohen6 Jun 30, 2023
Maintainer Author

@ataft Great comments, thank you!

Why does the primary_key constraint require a contract and/or all data_types to then be set?

This is a fact of the analytical databases: In order to provide constraints before/during the actual data is populated, we need to provide a full column schema, including all data types. That's important for ensuring those constraints are actually enforced.

That said:

For constraints that these databases don't actually enforce (among them primary_key + foreign_key), it doesn't actually matter when you apply them, before or after the data is populated. So you could continue to apply these as tests or post-hooks, as the dbt_constraints package does.
We could try to do cleverer (jankier) things, inferring the data types from an "empty" version of the model's SQL and then applying them to DDL—but we've found are subtle differences between the data types returned from the database cursor and the ones accepted in the actual DDL statement

I have two proposals to share:

[CT-2774] Optional type aliasing for model contract data_type #8007 - database agnosticism for data_type (where possible)
[CT-2460] [Feature] Infer schema from prod, to enforce contract & detect breaking changes in dev #7432 - don't require setting all columns / data types, by inferring them from the "prod" version of this model (when --defer is enabled)

An idea I hadn't considered until now:

Allowing users to define unenforced constraints, without a contract, and then applying/updating those constraints via alter statements after the table is created. Does this sound like what you're after? If so, I could open a new issue to track it. I'd like a better of sense of how many people would actually want this capability before prioritizing it. I'm also thinking that it's not really a model contract, and it's already possible to achieve with the package in the meantime.

ataft Jun 30, 2023

@jtcohen6 Excellent points and thank you for the detail. I'm with you on staying away from clever/janky solutions, as this area needs to be especially reliable. I think the two proposals are good, especially the type aliasing, which would make it easier to remain database agnostic. If there's a reliable way to define unenforced constraints, without a contract and lots of yaml, then that would be good too.

piotrs-ae · 2023-08-25T07:25:06Z

piotrs-ae
Aug 25, 2023

Hi! IT would be great to have a possibility to define requirements for one column only; we had a case when a DBT model was referenced in another table, not built in DBT and that table was expecting a column that was dropped in DBT. I wanted to enforce that my DBT model has at least this column (since the DBT run would not detect any error obviously) but then it prompted me to fill contract specifications for all other columns which is something I did not want to do.

1 reply

jtcohen6 Aug 31, 2023
Maintainer Author

@piotrs-ae We don't have immediate plans to support partial contracts, where only some of the columns in a model have guarantees while others are free-for-all. There would be lots of implications for how we surface the metadata, how other folks access the object in the DWH, etc.

That said, we have an idea for how we could make it easier to define model contracts, when you only really care about specifying certain columns, and just want to keep all the other columns at parity with what's in production. Check out:

[CT-2460] [Feature] Infer schema from prod, to enforce contract & detect breaking changes in dev #7432

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Model contracts #6726

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 8 comments 16 replies

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Model contracts #6726

jtcohen6 Jan 25, 2023 Maintainer

What is a model "contract"?

Goals

Considerations

How does this compare to dbt test?

[Future] Other types of testing

Tests as "pre-flight" checks

Unit testing

Replies: 8 comments · 16 replies

jtcohen6 Feb 6, 2023 Maintainer Author

jtcohen6 Feb 6, 2023 Maintainer Author

jtcohen6 Mar 26, 2023 Maintainer Author

jtcohen6 Feb 8, 2023 Maintainer Author

boxysean Mar 16, 2023 Collaborator

jtcohen6 Mar 26, 2023 Maintainer Author

boxysean Feb 17, 2023 Collaborator

jtcohen6 Mar 26, 2023 Maintainer Author

jtcohen6 Mar 26, 2023 Maintainer Author

jtcohen6 Apr 18, 2023 Maintainer Author

jtcohen6 Jun 30, 2023 Maintainer Author

jtcohen6 Aug 31, 2023 Maintainer Author

jtcohen6
Jan 25, 2023
Maintainer

How does this compare to `dbt test`?

Replies: 8 comments 16 replies

jtcohen6 Feb 6, 2023
Maintainer Author

jtcohen6 Feb 6, 2023
Maintainer Author

jtcohen6 Mar 26, 2023
Maintainer Author

jtcohen6 Feb 8, 2023
Maintainer Author

boxysean Mar 16, 2023
Collaborator

jtcohen6 Mar 26, 2023
Maintainer Author

boxysean Feb 17, 2023
Collaborator

jtcohen6 Mar 26, 2023
Maintainer Author

jtcohen6 Mar 26, 2023
Maintainer Author

jtcohen6 Apr 18, 2023
Maintainer Author

jtcohen6 Jun 30, 2023
Maintainer Author

jtcohen6 Aug 31, 2023
Maintainer Author