dbt Constraints / model contracts #574

b-per · 2022-12-22T17:12:29Z

resolves #558

Description

Adds the ability to provide a list of column for a model and force the model to a specific table schema. This PR also allows users to add a not null constraints on columns.

Related Adapter Pull Requests

Must be reviewed with passing tests

Checklist

I have read the contributing guide and understand what's expected of me
I have signed the CLA
I have run this code in development and it appears to resolve the stated issue
This PR includes tests, or tests are not required/relevant for this PR
I have opened an issue to add/update docs, or docs changes are not required/relevant for this PR
I have run changie new to create a changelog entry

github-actions · 2022-12-22T17:12:48Z

Thank you for your pull request! We could not find a changelog entry for this change. For details on how to document a change, see the dbt-spark contributing guide.

b-per · 2022-12-23T09:54:13Z

There is a bit of complication with the Spark implementation:

Spark doesn't support create or replace table as <SQL> with a schema. We get a Operation not allowed: Schema may not be specified in a Replace Table As Select (RTAS).
- we could either drop the table first and then use create table as (with a schema and without replace) or do a create or replace table <schema> followed by an insert into (like it is done for Postgres/Redshift) and what I am doing here
The previous point raises another issue though. Spark doesn't support begin/end for SQL transactions. Whatever approach we take from the point above will result with the table being either empty or dropped for some time, until the data insertion finishes
Finally, our current implementation of the Spark adapter doesn't allow sending multiple SQL statements separated by ; as spark.sql() only allows 1 SQL statement. Options would be:
1. split the statements by ; and do a spark.sql(statement) for each
2. or modify the table materialization to do multiple call statement when the create and insert are required to be run separately (what I am trying here)

Overall all those limitations make the solution look quite brittle.

Also, our dbt-spark tests are still using the older paradigm with decorators. I got it working now but it makes it difficult to reuse similar tests across adapters

…pport-for-constraints

b-per · 2022-12-23T10:55:30Z

The CI tests fail when running

create or replace table test16717917161013806450_test_constraints.constraints_column_types
      
      
  
  (
    
      int_column int  not null  ,
      float_column float  ,
      bool_column boolean  ,
      date_column date  
  )
  

      
    using delta

and the error is Caused by: java.lang.ClassNotFoundException: Failed to find data source: delta. Please find packages at http://spark.apache.org/third-party-projects.html but I don't know what it means.

The same query runs fine in databricks on the hive_metastore

sungchun12 · 2022-12-23T19:14:18Z

I've been talking with the people who maintain dbt-databricks and looks like they implemented constraints using the meta tag. I haven't dived deep into this yet, but may be worth skimming to see if it resolves any of your comments above: databricks/dbt-databricks#71

b-per · 2023-01-03T14:05:57Z

The approach in the databricks adapter is to do a create table as followed by an alter to add constraints.

I can follow a similar approach here but what it means is that due to the lack of begin/end, it is possible that the table gets created properly but then the alter statement fails when adding the constraints. We will then have a table that has been refreshed with the new data despite the constraints defined in dbt not being followed. This is very different from the rest of the adapters.

sungchun12 · 2023-01-03T16:36:50Z

@b-per
Okay I see 2 paths here:

Can you test out the below syntax to see if it'll work? I got this from chatgpt(think: looks right but may be wrong)

START TRANSACTION;
-- execute some SQL statements here
COMMIT;

We use a brand new approach below.

Save original spark table as a temp table(if it exists. If it's the first time running this, skip this step.)
Create an empty spark table with constraints
Insert rows into empty spark table with constraints
If it fails at any of these steps, we use SQL to revert any changes vs. rely on transaction mechanics traditionally in things like Postgres.

b-per · 2023-01-04T10:21:33Z

I think that ChatGPT is a bit out of its league here 😄

I tried it and got a Operation not allowed: START TRANSACTION. This looks consistent with Databricks statement in their docs that:

Databricks manages transactions at the table level. Transactions always apply to one table at a time
Databricks does not have BEGIN/END constructs that allow multiple operations to be grouped together as a single transaction. Applications that modify multiple tables commit transactions to each table in a serial fashion
You can combine inserts, updates, and deletes against a table into a single write transaction using MERGE INTO.

So we'll have to go with approach 2.

But again, the lack of cross tables transactions is tricky. We can't:

save a backup of the original table and revert it if the new one fails
create the new table with a tmp name and, in a transaction, swap/rename the old one and the new one

The only think I could think of, in order not to have any time where the table doesn't exist, would be to:

create the new table with a tmp name, making sure that creating the table, loading data and adding the constraints work
do a create or replace table <original_table> deep clone <tmp_table> (docs). This seems to be pretty inefficient from an IO standpoint (copying the whole dataset) but I think that using a shallow clone might not work when we perform the last step
drop the tmp table

sungchun12 · 2023-01-04T17:16:39Z

@b-per I like your approach better in creating the new table as a temp table and then replacing the original table once everything is correct.

When it comes to performance, I don't see a way around IO blockers because we bumped into the same problems for redshift and postgres implementations when it comes to inserting and copying data over. The tough thing for spark will be manually building out the rollback logic for specific steps, and we'll have to explore how far jinja can go in that department.

@jtcohen6 do you have pro tips in your spark experience with DDL strategies?

…pport-for-constraints

jtcohen6 · 2023-01-23T14:13:48Z

Sorry I missed this a few weeks ago! Gross.

We can verify column names (later: also data types) by running the model SQL query with where false limit 0, in advance of creating/replacing the real table.

To enforce not null + check constraints, without first dropping the already-existing table... we have only a few bad options, and no good ones.

Those options, as I see them:

Add constraints after the table is created. If it raises an error, the model has already been built, but we'll still be able to report the error and skip downstream models from building. (This is comparable to dbt test today.) The constraints will also prevent new bad data from being inserted/merged into incremental models.
Accept either the risk of significant downtime (drop + recreate), or the risk of the model taking significantly longer (create table in temp location + apply constraints + deep-clone to new location if constraints succeed).
Verify the checks ourselves, by saving the model SQL as a temporary view, and running actual SQL against it (a la dbt test), and then only create/replace a table from that temporary view if all checks pass. This mostly loses the value of the constraints as actually enforced by the data platform...

My vote would be for option 1. I'd be willing to document this as a known limitation of constraints on Spark/Databricks. Also, I believe this approach matches up most closely with the current implementation in dbt-databricks, which "persists" constraints after table creation (similar to persist_docs).

sungchun12 · 2023-01-23T14:27:16Z

Let's go with option 1 as this has the most reasonable tradeoffs. The other options have more cost than benefit. It's good to know we have the databricks implementation as a reference! Thanks for thinking this through Jerco!

sungchun12 · 2023-01-31T16:22:33Z

@Fleid when your team reviews this, keep in mind spark's limitations in how constraints behaviors work as a whole compared to snowflake, redshift, and postgres which are all easier to reason about compared to spark

We'll need your help troubleshooting the failed databricks test.

…s/dbt-spark into bper/add-support-for-constraints