-
Notifications
You must be signed in to change notification settings - Fork 227
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
dbt Constraints / model contracts #574
Conversation
Thank you for your pull request! We could not find a changelog entry for this change. For details on how to document a change, see the dbt-spark contributing guide. |
There is a bit of complication with the Spark implementation:
Overall all those limitations make the solution look quite brittle. Also, our dbt-spark tests are still using the older paradigm with decorators. I got it working now but it makes it difficult to reuse similar tests across adapters |
The CI tests fail when running
and the error is The same query runs fine in databricks on the |
I've been talking with the people who maintain |
The approach in the databricks adapter is to do a I can follow a similar approach here but what it means is that due to the lack of |
@b-per
START TRANSACTION;
-- execute some SQL statements here
COMMIT;
|
I think that ChatGPT is a bit out of its league here 😄 I tried it and got a
So we'll have to go with approach 2. But again, the lack of cross tables transactions is tricky. We can't:
The only think I could think of, in order not to have any time where the table doesn't exist, would be to:
|
@b-per I like your approach better in creating the new table as a temp table and then replacing the original table once everything is correct. When it comes to performance, I don't see a way around IO blockers because we bumped into the same problems for redshift and postgres implementations when it comes to inserting and copying data over. The tough thing for spark will be manually building out the rollback logic for specific steps, and we'll have to explore how far jinja can go in that department. @jtcohen6 do you have pro tips in your spark experience with DDL strategies? |
Sorry I missed this a few weeks ago! Gross. We can verify column names (later: also data types) by running the model SQL query with To enforce Those options, as I see them:
My vote would be for option 1. I'd be willing to document this as a known limitation of constraints on Spark/Databricks. Also, I believe this approach matches up most closely with the current implementation in |
Let's go with option 1 as this has the most reasonable tradeoffs. The other options have more cost than benefit. It's good to know we have the databricks implementation as a reference! Thanks for thinking this through Jerco! |
@Fleid when your team reviews this, keep in mind spark's limitations in how constraints behaviors work as a whole compared to snowflake, redshift, and postgres which are all easier to reason about compared to spark We'll need your help troubleshooting the failed databricks test. |
…s/dbt-spark into bper/add-support-for-constraints
{% for column_name in column_dict %} | ||
{% set constraints_check = column_dict[column_name]['constraints_check'] %} | ||
{% for constraint_check in constraints_check %} | ||
{%- set constraint_hash = local_md5(column_name ~ ";" ~ constraint_check) -%} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For later: Constraint hash is a sensible default, since we need a unique identifier. We may also want to let users define their own custom name
for the constraint.
{% set constraints_check = column_dict[column_name]['constraints_check'] %} | ||
{% for constraint_check in constraints_check %} | ||
{%- set constraint_hash = local_md5(column_name ~ ";" ~ constraint_check) -%} | ||
{% if not is_incremental() %} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For later: dbt-labs/dbt-core#6755
(Maybe just drop a comment to that issue for now)
{%- set constraint_hash = local_md5(column_name ~ ";" ~ constraint_check) -%} | ||
{% if not is_incremental() %} | ||
{% call statement() %} | ||
alter table {{ relation }} add constraint {{ constraint_hash }} check ({{ column_name }} {{ constraint_check }}); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should users define the check including the column name, or not? In the current implementation, it is included, so it wold be repeated here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am first trying to get the tests passing with it included but in my mind it doesn't make sense to add the column name. We technically today can put the check
of a given column under another one which seems odd to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Even though you can add a check
to another column, we shouldn't limit developers. The check
inline with the respective column provides reasonable signals that there should be a 1:1 mapping even if we don't enforce it.
{% set constraints_check = column_dict[column_name]['constraints_check'] %} | ||
{% for constraint_check in constraints_check %} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the current implementation, constraints_check
is a string, not a list, so we shouldn't loop over it here. That's why we're only seeing the first character ((
) show up in the CI test!
alter table test16752556745428623819_test_constraints.my_model add constraint 50006a0485dbd5df8255e85e2c79411f check (id ();
We are planning to change this in the future by unifying these into a single constraints
attribute: dbt-labs/dbt-core#6750
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was just looking at it, yes, and I will update my code.
In my mind it could (or should) be a list. I am not sure why we would only allow 1 value. (maybe worth considering for the future)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed! It should be a list. TK in 6750
resolves #558
Description
Adds the ability to provide a list of column for a model and force the model to a specific table schema. This PR also allows users to add a
not null
constraints on columns.Related Adapter Pull Requests
Checklist
changie new
to create a changelog entry