Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support new constraints spec, and specifying constraints in create table as #288

Open
jtcohen6 opened this issue Mar 7, 2023 · 1 comment
Labels
enhancement New feature or request Stale

Comments

@jtcohen6
Copy link
Contributor

jtcohen6 commented Mar 7, 2023

(Happy to split this into multiple issues — keeping it together here for now, to keep the conversation centralized)

Describe the feature

First, we are changing the spec of constraints to be a first-class property/configuration, and should update the implementation in dbt-databricks to match (rather than the current meta-based approach).

Changes to dbt-core:

Associated changes to dbt-spark:

Second: This is a feature request for Databricks, more so than a feature request for dbt-databricks!

create table <table> (<column_name> <data_type> not null) as (
    select <query>
);

https://docs.databricks.com/sql/language-manual/sql-ref-syntax-ddl-create-table-using.html

This optional clause populates the table using the data from query. When you specify a query you must not also specify a column_specification. The table schema will be derived form the query.

Context & alternatives

Databricks supports many types of constraints, and it can enforce two: not null constraints on columns in a table, and check constraints of boolean expressions on one or more columns in a table.

However, Databricks does not support the inclusion of a column spec / constraint within CTA (create or replace table <tablename> as <sql>). This puts us in a tricky situation when trying to atomically replace a table, and enforce constraints on the data being added in. These are the options, as we understand them:

  1. create or replace the table with the constraints, then insert the data returned by the query, with the constraints enforced on any data being added in. Because Databricks does not support transactions, while the second statement (insert) is running, the table will appear empty to downstream queriers.
  2. Create a new table with the constraints, in a separate location, then insert the data returned by the query, with the constraints enforced. If all constraints pass, move the new table to the preexisting table’s location, requiring a “deep” clone that can be very slow (effectively copying the full dataset over from one location to another).
  3. Atomically replace the table via create or replace table ... as, ensuring zero downtime and no need to move data, then apply the constraints after the fact via alter table statements. Unfortunately, this means that the constraints aren’t actually enforced until after the fact—no better than existing dbt test—and that the model’s table can therefore include data that violates its contracted expectations.

Is that understanding correct? For now, we have opted for option 3 in dbt-spark (dbt-labs/dbt-spark#574), as it is closest to the existing pattern for atomically replacing models and testing them after-the-fact. (It's also the existing implementation in dbt-databricks, which allows defining constraints in the meta dictionary.) The ideal is gaining the ability to include constraints within a CTA (create or replace table … as).`

Who will this benefit?

Are you interested in contributing this feature?

For the first bit, the code changes should look a lot like the changes that we'll be making in dbt-spark.

For the second, I believe this would require a change to Apache Spark or Databricks, so I don't think I'd know how :)

@jtcohen6 jtcohen6 added the enhancement New feature or request label Mar 7, 2023
@github-actions
Copy link

github-actions bot commented Sep 4, 2023

This issue has been marked as Stale because it has been open for 180 days with no activity. If you would like the issue to remain open, please remove the stale label or comment on the issue.

@github-actions github-actions bot added the Stale label Sep 4, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request Stale
Projects
None yet
Development

No branches or pull requests

1 participant