Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ADAP-383] [Feature] Support Dynamic Data Masking in CTAS Statements #85

Open
3 tasks done
Tracked by #7979
jdoldis opened this issue Mar 17, 2023 · 24 comments
Open
3 tasks done
Tracked by #7979
Labels
enhancement New feature or request triage

Comments

@jdoldis
Copy link

jdoldis commented Mar 17, 2023

Is this your first time submitting a feature request?

  • I have read the expectations for open source contributors
  • I have searched the existing issues, and I could not find an existing issue for this feature
  • I am requesting a straightforward extension of existing dbt-snowflake functionality, rather than a Big Idea better suited to a discussion

Describe the feature

Currently you cannot specify column masking policies in Snowflake CTAS statements with dbt-snowflake. For example, CREATE TABLE <table_name>(<col_name> <col_type> WITH MASKING POLICY <policy_name>) AS SELECT <query>.

As a workaround masking policies can be applied to columns in a dbt post hook using an ALTER TABLE statement. The issue with doing this is that the CTAS and ALTER TABLE statements cannot be issued in the same transaction, as per the Snowflake documentation - "Each DDL statement executes as a separate transaction". As a result there is there is a small window of time between the CTAS and ALTER TABLE <table_name> MODIFY COLUMN <column_name> SET MASKING POLICY <policy_name> statements where the data is not masked, and if the ALTER TABLE statement fails it would remain that way.

Supporting masking policy specification in the CTAS statement would fix this. As per the Snowflake documentation "Executing a CREATE TABLE … AS SELECT (CTAS) statement applies any masking policies on columns included in the statement before the data is populated in the new table".

It also may not be difficult to support this given the recent work on model contracts which provides the CREATE TABLE <table_name>(<col_name> <col_type>) AS SELECT <query> syntax. All that would need to be added is the WITH MASKING POLICY <policy_name> part of the statement.

One way to provide the config would be something like:

columns:
      - name: id
        description: "Primary key for this model"
        data_type: integer
        masking_policy: <policy_name>
        contraints: <constraint_list>
        ...

The masking policy could then be applied in get_columns_spec_ddl.

Describe alternatives you've considered

Using an ALTER TABLE statement in a post hook to apply the masking policy. As described above due to Snowflake DDL statements always being executed in separate transactions this leaves the possibility of unmasked data.

Who will this benefit?

Anyone that wants to take advantage of dynamic data masking in Snowflake using dbt.

Are you interested in contributing this feature?

Yes

Anything else?

No response

@jdoldis jdoldis added enhancement New feature or request triage labels Mar 17, 2023
@github-actions github-actions bot changed the title [Feature] Support Dynamic Data Masking in CTAS Statements [ADAP-383] [Feature] Support Dynamic Data Masking in CTAS Statements Mar 17, 2023
@jdoldis
Copy link
Author

jdoldis commented Mar 21, 2023

@Fleid
Copy link

Fleid commented Mar 24, 2023

@jtcohen6 I'm sorry if we already discussed this... but is that something on your radar for contracts?

@Fleid Fleid added refinement and removed triage labels Mar 24, 2023
@Fleid Fleid self-assigned this Mar 24, 2023
@Fleid
Copy link

Fleid commented Mar 24, 2023

Thanks @jdoldis for the write-up, you make a great case for it :)

@jdoldis
Copy link
Author

jdoldis commented Mar 25, 2023

Thanks @Fleid , I'm writing a custom materialization to support the syntax as described above. Let me know if you'd like those changes contributed here. Otherwise, I look forward to seeing it at some point down the road 🙂

@ingolevin
Copy link

ingolevin commented Apr 4, 2023

I have the exact same requirement, but for row access policies ! I think this can be implemented in one go, as it share the same CTAS syntax

CTAS:

CREATE TABLE <table_name> ( <col1> <data_type> [ with ] masking policy <policy_name> [ , ... ] )
  ...
  [ WITH ] ROW ACCESS POLICY <policy_name> ON ( <col1> [ , ... ] )

Config with masking and row access policies could like this:

columns:
      - name: id
        description: "Primary key for this model"
        data_type: integer
        masking_policy: <policy_name>
        row_access_policy: <policy_name>
        contraints: <constraint_list>
        ...

Currently the only option is via post_hook ALTER table|view ADD ROW ACCESS POLICY ...

Since one has to check the information_schema first if that relation (view or table) already has said RAP applied, because otherwise the ALTER command fails, this whole process can take up to 10 seconds [Edit: that was mainly due to a cluster_by, but even in the best case there is a 1sec delay] in which the data in the relation is unprotected.

@ingolevin
Copy link

Thanks @Fleid , I'm writing a custom materialization to support the syntax as described above. Let me know if you'd like those changes contributed here. Otherwise, I look forward to seeing it at some point down the road slightly_smiling_face

@jdoldis Would you release your custom materialization as a package, until this gets implemented into dbt-snowflake? I'd be very interested as well!

@jdoldis
Copy link
Author

jdoldis commented Apr 19, 2023

Hey @ingolevin, the materialisation I wrote is essentially a copy paste of the standard v1.5 table materialisation. The difference is I have modified the table_columns_and_constraints macro to support masking policies. In this modified macro I build the ddl by looping through the model columns and outputting the name/datatype, and then adding WITH MASKING POLICY <policy_name> if a masking policy is defined for the column.

@jdoldis
Copy link
Author

jdoldis commented Apr 19, 2023

Since I raised this it seems the ddl logic has moved around a bit, the relevant code that could be modified in the Snowflake adapter to support masking policies would now be this function I think.

@jdoldis
Copy link
Author

jdoldis commented Apr 19, 2023

Regarding creating a package, it would be good to hear back from @Fleid first. Ideally we could implement here, but if that's not possible I would be open to it 🙂

@jtcohen6
Copy link
Contributor

I'm overdue responding here!

It's true that, starting in v1.5, for models with enforced contracts, dbt will be able to template out create table DDL that includes a full column spec.

That's the prerequisite to defining row-level access policies & column-level masking policies while the table is being created, rather than via an alter statement right after! (And avoids a thorny problem where, copy grants would lead to folks having access to a just-replaced table before masking policies have been applied.)

I hadn't had row-level & column-level access/masking policies in scope for model contracts — but it would be very cool to combine the two, to consider the access/masking policy part of the contract, and to consider the removal of one such policy to be a breaking change to the contract (catchable in CI).

For the moment, it would be possible to stand this up via some macro overrides. (Maybe these policies could even be a constraint of type: custom, i.e. "free text input"?) Before we support this as out-of-the-box functionality, I'd want us to first see if there's an opportunity to standardize over the way this works on different data platforms that support related functionality. If we can offer a standard syntax for this, that abstracts over the capabilities on multiple platforms without leaking, we should do it :)

@jdoldis
Copy link
Author

jdoldis commented Apr 20, 2023

Sounds great @jtcohen6 , let me know if I can help!

@jtcohen6
Copy link
Contributor

I'm following up here after a good chat with @graciegoheen @dbeatty10 @dataders, given the prompt to support similar functionality in dbt-redshift (dbt-labs/dbt-redshift#589). There's now support for this on several of our most popular data platforms:

I think the good implementation of this functionality would look like:

  1. Teaching dbt about functions, as a new "materialization"* + a new node type. This would be inclusive of both UDFs (many previous discussions + older issues) and masking/access rules (a special type of function that returns a boolean).
  2. Allowing dbt to reference masking/access rules within model contract definitions

*These map to DWH objects, so they are members of the DAG, and models / other functions could call (= depend on) them. I don't think these are models because they aren't select statements — they can be parametrized, they cannot be previewed to return a dataset (...unless it's a UDTF, which blurs the boundary).

(1) is a bigger lift than (2), and it's not something we have the capacity to prioritize right now — but in the meantime, I've asked @dataders to do a bit more thinking about what a good UX might look like :)

@b-per
Copy link

b-per commented Feb 13, 2024

Databricks also allows data masking in CTA as well!

[CREATE OR] REPLACE TABLE
...

column_properties
  { NOT NULL |
    GENERATED ALWAYS AS ( expr ) |
    GENERATED { ALWAYS | BY DEFAULT } AS IDENTITY [ ( [ START WITH start ] [ INCREMENT BY step ] ) ] |
    DEFAULT default_expression |
    COMMENT column_comment |
    column_constraint |
    MASK clause } [ ... ]

@dbeatty10
Copy link
Contributor

Moving this feature request to the dbt-adapters repo for further refinement since the underlying functionality is supported on many cloud data warehouses now (Redshift, Snowflake (Enterprise only), BigQuery, Databricks, Azure, etc.)

Copy link

This issue has been marked as Stale because it has been open for 180 days with no activity. If you would like the issue to remain open, please comment on the issue or else it will be closed in 7 days.

@github-actions github-actions bot added the Stale label Aug 13, 2024
Copy link

Although we are closing this issue as stale, it's not gone forever. Issues can be reopened if there is renewed community interest. Just add a comment to notify the maintainers.

@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Aug 21, 2024
@jdoldis
Copy link
Author

jdoldis commented Aug 26, 2024

I still think this would be a great security improvement for DBT, would be good to keep the issue open.

@b-per
Copy link

b-per commented Aug 26, 2024

Today with contracts and constraints, we actually support tag-based masking policies with the follwing syntax

models:
  - name: my_model
    config:
      contract: {enforced: true}
      materialized: table
    columns:
      - name: id
        data_type: int
        constraints:
          - type: custom
            expression: "tag (my_tag = 'my_value')"

This requires defining all the columns and their types though.

Would it work for your use case?

@jdoldis
Copy link
Author

jdoldis commented Aug 27, 2024

That's great news, hadn't realised that, thanks @b-per ! I don't currently have all column types documented, but I can move to that 👍

@jellerbee-altr
Copy link

@b-per This seems promising but is there any documentation on custom constraints? In particular docs on the expression "tag (my_tag = 'my_value')"

@jellerbee-altr
Copy link

Okay I figured it out. The expression is simply the SQL to represent the constraint on the column.

@grantith
Copy link

grantith commented Aug 27, 2024

@b-per That's interesting, I didn't know that was possible! I'll have to give it a shot. If only I had a good way to add row access policies to relations in Snowflake, too.

@jdoldis You might find something like dbt-osmosis useful for quickly generating model docs.

@b-per
Copy link

b-per commented Aug 28, 2024

The reason we need a contract with all columns is that it is the only way where we can send a create or replace <cols_info_with_masking> mytable as .... We can't do it with just a partial list of columns.

To generate a YAML with all the columns you could also use generate_model_yaml from dbt-codegen .

As there seems to be some interest from folks about the custom constraint "trick", I will reach out to the Docs team to see if we could add it to the docs.

@b-per
Copy link

b-per commented Aug 28, 2024

Also, I have not tried it, but I think that you could add directly a masking policy instead of using tags

models:
  - name: my_model
    config:
      contract: {enforced: true}
      materialized: table
    columns:
      - name: id
        data_type: int
        constraints:
          - type: custom
            expression: "masking policy my_policy"

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request triage
Projects
None yet
Development

No branches or pull requests

8 participants