Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

docs(): Announcing DataHub Open Assertions Specification #10609

Conversation

jjoyce0510
Copy link
Collaborator

@jjoyce0510 jjoyce0510 commented May 28, 2024

Summary

View rendered announcement.

In this PR, we add a doc for announcing a new initiative to build the DataHub Open Source Assertions Specification, which will be a universal specification for declaring Data Quality checks, and then compiling them into artifacts that can be registered or directly executed by 3rd party Data Quality tools like Great Expectations, dbt tests, and Snowflake via Data Quality DMFs.

The sister PR for this one is located here and declares the foundational data models for each type of assertion we aim to support, along with a reference implementation of various assertion types built on top of Snowflake DMFs.

Please reach out if this project is interesting to you and you'd like to contribute other DQ sinks like GE, dbt test, soda, etc!

Status

Work in Progress. Working to provide updated examples of the Assertion definition specification.

@github-actions github-actions bot added the docs Issues and Improvements to docs label May 28, 2024
@jjoyce0510 jjoyce0510 changed the title docs(): Announcing DataHub Open Assertions Specification docs(): Announcing DataHub Open Assertions Specification (WIP) May 28, 2024
John Joyce added 2 commits May 29, 2024 09:56
John Joyce added 2 commits May 29, 2024 21:35
…ler' into jj--add-docs-for-assertions-compiler
docs/observability/open-assertions-spec.md Outdated Show resolved Hide resolved
Comment on lines 550 to 552
```bash
datahub assertions compile -f examples/library/assertions_configuration.yml -p snowflake
```
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
```bash
datahub assertions compile -f examples/library/assertions_configuration.yml -p snowflake
```
```bash
datahub assertions compile -f examples/library/assertions_configuration.yml -p snowflake -x DMF_SCHEMA=<db>.<schema>

docs/observability/open-assertions-spec.md Outdated Show resolved Hide resolved
docs/observability/open-assertions-spec.md Outdated Show resolved Hide resolved
docs/observability/open-assertions-spec.md Outdated Show resolved Hide resolved
docs/observability/open-assertions-spec.md Outdated Show resolved Hide resolved
docs/observability/open-assertions-spec.md Outdated Show resolved Hide resolved
docs/observability/open-assertions-spec.md Outdated Show resolved Hide resolved
docs/observability/open-assertions-spec.md Outdated Show resolved Hide resolved
docs/observability/open-assertions-spec.md Outdated Show resolved Hide resolved
docs/observability/open-assertions-spec.md Outdated Show resolved Hide resolved
docs/observability/open-assertions-spec.md Outdated Show resolved Hide resolved
docs/observability/open-assertions-spec.md Outdated Show resolved Hide resolved
John Joyce added 2 commits May 30, 2024 15:13
John Joyce added 2 commits May 30, 2024 15:39
- You must have a Snowflake Enterprise account, where the DMFs feature is enabled.
- You must have the necessary permissions to provision DMFs in your Snowflake environment (see below)
- You must have the necessary permissions to query the DMF results in your Snowflake environment (see below)

Copy link
Collaborator

@mayurinehate mayurinehate May 31, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we keep 3 separate sections for permissions required in snowflake environment, as mentioned below:

Group 1. Permissions required for creating and registering DMFs (running dmf_definitions.sql and dmf_associations.sql)

Privilege Object Notes
USAGE Database, schema Database and schema where snowflake DMFs will be created. This is configured in compile command described below.
CREATE FUNCTION Schema This privilege enables creating new DMF in schema configured in compile command.
EXECUTE DATA METRIC FUNCTION Account This privilege enables you to control which roles have access to server-agnostic compute resources to call the system DMF.
USAGE Database, schema These objects are the database and schema that contain the referenced table in the query.
OWNERSHIP Table This privilege enables you to associate a DMF with a referenced table.
USAGE DMF This privilege enables calling the DMF in schema configured in compile command.
Database Role Notes
SNOWFLAKE.DATA_METRIC_USER To use System DMFs

Group 2. Permissions required to view DMF results (snowflake ingestion)

Application Role Notes
SNOWFLAKE.DATA_QUALITY_MONITORING_VIEWER Query the DMF results table

Group 3. Permissions required by owner of table (as scheduled DMFs run with table owner's role)

Privilege Object Notes
USAGE Database, schema Database and schema where snowflake DMFs will be created. This is configured in compile command described below.
USAGE DMF This privilege enables calling the DMF in schema configured in compile command.
EXECUTE DATA METRIC FUNCTION Account This privilege enables you to control which roles have access to server-agnostic compute resources to call the system DMF.
Database Role Notes
SNOWFLAKE.DATA_METRIC_USER To use System DMFs

Copy link
Collaborator

@mayurinehate mayurinehate May 31, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Snowflake system admin can follow this guide to create new DataHub-specific role for assertions.

-- setup permissions to <table-owner-role> to run DMFs on schedule
grant usage on database "<dmf-database>" to role "<table-owner-role>"
grant usage on schema "<dmf-database>.<dmf-schema>" to role "<table-owner-role>"
grant usage on all functions in "<dmf-database>.<dmf-schema>" to role "<table-owner-role>"
grant usage on future functions in "<dmf-database>.<dmf-schema>" to role "<table-owner-role>"
grant database role SNOWFLAKE.DATA_METRIC_USER to role "<table-owner-role>"
grant execute data metric function on account to role "<table-owner-role>"


-- setup permissions to <assertion-service-role> to create DMFs and associate DMFs with table
grant usage on database "<dmf-database>" to role "<assertion-service-role>"
grant usage on schema "<dmf-database>.<dmf-schema>" to role "<assertion-service-role>"
grant create function on schema "<dmf-database>.<dmf-schema>" to role "<assertion-service-role>"
-- grant ownership + rest of permissions to <assertion-service-role>
grant role "<table-owner-role>" to role "<assertion-service-role>"

grant application role SNOWFLAKE.DATA_QUALITY_MONITORING_VIEWER to role "<datahub_role>"

where "<datahub_role>" is role used for ingestion and "" is role used to provision assertions on snowflake using SQL artifacts generated in compile step below.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks! added this~

either via CLI or the UI visible as normal assertions.

`datahub ingest -c snowflake.yml`

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we mention few caveats section including:

  1. Snowflake supports at most 1000 dmf-table associations at the moment so you can not define more than 1000 assertions for snowflake.
  2. Snowflake does not allow JOIN queries or non-deterministic functions in DMF definition so you can not use these in SQL for SQL assertion or in filters section.
  3. All DMFs scheduled on a table must follow same exact schedule, so you can not set assertions on same table to run on different schedules.
  4. DMFs are only supported for regular tables and not dynamic or external tables. Same limitation applies for assertions.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure that makes sense. If we redirect to Snowflake DMF documentation it's sufficiet, otherwise we'll have to constantly update this file as things change on the snowflake side.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll look for opportunities to incorporate.

@jjoyce0510 jjoyce0510 changed the title docs(): Announcing DataHub Open Assertions Specification (WIP) docs(): Announcing DataHub Open Assertions Specification Jun 12, 2024
Copy link
Contributor

@shirshanka shirshanka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Super excited for this!

@jjoyce0510 jjoyce0510 merged commit ea7b27b into datahub-project:master Jun 12, 2024
4 of 5 checks passed
sleeperdeep pushed a commit to sleeperdeep/datahub that referenced this pull request Jun 25, 2024
…ject#10609)

Co-authored-by: John Joyce <[email protected]>
Co-authored-by: John Joyce <[email protected]>
Co-authored-by: John Joyce <[email protected]>
Co-authored-by: John Joyce <[email protected]>
Co-authored-by: John Joyce <[email protected]>
Co-authored-by: John Joyce <[email protected]>
yoonhyejin pushed a commit that referenced this pull request Jul 16, 2024
Co-authored-by: John Joyce <[email protected]>
Co-authored-by: John Joyce <[email protected]>
Co-authored-by: John Joyce <[email protected]>
Co-authored-by: John Joyce <[email protected]>
Co-authored-by: John Joyce <[email protected]>
Co-authored-by: John Joyce <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
docs Issues and Improvements to docs
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants