From 9e4d1eab1a953d5caae681b2a4d04a81586641da Mon Sep 17 00:00:00 2001 From: John Joyce Date: Wed, 23 Aug 2023 11:38:27 -0700 Subject: [PATCH 1/2] Volume assertions def --- .../observe/freshness-assertions.md | 2 +- .../observe/volume-assertions.md | 335 ++++++++++++++++++ 2 files changed, 336 insertions(+), 1 deletion(-) create mode 100644 docs/managed-datahub/observe/volume-assertions.md diff --git a/docs/managed-datahub/observe/freshness-assertions.md b/docs/managed-datahub/observe/freshness-assertions.md index 54b3134151d3a2..22ee555127c6bb 100644 --- a/docs/managed-datahub/observe/freshness-assertions.md +++ b/docs/managed-datahub/observe/freshness-assertions.md @@ -59,7 +59,7 @@ Tables. For example, imagine that we work for a company with a Snowflake Table that stores user clicks collected from our e-commerce website. This table is updated with new data on a specific cadence: once per hour (In practice, daily or even weekly are also common). In turn, there is a downstream Business Analytics Dashboard in Looker that shows important metrics like -the number of people clicking our "Daily Sale" banners, and this dashboard pulls is generated from data stored in our "clicks" table. +the number of people clicking our "Daily Sale" banners, and this dashboard is generated from data stored in our "clicks" table. It is important that our clicks Table continues to be updated each hour because if it stops being updated, it could mean that our downstream metrics dashboard becomes incorrect. And the risk of this situation is obvious: our organization may make bad decisions based on incomplete information. diff --git a/docs/managed-datahub/observe/volume-assertions.md b/docs/managed-datahub/observe/volume-assertions.md new file mode 100644 index 00000000000000..173be9855f92d8 --- /dev/null +++ b/docs/managed-datahub/observe/volume-assertions.md @@ -0,0 +1,335 @@ +--- +description: This page provides an overview of working with DataHub Volume Assertions +--- +import FeatureAvailability from '@site/src/components/FeatureAvailability'; + + +# Volume Assertions + + + + +> ⚠️ The **Volume Assertions** feature is currently in private beta, part of the **Acryl Observe** module, and may only be available to a +> limited set of design partners. +> +> If you are interested in trying it and providing feedback, please reach out to your Acryl Customer Success +> representative. + +## Introduction + +Can you remember a time when the meaning of Data Warehouse Table that you depended on fundamentally changed, with little or no notice? +If the answer is yes, how did you find out? We'll take a guess - someone looking at an internal reporting dashboard or worse, a user using your your product, sounded an alarm when +a number looked a bit out of the ordinary. Perhaps your table initially tracked purchases made on your company's e-commerce web store, but suddenly began to include purchases made +through your company's new mobile app. + +There are many reasons why an important Table on Snowflake, Redshift, or BigQuery may change in its meaning - application code bugs, new feature rollouts, +changes to key metric definitions, etc. Often times, these changes break important assumptions made about the data used in building key downstream data products +like reporting dashboards or data-driven product features. + +What if you could reduce the time to detect these incidents, so that the people responsible for the data were made aware of data +issues _before_ anyone else? With Acryl DataHub **Volume Assertions**, you can. + +Acryl DataHub allows users to define expectations about the normal volume, or size, of a particular warehoues Table, +and then monitor those expectations over time as the table grows and changes. + +In this article, we'll cover the basics of monitoring Volume Assertions - what they are, how to configure them, and more - so that you and your team can +start building trust in your most important data assets. + +Let's get started! + +## Support + +Volume Assertions are currently supported for: + +1. Snowflake +2. Redshift +3. BigQuery + +Note that an Ingestion Source _must_ be configured with the data platform of your choice in Acryl DataHub's **Ingestion** +tab. + +> Note that Volume Assertions are not yet supported if you are connecting to your warehouse +> using the DataHub CLI or a Remote Ingestion Executor. + +## What is a Volume Assertion? + +A **Volume Assertion** is a configurable Data Quality rule used to monitor a Data Warehouse Table +for unexpected or sudden changes in "volume", or row count. Volume Assertions can be particularly useful when you have frequently-changing +Tables which have a relatively stable pattern of growth or decline. + +For example, imagine that we work for a company with a Snowflake Table that stores user clicks collected from our e-commerce website. +This table is updated with new data on a specific cadence: once per hour (In practice, daily or even weekly are also common). +In turn, there is a downstream Business Analytics Dashboard in Looker that shows important metrics like +the number of people clicking our "Daily Sale" banners, and this dashboard is generated from data stored in our "clicks" table. +It is important that our clicks Table is updated with the correct number of rows each hour, else it could mean +that our downstream metrics dashboard becomes incorrect. The risk of this situation is obvious: our organization +may make bad decisions based on incomplete information. + +In such cases, we can use a **Volume Assertion** that checks whether the Snowflake "clicks" Table is growing in an expected +way, and that there are no sudden increases or sudden decreases in the rows being added or removed from the table. +If too many rows are added or removed within an hour, we can notify key stakeholders and begin to root cause before the problem impacts stakeholders of the data. + +### Anatomy of a Volume Assertion + +At the most basic level, **Volume Assertions** consist of a few important parts: + +1. An **Evaluation Schedule** +2. A **Volume Condition** +2. A **Volume Source** + +In this section, we'll give an overview of each. + +#### 1. Evaluation Schedule + +The **Evaluation Schedule**: This defines how often to check a given warehouse Table for its volume. This should usually +be configured to match the expected change frequency of the Table, although it can also be less frequently depending +on the requirements. You can also specify specific days of the week, hours in the day, or even +minutes in an hour. + + +#### 2. Volume Condition + +The **Volume Condition**: This defines the type of condition that we'd like to monitor, or when the Assertion +should result in failure. + +There are a 2 different categories of conditions: **Total** Volume and **Change** Volume. + +_Total_ volume conditions are those which are defined against the point-in-time total row count for a table. They allow you to specify conditions like: + +1. **Table has too many rows**: The table should always have less than 1000 rows +2. **Table has too few rows**: The table should always have more than 1000 rows +3. **Table row count is outside a range**: The table should always have between 1000 and 2000 rows. + +_Change_ volume conditions are those which are defined against the growth or decline rate of a table, measured between subsequent checks +of the table volume. They allow you to specify conditions like: + +1. **Table growth is too fast**: When the table volume is checked, it should have < 1000 more rows than it had during the previous check. +2. **Table growth is too slow**: When the table volume is checked, it should have > 1000 more rows than it had during the previous check. +3. **Table growth is outside a range**: When the table volume is checked, it should have between 1000 and 2000 more rows than it had during the previous check. + +For change volume conditions, both _absolute_ row count deltas and relative percentage deltas are supported for identifying +table that are following an abnormal pattern of growth. + + +#### 3. Volume Source + +The **Volume Source**: This is the mechanism that Acryl DataHub should use to determine the table volume (row count). The supported +source types vary by the platform, but generally fall into these categories: + +- **Information Schema**: A system Table that is exposed by the Data Warehouse which contains live information about the Databases + and Tables stored inside the Data Warehouse, including their row count. It is usually efficient to check, but can in some cases be slightly delayed to update + once a change has been made to a table. + +- **Query**: A `COUNT(*)` query is used to retrieve the latest row count for a table, with optional SQL filters applied (depending on platform). + This can be less efficient to check depending on the size of the table. This approach is more portable, as it does not involve + system warehouse tables, it is also easily portable across Data Warehouse and Data Lake providers. + +Volume Assertions also have an off switch: they can be started or stopped at any time with the click of button. + + +## Creating a Volume Assertion + +### Prerequisites + +1. **Permissions**: To create or delete Volume Assertions for a specific entity on DataHub, you'll need to be granted the + `Edit Assertions` and `Edit Monitors` privileges for the entity. This is granted to Entity owners by default. + +2. **Data Platform Connection**: In order to create a Volume Assertion, you'll need to have an **Ingestion Source** configured to your + Data Platform: Snowflake, BigQuery, or Redshift under the **Integrations** tab. + +Once these are in place, you're ready to create your Volume Assertions! + +### Steps + +1. Navigate to the Table that to monitor for volume +2. Click the **Validations** tab + +

+ +

+ +3. Click **+ Create Assertion** + +

+ +

+ +4. Choose **Volume** + +5. Configure the evaluation **schedule**. This is the frequency at which the assertion will be evaluated to produce a pass or fail result, and the times + when the table volume will be checked. + +6. Configure the evaluation **condition type**. This determines the cases in which the new assertion will fail when it is evaluated. + +

+ +

+ +7. (Optional) Click **Advanced** to customize the volume **source**. This is the mechanism that will be used to obtain the table + row count metric. Each Data Platform supports different options including Information Schema and Query. + +

+ +

+ +- **Information Schema**: Check the Data Platform system metadata tables to determine the table row count. +- **Query**: Issue a `COUNT(*)` query to the table to determine the row count. + +8. Click **Next** +9. Configure actions that should be taken when the Volume Assertion passes or fails + +

+ +

+ +- **Raise incident**: Automatically raise a new DataHub `Volume` Incident for the Table whenever the Volume Assertion is failing. This + may indicate that the Table is unfit for consumption. Configure Slack Notifications under **Settings** to be notified when + an incident is created due to an Assertion failure. +- **Resolve incident**: Automatically resolved any incidents that were raised due to failures in this Volume Assertion. Note that + any other incidents will not be impacted. + +10. Click **Save**. + +And that's it! DataHub will now begin to monitor your Volume Assertion for the table. + +To view the time of the next Volume Assertion evaluation, simply click **Volume** and then click on your +new Assertion: + +

+ +

+ +Once your assertion has run, you will begin to see Success or Failure status for the Table + +

+ +

+ + +## Stopping a Volume Assertion + +In order to temporarily stop the evaluation of a Volume Assertion: + +1. Navigate to the **Validations** tab of the Table with the assertion +2. Click **Volume** to open the Volume Assertions list +3. Click the three-dot menu on the right side of the assertion you want to disable +4. Click **Stop** + +

+ +

+ +To resume the Volume Assertion, simply click **Turn On**. + +

+ +

+ + +## Smart Assertions ⚡ + +As part of the **Acryl Observe** module, Acryl DataHub also provides **Smart Assertions** out of the box. These are +dynamic, AI-powered Volume Assertions that you can use to monitor the volume of important warehouse Tables, without +requiring any manual setup. + +If Acryl DataHub is able to detect a pattern in the volume of a Snowflake, Redshift, or BigQuery Table, you'll find +a recommended Smart Assertion under the `Validations` tab on the Table profile page: + +

+ +

+ +In order to enable it, simply click **Turn On**. From this point forward, the Smart Assertion will check for changes on a cadence +based on the Table history. + +Don't need it anymore? Smart Assertions can just as easily be turned off by clicking the three-dot "more" button and then **Stop**. + + +## Creating Volume Assertions via API + +Under the hood, Acryl DataHub implements Volume Assertion Monitoring using two "entity" concepts: + +- **Assertion**: The specific expectation for volume, e.g. "The table was changed int the past 7 hours" + or "The table is changed on a schedule of every day by 8am". This is the "what". + +- **Monitor**: The process responsible for evaluating the Assertion on a given evaluation schedule and using specific + mechanisms. This is the "how". + +Note that to create or delete Assertions and Monitors for a specific entity on DataHub, you'll need the +`Edit Assertions` and `Edit Monitors` privileges for it. + +#### GraphQL + +In order to create a Volume Assertion that is being monitored on a specific **Evaluation Schedule**, you'll need to use 2 +GraphQL mutation queries to create a Volume Assertion entity and create an Assertion Monitor entity responsible for evaluating it. + +Start by creating the Volume Assertion entity using the `createVolumeAssertion` query and hang on to the 'urn' field of the Assertion entity +you get back. Then continue by creating a Monitor entity using the `createAssertionMonitor`. + +##### Examples + +To create a Volume Assertion Entity that checks whether a table has been updated in the past 8 hours: + +```json +mutation createVolumeAssertion { + createVolumeAssertion( + input: { + entityUrn: "" + type: DATASET_CHANGE + schedule: { + type: FIXED_INTERVAL + fixedInterval: { unit: HOUR, multiple: 8 } + } + } + ) { + urn + } +} +``` + +To create an Assertion Monitor Entity that evaluates the volume assertion every 8 hours using the Information Schema: + +```json +mutation createAssertionMonitor { + createAssertionMonitor( + input: { + entityUrn: "", + assertionUrn: "", + schedule: { + cron: "0 */8 * * *", + timezone: "America/Los_Angeles" + }, + parameters: { + type: DATASET_VOLUME, + datasetVolumeParameters: { + sourceType: INFORMATION_SCHEMA, + } + } + } + ) { + urn + } +} +``` + +This entity defines _when_ to run the check (Using CRON format - every 8th hour) and _how_ to run the check (using the Information Schema). + +After creating the monitor, the new assertion will start to be evaluated every 8 hours in your selected timezone. + +You can delete assertions along with their monitors using GraphQL mutations: `deleteAssertion` and `deleteMonitor`. + +### Tips + +:::info +**Authorization** + +Remember to always provide a DataHub Personal Access Token when calling the GraphQL API. To do so, just add the 'Authorization' header as follows: + +``` +Authorization: Bearer +``` + +**Exploring GraphQL API** + +Also, remember that you can play with an interactive version of the Acryl GraphQL API at `https://your-account-id.acryl.io/api/graphiql` +::: From e5f746d8667d76590db516bb981ada008d132c2f Mon Sep 17 00:00:00 2001 From: John Joyce Date: Wed, 23 Aug 2023 11:38:51 -0700 Subject: [PATCH 2/2] Adding volume assertions doc --- docs-website/sidebars.js | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs-website/sidebars.js b/docs-website/sidebars.js index 51a57fc41dd364..8745ca5f339a6a 100644 --- a/docs-website/sidebars.js +++ b/docs-website/sidebars.js @@ -418,7 +418,7 @@ module.exports = { }, "docs/act-on-metadata/impact-analysis", { - Observability: ["docs/managed-datahub/observe/freshness-assertions"], + Observability: ["docs/managed-datahub/observe/freshness-assertions", "docs/managed-datahub/observe/volume-assertions"], }, ], },