Asset expectations (superseded) #9543

sryza · 2022-08-30T16:12:41Z

sryza
Aug 30, 2022

This discussion is now locked, as it is superseded by #15880. In the new discussion, we explore a not-yet-implemented Python API for defining and executing asset checks in Dagster. We would love your feedback on any and all aspects of it!

For a previous history of this discussion, see below.

[RFC] Asset expectations

User story

I’d like to understand whether my assets have the contents I expect them to.

Terminology

Expectation - a condition that you wish to be true about an asset.
Expectation result - an instantaneous event where a particular expectation is evaluated as succeeded or failed.

Python API

Runtime expectation results

You can record expectation results at the time you’re materializing an asset:

@asset
def logins(context):
    non_empty: bool = ...
    return Output(
        value=...,
        expectation_results=ExpectationResult("non_empty", success=non_empty)
    )

If you want to record expectation results after materializing an asset, you attach ExpectationResults to an AssetObservation:

def check_logins(context):
    non_empty: bool = ...
    user_id_not_null: bool = ...
    context.log_event(
        AssetObservation(
            asset_key="abc",
            expectation_results=[
                ExpectationResult("non_empty", success=non_empty),
                ExpectationResult("user_id_not_null", success=user_id_not_null),
            ]
        )
    )

Definition-level expectation declarations

With definition-level expectation declarations, your asset definition includes a set of expectations that you can log runtime results for.

Advantages of definition-level expectation declarations:

You want to make sure that the expectation always shows up in Dagit even when it doesn’t have a result, so you can track whether it has a result or not.
You want to co-locate code for an asset’s expectations with its definition.

There are a few flavors of this:

The asset definition includes a set of expectation names, but not how to check them.
The asset definition includes a set of expectations that should be automatically checked as part of the step that materializes the asset.
The asset definition includes a set of expectations that should be automatically checked as a downstream step of the step that materializes the asset.
The asset definition includes a set of expectations that aren’t checked automatically, but can that can be packaged into a job.
A mix of the above.

@expectation(required_resource_keys={"snowflake"})
def user_id_not_null(context: ExpectationContext) -> bool:
    table_address = context.io_manager.get_address()
    query_result = context.resources.snowflake.fetch_scalar(
        f"select count(*) from {table_address} where user_id is null"
    )
    return query_result == 0


@expectation
def not_empty(context: ExpectationContext) -> bool:
    df = context.load_value()
    return df.shape[0] > 0

@asset(
    in_step_expectations=[not_empty],
    out_of_step_expectations=[user_id_not_null],
    expectation_names={"non_fraudulent"},
)
def logins():
    ...

UI

Cross-asset view

For a group of assets, you can see a table of expectation results across all assets. Only expectation results since the latest materialization are shown.

You can click a button to hide successful expectation results, so you can focus on what needs fixing.

Single-asset view

Somewhere prominent on the asset details page, you can see current expectation status of the asset, i.e. what expectations have passed and failed since the latest materialization.

Somewhere slightly less prominent (maybe its own tab?), you can see a table of historical expectations. The rows (or maybe columns?) would be materializations, and the columns (or maybe rows?) would be expectations.

E.g.

It should be easy to link to this view so that a stakeholder can get a view without wading through a bunch of orchestration stuff they don’t understand.

Topics

Partitions

For partitioned assets, expectations can apply at the partition-level or at the asset-level.

Control flow

By default, runs don't halt on a failed expectation. There should be an option to make them do so.

Relationship to SLAs

An SLA, as described here, could modeled as a kind of expectation. Unlike other expectations, it could be automatically computed by the framework instead of by user code.

Relationship to Dagster types

Dagster type checks run after the op's compute function completes and before the output is stored. They're built to check the value that's returned by the op compute function.

In contrast, asset expectations will often want to check the materialized artifact. E.g. if a table is created, run a select statement against that table. Make sure that the columns in Snowflake have the right types. Thus, they need to run after the output is stored.

This makes me lean towards asset expectations as distinct from dagster types.

Other possible options:

Dagster types on assets could get automatically converted to asset expectations? A downside is that every asset would end up having an expectation verifying its Python type, which might be confusing.
There could be a way to create an asset expectation from a Dagster type?
Asset definitions could have booleans saying whether their expectation should be used as a Dagster type?

dbt tests

When load_assets_from_dbt_x is called with use_build=True, it can automatically generate asset expectations from the dbt tsts

Source assets

Source assets can have expectations. The only limitation is that they can’t be checked as part of materializing the asset, because source assets aren’t materialized.

Great Expectations

We could beef up the great expectations integration to produce asset expectation results.

sryza · 2022-09-07T20:12:12Z

sryza
Sep 7, 2022
Author

Those who 👍 'd this - @amarrella, @jburnich @zyd14 - do you have particular thoughts on what situations you'd use it in? Would you want your expectations to execute in the same step as the step that materializes your asset? Or later on? Would you want to attach expectations to your asset definition? Or make them dynamic?

6 replies

sryza Sep 12, 2022
Author

Thanks for this input @PadenZach, makes a a lot of sense.

I'm currently using graph backed assets where one of the ops is data validation. I'd like If there were a way to easily to insert a validate step into asset creation to get a sort of "extract transform validate load" pattern. In this way, I'd imagine wanting to attach expectations to a asset definition.

Do you have thoughts on whether you'd prefer the validation to occur in a separate process from the process that materializes the asset? Are there any situations where you'd want to materialize the asset without running the validation?

PadenZach Sep 13, 2022

I'd lean towards separate, Idea being that we could plug in systems that may not be compatible (eg: maybe my validation is a simple pure python script where as my transformation is a more heavy spark process).

Additionally, there may be times revalidation without rematerialization could be helpful. Similarly to the "Materialize all" there could be a validate all. This would be useful for cases where we make our data validation sets more strict and want to make sure all assets meet the new standards.

zyd14 Sep 13, 2022

+1 for validating without materializing, that could be really nice. I also like the thought of doing validation in a separate step, or something that can be retried without rematerializing the asset; maybe the materialization is particularly resource-intensive and I don't want to redo it, but the validation failed due to a transient error or misconfiguration.

sryza Sep 15, 2022
Author

Got it - validating without materializing makes sense.

In the case where you materialize and validate at the same time, would you want to avoid materializing downstream assets when the expectations fail?

PadenZach Sep 15, 2022

I'd imagine there's a few levels that would work for this.

Most simple case would be if any assertion fails, it prevents materialization of the asset.

More complicated case could be some can be set to fail it, others just are ignored (say, we have something like row count that we want to eventually tack via metadata, but don't want to use it to fail the pipeline on it's own.

Most complex case would be plugging in failure handlers/callbacks. This way some would fail, others would be able to send out a slack alert, others could be ignored, or they could be used to push results to another system (say a database for example)

Ultimately, since we're currently using graph-backed assets, our asset is still being written to S3 but shouldn't be fully realized/published until it passed audit. This allows us also to check the unpublished asset on failure.

I recently realized there's actually a name for this pattern, Write-Audit-Publish.

geoHeil · 2022-09-08T11:12:41Z

geoHeil
Sep 8, 2022

Please also include stateful expecations i.e. row count of an asset is expected to be in 2 standard deviations of the average (last time) same day last week .... and if not the job should be flagged /alerted

1 reply

zyd14 Sep 8, 2022

yeah comparing against previous versions of an asset (or observations previously recorded about an asset) would be so so useful, and it's something that I haven't seen other data quality frameworks provide good solutions for

zyd14 · 2022-09-13T18:47:49Z

zyd14
Sep 13, 2022

I'd love to be able to have expectations execute on assets with or without materialization. It'd be nice to be able to retry the expectation without rerunning the materialization. It'd be great if expectations could be dynamic also, possibly dependent on some upstream output or historical state. It might be cool to be able to have expectation failures prevent the materialization of an asset, but I know there would be a lot of cases where you wouldn't want that behavior so I'd want it to be optional

One thing that I really haven't seen much in data quality frameworks out there is the idea of data measurements / metrics, which can then be used to generate expectations. I might not necessarily have a specific expectation of a dimension of my data, but I'd like to measure that dimension each time the asset is materialized (and maybe create an expectation of what that metric should be later down the line). I currently use AssetObservations for this type of thing, but it's still a bit of a limited interface for viewing / analyzing metrics over time.
Once you have a measurement of a dataset defined, it should be simple to then create an expectation of that measurement / metric. So if I define a measurement that is the average of a column, later on I might want to be able to say "alert me if the average column value metric I defined is greater than this threshold" or "alert me if the average column value metric I defined deviates more than 1 standard deviation from the previous measurement (or maybe from the average of all previous measurements).

The long story short here is that being able to define metrics on your data is super helpful for telling a story about your data over time, and is a base component of an expectation anyway, so exposing the ability to define data metrics without expectations and add your expectations optionally would be really cool. You could even start using those metrics to trigger other tasks (if the number of files generated by a materialization is > X, run a compaction job)

One concrete example of a metric and corresponding alarm I've implemented in one of our pipelines is this:
metric - the number of rows where the p-value column deviates by more then 1e-08 from the previous materialization of the asset
alarm_when - the row count described above > 0

another:
metric - the number of unique values in column x
alarm_when - metric > 23

4 replies

sryza Sep 14, 2022
Author

This metrics idea is interesting. E.g. we could draw a line representing the expectation onto the graph where we show the metric over time.

it's still a bit of a limited interface for viewing / analyzing metrics over time.

Any particular functionality you wish was there?

geoHeil Sep 15, 2022

An Integration with tools like https://open-metadata.org/ would be much more worthwhile here.

geoHeil Sep 15, 2022

especially for the data tests and results of these tests over time.

PadenZach Sep 15, 2022

Being able to set up some sort of sensor-like class to occasionally be triggered and given the set of metrics on an asset over time could be extremely powerful. Not sure if would be best to have this be push based (EG: run this class with metadata whenever a new asset is materialized/metadata updated), or pull based (Run this check every )

EDIT: Didn't see geoheil's comments prior to writing my own. His suggestion on the integration may be better, unless dagster is trying to become an all-in-one catalog/data asset/orchestration sort of platform.

abkfenris · 2023-03-23T12:45:47Z

abkfenris
Mar 23, 2023

Cross referencing #13102 as I had a few thoughts on how timing based expectations could work with FreshnessPolicies.

0 replies

mkleinbort-ic · 2023-04-19T14:30:02Z

mkleinbort-ic
Apr 19, 2023

Any updates on this?

2 replies

sryza Apr 19, 2023
Author

@mkleinbort-ic nothing concrete, but it's something that we've revived internal discussions on recently.

sryza Jul 28, 2023
Author

@mkleinbort-ic - following up here because we're having deeper discussions now about supporting asset expectations / data quality in a more first class way. Are you able to describe what functionality you're looking for?

PadenZach · 2023-04-19T15:09:31Z

PadenZach
Apr 19, 2023

Just to add more to the discussion, I've actually recently implemented an MVP of asset expectations by creating DagsterTypes with expectations embedded in them.

Currently these exist as two types:

schema checks: eg: is table schema correct?
value checks: Queries like "column X should have values between 0 and 1", or, "the sum of column Y should be greater less than 100"

During materialization the dagster type check is ran and sql ran against the object returned (either via spark sql or duckdb in my case). This has the great benefit of even running outside of a job/pipeline so we can use them in pytests by just running assert my_asset(test_context, test_data)

I also yield metadata values from these that report things such as observed schema, and the results of expectations.

Some of the biggest pain points now with this are:

Worrying I've built too much on experimental APIs 😆
Productivity in creating and defining checks. They're all written via SQL! I'd love a way to introduce something like SODA or Great Expectations engine here, but I'm not familiar with the former and the latter is quite a heavy library :(
Every datatype has it's own query backend. Luckily with duckdb and spark I'm able to cover our bases, but the minute differences between SQL dialects means the checks aren't always portable.

3 replies

askvinni May 2, 2023

@PadenZach Love that! Would you be able to share a short code snippet on how you've implemented this without failing the type checks when the assets don't match the expectations? It's something I recently put on my backlog, but I'd love to save a few hours in thinking through it :)

PadenZach May 2, 2023

After testing and iterating more on it (Also considering changing some of the underlying check logic to a dedicated engine like soda or great expectations) I plan to write a blog post on it, but still lots of work to iron out before I get there.

It's also worth noting that we currently are failing materializations when some of these checks aren't meant but that could be changed in your implementation.

Here's an edited snippet that should show how one may go about doing this:

def spark_check_validator(
    context: "TypeCheckContext", check_defs: Iterable[CheckDef], data: DataFrame
) -> Iterable[TypeCheck]:
    """Validates Checks against the data via spark, yielding TypeCheck results.

    requires context to be configured with a pyspark resource.

    Args:
        context (TypeCheckContext): TypeCheckContext
        checks (Iterable[CheckDef]): An iterable of CheckDefs to run against
        data (DataFrame): A dataset, normally the output of an asset.

    Returns:
        Iterable[TypeCheck]: Yield type checks that are the result of CheckDefs
    """
    spark: SparkSession = context.resources.pyspark.spark_session
    data.createOrReplaceTempView("data")
    results = map(
        # Executes each check individually, not optimal but much easier to impl.
        lambda c: _CheckResult(c, spark.sql(_resolve_expr_if_needed(c.sql)).toPandas()),
        check_defs,
    )

    for check_def, result in results:
        check_passed = None
        metadata = {}

        if check_def.constraint:
            """Currently we allow up to 10 lines of output from a check,
            however, we currently only check constraints on the first line.
            sometimes this is okay. In the future, smarter checking of the SQL
            check or output could be useful (or allowing a 'expected' dataset
            to be passed in with the CheckDef)
            """
            context.log.warn(
                f"Check {check_def.name} returned more than one row, only constraining on first row."
            )
            check_passed = bool(  # Returns numpy bool not native bool >:(
                result.iloc[:1].T[0].all()
            )

        if check_def.metadata:
            if result.shape == (1, 1):
                metadata[f"{check_def.name} result"] = result[0][0]
            else:  # Try to expr result as table, first 10 rows.
                metadata[f"{check_def.name} result"] = MetadataValue.table(
                    records=[TableRecord(**v) for k, v in result.to_dict(orient="index").items()],
                )
        # Need is not False since we want None to be truthy to allow for logging unconstrainted checks
        # but not double logging failing constraints.
        if check_def.log and check_passed is not False:
            context.log.info(f"Def: {str(check_def)}, Result: {str(result)}")
        elif check_passed is False:  # Always log on acute failure.
            context.log.error(f"Def: {str(check_def)}, Result: {str(result)}")

        yield TypeCheck(
            check_passed if check_passed is not None else True, check_def.name, metadata=metadata
        )

There's a good bit of other places in code that referred to but not shown here, but the gist of it is that since you can pass a function to DagsterType to validate itself, and use the context, you can create functions that can be given lists of Checks/Expectations/etc. and then run those check and reduce them down to the final result.