Replies: 7 comments 16 replies
-
Those who 👍 'd this - @amarrella, @jburnich @zyd14 - do you have particular thoughts on what situations you'd use it in? Would you want your expectations to execute in the same step as the step that materializes your asset? Or later on? Would you want to attach expectations to your asset definition? Or make them dynamic? |
Beta Was this translation helpful? Give feedback.
-
Please also include stateful expecations i.e. row count of an asset is expected to be in 2 standard deviations of the average (last time) same day last week .... and if not the job should be flagged /alerted |
Beta Was this translation helpful? Give feedback.
-
I'd love to be able to have expectations execute on assets with or without materialization. It'd be nice to be able to retry the expectation without rerunning the materialization. It'd be great if expectations could be dynamic also, possibly dependent on some upstream output or historical state. It might be cool to be able to have expectation failures prevent the materialization of an asset, but I know there would be a lot of cases where you wouldn't want that behavior so I'd want it to be optional One thing that I really haven't seen much in data quality frameworks out there is the idea of data measurements / metrics, which can then be used to generate expectations. I might not necessarily have a specific expectation of a dimension of my data, but I'd like to measure that dimension each time the asset is materialized (and maybe create an expectation of what that metric should be later down the line). I currently use The long story short here is that being able to define metrics on your data is super helpful for telling a story about your data over time, and is a base component of an expectation anyway, so exposing the ability to define data metrics without expectations and add your expectations optionally would be really cool. You could even start using those metrics to trigger other tasks (if the number of files generated by a materialization is > X, run a compaction job) One concrete example of a metric and corresponding alarm I've implemented in one of our pipelines is this: another: |
Beta Was this translation helpful? Give feedback.
-
Cross referencing #13102 as I had a few thoughts on how timing based expectations could work with |
Beta Was this translation helpful? Give feedback.
-
Any updates on this? |
Beta Was this translation helpful? Give feedback.
-
Just to add more to the discussion, I've actually recently implemented an MVP of asset expectations by creating DagsterTypes with expectations embedded in them. Currently these exist as two types:
During materialization the dagster type check is ran and sql ran against the object returned (either via spark sql or duckdb in my case). This has the great benefit of even running outside of a job/pipeline so we can use them in pytests by just running I also yield metadata values from these that report things such as observed schema, and the results of expectations. Some of the biggest pain points now with this are:
|
Beta Was this translation helpful? Give feedback.
-
Pinging some folks who have upvoted this discussion and haven't commented on it yet: @m-o-leary @Elliot2718, @oguzhangur96 – would any of you (or anyone else who comes across this comment) be up for a short chat about what you're looking for from an asset expectations system? If so, mind reaching out to me on the Dagster Slack or emailing me at (my first name) at elementl.com.? |
Beta Was this translation helpful? Give feedback.
-
This discussion is now locked, as it is superseded by #15880. In the new discussion, we explore a not-yet-implemented Python API for defining and executing asset checks in Dagster. We would love your feedback on any and all aspects of it!
For a previous history of this discussion, see below.
[RFC] Asset expectations
User story
I’d like to understand whether my assets have the contents I expect them to.
Terminology
Python API
Runtime expectation results
You can record expectation results at the time you’re materializing an asset:
If you want to record expectation results after materializing an asset, you attach ExpectationResults to an AssetObservation:
Definition-level expectation declarations
With definition-level expectation declarations, your asset definition includes a set of expectations that you can log runtime results for.
Advantages of definition-level expectation declarations:
There are a few flavors of this:
UI
Cross-asset view
For a group of assets, you can see a table of expectation results across all assets. Only expectation results since the latest materialization are shown.
You can click a button to hide successful expectation results, so you can focus on what needs fixing.
Single-asset view
Somewhere prominent on the asset details page, you can see current expectation status of the asset, i.e. what expectations have passed and failed since the latest materialization.
Somewhere slightly less prominent (maybe its own tab?), you can see a table of historical expectations. The rows (or maybe columns?) would be materializations, and the columns (or maybe rows?) would be expectations.
E.g.
It should be easy to link to this view so that a stakeholder can get a view without wading through a bunch of orchestration stuff they don’t understand.
Topics
Partitions
For partitioned assets, expectations can apply at the partition-level or at the asset-level.
Control flow
By default, runs don't halt on a failed expectation. There should be an option to make them do so.
Relationship to SLAs
An SLA, as described here, could modeled as a kind of expectation. Unlike other expectations, it could be automatically computed by the framework instead of by user code.
Relationship to Dagster types
Dagster type checks run after the op's compute function completes and before the output is stored. They're built to check the value that's returned by the op compute function.
In contrast, asset expectations will often want to check the materialized artifact. E.g. if a table is created, run a select statement against that table. Make sure that the columns in Snowflake have the right types. Thus, they need to run after the output is stored.
This makes me lean towards asset expectations as distinct from dagster types.
Other possible options:
dbt tests
When
load_assets_from_dbt_x
is called withuse_build=True
, it can automatically generate asset expectations from the dbt tstsSource assets
Source assets can have expectations. The only limitation is that they can’t be checked as part of materializing the asset, because source assets aren’t materialized.
Great Expectations
We could beef up the great expectations integration to produce asset expectation results.
Beta Was this translation helpful? Give feedback.
All reactions