-
Notifications
You must be signed in to change notification settings - Fork 314
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: change dataQualityAssertions type to DATASET #2505
feat: change dataQualityAssertions type to DATASET #2505
Conversation
Thanks for opening your first pull request in the Marquez project! Please check out our contributing guidelines (https://github.com/MarquezProject/marquez/blob/main/CONTRIBUTING.md). |
@@ -40,7 +40,7 @@ enum DatasetFacet { | |||
COLUMN_LINEAGE(Type.DATASET, "columnLineage"), | |||
OWNERSHIP(Type.DATASET, "ownership"), | |||
DATA_QUALITY_METRICS(Type.INPUT, "dataQualityMetrics"), | |||
DATA_QUALITY_ASSERTIONS(Type.INPUT, "dataQualityAssertions"), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
dataQualityAssertions
are kind of facet that only make sense in the context of a particular run. They list assertions verified within a certain run and do not describe Dataset
in general. That's why I think the type is OK.
The issue you describe may be a result of my change: https://github.com/MarquezProject/marquez/pull/2417/files
We decided there that although inputFacets
are contained within Openlineage events as part of datasets, the Marquez API will return them when asking for run
data. So, each run should have fields inputDatasetVersions
and outputDatasetFacets
which should contain the facets you need.
Could you please attach the corresponding run
(fetched from https://marquezproject.ai/openapi.html#tag/Jobs/operation/getRun endpoint) to verify the facets are not there?
Even if it's not there, I still think the issue is related to fetching run facets and we should not make dataQualityAssertions
dataset facets as they're logically input
facets.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with @pawel-big-lebowski here that dataQualityAssertions
facet should only be associated with input datasets as defined by OpenLineage.
That said, 0.34.0
does seem to introduce a bug, as you pointed out @YLibert in #2503
In the food_delivery namespace, the dataset public.delivery_7_days should have a quality assertion (which is indeed in the db with the INPUT type) but it doesn't appear in the Marquez UI.
In metadata.json
, the dataset public.delivery_7_days
defines the following dataQualityAssertions
that is no longer displayed in the UI:
"dataQualityAssertions": {
"_producer": "https://github.com/MarquezProject/marquez/blob/main/docker/metadata.json",
"_schemaURL": "https://openlineage.io/spec/facets/1-0-0/DataQualityAssertionsDatasetFacet.json",
"assertions": [
{
"assertion": "not_null",
"success": false,
"column": "driver_id"
},
{
"assertion": "is_string",
"success": true,
"column": "customer_address"
}
]
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Apologies for the confusion, I was misled by the fact that the dataQualityAssertions facet is a DatasetFacet by Open Lineage standard, but it's not equivalent to what the facet type means in Marquez.
To answer @pawel-big-lebowski's question, here is the corresponding run
fetched from the getRun
endpoint:
{
"id": "d5a2a4c4-fc78-428d-ae85-08c942ed8371",
"createdAt": "2020-02-22T22:42:42Z",
"updatedAt": "2020-02-22T22:48:12Z",
"nominalStartTime": "2020-02-22T22:00:00Z",
"nominalEndTime": "2020-02-22T22:00:00Z",
"state": "COMPLETED",
"startedAt": "2020-02-22T22:42:42Z",
"endedAt": "2020-02-22T22:48:12Z",
"durationMs": 330000,
"args": {
"nominal_start_time": "2020-02-22T22:00Z[UTC]",
"nominal_end_time": "2020-02-22T22:00Z[UTC]"
},
"jobVersion": {
"namespace": "food_delivery",
"name": "etl_delivery_7_days",
"version": "c50792dd-7657-31b5-8e33-3ea014a8096b"
},
"inputDatasetVersions": [
{
"datasetVersionId": {
"namespace": "food_delivery",
"name": "public.orders_7_days",
"version": "d09633c4-4412-36de-bce6-8002c662e18a"
},
"facets": {}
},
{
"datasetVersionId": {
"namespace": "food_delivery",
"name": "public.customers",
"version": "68c1e307-f6bb-36f9-8596-14609c7f022b"
},
"facets": {}
},
{
"datasetVersionId": {
"namespace": "food_delivery",
"name": "public.order_status",
"version": "676ac323-c8e3-3cea-b172-b468827afb51"
},
"facets": {}
},
{
"datasetVersionId": {
"namespace": "food_delivery",
"name": "public.drivers",
"version": "93ae26cc-87d8-3eae-9cdd-f9b6fd71f1f7"
},
"facets": {}
},
{
"datasetVersionId": {
"namespace": "food_delivery",
"name": "public.restaurants",
"version": "4db26821-7966-390f-9cf9-ac775fe9182b"
},
"facets": {}
}
],
"outputDatasetVersions": [
{
"datasetVersionId": {
"namespace": "food_delivery",
"name": "public.delivery_7_days",
"version": "6f8f52f5-0230-31ce-a138-08b79e671b33"
},
"facets": {}
}
],
"facets": {
"nominalTime": {
"_producer": "https://github.com/MarquezProject/marquez/blob/main/docker/metadata.json",
"_schemaURL": "https://openlineage.io/spec/facets/1-0-0/NominalTimeRunFacet.json",
"nominalEndTime": "2020-02-22T22:00:00Z",
"nominalStartTime": "2020-02-22T22:00:00Z"
}
}
}
This is taken directly from the ./docker/up.sh --seed
command within the repo
Problem
Closes: #2503
Solution
As a quick fix to display the dataQualityAssertion again in the Marquez UI, I propose changing the dataset facet type of the dataQualityAssertion to DATASET
One-line summary:
Checklist
CHANGELOG.md
(Depending on the change, this may not be necessary)..sql
database schema migration according to Flyway's naming convention (if relevant)