Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: change dataQualityAssertions type to DATASET #2505

Conversation

YLibert
Copy link

@YLibert YLibert commented Jun 8, 2023

Problem

Closes: #2503

Solution

As a quick fix to display the dataQualityAssertion again in the Marquez UI, I propose changing the dataset facet type of the dataQualityAssertion to DATASET

One-line summary:

Checklist

  • You've signed-off your work
  • Your changes are accompanied by tests (if relevant)
  • Your change contains a small diff and is self-contained
  • You've updated any relevant documentation (if relevant)
  • You've included a one-line summary of your change for the CHANGELOG.md (Depending on the change, this may not be necessary).
  • You've versioned your .sql database schema migration according to Flyway's naming convention (if relevant)
  • You've included a header in any source code files (if relevant)

@boring-cyborg boring-cyborg bot added the api API layer changes label Jun 8, 2023
@boring-cyborg
Copy link

boring-cyborg bot commented Jun 8, 2023

Thanks for opening your first pull request in the Marquez project! Please check out our contributing guidelines (https://github.com/MarquezProject/marquez/blob/main/CONTRIBUTING.md).

@@ -40,7 +40,7 @@ enum DatasetFacet {
COLUMN_LINEAGE(Type.DATASET, "columnLineage"),
OWNERSHIP(Type.DATASET, "ownership"),
DATA_QUALITY_METRICS(Type.INPUT, "dataQualityMetrics"),
DATA_QUALITY_ASSERTIONS(Type.INPUT, "dataQualityAssertions"),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dataQualityAssertions are kind of facet that only make sense in the context of a particular run. They list assertions verified within a certain run and do not describe Dataset in general. That's why I think the type is OK.

The issue you describe may be a result of my change: https://github.com/MarquezProject/marquez/pull/2417/files
We decided there that although inputFacets are contained within Openlineage events as part of datasets, the Marquez API will return them when asking for run data. So, each run should have fields inputDatasetVersions and outputDatasetFacets which should contain the facets you need.

Could you please attach the corresponding run (fetched from https://marquezproject.ai/openapi.html#tag/Jobs/operation/getRun endpoint) to verify the facets are not there?

Even if it's not there, I still think the issue is related to fetching run facets and we should not make dataQualityAssertions dataset facets as they're logically input facets.

Copy link
Member

@wslulciuc wslulciuc Jun 10, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with @pawel-big-lebowski here that dataQualityAssertions facet should only be associated with input datasets as defined by OpenLineage.

That said, 0.34.0 does seem to introduce a bug, as you pointed out @YLibert in #2503

In the food_delivery namespace, the dataset public.delivery_7_days should have a quality assertion (which is indeed in the db with the INPUT type) but it doesn't appear in the Marquez UI.

In metadata.json, the dataset public.delivery_7_days defines the following dataQualityAssertions that is no longer displayed in the UI:

 "dataQualityAssertions": {
  "_producer": "https://github.com/MarquezProject/marquez/blob/main/docker/metadata.json",
  "_schemaURL": "https://openlineage.io/spec/facets/1-0-0/DataQualityAssertionsDatasetFacet.json",
  "assertions": [
    {
      "assertion": "not_null",
      "success": false,
      "column": "driver_id"
    },
    {
      "assertion": "is_string",
      "success": true,
       "column": "customer_address"
     }
   ]
}

Copy link
Author

@YLibert YLibert Jun 16, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apologies for the confusion, I was misled by the fact that the dataQualityAssertions facet is a DatasetFacet by Open Lineage standard, but it's not equivalent to what the facet type means in Marquez.
To answer @pawel-big-lebowski's question, here is the corresponding run fetched from the getRun endpoint:

{
    "id": "d5a2a4c4-fc78-428d-ae85-08c942ed8371",
    "createdAt": "2020-02-22T22:42:42Z",
    "updatedAt": "2020-02-22T22:48:12Z",
    "nominalStartTime": "2020-02-22T22:00:00Z",
    "nominalEndTime": "2020-02-22T22:00:00Z",
    "state": "COMPLETED",
    "startedAt": "2020-02-22T22:42:42Z",
    "endedAt": "2020-02-22T22:48:12Z",
    "durationMs": 330000,
    "args": {
        "nominal_start_time": "2020-02-22T22:00Z[UTC]",
        "nominal_end_time": "2020-02-22T22:00Z[UTC]"
    },
    "jobVersion": {
        "namespace": "food_delivery",
        "name": "etl_delivery_7_days",
        "version": "c50792dd-7657-31b5-8e33-3ea014a8096b"
    },
    "inputDatasetVersions": [
        {
            "datasetVersionId": {
                "namespace": "food_delivery",
                "name": "public.orders_7_days",
                "version": "d09633c4-4412-36de-bce6-8002c662e18a"
            },
            "facets": {}
        },
        {
            "datasetVersionId": {
                "namespace": "food_delivery",
                "name": "public.customers",
                "version": "68c1e307-f6bb-36f9-8596-14609c7f022b"
            },
            "facets": {}
        },
        {
            "datasetVersionId": {
                "namespace": "food_delivery",
                "name": "public.order_status",
                "version": "676ac323-c8e3-3cea-b172-b468827afb51"
            },
            "facets": {}
        },
        {
            "datasetVersionId": {
                "namespace": "food_delivery",
                "name": "public.drivers",
                "version": "93ae26cc-87d8-3eae-9cdd-f9b6fd71f1f7"
            },
            "facets": {}
        },
        {
            "datasetVersionId": {
                "namespace": "food_delivery",
                "name": "public.restaurants",
                "version": "4db26821-7966-390f-9cf9-ac775fe9182b"
            },
            "facets": {}
        }
    ],
    "outputDatasetVersions": [
        {
            "datasetVersionId": {
                "namespace": "food_delivery",
                "name": "public.delivery_7_days",
                "version": "6f8f52f5-0230-31ce-a138-08b79e671b33"
            },
            "facets": {}
        }
    ],
    "facets": {
        "nominalTime": {
            "_producer": "https://github.com/MarquezProject/marquez/blob/main/docker/metadata.json",
            "_schemaURL": "https://openlineage.io/spec/facets/1-0-0/NominalTimeRunFacet.json",
            "nominalEndTime": "2020-02-22T22:00:00Z",
            "nominalStartTime": "2020-02-22T22:00:00Z"
        }
    }
}

This is taken directly from the ./docker/up.sh --seed command within the repo

@YLibert YLibert closed this Sep 6, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api API layer changes
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Bug] facet dataQualityAssertions is not displayed since 0.33.0
3 participants