Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OL facets - PR1 - create & write new events to new tables while not reading them #2350

Merged
merged 2 commits into from
Jan 17, 2023

Conversation

pawel-big-lebowski
Copy link
Collaborator

Signed-off-by: Pawel Leszczynski [email protected]

Problem

Facets are stored in lineage_events table in JSONs containing whole events. This PR is the first among three PRs to create and make use of OpenLineage facet tables.

This PR was created as a part of #2152. Content provided within this PR is mostly authored by @wslulciuc.

Solution

  • Create job_facets, run_facets and dataset_facets tables.
  • Insert data to newly created tables for the new events.
  • Reading from those tables should be avoided until data is migrated. Migration procedure will be provided soon.

Note: All database schema changes require discussion. Please link the issue for context.

Checklist

  • You've signed-off your work
  • Your changes are accompanied by tests (if relevant)
  • Your change contains a small diff and is self-contained
  • You've updated any relevant documentation (if relevant)
  • You've updated the CHANGELOG.md with details about your change under the "Unreleased" section (if relevant, depending on the change, this may not be necessary)
  • You've versioned your .sql database schema migration according to Flyway's naming convention (if relevant)
  • You've included a header in any source code files (if relevant)

@boring-cyborg boring-cyborg bot added api API layer changes docker labels Jan 9, 2023
@codecov
Copy link

codecov bot commented Jan 9, 2023

Codecov Report

Merging #2350 (be0f179) into main (656b2e6) will increase coverage by 0.08%.
The diff coverage is 81.14%.

❗ Current head be0f179 differs from pull request most recent head d3f4829. Consider uploading reports for the commit d3f4829 to get more accurate results

@@             Coverage Diff              @@
##               main    #2350      +/-   ##
============================================
+ Coverage     76.72%   76.81%   +0.08%     
- Complexity     1177     1195      +18     
============================================
  Files           222      226       +4     
  Lines          5354     5473     +119     
  Branches        429      443      +14     
============================================
+ Hits           4108     4204      +96     
- Misses          768      772       +4     
- Partials        478      497      +19     
Impacted Files Coverage Δ
...ain/java/marquez/db/mappers/DatasetDataMapper.java 85.18% <ø> (ø)
...rc/main/java/marquez/db/mappers/JobDataMapper.java 87.50% <ø> (ø)
.../java/marquez/db/models/ColumnLineageNodeData.java 100.00% <ø> (ø)
.../src/main/java/marquez/service/LineageService.java 80.30% <ø> (ø)
.../main/java/marquez/service/models/DatasetData.java 100.00% <ø> (ø)
.../src/main/java/marquez/service/models/JobData.java 66.66% <ø> (ø)
api/src/main/java/marquez/service/models/Node.java 61.53% <ø> (ø)
api/src/main/java/marquez/service/models/Run.java 82.50% <ø> (ø)
api/src/main/java/marquez/db/Columns.java 79.74% <60.00%> (-2.87%) ⬇️
api/src/main/java/marquez/db/FacetUtils.java 66.66% <66.66%> (ø)
... and 4 more

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

@pawel-big-lebowski pawel-big-lebowski force-pushed the ol-facets/PR1-feed-new-tables branch 2 times, most recently from a026783 to 34311f3 Compare January 10, 2023 08:19
@pawel-big-lebowski pawel-big-lebowski marked this pull request as ready for review January 10, 2023 08:25
@pawel-big-lebowski pawel-big-lebowski self-assigned this Jan 10, 2023
@pawel-big-lebowski pawel-big-lebowski changed the title OL facets - PR1 - create & feed new tables while not reading them OL facets - PR1 - create & write new events to new tables while not reading them Jan 10, 2023
@pawel-big-lebowski pawel-big-lebowski mentioned this pull request Jan 10, 2023
7 tasks
Comment on lines 34 to 44
enum DatasetFacet implements Facet {
DOCUMENTATION(Type.DATASET, "documentation") {
Optional<PGobject> toPgObject(LineageEvent.DatasetFacets facets) {
return Optional.ofNullable(facets.getDocumentation()).map(this::toPgObject);
}
},
SCHEMA(Type.DATASET, "schema") {
Optional<PGobject> toPgObject(LineageEvent.DatasetFacets facets) {
return Optional.ofNullable(facets.getSchema()).map(this::toPgObject);
}
},
DATASOURCE(Type.DATASET, "dataSource") {
Optional<PGobject> toPgObject(LineageEvent.DatasetFacets facets) {
return Optional.ofNullable(facets.getDataSource()).map(this::toPgObject);
}
},
DESCRIPTION(Type.DATASET, "description") {
Optional<PGobject> toPgObject(LineageEvent.DatasetFacets facets) {
return Optional.ofNullable(facets.getDescription()).map(this::toPgObject);
}
},
LIFECYCLE_STATE_CHANGE(Type.DATASET, "lifecycleStateChange") {
Optional<PGobject> toPgObject(LineageEvent.DatasetFacets facets) {
return Optional.ofNullable(facets.getLifecycleStateChange()).map(this::toPgObject);
}
},
VERSION(Type.DATASET, "version"),
COLUMN_LINEAGE(Type.DATASET, "columnLineage") {
Optional<PGobject> toPgObject(LineageEvent.DatasetFacets facets) {
return Optional.ofNullable(facets.getColumnLineage()).map(this::toPgObject);
}
},
OWNERSHIP(Type.DATASET, "ownership"),
DATA_QUALITY_METRICS(Type.INPUT, "dataQualityMetrics"),
DATA_QUALITY_ASSERTIONS(Type.INPUT, "dataQualityAssertions"),
OUTPUT_STATISTICS(Type.OUTPUT, "outputStatistics");
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Discussed briefly the other day, but these enums should be unnecessary, especially the facets that aren't mapped directly to Java fields - e.g., outputStatistics, dataQualityMetrics, etc.). Your call to datasetFacets.getAdditionalFacets() below should fetch all facets that haven't been mapped to a field in the DatasetFacets class. But since you have to convert everything to JSON anyway, why not convert the whole DatasetFacets class to a generic JsonNode object? You could do something like

    JsonNode jsonNode = Utils.getMapper().valueToTree(facets);
    StreamSupport.stream(Spliterators.spliteratorUnknownSize(jsonNode.fieldNames(), Spliterator.DISTINCT),
    false)
        .forEach(fieldName -> {
          storeFacet(fieldName, toPgObject(jsonNode.get(fieldName)));
        });

You'll be able to support all currently known facets as well as new facets and custom facets without ever having to update the server code.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I got rid of enums in RunFacetsDao and JobFacetsDao.
We still need it in a reduced form in DatasetFacetsDao to extract facet type (DATASET, INPUT, OUTPUT or UNKNOWN).

@pawel-big-lebowski pawel-big-lebowski force-pushed the ol-facets/PR1-feed-new-tables branch 2 times, most recently from 3b63ae1 to c89504d Compare January 12, 2023 08:49
Copy link
Member

@wslulciuc wslulciuc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, @pawel-big-lebowski 💯 🥇

@wslulciuc wslulciuc merged commit 6436ddc into main Jan 17, 2023
@wslulciuc wslulciuc deleted the ol-facets/PR1-feed-new-tables branch January 17, 2023 10:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api API layer changes docker
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants