Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix column lineage when multiple jobs write to same dataset #2289

Conversation

pawel-big-lebowski
Copy link
Collaborator

@pawel-big-lebowski pawel-big-lebowski commented Dec 6, 2022

Signed-off-by: Pawel Leszczynski [email protected]

Problem

Current model of column-lineage API does not suit well scenario such that multiple different jobs write to same column of output dataset. Specifically, a response of the form:

         "transformationDescription": "identical",
         "transformationType": "IDENTITY",    
        ,    
         "inputFields": [
            { "namespace": "DBA", "name": "tableA", "field": "columnA"},
            { "namespace": "DBB", "name": "tableB", "field": "columnB"},
            { "namespace": "DBC", "name": "tableC", "field": "columnC"}
         ]

should be converted into:

 "inputFields": [
            { 
                  "namespace": "DBA", 
                  "name": "tableA", 
                  "field": "columnA",   
                  "transformationDescription": "identical", 
                  "transformationType": "IDENTITY"
            },
            ....
         ]

with transformationDescription and transformationType contained per input field.

Solution

  • Update API model while still returning deprecated transformationDescription and transformationType,
  • Write a test such that multiple different jobs write to same column,
  • Additionally, display column-lineage of a dataset in Marquez UI (which is helpful for debugging purposes)

Note: All database schema changes require discussion. Please link the issue for context.

Checklist

  • You've signed-off your work
  • Your changes are accompanied by tests (if relevant)
  • Your change contains a small diff and is self-contained
  • You've updated any relevant documentation (if relevant)
  • You've updated the CHANGELOG.md with details about your change under the "Unreleased" section (if relevant, depending on the change, this may not be necessary)
  • You've versioned your .sql database schema migration according to Flyway's naming convention (if relevant)
  • You've included a header in any source code files (if relevant)

@boring-cyborg boring-cyborg bot added api API layer changes client/java labels Dec 6, 2022
@pawel-big-lebowski pawel-big-lebowski force-pushed the fix/column-lineage-multiple-jobs-write-to-same-columns branch from 7f7a754 to 8a2bd5a Compare December 6, 2022 09:14
@codecov
Copy link

codecov bot commented Dec 6, 2022

Codecov Report

Merging #2289 (a7ecf04) into main (c8a38a1) will increase coverage by 0.13%.
The diff coverage is 100.00%.

@@             Coverage Diff              @@
##               main    #2289      +/-   ##
============================================
+ Coverage     76.84%   76.97%   +0.13%     
- Complexity     1154     1163       +9     
============================================
  Files           220      222       +2     
  Lines          5268     5298      +30     
  Branches        423      423              
============================================
+ Hits           4048     4078      +30     
  Misses          747      747              
  Partials        473      473              
Impacted Files Coverage Δ
api/src/main/java/marquez/db/ColumnLineageDao.java 100.00% <ø> (ø)
...a/marquez/client/models/ColumnLineageNodeData.java 0.00% <ø> (ø)
...arquez/db/mappers/ColumnLineageNodeDataMapper.java 90.47% <100.00%> (ø)
.../java/marquez/db/models/ColumnLineageNodeData.java 100.00% <100.00%> (ø)
...ain/java/marquez/service/ColumnLineageService.java 97.24% <100.00%> (+0.03%) ⬆️
...ain/java/marquez/service/models/ColumnLineage.java 100.00% <100.00%> (ø)

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

@pawel-big-lebowski pawel-big-lebowski force-pushed the fix/column-lineage-multiple-jobs-write-to-same-columns branch from 8a2bd5a to 5e3ce84 Compare December 6, 2022 11:45
@wslulciuc
Copy link
Member

@pawel-big-lebowski, per our discussion offline, we'll want to:

  1. First, add the fields transformationDescription and transformationType under the inputFields object
  2. Then, delete the top-level fields transformationDescription and transformationType

We'll also want to make the breaking changes above in separate (minor) releases and communicate that in our changelog / release notes on their deprecation / removal.

@pawel-big-lebowski pawel-big-lebowski force-pushed the fix/column-lineage-multiple-jobs-write-to-same-columns branch 2 times, most recently from 1b768e4 to ace6fe0 Compare December 6, 2022 13:38
@boring-cyborg boring-cyborg bot added the docs label Dec 6, 2022
@pawel-big-lebowski pawel-big-lebowski marked this pull request as ready for review December 6, 2022 13:40
@pawel-big-lebowski pawel-big-lebowski force-pushed the fix/column-lineage-multiple-jobs-write-to-same-columns branch from ace6fe0 to e988542 Compare December 6, 2022 13:42
@pawel-big-lebowski pawel-big-lebowski force-pushed the fix/column-lineage-multiple-jobs-write-to-same-columns branch from e988542 to 17fb0f7 Compare December 7, 2022 07:27
@pawel-big-lebowski pawel-big-lebowski force-pushed the fix/column-lineage-multiple-jobs-write-to-same-columns branch from 17fb0f7 to a7ecf04 Compare December 7, 2022 10:46
@pawel-big-lebowski pawel-big-lebowski merged commit 11f6cec into main Dec 7, 2022
@pawel-big-lebowski pawel-big-lebowski deleted the fix/column-lineage-multiple-jobs-write-to-same-columns branch December 7, 2022 11:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api API layer changes client/java docs
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

3 participants