Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support streaming jobs in Marquez #2682

Merged
merged 3 commits into from
Dec 14, 2023

Conversation

pawel-big-lebowski
Copy link
Collaborator

@pawel-big-lebowski pawel-big-lebowski commented Nov 15, 2023

Problem

Currently, job_version is created once the job finishes. This assumption no longer holds for streaming jobs which can run for days or weeks while continuously writing output datasets.

It makes sense in this case, to create job version and identify input/output datasets at the beginning of the job and modify it on the fly if a job changes.

In particular, a lineage endpoint shall include streaming jobs in lineage graph as they're running and datasets being written by streaming jobs should be available (and have dataset version modified) once the job is started.

Solution

  • add extra logic to create job_version entry and its references before the job completes,
  • make sure new version is written only once,
  • support JobTypeJobFacet to determine if a job is batch or streaming
  • store job characteristics in jobs table.

Note: All database schema changes require discussion. Please link the issue for context.

One-line summary:

Checklist

  • You've signed-off your work
  • Your changes are accompanied by tests (if relevant)
  • Your change contains a small diff and is self-contained
  • You've updated any relevant documentation (if relevant)
  • You've included a one-line summary of your change for the CHANGELOG.md (Depending on the change, this may not be necessary).
  • You've versioned your .sql database schema migration according to Flyway's naming convention (if relevant)
  • You've included a header in any source code files (if relevant)

@boring-cyborg boring-cyborg bot added the api API layer changes label Nov 15, 2023
Copy link

codecov bot commented Nov 15, 2023

Codecov Report

Attention: 4 lines in your changes are missing coverage. Please review.

Comparison is base (b73fb15) 84.15% compared to head (a4e6fe7) 84.24%.

Files Patch % Lines
...pi/src/main/java/marquez/db/mappers/JobMapper.java 93.54% 1 Missing and 1 partial ⚠️
api/src/main/java/marquez/db/JobVersionDao.java 97.77% 0 Missing and 1 partial ⚠️
api/src/main/java/marquez/db/OpenLineageDao.java 95.45% 0 Missing and 1 partial ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##               main    #2682      +/-   ##
============================================
+ Coverage     84.15%   84.24%   +0.09%     
- Complexity     1390     1405      +15     
============================================
  Files           249      249              
  Lines          6322     6371      +49     
  Branches        286      291       +5     
============================================
+ Hits           5320     5367      +47     
- Misses          850      851       +1     
- Partials        152      153       +1     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@pawel-big-lebowski pawel-big-lebowski force-pushed the static/job-version-mapping branch 2 times, most recently from d7166cc to 17810f2 Compare November 16, 2023 07:17
@boring-cyborg boring-cyborg bot added the docs label Nov 16, 2023
@pawel-big-lebowski pawel-big-lebowski force-pushed the stream/support-streaming-jobs branch 6 times, most recently from 0f4c077 to 6008834 Compare November 20, 2023 10:14
@pawel-big-lebowski pawel-big-lebowski marked this pull request as ready for review November 20, 2023 10:30
@pawel-big-lebowski pawel-big-lebowski force-pushed the stream/support-streaming-jobs branch 3 times, most recently from 8efc020 to 2ff9048 Compare December 1, 2023 13:52
@pawel-big-lebowski pawel-big-lebowski force-pushed the stream/support-streaming-jobs branch 3 times, most recently from e90d7ec to 9d2eea8 Compare December 5, 2023 14:57
*
* <p>In this case, a job version is created based on the list of input and output datasets
* referenced by this job. If a job starts with inputs:{A,B} and outputs:{C}, new job version is
* created immediately at job start. If a following event produces inputs:{A}, outputs:{C}, then
Copy link
Member

@wslulciuc wslulciuc Dec 12, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If a job has {A,B} as initial inputs, the only {A} shouldn't that be a new job version? Given that all inputs/outputs are expected when a job run has been started, we should create a job version anytime the inputs or outputs change and associated the run with the new job version.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Existing algorithm to compute version of the job relies on all the inputs & datasets for the particular run. We should not modify it, as this would cause new job version for all the existing jobs, but we could create a separate version to evaluate version of the streaming job if we wanted.

However, when looking into the approach, I found it useful. It's a cumulative approach, where new job version is created if a new input/output dataset is involved in processing. If some dataset was included in the past events, but is no longer present, the version does not get change.

The benefit of this assumption is that we don't require producer to emit all the datasets all the time. If you emit amount of bytes written into output dataset, without containing input dataset in the event, it doesn't mean there is new job version without the inputs.

Is it OK?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do feel if a stream is removed, you'll want to remove that edge from the current lineage graph. But, I also understand the limitations here, as you mentioned, with bytes written to an output dataset present with no input datasets specified; that said, and after thinking it through, this may be a noop and the logic you have is reasonable.

For example, let's say we have a streaming job X with inputs {A,B} and outputs C. The job runs and is marked as RUNNING with the run ID 74f117af-eb90-4ffd-98e1-c1bc215934df. To change the inputs from {A,B} to {B} (or similarly for the outputs), the user will have to redeploy the job (with new code possibly) and therefore be associated with a new run ID. So, what I think you have is logically correct given how streaming jobs are deployed. For batch jobs, versioning is more straight forward as we version from run-to-run.

Copy link
Member

@wslulciuc wslulciuc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We finally have support for streaming jobs! 💯 💯 💯

@wslulciuc wslulciuc added this to the 0.43.0 milestone Dec 13, 2023
Base automatically changed from static/job-version-mapping to main December 13, 2023 07:58
Copy link

netlify bot commented Dec 13, 2023

Deploy Preview for peppy-sprite-186812 canceled.

Name Link
🔨 Latest commit a4e6fe7
🔍 Latest deploy log https://app.netlify.com/sites/peppy-sprite-186812/deploys/65797fd55a323b00083b748d

Signed-off-by: Pawel Leszczynski <[email protected]>
@pawel-big-lebowski pawel-big-lebowski merged commit 2c8613a into main Dec 14, 2023
16 checks passed
@pawel-big-lebowski pawel-big-lebowski deleted the stream/support-streaming-jobs branch December 14, 2023 13:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api API layer changes docs
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

2 participants