Runless events - refactor job_versions_io_mapping #2654

pawel-big-lebowski · 2023-10-17T15:49:10Z

Problem

This is currently a draft PR which is far from being merged. It is missing few tests related to schema changes which are marked with todo within the code. I've created a PR to have a better discussion on adding job_id to job_versions_io_mapping. This PR should be a follow-up of #2641.

The assumption was that it should be helpful in optimising get-lineage query. I would like first to clarify how are we going to make benefit of this extra column.

Solution

Please describe your change as it relates to the problem, or bug fix, as well as any dependencies. If your change requires a database schema migration, please describe the schema modification(s) and whether it's a backwards-incompatible or backwards-compatible change.

Note: All database schema changes require discussion. Please link the issue for context.

One-line summary:

Checklist

You've signed-off your work
Your changes are accompanied by tests (if relevant)
Your change contains a small diff and is self-contained
You've updated any relevant documentation (if relevant)
You've included a one-line summary of your change for the CHANGELOG.md (Depending on the change, this may not be necessary).
You've versioned your .sql database schema migration according to Flyway's naming convention (if relevant)
You've included a header in any source code files (if relevant)

netlify · 2023-10-17T15:49:16Z

✅ Deploy Preview for peppy-sprite-186812 canceled.

Name	Link
🔨 Latest commit	`a8cdbe0`
🔍 Latest deploy log	https://app.netlify.com/sites/peppy-sprite-186812/deploys/65795f0ad206920009a4ac89

codecov · 2023-10-17T15:58:46Z

Codecov Report

Attention: 2 lines in your changes are missing coverage. Please review.

Comparison is base (a5a0e55) 84.08% compared to head (a8cdbe0) 84.15%.

Files	Patch %	Lines
...rations/V67_2_JobVersionsIOMappingBackfillJob.java	81.81%	2 Missing ⚠️

Additional details and impacted files

@@             Coverage Diff              @@
##               main    #2654      +/-   ##
============================================
+ Coverage     84.08%   84.15%   +0.06%     
- Complexity     1379     1390      +11     
============================================
  Files           248      249       +1     
  Lines          6295     6322      +27     
  Branches        286      286              
============================================
+ Hits           5293     5320      +27     
- Misses          849      850       +1     
+ Partials        153      152       -1

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

api/src/main/java/marquez/db/DatasetFacetsDao.java

wslulciuc · 2023-10-18T05:01:43Z

api/src/main/java/marquez/db/DatasetFacetsDao.java

      @NonNull Instant lineageEventTime,
-      @NonNull String lineageEventType,
+      String lineageEventType,


Can we use DATASET or JOB as the lineageEventType? I think it would be helpful to know the event type.

Currently, this column holds run states and its name is kind of misleading. Its name fits static lineage scenario well. However, I don't think we should store in a single column run-state and event-type. Renaming this column would require significant amount of work witch changes all over the project, including the spec.

Assuming this column contains run state update, then keeping it to null for DatasetEvent makes sense to me.

I agree it's misleading, but Marquez doesn't necessarily have to adhere exactly to OpenLineage concepts and can redefine them. We can set lineage_event_type as null , but would prefer we (eventually) remap eventType to runState (just not within this PR).

api/src/main/resources/marquez/db/migration/V66__job_versions_io_mapping_add_job_reference.sql

api/src/main/java/marquez/db/OpenLineageDao.java

api/src/main/resources/marquez/db/migration/V66__job_versions_io_mapping_add_job_reference.sql

api/src/main/java/marquez/db/JobVersionDao.java

pawel-big-lebowski · 2023-10-25T12:07:45Z

@wslulciuc I've added another commit to the PR keeping in mind upcoming streaming job support and our offline discussion.

Based on that, I think we should rename job_versions_io_mapping to job_io_mapping and make job_version_uuid nullable to be set at the end of the job.

julienledem · 2023-10-26T22:38:52Z

That's sounds fine to me. I'm curious to hear what Willy thinks.

wslulciuc · 2023-12-05T09:55:25Z

api/src/main/java/marquez/db/LineageDao.java

+                           ARRAY_AGG(DISTINCT io.dataset_uuid) FILTER (WHERE io.io_type='INPUT') AS inputs,
+                           ARRAY_AGG(DISTINCT io.dataset_uuid) FILTER (WHERE io.io_type='OUTPUT') AS outputs
+                    FROM job_io_mapping io
+                    WHERE io.is_job_version_current = TRUE


I like the idea of using a current version check, I'd be interesting to see the query plan and how the query may be optimized with the removal of join on that jobs table. Do we have any numbers on this?

getLineage method is commonly used as it is entry point for a user to Marquez. The method is recursive and the purpose of this refactor is to make each recursion step to be computed within a single table with no joins required.

After this change, a whole lineage graph can be computed based on job_versions_io_mapping table. jobs_view is used only to enrich the returned graph nodes. Before this change, a join to jobs_view was needed in each recursion step to make sure if a row in job_versions_io_mapping represents current job version.

api/src/main/java/marquez/db/LineageDao.java

wslulciuc · 2023-12-05T10:04:29Z

...src/main/resources/marquez/db/migration/V67.1__job_versions_io_mapping_add_job_reference.sql

@@ -0,0 +1,13 @@
+ALTER TABLE job_versions_io_mapping ADD COLUMN job_uuid uuid REFERENCES jobs(uuid) ON DELETE CASCADE;
+ALTER TABLE job_versions_io_mapping ADD COLUMN symlink_target_job_uuid uuid REFERENCES jobs(uuid) ON DELETE CASCADE;


Minor: symlink_target_job_uuid -> job_symlink_target_uuid since it's defined in the jobs table as jobs.symlink_target_uuid

...src/main/resources/marquez/db/migration/V67.1__job_versions_io_mapping_add_job_reference.sql

wslulciuc · 2023-12-05T10:16:10Z

api/src/main/java/marquez/db/JobVersionDao.java

+    INSERT INTO job_io_mapping (
+      job_version_uuid, dataset_uuid, io_type, job_uuid, symlink_target_job_uuid, is_job_version_current)
+    VALUES (:jobVersionUuid, :datasetUuid, :ioType, :jobUuid, :symlinkTargetJobUuid, TRUE)
+    ON CONFLICT (job_version_uuid, dataset_uuid, io_type, job_uuid) DO UPDATE SET is_job_version_current = TRUE


We set is_job_version_current = TRUE as a noop? i.e. just to fulfill the ON CONFLICT? Should we just use DO NOTHING instead?

Yes, you're right we can go with do nothing.
markVersionIOMappingObsolete marks obsolete rows with job version different that a given one, so we don't need to implement conflict scenario here.

api/src/main/java/marquez/db/JobVersionDao.java

wslulciuc · 2023-12-12T23:28:20Z

...src/main/resources/marquez/db/migration/V67.1__job_versions_io_mapping_add_job_reference.sql

@@ -0,0 +1,11 @@
+ALTER TABLE job_versions_io_mapping ADD COLUMN job_uuid uuid REFERENCES jobs(uuid) ON DELETE CASCADE;
+ALTER TABLE job_versions_io_mapping ADD COLUMN job_symlink_target_uuid uuid REFERENCES jobs(uuid) ON DELETE CASCADE;
+ALTER TABLE job_versions_io_mapping ADD COLUMN is_current_job_version boolean DEFAULT FALSE;


Can we add a made_current_at column?

wslulciuc

Looking forward to the lineage query perf improvements and follow up analysis!

Signed-off-by: Pawel Leszczynski <[email protected]>

boring-cyborg bot added api API layer changes docs labels Oct 17, 2023

wslulciuc reviewed Oct 18, 2023

View reviewed changes

api/src/main/resources/marquez/db/migration/V66__job_versions_io_mapping_add_job_reference.sql Outdated Show resolved Hide resolved

wslulciuc reviewed Oct 18, 2023

View reviewed changes

api/src/main/java/marquez/db/OpenLineageDao.java Outdated Show resolved Hide resolved

wslulciuc reviewed Oct 18, 2023

View reviewed changes

api/src/main/java/marquez/db/OpenLineageDao.java Outdated Show resolved Hide resolved

wslulciuc reviewed Oct 18, 2023

View reviewed changes

api/src/main/resources/marquez/db/migration/V66__job_versions_io_mapping_add_job_reference.sql Outdated Show resolved Hide resolved

wslulciuc reviewed Oct 18, 2023

View reviewed changes

api/src/main/java/marquez/db/JobVersionDao.java Outdated Show resolved Hide resolved

pawel-big-lebowski changed the title ~~Runless events - consume dataset event~~ Runless events - refactor job_versions_io_mapping Oct 18, 2023

pawel-big-lebowski changed the base branch from main to static/dataset-event October 18, 2023 09:57

pawel-big-lebowski force-pushed the static/job-version-mapping branch 5 times, most recently from 1607104 to 9f1dedd Compare October 23, 2023 10:57

pawel-big-lebowski force-pushed the static/dataset-event branch 3 times, most recently from 2213b35 to 40bfe6b Compare October 24, 2023 12:24

pawel-big-lebowski force-pushed the static/job-version-mapping branch 2 times, most recently from 46e26da to 7592059 Compare October 24, 2023 13:06

Base automatically changed from static/dataset-event to main November 6, 2023 07:16

pawel-big-lebowski force-pushed the static/job-version-mapping branch from ca5f6cd to 79acd2d Compare November 8, 2023 07:42

pawel-big-lebowski changed the base branch from main to static/job-event November 8, 2023 07:42

pawel-big-lebowski force-pushed the static/job-version-mapping branch 3 times, most recently from 2110951 to 4aa0900 Compare November 8, 2023 13:15

pawel-big-lebowski force-pushed the static/job-version-mapping branch 4 times, most recently from b6c0213 to 554f90c Compare November 10, 2023 07:39

pawel-big-lebowski marked this pull request as ready for review November 10, 2023 07:40

pawel-big-lebowski force-pushed the static/job-event branch from b7e40f8 to 0a3f98a Compare November 15, 2023 10:25

Base automatically changed from static/job-event to main November 16, 2023 06:57

pawel-big-lebowski force-pushed the static/job-version-mapping branch 2 times, most recently from d7166cc to 17810f2 Compare November 16, 2023 07:17