Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

optimizing current runs query for lieage api #2211

Conversation

prachim-collab
Copy link
Contributor

@prachim-collab prachim-collab commented Oct 24, 2022

Problem

Introduce a simpler alternate getCurrentRuns query that gets only simple runs from DB without the additional data from tables such as run_args, job_context, facets, input/output versions etc which required the extra table joins in the old getCurrentRuns query. This new getCurrentRuns DAO is NOT being used in Marquez as of now.

Closes: #4425

Solution

  • The old getCurrentRuns DAO is renamed to getCurrentRunsWithFacets without any change to the sql query .
  • This new light weight query is call getCurrentRuns and is NOT called from /lineage api as of now so NO change is required to the /lineage api response spec.
  • A flag withRunFacets is also introduced as parameter to lineage api, which is always set to true to call getCurrentRunsWithFacets, so that /lineage api and hence all the tests still call the old DAO .

Note: All database schema changes require discussion. Please link the issue for context.

Checklist

  • You've signed-off your work
  • Your changes are accompanied by tests (if relevant)
  • Your change contains a small diff and is self-contained
  • You've updated any relevant documentation (if relevant)
  • You've updated the CHANGELOG.md with details about your change under the "Unreleased" section (if relevant, depending on the change, this may not be necessary)
  • You've versioned your .sql database schema migration according to Flyway's naming convention (if relevant)
  • You've included a header in any source code files (if relevant)

@boring-cyborg boring-cyborg bot added the api API layer changes label Oct 24, 2022
@boring-cyborg
Copy link

boring-cyborg bot commented Oct 24, 2022

Thanks for opening your first pull request in the Marquez project! Please check out our contributing guidelines (https://github.com/MarquezProject/marquez/blob/main/CONTRIBUTING.md).

@codecov
Copy link

codecov bot commented Oct 24, 2022

Codecov Report

Merging #2211 (a330beb) into main (4b86615) will decrease coverage by 9.44%.
The diff coverage is n/a.

@@             Coverage Diff              @@
##               main    #2211      +/-   ##
============================================
- Coverage     76.67%   67.23%   -9.45%     
+ Complexity     1144      227     -917     
============================================
  Files           219       40     -179     
  Lines          5312      940    -4372     
  Branches        421      102     -319     
============================================
- Hits           4073      632    -3441     
+ Misses          764      159     -605     
+ Partials        475      149     -326     
Impacted Files Coverage Δ
...src/main/java/marquez/api/OpenLineageResource.java
...pi/src/main/java/marquez/db/mappers/RunMapper.java
.../src/main/java/marquez/service/LineageService.java
...ain/java/marquez/service/ColumnLineageService.java
...i/src/main/java/marquez/service/SourceService.java
...c/main/java/marquez/api/ColumnLineageResource.java
...c/main/java/marquez/db/mappers/OwnerRowMapper.java
...i/src/main/java/marquez/graphql/mapper/RowMap.java
...pi/src/main/java/marquez/db/models/DatasetRow.java
...va/marquez/db/mappers/DatasetVersionRowMapper.java
... and 169 more

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

@prachim-collab prachim-collab force-pushed the prachim/perf/4425_optimize_get_current_run_query branch from 6ca59e1 to bc2349d Compare October 24, 2022 21:23
@prachim-collab prachim-collab marked this pull request as ready for review October 24, 2022 21:25
@prachim-collab prachim-collab force-pushed the prachim/perf/4425_optimize_get_current_run_query branch 2 times, most recently from fea1168 to b39fcc5 Compare October 24, 2022 21:31
@@ -268,7 +268,7 @@ public void testLineageWithDeletedDataset() {
.hasSize(0);
runAssert
.extracting(Run::getOutputVersions, InstanceOfAssertFactories.list(DatasetVersionId.class))
.hasSize(1);
.hasSize(0);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is a little misleading since we're not asking the database for this information anymore. Should we just remove this assertion or change the type definition?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree and had initially removed this completely but then added it back again, because getInputVersions also had an assertion. I guess i should remove both of them now.

Comment on lines 161 to 163
runAssert
.extracting(Run::getOutputVersions, InstanceOfAssertFactories.list(DatasetVersionId.class))
.hasSize(1);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are asserting on the Lineage object returned from lineageService.lineage call, and it is the response payload of GET lineage API. Does this mean that API response is being changed?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I don't like all the information being returned here, but breaking API compatibility is not good. If we want a lighter-weight version of the lineage API, I think it's better to either include an optional parameter to exclude the superfluous data or to create a new API and deprecate the old one

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have updated the PR and added a withRunFacets flag to get runs with all the superfluous data and this flag always true in Marquez apis, so no affect on the API compatibility.

@prachim-collab prachim-collab marked this pull request as draft October 26, 2022 20:51
@prachim-collab prachim-collab force-pushed the prachim/perf/4425_optimize_get_current_run_query branch 2 times, most recently from 601278b to 52ebaa3 Compare October 31, 2022 22:12
@prachim-collab prachim-collab marked this pull request as ready for review October 31, 2022 22:49
@@ -46,7 +46,7 @@ public LineageService(LineageDao delegate, JobDao jobDao) {
this.jobDao = jobDao;
}

public Lineage lineage(NodeId nodeId, int depth) {
public Lineage lineage(NodeId nodeId, int depth, boolean withRunFacets) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a good change, but I worry that we'll want to add more options to this method (e.g., include job facets? dataset facets? exclude runs altogether?). I don't think we should take this on now, but let's add a TODO to make the input parameters here more easily extendable so that we can add those other options later one.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Couldn't agree more. I had small aversion to adding a flag to make this work, but there was no other better option. I also thought in future if more changes like this come along that alter api significantly, we could add these as options to api query parameters, or create more broken apis to get specific data.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added TODO

Comment on lines 134 to 140
@SqlQuery(
"SELECT DISTINCT on(r.job_name, r.namespace_name) r.*, jv.version as job_version\n"
+ " FROM runs_view r\n"
+ " INNER JOIN job_versions jv ON jv.uuid=r.job_version_uuid\n"
+ " INNER JOIN jobs_view j ON j.uuid=jv.job_uuid\n"
+ " WHERE j.uuid in (<jobUuid>) OR j.symlink_target_uuid IN (<jobUuid>)\n"
+ " ORDER BY r.job_name, r.namespace_name, created_at DESC\n")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should use the more readable syntax variant as we update our queries.

"""
SELECT DISTINCT on(r.job_name, r.namespace_name) r.*, jv.version as job_version
          FROM runs_view
          INNER JOIN job_versions jv ON jv.uuid=r.job_version_uuid
          INNER JOIN jobs_view j ON j.uuid=jv.job_uuid
          WHERE j.uuid in (<jobUuid>) OR j.symlink_target_uuid IN (<jobUuid>)
          ORDER BY r.job_name, r.namespace_name, created_at DESC
"""

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated the syntax as you asked

Copy link
Member

@phixMe phixMe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks so much!

@prachim-collab prachim-collab force-pushed the prachim/perf/4425_optimize_get_current_run_query branch from 4662e51 to a330beb Compare November 2, 2022 21:43
@collado-mike collado-mike merged commit 6dad6aa into MarquezProject:main Nov 2, 2022
@boring-cyborg
Copy link

boring-cyborg bot commented Nov 2, 2022

Great job! Congrats on your first merged pull request in the Marquez project!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api API layer changes
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants