Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Full execution provenance resolution #5639

Draft
wants to merge 45 commits into
base: master
Choose a base branch
from
Draft

Conversation

pditommaso
Copy link
Member

This PR implements the ability to trace the full provenance of a Nextflow pipeline, so that once a task execution is completed, it reports the set of direct upstream tasks that have originated one or more inputs.

How it works

Each output value that's emitted by a task or an operator is wrapped with an object instance. This makes it possible to assign to each emitted value a unique identity based on the underlying Java object identity.

Each object is associated with the corresponding task or operator run (i.e. TaskRun and OperatorRun).

Once the output value is received as an input by task, the upstream task is determined by inspecting the output-run association table.

Required changes

This approach requires enclosing each output value with a wrapper object, and "unwrap" it once it is received by the downstream task or operator, so that the corresponding operation is not altered.

The input unwrapping can be automated easily both for tasks and operators because they have a common message receive interface.

However the output wrapping requires modifying all nextflow operators because each of them of a custom logic to produce the outputs

Possible problems

It should be assessed the impact of creating an object instance for each output value generated by the workflow execution on the underlying Java heap.

Similarity, keeping a heap reference for each task and operator run may determine memory pressure on large workflow graphs.

Current state and next steps

The current implementation demonstrates that this approach is viable. The solution already supports any tasks and the operators: branch, map, flatMap, collectFile.

Tests are available in this case.

The remaining operators should be added to fully support existing workflow applications.

Alternative solution

A simpler solution is possible using the output file paths as the identity value to track the tasks provenance using a logic very similar to the above proposal.

However, the path approach is limited to the case in which all workflow tasks and operator produce file values. The provenance can be tracked for task having one or more non-file input/output values.

Signed-off-by: Paolo Di Tommaso <[email protected]>
Signed-off-by: Paolo Di Tommaso <[email protected]>
Signed-off-by: Paolo Di Tommaso <[email protected]>
@pditommaso pditommaso requested review from jorgee and bentsherman and removed request for jorgee January 5, 2025 13:06
Copy link

netlify bot commented Jan 5, 2025

Deploy Preview for nextflow-docs-staging canceled.

Name Link
🔨 Latest commit 9e5bc10
🔍 Latest deploy log https://app.netlify.com/sites/nextflow-docs-staging/deploys/678830b94204bb000805404b

@pditommaso pditommaso marked this pull request as draft January 5, 2025 13:22
Signed-off-by: Paolo Di Tommaso <[email protected]>
Signed-off-by: Paolo Di Tommaso <[email protected]>
Signed-off-by: Paolo Di Tommaso <[email protected]>
Signed-off-by: Paolo Di Tommaso <[email protected]>
Signed-off-by: Paolo Di Tommaso <[email protected]>
Signed-off-by: Paolo Di Tommaso <[email protected]>
Signed-off-by: Paolo Di Tommaso <[email protected]>
Signed-off-by: Paolo Di Tommaso <[email protected]>
Signed-off-by: Paolo Di Tommaso <[email protected]>
Signed-off-by: Paolo Di Tommaso <[email protected]>
Signed-off-by: Paolo Di Tommaso <[email protected]>
Signed-off-by: Paolo Di Tommaso <[email protected]>
Signed-off-by: Paolo Di Tommaso <[email protected]>
Signed-off-by: Paolo Di Tommaso <[email protected]>
Signed-off-by: Paolo Di Tommaso <[email protected]>
Signed-off-by: Paolo Di Tommaso <[email protected]>
Signed-off-by: Paolo Di Tommaso <[email protected]>
Signed-off-by: Paolo Di Tommaso <[email protected]>
Signed-off-by: Paolo Di Tommaso <[email protected]>
Signed-off-by: Paolo Di Tommaso <[email protected]>
Signed-off-by: Paolo Di Tommaso <[email protected]>
Signed-off-by: Paolo Di Tommaso <[email protected]>
Signed-off-by: Paolo Di Tommaso <[email protected]>
Signed-off-by: Paolo Di Tommaso <[email protected]>
Signed-off-by: Paolo Di Tommaso <[email protected]>
Signed-off-by: Paolo Di Tommaso <[email protected]>
Signed-off-by: Paolo Di Tommaso <[email protected]>
Signed-off-by: Paolo Di Tommaso <[email protected]>
@pditommaso
Copy link
Member Author

All green!

@bentsherman
Copy link
Member

Great, I will try to review this week.

Signed-off-by: Paolo Di Tommaso <[email protected]>
Signed-off-by: Paolo Di Tommaso <[email protected]>
Signed-off-by: Paolo Di Tommaso <[email protected]>
Signed-off-by: Paolo Di Tommaso <[email protected]>
Signed-off-by: Paolo Di Tommaso <[email protected]>
@pditommaso
Copy link
Member Author

pditommaso commented Jan 15, 2025

An update in the status of this PR. The following operators are fully supported:

  • branch
  • buffer
  • concat
  • collect
  • collectFile
  • combine
  • count
  • distinct
  • filter
  • first
  • flatMap
  • flatten
  • join
  • last
  • map
  • max
  • min
  • mix
  • mean
  • multiMap
  • reduce
  • take
  • toList
  • toSortedList
  • unique
  • sum

The most complex that remain to support are likely the splitter ones.

Signed-off-by: Paolo Di Tommaso <[email protected]>
@bentsherman
Copy link
Member

This is why we need fewer operators 😆

The splitter operators should work similarly to flatMap

@pditommaso
Copy link
Member Author

I know, I know but they exists

Signed-off-by: Paolo Di Tommaso <[email protected]>
Signed-off-by: Paolo Di Tommaso <[email protected]>
@kenibrewer

This comment was marked as off-topic.

@pditommaso

This comment was marked as off-topic.

Signed-off-by: Paolo Di Tommaso <[email protected]>
Signed-off-by: Paolo Di Tommaso <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants