Proposal: Improved `ProvenanceProfile` definition #2082

fmigneault · 2024-12-07T01:10:52Z

I would like to propose the following changes to the PROV utilities.

The reasoning behind the requested changes is that I am in the process of implementing PROV in crim-ca/weaver#778 such that OGC API - Processes can be extended to demonstrate PROV capabilities (i.e.: https://docs.ogc.org/DRAFTS/24-051.html#_requirements_class_provenance), which can build upon the great work from the CWL community. This would allow the Geospatial community to improve traceability and understanding of their processing pipelines.

However, while I'm able to enable the PROV features and get the resulting metadata, I end up in a situation where the resulting tool and execution PROV files generated do not reflect the reality of what happened with the CWL workflow run, since all the details about the remote server where they are running, the actual users employed by worker instances crunching the data, weaver dispatching the workflow sequencing/resolution to cwltool, or any intermediate transformations from "geo data sources" to CWL-compatible inputs is not reported anywhere.

In the current code state, the ProvenanceProfile class is the one that modifies the PROV document (and with which I would need to extended entities/agents/relationships). This class generates the resulting metadata all within the job execution, and is not easily accessible from "outside" cwltool steps. The only interface that I can access part of the references is are the LoadingContext.research_obj and RuntimeContext.research_obj (along some other arguments like orcid).

Therefore, this PR delegates the creation of the ProvenanceProfile instance to LoadingContext, such that I can create a derived LoadingContext that extends the profile with definitions that are more aligned with weaver and cwltool working together. From the point of view of cwltool, the operations resolve exactly the same way as before.

Let me know if you have any question or if anything should be adjusted.

…not single CommandLineTool (relates to common-workflow-language/cwltool#2082)

codecov · 2024-12-09T17:22:19Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 84.71%. Comparing base (d3c7bd5) to head (1d9c1a1).

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #2082      +/-   ##
==========================================
+ Coverage   84.20%   84.71%   +0.51%     
==========================================
  Files          46       46              
  Lines        8320     8323       +3     
  Branches     1961     1960       -1     
==========================================
+ Hits         7006     7051      +45     
+ Misses        838      804      -34     
+ Partials      476      468       -8

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

mr-c

@fmigneault I am very happy to see that you want to implement CWLProv on your engine, and help refactoring this code is very welcome!

mr-c · 2024-12-11T15:35:51Z

@fmigneault Can you add some more unit tests to increase the test code coverage?

…ation

…to cwl-prov-profile

fmigneault · 2024-12-11T22:02:49Z

@mr-c
I've added some tests to validate that provided provenance user is employed.
I'm not sure how to test specifically the override of the new functions without involving complicated mocks, but at least adding the provenance options should ensure the calls go through the conditions leading to their insertion in the provenance files.

…mandLineTool directly

fmigneault · 2024-12-12T03:51:08Z

Digging deeper into how the user provenance details were set with/without the --orcid option, I actually encountered a situation that I believe is a bug. Running a CommandLineTool directly without the --orcid flag but with --enable-user-provenance/--enable-host-provenance caused the user/host details to actually be omitted, since host_provenance=False, user_provenance=False were explicitly set in the SingleJobExecutor (to avoid duplicating them as workflow step), but the operation to set them in the first place when it is not a Workflow never occurred.

I took the opportunity to refactor slightly the user-provenance strategy, since the code was partially duplicated between the ProvenanceProfile and the ResearchObject classes.

mr-c

I love it when additional testing leads to fixes! Thank you @fmigneault .

Shall I merge as 1 big commit, or do you want to rebase this as several clean commits?

fmigneault · 2024-12-12T19:37:35Z

A rebase is better IMO. I prefer to keep the edit history of the attempts performed.
However, GitHub seems to say there are conflicts making the rebase not possible.
Any idea about what to do?

extend CWLProv utilities

cfa534c

fmigneault added a commit to crim-ca/weaver that referenced this pull request Dec 7, 2024

adjust PROV for potential metadata updates - works for workflow, but …

28af8a5

…not single CommandLineTool (relates to common-workflow-language/cwltool#2082)

Merge branch 'main' into cwl-prov-profile - fix merge conflict

5776956

fmigneault mentioned this pull request Dec 7, 2024

add CWL PROV support crim-ca/weaver#778

Open

fix circular import

a57b2a4

fmigneault added 3 commits December 9, 2024 14:00

fix mypy typing

fe6b706

fix docs linting

f615403

more linting fixes

525c1cc

mr-c approved these changes Dec 11, 2024

View reviewed changes

Merge branch 'main' into cwl-prov-profile

acc89c7

mr-c enabled auto-merge (squash) December 11, 2024 15:35

mr-c disabled auto-merge December 11, 2024 15:35

fmigneault added 3 commits December 11, 2024 16:55

test extra provenance options and validate resolved agent/user associ…

a5f603e

…ation

Merge branch 'cwl-prov-profile' of github-perso:fmigneault/cwltool in…

5280368

…to cwl-prov-profile

fix linting

ea2a0b9

fmigneault added 2 commits December 11, 2024 22:37

test with/without orcid prov + fix missing user prov when running Com…

78ef52d

…mandLineTool directly

fix prov graph literal type value property

365b4f2

fmigneault requested a review from mr-c December 12, 2024 05:08

fmigneault mentioned this pull request Dec 12, 2024

add more W3C PROV details about process I/O crim-ca/weaver#780

Open

2 tasks

mr-c reviewed Dec 12, 2024

View reviewed changes

Merge branch 'main' into cwl-prov-profile

1d9c1a1

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Proposal: Improved `ProvenanceProfile` definition #2082

Proposal: Improved `ProvenanceProfile` definition #2082

fmigneault commented Dec 7, 2024

codecov bot commented Dec 9, 2024 •

edited

Loading

mr-c left a comment

mr-c commented Dec 11, 2024

fmigneault commented Dec 11, 2024

fmigneault commented Dec 12, 2024

mr-c left a comment

fmigneault commented Dec 12, 2024 •

edited

Loading

Proposal: Improved ProvenanceProfile definition #2082

Are you sure you want to change the base?

Proposal: Improved ProvenanceProfile definition #2082

Conversation

fmigneault commented Dec 7, 2024

codecov bot commented Dec 9, 2024 • edited Loading

Codecov Report

mr-c left a comment

Choose a reason for hiding this comment

mr-c commented Dec 11, 2024

fmigneault commented Dec 11, 2024

fmigneault commented Dec 12, 2024

mr-c left a comment

Choose a reason for hiding this comment

fmigneault commented Dec 12, 2024 • edited Loading

Proposal: Improved `ProvenanceProfile` definition #2082

Proposal: Improved `ProvenanceProfile` definition #2082

codecov bot commented Dec 9, 2024 •

edited

Loading

fmigneault commented Dec 12, 2024 •

edited

Loading