Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proposal: Improved ProvenanceProfile definition #2082

Open
wants to merge 13 commits into
base: main
Choose a base branch
from

Conversation

fmigneault
Copy link
Contributor

@mr-c

I would like to propose the following changes to the PROV utilities.

The reasoning behind the requested changes is that I am in the process of implementing PROV in crim-ca/weaver#778 such that OGC API - Processes can be extended to demonstrate PROV capabilities (i.e.: https://docs.ogc.org/DRAFTS/24-051.html#_requirements_class_provenance), which can build upon the great work from the CWL community. This would allow the Geospatial community to improve traceability and understanding of their processing pipelines.

However, while I'm able to enable the PROV features and get the resulting metadata, I end up in a situation where the resulting tool and execution PROV files generated do not reflect the reality of what happened with the CWL workflow run, since all the details about the remote server where they are running, the actual users employed by worker instances crunching the data, weaver dispatching the workflow sequencing/resolution to cwltool, or any intermediate transformations from "geo data sources" to CWL-compatible inputs is not reported anywhere.

In the current code state, the ProvenanceProfile class is the one that modifies the PROV document (and with which I would need to extended entities/agents/relationships). This class generates the resulting metadata all within the job execution, and is not easily accessible from "outside" cwltool steps. The only interface that I can access part of the references is are the LoadingContext.research_obj and RuntimeContext.research_obj (along some other arguments like orcid).

Therefore, this PR delegates the creation of the ProvenanceProfile instance to LoadingContext, such that I can create a derived LoadingContext that extends the profile with definitions that are more aligned with weaver and cwltool working together. From the point of view of cwltool, the operations resolve exactly the same way as before.

Let me know if you have any question or if anything should be adjusted.

fmigneault added a commit to crim-ca/weaver that referenced this pull request Dec 7, 2024
Copy link

codecov bot commented Dec 9, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 84.71%. Comparing base (d3c7bd5) to head (1d9c1a1).

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #2082      +/-   ##
==========================================
+ Coverage   84.20%   84.71%   +0.51%     
==========================================
  Files          46       46              
  Lines        8320     8323       +3     
  Branches     1961     1960       -1     
==========================================
+ Hits         7006     7051      +45     
+ Misses        838      804      -34     
+ Partials      476      468       -8     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link
Member

@mr-c mr-c left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@fmigneault I am very happy to see that you want to implement CWLProv on your engine, and help refactoring this code is very welcome!

@mr-c mr-c enabled auto-merge (squash) December 11, 2024 15:35
@mr-c
Copy link
Member

mr-c commented Dec 11, 2024

@fmigneault Can you add some more unit tests to increase the test code coverage?

@mr-c mr-c disabled auto-merge December 11, 2024 15:35
@fmigneault
Copy link
Contributor Author

@mr-c
I've added some tests to validate that provided provenance user is employed.
I'm not sure how to test specifically the override of the new functions without involving complicated mocks, but at least adding the provenance options should ensure the calls go through the conditions leading to their insertion in the provenance files.

@fmigneault
Copy link
Contributor Author

Digging deeper into how the user provenance details were set with/without the --orcid option, I actually encountered a situation that I believe is a bug. Running a CommandLineTool directly without the --orcid flag but with --enable-user-provenance/--enable-host-provenance caused the user/host details to actually be omitted, since host_provenance=False, user_provenance=False were explicitly set in the SingleJobExecutor (to avoid duplicating them as workflow step), but the operation to set them in the first place when it is not a Workflow never occurred.

I took the opportunity to refactor slightly the user-provenance strategy, since the code was partially duplicated between the ProvenanceProfile and the ResearchObject classes.

Copy link
Member

@mr-c mr-c left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I love it when additional testing leads to fixes! Thank you @fmigneault .

Shall I merge as 1 big commit, or do you want to rebase this as several clean commits?

@fmigneault
Copy link
Contributor Author

fmigneault commented Dec 12, 2024

A rebase is better IMO. I prefer to keep the edit history of the attempts performed.
However, GitHub seems to say there are conflicts making the rebase not possible.
Any idea about what to do?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants