Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add job controller, job monitor and tools to support compute4punch backend #430

Merged
merged 1 commit into from
Jul 5, 2024

Conversation

giffels
Copy link

@giffels giffels commented Feb 12, 2024

This pull request contains a job controller, job monitor and tools to support the compute4punch backend.

  • Add support for token based ssh access motley_cue the Compute4PUNCH login node. The HELMHOLTZ_TOP token needs to be added as secret using the reana-client.
  • Add Compute4PUNCH job monitor.
  • Add configurations for the Compute4PUNCH backend.
  • Add diverse utilities to simply parsing of HTCondor´s condor_q and condor_history commands.
  • Add Compute4PUNCH job controller based on ssh calls to the Compute4PUNCH login node.

@giffels
Copy link
Author

giffels commented Feb 12, 2024

I opened this as a draft pull request in order to start a discussion with the maintainers.
In particular, I am interested in early feedback and how to specify specific resource needs of the job?

At the moment I have added two new arguments to the compute4punch job controller knowing that they need to be added to the reana-commons and reana-workflow-engine-* modules as well. Before doing this, I would like to check if this is the way to go or if there is already a mechanism about that?

I have found #32, but it seems that only htcondor_max_runtime and htcondor_accounting_group have been implemented so far?

@giffels giffels force-pushed the add/compute4punch-support branch 2 times, most recently from 54768b0 to dc0fc33 Compare February 16, 2024 10:40
@giffels giffels marked this pull request as ready for review February 21, 2024 12:29
@giffels
Copy link
Author

giffels commented Feb 21, 2024

I would to share with you, that the first successful REANA workflow completed on Compute4PUNCH.

Bildschirmfoto 2024-02-21 um 13 30 42

Copy link

codecov bot commented Feb 26, 2024

Codecov Report

Attention: Patch coverage is 7.56303% with 220 lines in your changes missing coverage. Please review.

Project coverage is 39.72%. Comparing base (32ce567) to head (5ad471b).

Current head 5ad471b differs from pull request most recent head 6e62daa

Please upload reports for the commit 6e62daa to get more accurate results.

❗ There is a different number of reports uploaded between BASE (32ce567) and HEAD (5ad471b). Click for more details.

HEAD has 4 uploads less than BASE
Flag BASE (32ce567) HEAD (5ad471b)
5 1
Additional details and impacted files

Impacted file tree graph

@@              Coverage Diff              @@
##           maint-0.9     #430      +/-   ##
=============================================
- Coverage      45.99%   39.72%   -6.27%     
=============================================
  Files             17       18       +1     
  Lines           1235     1470     +235     
=============================================
+ Hits             568      584      +16     
- Misses           667      886     +219     
Files Coverage Δ
reana_job_controller/config.py 100.00% <100.00%> (ø)
reana_job_controller/utils.py 28.69% <10.63%> (-12.49%) ⬇️
reana_job_controller/job_monitor.py 40.27% <11.66%> (-7.76%) ⬇️
reana_job_controller/compute4punch_job_manager.py 0.00% <0.00%> (ø)

... and 2 files with indirect coverage changes

Copy link
Member

@mdonadoni mdonadoni left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @giffels, sorry for the delay!

I have left some comments after my first look at this PR, most of them are just due to my small knowledge of HTCondor and of Compute4PUNCH ;)

I still haven't looked at the token authentication part, though, and I haven't run the code yet.

I have found #32, but it seems that only htcondor_max_runtime and htcondor_accounting_group have been implemented so far?

Yes, those are the only ones as far as I can tell.
Regarding these new values:

  • c4p_cpu_cores
  • c4p_memory_limit
  • c4p_additional_requirements

Even though the first two could be harmonised with the ones for Kubernetes, I guess that for now these will work. We will also need to add them to the specification of reana.yaml.

@tiborsimko what do you think? Is it ok to keep these new values separate from the ones used for k8s jobs? See also this related comment: #32 (comment)

reana_job_controller/compute4punch_job_manager.py Outdated Show resolved Hide resolved
reana_job_controller/compute4punch_job_manager.py Outdated Show resolved Hide resolved
reana_job_controller/compute4punch_job_manager.py Outdated Show resolved Hide resolved
reana_job_controller/compute4punch_job_manager.py Outdated Show resolved Hide resolved
reana_job_controller/compute4punch_job_manager.py Outdated Show resolved Hide resolved
reana_job_controller/compute4punch_job_manager.py Outdated Show resolved Hide resolved
reana_job_controller/compute4punch_job_manager.py Outdated Show resolved Hide resolved
reana_job_controller/job_monitor.py Outdated Show resolved Hide resolved
@mdonadoni
Copy link
Member

mdonadoni commented Feb 26, 2024

One last thing: how hard would it be to have at least some small tests for these changes? Those are the only way for us to make sure that the code keeps working also in the future, as we do not have access on Compute4PUNCH and we will not be able to test the integration with REANA there.

@giffels
Copy link
Author

giffels commented Feb 27, 2024

One last thing: how hard would it be to have at least some small tests for these changes? Those are the only way for us to make sure that the code keeps working also in the future, as we do not have access on Compute4PUNCH and we will not be able to test the integration with REANA there.

Sure, it would be really useful to have test coverage for this. I will work on this.

@giffels giffels force-pushed the add/compute4punch-support branch 2 times, most recently from 515c974 to a244df3 Compare May 10, 2024 10:01
@giffels giffels changed the base branch from master to maint-0.9 May 10, 2024 10:02
@giffels giffels force-pushed the add/compute4punch-support branch from a244df3 to fa3ecab Compare May 10, 2024 15:16
setup.py Show resolved Hide resolved
reana_job_controller/utils.py Outdated Show resolved Hide resolved
reana_job_controller/utils.py Show resolved Hide resolved
@mdonadoni
Copy link
Member

Except for some comments that are left, I have not found issue with my local deployment of REANA, even when using HTCondor@CERN and Slurm@CERN, so these changes should not affect other REANA instances that do not use the C4P backend.

@giffels giffels force-pushed the add/compute4punch-support branch 4 times, most recently from 9b856fa to 52b903c Compare June 21, 2024 16:52
@tiborsimko
Copy link
Member

I have not found issue with my local deployment of REANA, even when using HTCondor@CERN and Slurm@CERN, so these changes should not affect other REANA instances that do not use the C4P backend.

Confirmed, just tried HTCondorCERN, and the paramiko version number change (that is coming with this PR) works perfectly well.

@giffels giffels force-pushed the add/compute4punch-support branch from 52b903c to 6e62daa Compare July 1, 2024 14:15
mdonadoni pushed a commit to giffels/reana-job-controller that referenced this pull request Jul 4, 2024
mdonadoni pushed a commit to giffels/reana-job-controller that referenced this pull request Jul 4, 2024
@mdonadoni mdonadoni merged commit 4243252 into reanahub:maint-0.9 Jul 5, 2024
12 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

3 participants