Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Have lsf_driver specify SIGKILL when bkilling #7433

Merged
merged 1 commit into from
Mar 21, 2024

Conversation

jonathan-eq
Copy link
Contributor

@jonathan-eq jonathan-eq commented Mar 12, 2024

This commit makes the lsf_driver use bkill with the SIGKILL signal instead of the default SIGINT.

Issue
Resolves #7443

Approach
Short description of the approach

(Screenshot of new behavior in GUI if applicable)

  • PR title captures the intent of the changes, and is fitting for release notes.
  • Added appropriate release note label
  • Commit history is consistent and clean, in line with the contribution guidelines.
  • Make sure tests pass locally (after every commit!)

When applicable

  • When there are user facing changes: Updated documentation
  • New behavior or changes to existing untested code: Ensured that unit tests are added (See Ground Rules).
  • Large PR: Prepare changes in small commits for more convenient review
  • Bug fix: Add regression test for the bug
  • Bug fix: Create Backport PR to latest release

@jonathan-eq jonathan-eq added the release-notes:skip If there should be no mention of this in release notes label Mar 12, 2024
@jonathan-eq jonathan-eq self-assigned this Mar 12, 2024
@berland
Copy link
Contributor

berland commented Mar 12, 2024

If the mother ERT process is terminated (say an eager user hits ctrl-c), this signal will be propagated and kill this subprocess, we might not want that, we want this script with bkills live happily independent of mother Ert.

I think you want to do as here:

preexec_fn=os.setpgrp,

in order to avoid this.

@jonathan-eq jonathan-eq force-pushed the improve-lsf-bkill branch 3 times, most recently from 2a2c1a7 to 57f9419 Compare March 13, 2024 14:58
@jonathan-eq jonathan-eq changed the title Have lsf_driver specify posix signal when bkilling Have lsf_driver specify SIGTERM when bkilling Mar 13, 2024
@codecov-commenter
Copy link

codecov-commenter commented Mar 13, 2024

Codecov Report

Attention: Patch coverage is 0% with 11 lines in your changes are missing coverage. Please review.

Project coverage is 84.52%. Comparing base (e4d25a1) to head (9d23320).
Report is 2 commits behind head on main.

Files Patch % Lines
src/clib/lib/job_queue/lsf_driver.cpp 0.00% 11 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #7433      +/-   ##
==========================================
- Coverage   85.25%   84.52%   -0.74%     
==========================================
  Files         387      387              
  Lines       23055    23069      +14     
  Branches      881      887       +6     
==========================================
- Hits        19655    19498     -157     
- Misses       3290     3464     +174     
+ Partials      110      107       -3     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@jonathan-eq
Copy link
Contributor Author

Should I change this in the job queue driver too?

@berland
Copy link
Contributor

berland commented Mar 14, 2024

Should I change this in the job queue driver too?

Yes I think that makes sense.

@berland berland removed the release-notes:skip If there should be no mention of this in release notes label Mar 14, 2024
@jonathan-eq jonathan-eq force-pushed the improve-lsf-bkill branch 3 times, most recently from 26a5ede to 0c4c73a Compare March 14, 2024 09:21
@jonathan-eq jonathan-eq added the release-notes:bug-fix Automatically categorise as bug fix in release notes label Mar 14, 2024
@jonathan-eq
Copy link
Contributor Author

I marked it as a bug fix because bkill uses different signals than qdel, and the users got confused because it showed as KeyboardInterrupt in the logs.

@jonathan-eq jonathan-eq requested a review from berland March 14, 2024 10:42
@jonathan-eq jonathan-eq force-pushed the improve-lsf-bkill branch 2 times, most recently from 00a0d67 to d3c6027 Compare March 14, 2024 12:57
@pinkwah
Copy link
Contributor

pinkwah commented Mar 14, 2024

This is not the correct behaviour. There needs to be a SIGKILL after a while, otherwise a hanging job will hang forever taking up resources.

@berland
Copy link
Contributor

berland commented Mar 14, 2024

This is not the correct behaviour. There needs to be a SIGKILL after a while, otherwise a hanging job will hang forever taking up resources.

Do we expect the job to react differently to SIGTERM than SIGINT? The job can choose to ignore both. I am thinking that changing to SIGTERM solves the confusion for the users, and that sending an additional SIGKILL later is a new feature.

@jonathan-eq
Copy link
Contributor Author

I updated this to do SIGKILL instead of SIGTERM.

@jonathan-eq jonathan-eq force-pushed the improve-lsf-bkill branch 2 times, most recently from 03806fb to bd1fea1 Compare March 18, 2024 08:04
@jonathan-eq jonathan-eq changed the title Have lsf_driver specify SIGTERM when bkilling Have lsf_driver specify SIGKILL when bkilling Mar 18, 2024
Copy link
Contributor

@xjules xjules left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it looks good now! Good job @jonathan-eq ! 🚀

@berland
Copy link
Contributor

berland commented Mar 18, 2024

Have you performed a manual test of the "kill all realizations" button in the GUI with this change? (or some other real-world test of the changed C-code?)

@jonathan-eq
Copy link
Contributor Author

Have you performed a manual test of the "kill all realizations" button in the GUI with this change? (or some other real-world test of the changed C-code?)

It does indeed NOT work.

 SIGKILL 785865: Illegal signal value
 SIGKILL 785866: Illegal signal value
 SIGKILL 785867: Illegal signal value
 SIGKILL 785868: Illegal signal value
 SIGKILL 785869: Illegal signal value
 SIGKILL 785870: Illegal signal value
 SIGKILL 785871: Illegal signal value
 SIGKILL 785872: Illegal signal value
 SIGKILL 785873: Illegal signal value

@jonathan-eq jonathan-eq force-pushed the improve-lsf-bkill branch 6 times, most recently from dd507da to 9d23320 Compare March 19, 2024 13:17
@jonathan-eq
Copy link
Contributor Author

Have you performed a manual test of the "kill all realizations" button in the GUI with this change? (or some other real-world test of the changed C-code?)

It does indeed NOT work.

 SIGKILL 785865: Illegal signal value
 SIGKILL 785866: Illegal signal value
 SIGKILL 785867: Illegal signal value
 SIGKILL 785868: Illegal signal value
 SIGKILL 785869: Illegal signal value
 SIGKILL 785870: Illegal signal value
 SIGKILL 785871: Illegal signal value
 SIGKILL 785872: Illegal signal value
 SIGKILL 785873: Illegal signal value

Fixed it!!

Copy link
Contributor

@xjules xjules left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very good! 🚀

@jonathan-eq jonathan-eq merged commit a71a6df into equinor:main Mar 21, 2024
38 checks passed
@jonathan-eq jonathan-eq deleted the improve-lsf-bkill branch March 21, 2024 06:52
eivindjahren pushed a commit to eivindjahren/ert that referenced this pull request Apr 3, 2024
Have lsf_driver specify SIGKILL signal when using bkill
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
release-notes:bug-fix Automatically categorise as bug fix in release notes
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

bkill uses SIGINT as first signal
5 participants