-
Notifications
You must be signed in to change notification settings - Fork 186
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Kill descendant processes in core.direct
schedulers plugin
#6572
base: main
Are you sure you want to change the base?
Conversation
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## main #6572 +/- ##
==========================================
+ Coverage 77.51% 77.86% +0.35%
==========================================
Files 560 566 +6
Lines 41444 42094 +650
==========================================
+ Hits 32120 32771 +651
+ Misses 9324 9323 -1 ☔ View full report in Codecov by Sentry. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @agoscinski , really fast in hunting bugs :)
I've put a minor comment,
In anycase, would be nice to add some regression tests.
process_ids.extend([str(child.pid) for child in children]) | ||
process_ids_str = ' '.join(process_ids) | ||
|
||
submit_command = f'kill {process_ids_str}' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a side node:
I've encountered cases where kill PID
silently returns without actually killing a job.
I would suggest handling this scenario, if PID still exists after sending the command kill PID
.
then properly inform with a log message.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In this case kill should return a nonzero exit code which should then be caught here
def _parse_kill_output(self, retval, stdout, stderr): |
As you can see can in the usage of this function in the bash scheduler (that is the base class of the direct one)
aiida-core/src/aiida/schedulers/plugins/bash.py
Lines 73 to 74 in f575916
retval, stdout, stderr = self.transport.exec_command_wait(self._get_kill_command(jobid)) | |
return self._parse_kill_output(retval, stdout, stderr) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @agoscinski . Tests seem to be hanging so need to fix those and have a few comments
def _get_kill_command(self, jobid): | ||
"""Return the command to kill the job with specified jobid.""" | ||
submit_command = f'kill {jobid}' | ||
def _get_kill_command(self, process_id): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
By changing jobid
to process_id
you broke the log line on line 370. Either keep it as jobid
or adapt other lines that referenced it accordingly. This would be a breaking change, but since it is an internal method it is ok to change
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, I think I overdid by changing the name. The rest of the code is referring to this as job id. I reverted back to job id and added documentation to make it more clear.
# get a list of the process id of all descendants | ||
process = Process(int(process_id)) | ||
children = process.children(recursive=True) | ||
process_ids = [process_id] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you should cast to str
here explicitly to be safe. Before, it was used in an f-string, which automatically casts, but now you are using it as arguments to ' '.join()
which will fail if the elements are not all strings.
process_ids = [process_id] | |
process_ids = [str(process_id)] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
right thanks, added
process_ids.extend([str(child.pid) for child in children]) | ||
process_ids_str = ' '.join(process_ids) | ||
|
||
submit_command = f'kill {process_ids_str}' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Might as well take the opportunity to fix the variable name
submit_command = f'kill {process_ids_str}' | |
kill_command = f'kill {process_ids_str}' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added
Seems this PR solves issue #3776 (cannot choose from the left panel, idk why) |
Not sure about that. The issue is about killing AiiDA child processes, whereas this PR deals with system subprocesses of a |
75dafe4
to
d846146
Compare
Rename jobid to process id Update src/aiida/schedulers/plugins/direct.py Update src/aiida/schedulers/plugins/direct.py
d846146
to
731ad7c
Compare
|
||
|
||
@pytest.mark.timeout(timeout=10) | ||
def test_kill_job(scheduler): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The test I added, increases on my machine the testing time roughly by 1.7 s (test direct was before 2.3 seconds). Spawning and killing needs some time therefore the while loops. I already have set the start method to spawn
so less resources are used. Not sure how to otherwise decrease the time of the test.
Proposal to solve #6571
In the direct scheduler we use
psutil
to obtain a list of descendant processes so we can kill all of them. This issue does not happen in the other scheduler as the job scheduler takes care of this. Here we have to manage the killing of the descendants by ourself.