-
Notifications
You must be signed in to change notification settings - Fork 110
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Let LSF driver fall back to bhist for not-found jobs in bjobs #7299
Conversation
d29b6b8
to
cf1ab74
Compare
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #7299 +/- ##
==========================================
+ Coverage 85.20% 85.23% +0.03%
==========================================
Files 387 387
Lines 23070 23121 +51
Branches 885 892 +7
==========================================
+ Hits 19656 19707 +51
- Misses 3302 3307 +5
+ Partials 112 107 -5 ☔ View full report in Codecov by Sentry. |
f81887b
to
a9af502
Compare
f4aba23
to
d7acfc7
Compare
src/ert/scheduler/lsf_driver.py
Outdated
if missing_in_bjobs_output: | ||
logger.warning( | ||
f"bjobs did not give status for job_ids {missing_in_bjobs_output}" | ||
logger.debug(f"bhist is needed for job ids: {missing_in_bjobs_output}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"bhist is used for missing job ids" ?
src/ert/scheduler/lsf_driver.py
Outdated
f"bjobs did not give status for job_ids {missing_in_bjobs_output}" | ||
logger.debug(f"bhist is needed for job ids: {missing_in_bjobs_output}") | ||
|
||
bhist_states = await self._poll_once_by_bhist(missing_in_bjobs_output) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hm, can the if statement be moved out from the function?:
if not missing_job_ids:
Since the function title is misleading then. Even combine it with the if
above?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added a commit now that tries this, it might look slightly better.
Path("mock_jobs/pendingtimemillis").write_text("100") | ||
driver._poll_period = 0.01 | ||
await driver.submit(0, "sh", "-c", "sleep 1") | ||
await poll(driver, {0}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given my comment above we would be able to assert that _poll_once_by_bhist
was called once or ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, I think I want to keep that as an implementation detail, this test should not care as long as await poll
is working fine.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good @berland ! Just had a small comment about the way you call _poll_once_by_bhist
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks very good now @berland ! 🚀
Whenever jobs for which we need running states for are missing in the output from bjobs, it can mean that the information has fallen out of the bjobs cache on the LSF server. A scenario where this can occur is that if Ert for some reason is hanging for a long time without polling, or maybe for other reasons on the LSF side. Any time bjobs misses information, one call to `bhist` is done, and its output is stored in the object. The first time bhist is called, the driver is not able to determine any states, but given that the next bjobs call also will miss a certain job id, the subsequent bhist call will be able to deduce the running state by comparing the timing values in the first and the second bhist call. This method is not able to catch job that has failed, it will be marked as done.
Issue
Resolves #7208
Approach
Replicate the functionality of the C-driver, but with a different approach. We rely that the polling will
happen repeatedly and will only provide JobStates from bhist polling when bhist has been polled earlier.
When applicable