Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Let LSF driver fall back to bhist for not-found jobs in bjobs #7299

Merged
merged 1 commit into from
Apr 2, 2024

Conversation

berland
Copy link
Contributor

@berland berland commented Feb 29, 2024

Issue
Resolves #7208

Approach
Replicate the functionality of the C-driver, but with a different approach. We rely that the polling will
happen repeatedly and will only provide JobStates from bhist polling when bhist has been polled earlier.

  • PR title captures the intent of the changes, and is fitting for release notes.
  • Added appropriate release note label
  • Commit history is consistent and clean, in line with the contribution guidelines.
  • Make sure tests pass locally (after every commit!)

When applicable

  • When there are user facing changes: Updated documentation
  • New behavior or changes to existing untested code: Ensured that unit tests are added (See Ground Rules).
  • Large PR: Prepare changes in small commits for more convenient review
  • Bug fix: Add regression test for the bug
  • Bug fix: Create Backport PR to latest release

@berland berland self-assigned this Feb 29, 2024
@berland berland force-pushed the bhist branch 3 times, most recently from d29b6b8 to cf1ab74 Compare March 15, 2024 14:58
@codecov-commenter
Copy link

codecov-commenter commented Mar 18, 2024

Codecov Report

Attention: Patch coverage is 88.33333% with 7 lines in your changes are missing coverage. Please review.

Project coverage is 85.23%. Comparing base (ef492f0) to head (4581480).
Report is 4 commits behind head on main.

Files Patch % Lines
src/ert/scheduler/lsf_driver.py 88.33% 7 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #7299      +/-   ##
==========================================
+ Coverage   85.20%   85.23%   +0.03%     
==========================================
  Files         387      387              
  Lines       23070    23121      +51     
  Branches      885      892       +7     
==========================================
+ Hits        19656    19707      +51     
- Misses       3302     3307       +5     
+ Partials      112      107       -5     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@berland berland force-pushed the bhist branch 3 times, most recently from f81887b to a9af502 Compare March 22, 2024 13:14
@berland berland force-pushed the bhist branch 3 times, most recently from f4aba23 to d7acfc7 Compare March 25, 2024 20:25
@berland berland marked this pull request as ready for review March 25, 2024 20:30
@berland berland added scheduler release-notes:skip If there should be no mention of this in release notes labels Mar 25, 2024
if missing_in_bjobs_output:
logger.warning(
f"bjobs did not give status for job_ids {missing_in_bjobs_output}"
logger.debug(f"bhist is needed for job ids: {missing_in_bjobs_output}")
Copy link
Contributor

@xjules xjules Mar 27, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"bhist is used for missing job ids" ?

f"bjobs did not give status for job_ids {missing_in_bjobs_output}"
logger.debug(f"bhist is needed for job ids: {missing_in_bjobs_output}")

bhist_states = await self._poll_once_by_bhist(missing_in_bjobs_output)
Copy link
Contributor

@xjules xjules Mar 27, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hm, can the if statement be moved out from the function?:

if not missing_job_ids:

Since the function title is misleading then. Even combine it with the if above?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a commit now that tries this, it might look slightly better.

Path("mock_jobs/pendingtimemillis").write_text("100")
driver._poll_period = 0.01
await driver.submit(0, "sh", "-c", "sleep 1")
await poll(driver, {0})
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given my comment above we would be able to assert that _poll_once_by_bhist was called once or ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, I think I want to keep that as an implementation detail, this test should not care as long as await poll is working fine.

Copy link
Contributor

@xjules xjules left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good @berland ! Just had a small comment about the way you call _poll_once_by_bhist.

Copy link
Contributor

@xjules xjules left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks very good now @berland ! 🚀

Whenever jobs for which we need running states for are
missing in the output from bjobs, it can mean that the information
has fallen out of the bjobs cache on the LSF server. A scenario where
this can occur is that if Ert for some reason is hanging for a long time
without polling, or maybe for other reasons on the LSF side.

Any time bjobs misses information, one call to `bhist` is done, and its
output is stored in the object. The first time bhist is called, the
driver is not able to determine any states, but given that the next
bjobs call also will miss a certain job id, the subsequent bhist call
will be able to deduce the running state by comparing the timing values
in the first and the second bhist call.

This method is not able to catch job that has failed, it will be marked
as done.
@berland berland enabled auto-merge (rebase) April 2, 2024 13:37
@berland berland merged commit 1c1badd into equinor:main Apr 2, 2024
37 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
release-notes:skip If there should be no mention of this in release notes
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

Support bhist when bjobs fails in LSF
4 participants