-
Notifications
You must be signed in to change notification settings - Fork 41
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WIP] Integrate Fluxion with flux-core's resource module #675
Closed
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Avoid any name collision with the new resource module within flux-core.
Add resource_interface_t in preparation for integration with the resource.acquire interface. Manage a flux future object corresponding to the resource.acquire RPC.
Perform the initial resource graph data store population when the first response of resource.acquire arrives. Update the resource graph whenever subsequence responses arrive. Introduce a structure to ultimately map the keys from the responses of resource.acquire to "grow", "shrink", "up" and "down" -- to reduce churns in the future. The target update semantics is the following. - grow: grow the resource graph by attaching the acquired subgraphs to the overall graph and the resulting resource graph becomes the basis for satisfiability - shrink: shrink the resource graph by detaching the subgraphs from the graph and the resulting resource graph becomes the basis for satisfiability - up/down: mark the resource subgraph up or down. The resources that are marked down will not be used for further scheduling. However, the state changes due to the marking do not affect satisfiability. Currently there are a few deficiencies that prevent this commit from implementing this semantics fully: - resource.acquire currently reports "administrative node exclusion" as "down". Since Fluxion can't tell the difference between an exclusion event and other types of real down events that shouldn't affect satisfiability, satisfiability semantics will not be completely correct when a node is administratively excluded at runtime via resource config file reloading. - vertex/edge removal required for shrink requires more investigations due to Boost Graph Library limitation. - mark() method within traverser has not yet been implemented so marking is currently NOOP.
Introduce a simple class that wraps and manages a flux_msg_t object in preparation for future synchronization with sched-fluxion-qmanager. Add a std::map of message wrapper objects to keep track of the clients who want to get notified of resource state changing events.
Introduce sched-fluxion-resource.notify to provide one or more clients resource state changing events including elastic and up/down events.
Codecov Report
@@ Coverage Diff @@
## master #675 +/- ##
==========================================
+ Coverage 75.09% 75.21% +0.12%
==========================================
Files 78 78
Lines 8166 8412 +246
==========================================
+ Hits 6132 6327 +195
- Misses 2034 2085 +51
Continue to review full report at Codecov.
|
Send sched-fluxion-resource.notify streaming RPC to get synchronized with sched-fluxion-resource for resource acquision and state change events. Delay its handshaking protocol with job-manager until receiving the first response of sched-fluxion-resource.notify. Upon receiving each response, run schedule loop as it represents a scheduleable resource event. Stop the reactor and module exit when receiving the ECANCELED response of sched-fluxion-resource.notify, which occurs when sched-fluxion-resource is unloaded.
Add flux module remove sched-simple for fluxion resource rc1 script. fluxion resource requires the resource.acquire RPC which conflicts with sched-simple's use of it.
Update test cases as the resource integration created more complex dependencies between resource, sched-simple, sched-fluxion-resource and sched-fluxion-qmanager. Add proper unloading order for sched-fluxion-resource and sched-fluxion-qmanager. Now that qmanager requires a response from sched-fluxion-resource.notify, unloading qmanager fails if sched-fluxion-resource is unloaded first. Rename test rc scripts (just symlinks to the number prefixed fluxion rc scripts) Remove custom, on-the-fly-generated rc support for t1005-qmanager-conf.t. Fluxion depends on flux-core's resource module which has a complicated dependencies on other flux-core modules, it didn't seem to make a whole lot of sense to create test-only rc scripts like what was embedded in t1005-qmanager-conf.t. Instead we now use the default rc scripts which should have all of the module dependencies set correctly.
Add tests to verify the new system meets the requirements w/ respect to specific load ordering and synchronization between the core resource, sched-simple, sched-fluxion-resource and sched-fluxion-qmanager.
Merged
The state of this PR was somewhat messed up so I closed this. |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This PR is experimental and only for early reviews: It needs PR #674, #665 and #667 to be merged first before making further progress.
Integrate Fluxion with flux-core's resource module.
sched-fluxion-resource
: perform the initial resource graph data store population when the first response of resource.acquire arrives. Update the resource graph whenever subsequence responses arrive. Its semantics and limitations are described in commit 1be7d51sched-fluxion-qmanager
: Send sched-fluxion-resource.notify streaming RPC to get synchronized withsched-fluxion-resource
for resource acquision and state change events. Delay its handshaking protocol with job-manager until receiving the first response ofsched-fluxion-resource.notify
. Upon receiving each response, run schedule loop as it represents a scheduleable resource event. Stop the reactor and module exit when receiving the ECANCELED response ofsched-fluxion-resource.notify
, which occurs when sched-fluxion-resource is unloaded.Adjust test cases for newly created dependencies among flux-core's
resource
,sched-simple
and fluxion's two service modules.The high-level interaction diagram is available at #662 (comment).