-
Notifications
You must be signed in to change notification settings - Fork 41
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
qmanager to integrate with the new exec system #481
Conversation
Codecov Report
@@ Coverage Diff @@
## master #481 +/- ##
==========================================
- Coverage 76.31% 74.89% -1.43%
==========================================
Files 45 60 +15
Lines 5535 6074 +539
==========================================
+ Hits 4224 4549 +325
- Misses 1311 1525 +214
Continue to review full report at Codecov.
|
888b3fe
to
d35cf64
Compare
5a74cb9
to
a805ed2
Compare
@SteVwonder, @garlick and maybe @grondo: This PR reached a reasonable stopping state to review this PR. The testing coverage went down a bit because this is actually an intermediate work before our end-of-July milestone. As such contains fair amounts of placeholder codes. |
Just some general comments, initially:
|
Thank you for the quick review. I note that this is preliminary work so some of the logic only has a place holder. More will come as part of the next sprint.
I can certainly do this. I was initially a bit ambivalent because resource can be used as a stand alone service independent of qmanager. I will create an issue to have a bit more discussion. But it doesn't have to be a part of this PR, does it?
Yes. This will be done as part of the end of July milestone. And this will be further utilized for resilience as logged with a scheduler resiliency ticket even later.
Same as above.
Ok. Good feedback. I will change some of the messages to LOG_DEBUG as part of this PR.
Yes. I had some manual testing. And the test case demonstrates this.
I will double check. I generally don't want to raise an exception unless necessary. And I think qmanager also needs the top level exception catch clause. (Will add that)
|
That sounds like the right thing! All good on filling in stuff and considering the name later as far as I'm concerned. |
Thanks @garlick. I should say that the new rpc based scheduler interface and infrastructure made my job far easier! Great work. |
I am using stl map and a few of its methods: Now, I just used the default comparator which is catch (std::exception &e) {
} |
Add initial support for high-level resource API. Two use cases: 1. The future queue manager support will require to interact with both match RPCs (when used in a service module) and CLIs (when used in a commandline based tester). We plan to hide such software complexity by making the queue manager class templated and instantiating it with different resource API types (module vs. cli). 2. @cmisale's project needs to layer our resource infrastructure with Go as required by Kubernetes. Currently low-level C++ API set doesn't do a good job with this case. Instead, high-level APIs with C bindings will significantly help with that effort. The hlapi/bindings/c++ directory contains the main code. So that it could be used with templated classes, our c++ API is a header file only solution. The hlapi/bindings/c directory contains the c APIs. They are simply wrappers around the c++ APIs. In fact, while its APIs are C, its implementation is C++. While this adds the necessary structure and API definitions for module APIs (with RPC) and CLIs, we only have module implementation. IOW, we only have placeholders for CLI APIs for both c and c++ bindings.
Implemented by @garlick.
OK. Rebased to the upstream/master and then pushed. |
```diff
+std::map<uint64_t, flux_jobid_t>::iterator queue_policy_base_impl_t::
```
Can you add a comment that the return value is the next element in the queue? It makes sense after looking up the return semantics of std::map::erase, but being mostly unfamiliar with the C++ STL, this behavior was initially surprising to me.
Good point. I will add a comment there. Incorrectly removing an element while iterating through a STL container is one of the common programming mistakes, and this semantics comes in handy to implement it correctly.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @dongahn. Generally LGTM! A few in-line comments below.
} | ||
|
||
|
||
std::map<uint64_t, flux_jobid_t>::iterator queue_policy_base_impl_t:: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add a comment that the return value is the next element in the queue when success (and the current element when unsuccessful)? It makes sense after looking up the return semantics of std::map::erase
, but being mostly unfamiliar with the C++ STL, this behavior was initially surprising to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great point! I somehow responded to this via an email. Will do.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I somehow responded to this via an email.
That was my fault. I accidentally made it as a top-level comment, hit delete, and then re-did it as a review comment. I forgot that comments spawned emails. Sorry about that.
* Boolean indicating if you want to use the | ||
* allocated job queue or not. This affects the | ||
* alloced_pop method. | ||
* \return 0 on success; -1 on error. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In one of the implementations, run_sched_loop
returns rc1 + rc2
. Maybe update this documentation to say < 0 on error
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I will do this. But like I said before, this is a loose end to tight for the next step. Sorry for a bit WIP nature of this. But I favor having to deal with two week sprint as opposed to over a month sprint :-//
Oh. One other question I had while reading through the code: it wasn't clear to me the need for the duplication of all of the interfaces for the CLI support. It'll probably be clear once there is an implementation filled out in there, but in the meantime, can you briefly explain your plans for the CLI piece? Will it still use RPCs etc to interface with the |
@SteVwonder: I pushed some more commits to address all of your review comments. This revision has changes for both yours and @garlick. So, if Travis turns green and you are okay with the latest changes, I will squash the latest commits and then this PR should be good to go. Thanks! |
Add the base queue-policy interface in qmanager/policies/base. (Flux::queue_manager namespace). Add two policy source files (FCFS and EASY) into qmanager/policies in the Flux::queue_manager::detail namespace. Provide an implementation for FCFS queueing policy. These are header-file only solutions because some of the core classes are templated with respect to high level resource API types.
Also make the value type of rank within the rlite writer to be a string to match with the RV1 spec.
Also Add qmanager support into the sched sharness script
Add support for the upcoming RFC14 & 24 revision. Adjust resource's traverser with it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @dongahn! LGTM.
OK. I squashed those later commits. Once Travis turns green, this is good to be merged @SteVwonder. Thanks. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dongahn, I didn't have any real comments on quick perusal besides a couple log messages that probably aren't needed and might clog up the logs.
if (flux_msg_get_userid (msg, &userid) < 0) | ||
return; | ||
|
||
flux_log (h, LOG_INFO, "alloc requested by user (%u).", userid); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd probably remove this informational message before merging. Since all alloc requests will come from the instance owner, this message will be the same for every job request.
if (flux_msg_get_userid (msg, &userid) < 0) | ||
return; | ||
|
||
flux_log (h, LOG_INFO, "free requested by user (%u).", userid); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similar to above, this information message will likely not add much to the logs.
@grondo: Thanks. Actually, I was concerned about these per-job messages myself as well. And I plan to go over the logging messages in a later PR across both qmanager and resource. Given that this is an intermediate PR, can this go in as is and then a later PR addresses this issue for all? |
Fine with me, I didn't consider them mandatory changes. |
Thanks @grondo and @SteVwonder! |
Yeah! Thanks! |
This PR has the following:
run_sched_loop
interface. Basic queuing operations are already supported by the base classes using C++ STL containers. To make certain queuing operation efficient, I decided not to usestd::list
but usestd::map
keyed by monotonically increasing queuing time.Resolve Issue #480, #468, and #471, #483, and #477.