qmanager to integrate with the new exec system #481

dongahn · 2019-06-28T19:50:14Z

This PR has the following:

Define the high level resource API so that our resource infrastructure can be better used by other module users and CLI users. This will also facilitate other language binding (such as Golang) creations.
Implement this API for the module users.
Incorporate schedutil from flux-core into flux-sched.
Introduce the new queuing policy interface. I expect that many classical queueing polices can be implemented by deriving from this class and overriding its run_sched_loop interface. Basic queuing operations are already supported by the base classes using C++ STL containers. To make certain queuing operation efficient, I decided not to use std::list but use std::map keyed by monotonically increasing queuing time.
Implement the FCFS policy derived class and also added a skeleton EASY policy class with no implementation as a placeholder. Note that the queuing policy interface is designed to be used by future CLI users as well as module users -- useful for testing.
Use these primitives to implement the first baseline version of qmanager which provides the schedule loop service integrating both the new execution system within flux-core and the resource match service within flux-sched.
Fix a few RV1 compatibility issues including libjobspec fix

Resolve Issue #480, #468, and #471, #483, and #477.

codecov-io · 2019-06-28T21:18:24Z

Codecov Report

Merging #481 into master will decrease coverage by 1.42%.
The diff coverage is 61.79%.

@@            Coverage Diff             @@
##           master     #481      +/-   ##
==========================================
- Coverage   76.31%   74.89%   -1.43%     
==========================================
  Files          45       60      +15     
  Lines        5535     6074     +539     
==========================================
+ Hits         4224     4549     +325     
- Misses       1311     1525     +214

Impacted Files	Coverage Δ
resource/hlapi/bindings/c++/reapi_cli_impl.hpp	`0% <0%> (ø)`
qmanager/policies/queue_policy_easy_impl.hpp	`0% <0%> (ø)`
qmanager/policies/queue_policy_easy.hpp	`0% <0%> (ø)`
qmanager/policies/queue_policy_fcfs.hpp	`100% <100%> (ø)`
resource/libjobspec/jobspec.hpp	`100% <100%> (ø)`	⬆️
qmanager/policies/queue_policy_fcfs_impl.hpp	`100% <100%> (ø)`
qmanager/policies/base/queue_policy_base.hpp	`100% <100%> (ø)`
resource/writers/match_writers.cpp	`93.71% <100%> (+0.11%)`	⬆️
resource/traversers/dfu_impl.hpp	`100% <100%> (ø)`	⬆️
src/common/libschedutil/free.c	`100% <100%> (ø)`
... and 26 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 3ed3da7...a906913. Read the comment docs.

dongahn · 2019-07-04T19:22:01Z

@SteVwonder, @garlick and maybe @grondo: This PR reached a reasonable stopping state to review this PR. The testing coverage went down a bit because this is actually an intermediate work before our end-of-July milestone. As such contains fair amounts of placeholder codes.

garlick · 2019-07-04T20:30:45Z

Just some general comments, initially:

You have two modules named qmanager and resource. Should they be named with a common prefix to indicate that they are part of a single scheduler implementation, like sched-full-qmanager and sched-full-resource? (substitute something more creative and cool sounding for "full" :-)
Your qmanager jobmanager_hello_cb needs to be filled in so that resources that are already allocated when the module loads can be marked allocated.
Your qmanager jobmanager_exception_cb needs to be filled in so that flux job cancel works for jobs that have an outstanding alloc request.
functions in qmanager.cpp look a bit chatty with the LOG_INFO messages on every message. Perhaps LOG_DEBUG?
Have you tried running this end to end with flux-core?
I see a ctx->queue->insert (job) call that isn't checked for failure. What happens to the broker if this hits a c++ exception, or am I missing how exceptions are handled?

dongahn · 2019-07-04T21:04:59Z

Thank you for the quick review. I note that this is preliminary work so some of the logic only has a place holder. More will come as part of the next sprint.

You have two modules named qmanager and resource. Should they be named with a common prefix to indicate that they are part of a single scheduler implementation, like sched-full-qmanager and sched-full-resource? (substitute something more creative and cool sounding for "full" :-)

I can certainly do this. I was initially a bit ambivalent because resource can be used as a stand alone service independent of qmanager. I will create an issue to have a bit more discussion. But it doesn't have to be a part of this PR, does it?

Your qmanager jobmanager_hello_cb needs to be filled in so that resources that are already allocated when the module loads can be marked allocated.

Yes. This will be done as part of the end of July milestone. And this will be further utilized for resilience as logged with a scheduler resiliency ticket even later.

Your qmanager jobmanager_exception_cb needs to be filled in so that flux job cancel works for jobs that have an outstanding alloc request.

Same as above.

functions in qmanager.cpp look a bit chatty with the LOG_INFO messages on every message. Perhaps LOG_DEBUG?

Ok. Good feedback. I will change some of the messages to LOG_DEBUG as part of this PR.

Have you tried running this end to end with flux-core?

Yes. I had some manual testing. And the test case demonstrates this.

I see a ctx->queue->insert (job) call that isn't checked for failure. What happens to the broker if this hits a c++ exception, or am I missing how exceptions are handled?

I will double check. I generally don't want to raise an exception unless necessary. And I think qmanager also needs the top level exception catch clause. (Will add that)

garlick · 2019-07-04T22:12:05Z

I think qmanager also needs the top level exception catch clause. (Will add that)

That sounds like the right thing!

All good on filling in stuff and considering the name later as far as I'm concerned.

dongahn · 2019-07-04T22:26:38Z

Thanks @garlick. I should say that the new rpc based scheduler interface and infrastructure made my job far easier! Great work.

dongahn · 2019-07-05T19:26:27Z

I will double check. I generally don't want to raise an exception unless necessary. And I think qmanager also needs the top level exception catch clause. (Will add that)

I am using stl map and a few of its methods: erase, insert, find and empty. The only methods that can throw an exception would be erase and find when used with the key as their method arguments (not iterator). But even then their exceptions are essentially transparently passing an exception that can be thrown from the comparator.

Now, I just used the default comparator which is std::less<key> which doesn't throw exceptions. So all of these should be pretty much exception-free. But there could be std::bac_alloc exceptions and etc, which can be thrown from object creation time etc, so I will wrap the entire mod_main with

catch (std::exception &e) {
}

@cmisale

Add initial support for high-level resource API. Two use cases: 1. The future queue manager support will require to interact with both match RPCs (when used in a service module) and CLIs (when used in a commandline based tester). We plan to hide such software complexity by making the queue manager class templated and instantiating it with different resource API types (module vs. cli). 2. @cmisale's project needs to layer our resource infrastructure with Go as required by Kubernetes. Currently low-level C++ API set doesn't do a good job with this case. Instead, high-level APIs with C bindings will significantly help with that effort. The hlapi/bindings/c++ directory contains the main code. So that it could be used with templated classes, our c++ API is a header file only solution. The hlapi/bindings/c directory contains the c APIs. They are simply wrappers around the c++ APIs. In fact, while its APIs are C, its implementation is C++. While this adds the necessary structure and API definitions for module APIs (with RPC) and CLIs, we only have module implementation. IOW, we only have placeholders for CLI APIs for both c and c++ bindings.

@garlick

Implemented by @garlick.

dongahn · 2019-07-08T21:22:25Z

OK. Rebased to the upstream/master and then pushed.

qmanager/policies/base/queue_policy_base.hpp

dongahn · 2019-07-08T22:37:25Z

```diff +std::map<uint64_t, flux_jobid_t>::iterator queue_policy_base_impl_t:: ```

Can you add a comment that the return value is the next element in the queue? It makes sense after looking up the return semantics of std::map::erase, but being mostly unfamiliar with the C++ STL, this behavior was initially surprising to me.

Good point. I will add a comment there. Incorrectly removing an element while iterating through a STL container is one of the common programming mistakes, and this semantics comes in handy to implement it correctly.

SteVwonder

Thanks @dongahn. Generally LGTM! A few in-line comments below.

SteVwonder · 2019-07-08T22:26:49Z

qmanager/policies/base/queue_policy_base_impl.hpp

+}
+
+
+std::map<uint64_t, flux_jobid_t>::iterator queue_policy_base_impl_t::


Can you add a comment that the return value is the next element in the queue when success (and the current element when unsuccessful)? It makes sense after looking up the return semantics of std::map::erase, but being mostly unfamiliar with the C++ STL, this behavior was initially surprising to me.

Great point! I somehow responded to this via an email. Will do.

I somehow responded to this via an email.

That was my fault. I accidentally made it as a top-level comment, hit delete, and then re-did it as a review comment. I forgot that comments spawned emails. Sorry about that.

qmanager/policies/base/queue_policy_base_impl.hpp

qmanager/modules/qmanager.cpp

resource/config/system_defaults.hpp

t/sharness.d/sched-sharness.sh

t/t1001-qmanager-basic.t

SteVwonder · 2019-07-09T00:22:08Z

qmanager/policies/base/queue_policy_base.hpp

+     *                   Boolean indicating if you want to use the
+     *                   allocated job queue or not. This affects the
+     *                   alloced_pop method.
+     *  \return          0 on success; -1 on error.


In one of the implementations, run_sched_loop returns rc1 + rc2. Maybe update this documentation to say < 0 on error?

Yes, I will do this. But like I said before, this is a loose end to tight for the next step. Sorry for a bit WIP nature of this. But I favor having to deal with two week sprint as opposed to over a month sprint :-//

qmanager/modules/qmanager.cpp

qmanager/policies/base/queue_policy_base.hpp

SteVwonder · 2019-07-09T00:31:47Z

Oh. One other question I had while reading through the code: it wasn't clear to me the need for the duplication of all of the interfaces for the CLI support. It'll probably be clear once there is an implementation filled out in there, but in the meantime, can you briefly explain your plans for the CLI piece? Will it still use RPCs etc to interface with the resource module, or are you planning on running the resource-query (or similar) tool(s) as subprocesses of the qmanager?

dongahn · 2019-07-09T18:38:40Z

@SteVwonder: I pushed some more commits to address all of your review comments. This revision has changes for both yours and @garlick. So, if Travis turns green and you are okay with the latest changes, I will squash the latest commits and then this PR should be good to go. Thanks!

Add the base queue-policy interface in qmanager/policies/base. (Flux::queue_manager namespace). Add two policy source files (FCFS and EASY) into qmanager/policies in the Flux::queue_manager::detail namespace. Provide an implementation for FCFS queueing policy. These are header-file only solutions because some of the core classes are templated with respect to high level resource API types.

Also make the value type of rank within the rlite writer to be a string to match with the RV1 spec.

Also Add qmanager support into the sched sharness script

Add support for the upcoming RFC14 & 24 revision. Adjust resource's traverser with it.

SteVwonder

Thanks @dongahn! LGTM.

dongahn · 2019-07-10T17:42:45Z

OK. I squashed those later commits. Once Travis turns green, this is good to be merged @SteVwonder. Thanks.

grondo

@dongahn, I didn't have any real comments on quick perusal besides a couple log messages that probably aren't needed and might clog up the logs.

grondo · 2019-07-10T17:29:46Z

qmanager/modules/qmanager.cpp

+    if (flux_msg_get_userid (msg, &userid) < 0)
+        return;
+
+    flux_log (h, LOG_INFO, "alloc requested by user (%u).", userid);


I'd probably remove this informational message before merging. Since all alloc requests will come from the instance owner, this message will be the same for every job request.

grondo · 2019-07-10T17:30:22Z

qmanager/modules/qmanager.cpp

+    if (flux_msg_get_userid (msg, &userid) < 0)
+        return;
+
+    flux_log (h, LOG_INFO, "free requested by user (%u).", userid);


Similar to above, this information message will likely not add much to the logs.

dongahn · 2019-07-10T18:28:32Z

@grondo: Thanks. Actually, I was concerned about these per-job messages myself as well. And I plan to go over the logging messages in a later PR across both qmanager and resource. Given that this is an intermediate PR, can this go in as is and then a later PR addresses this issue for all?

grondo · 2019-07-10T18:30:08Z

Fine with me, I didn't consider them mandatory changes.

dongahn · 2019-07-10T18:39:37Z

Thanks @grondo and @SteVwonder!

dongahn · 2019-07-10T21:51:18Z

Yeah! Thanks!

dongahn changed the title ~~WIP: skeleton qmanager to integrate the scheduler with resource~~ WIP: skeleton qmanager to integrate the scheduler with new exec system Jun 28, 2019

dongahn force-pushed the qmanager branch 2 times, most recently from 888b3fe to d35cf64 Compare July 4, 2019 09:24

dongahn changed the title ~~WIP: skeleton qmanager to integrate the scheduler with new exec system~~ qmanager to integrate with the new exec system Jul 4, 2019

dongahn force-pushed the qmanager branch 5 times, most recently from 5a74cb9 to a805ed2 Compare July 4, 2019 18:58

dongahn requested review from SteVwonder, garlick and grondo July 4, 2019 19:13

dongahn mentioned this pull request Jul 5, 2019

Name changes for service module names #485

Closed

dongahn added 4 commits July 8, 2019 14:20

build: Add make rules for high-level resource API support

1ebc866

schedutil: Copy schedutil from flux-core

a81f83b

Implemented by @garlick.

build: Add build rules for libschedutil

1b86239

dongahn force-pushed the qmanager branch from 0dd7cde to fb40eb3 Compare July 8, 2019 21:20

SteVwonder reviewed Jul 8, 2019

View reviewed changes

qmanager/policies/base/queue_policy_base.hpp Outdated Show resolved Hide resolved

SteVwonder reviewed Jul 9, 2019

View reviewed changes

dongahn mentioned this pull request Jul 9, 2019

Add role-based access control #488

Closed

dongahn added 11 commits July 9, 2019 17:05

rc: Add RC1 and RC3 support for qmanager

1b14e87

rc: fix a typo in the comment

8b5199d

qmanager: initial simple implementation

901ff62

build: Add rules for qmanager

7e5f150

build: Add rules for qmanager RC support

457050f

config: Remove references to old sched dirs

5cae81e

resource: Add version attribute to RV1 emitter

d9291be

Also make the value type of rank within the rlite writer to be a string to match with the RV1 spec.

test: Add basic qmanager test cases

a906913

Also Add qmanager support into the sched sharness script

libjobspec: Extend the attributes section

b11448e

Add support for the upcoming RFC14 & 24 revision. Adjust resource's traverser with it.

travis: Start munge daemons

843b271

SteVwonder approved these changes Jul 10, 2019

View reviewed changes

dongahn force-pushed the qmanager branch from 17e9821 to a906913 Compare July 10, 2019 17:41

grondo approved these changes Jul 10, 2019

View reviewed changes

dongahn mentioned this pull request Jul 10, 2019

Trim per-job log msgs from qmanager/resource #489

Closed

SteVwonder merged commit ef69e63 into flux-framework:master Jul 10, 2019

dongahn deleted the qmanager branch July 13, 2019 20:59

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

qmanager to integrate with the new exec system #481

qmanager to integrate with the new exec system #481

dongahn commented Jun 28, 2019 •

edited

Loading

codecov-io commented Jun 28, 2019 •

edited

Loading

dongahn commented Jul 4, 2019

garlick commented Jul 4, 2019 •

edited

Loading

dongahn commented Jul 4, 2019

garlick commented Jul 4, 2019

dongahn commented Jul 4, 2019

dongahn commented Jul 5, 2019 •

edited

Loading

dongahn commented Jul 8, 2019

dongahn commented Jul 8, 2019 via email •

edited by SteVwonder

Loading

SteVwonder left a comment

SteVwonder Jul 8, 2019

dongahn Jul 9, 2019

SteVwonder Jul 9, 2019 •

edited

Loading

SteVwonder Jul 9, 2019

dongahn Jul 9, 2019

SteVwonder commented Jul 9, 2019

dongahn commented Jul 9, 2019

SteVwonder left a comment

dongahn commented Jul 10, 2019

grondo left a comment

grondo Jul 10, 2019

grondo Jul 10, 2019

dongahn commented Jul 10, 2019

grondo commented Jul 10, 2019

dongahn commented Jul 10, 2019

dongahn commented Jul 10, 2019

		}


		std::map<uint64_t, flux_jobid_t>::iterator queue_policy_base_impl_t::

qmanager to integrate with the new exec system #481

qmanager to integrate with the new exec system #481

Conversation

dongahn commented Jun 28, 2019 • edited Loading

codecov-io commented Jun 28, 2019 • edited Loading

Codecov Report

dongahn commented Jul 4, 2019

garlick commented Jul 4, 2019 • edited Loading

dongahn commented Jul 4, 2019

garlick commented Jul 4, 2019

dongahn commented Jul 4, 2019

dongahn commented Jul 5, 2019 • edited Loading

dongahn commented Jul 8, 2019

dongahn commented Jul 8, 2019 via email • edited by SteVwonder Loading

SteVwonder left a comment

Choose a reason for hiding this comment

SteVwonder Jul 8, 2019

Choose a reason for hiding this comment

dongahn Jul 9, 2019

Choose a reason for hiding this comment

SteVwonder Jul 9, 2019 • edited Loading

Choose a reason for hiding this comment

SteVwonder Jul 9, 2019

Choose a reason for hiding this comment

dongahn Jul 9, 2019

Choose a reason for hiding this comment

SteVwonder commented Jul 9, 2019

dongahn commented Jul 9, 2019

SteVwonder left a comment

Choose a reason for hiding this comment

dongahn commented Jul 10, 2019

grondo left a comment

Choose a reason for hiding this comment

grondo Jul 10, 2019

Choose a reason for hiding this comment

grondo Jul 10, 2019

Choose a reason for hiding this comment

dongahn commented Jul 10, 2019

grondo commented Jul 10, 2019

dongahn commented Jul 10, 2019

dongahn commented Jul 10, 2019

dongahn commented Jun 28, 2019 •

edited

Loading

codecov-io commented Jun 28, 2019 •

edited

Loading

garlick commented Jul 4, 2019 •

edited

Loading

dongahn commented Jul 5, 2019 •

edited

Loading

dongahn commented Jul 8, 2019 via email •

edited by SteVwonder

Loading

SteVwonder Jul 9, 2019 •

edited

Loading