Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Integrate resource-match with the new scheduler interface #468

Closed
dongahn opened this issue Jun 11, 2019 · 17 comments
Closed

Integrate resource-match with the new scheduler interface #468

dongahn opened this issue Jun 11, 2019 · 17 comments

Comments

@dongahn
Copy link
Member

dongahn commented Jun 11, 2019

Had a discussion with @grondo yesterday. Though the scheduler interface isn't where he wants it to be, we thought it would be a good idea to start doing the integration work sooner rather than later. Once #467 is merged, I plan to take a crack at this.

@dongahn
Copy link
Member Author

dongahn commented Jun 11, 2019

@grondo: could you post some pointers as to what files I should start to take a look? Thanks.

@grondo
Copy link
Contributor

grondo commented Jun 11, 2019

The helper functions for scheduler integration were developed by @garlick and can be found in src/common/libschedutil.

For example uses check out src/modules/sched-simple and t/job-manager/sched-dummy.c.

We had planned to abstract libschedutil into a scheduler specific module interface to simplify scheduler development, however, that work has been pushed off due to other priorities. For now it might be easiest to just copy libschedutil into flux-sched. Once we transition to a more polished scheduler interface we could remove that convenience library from flux-sched.

@dongahn
Copy link
Member Author

dongahn commented Jun 14, 2019

Thanks @grondo. This will be my next priority.

@dongahn
Copy link
Member Author

dongahn commented Jun 16, 2019

I looked at some of these suggested files. They look great. My feeling is, though, the scheduler loop service at flux-sched will have to be much more complex, and we will need to manage this complexity very carefully. Note that the original sched was pretty complex with various scheduler parameter and queueing policy variations (plus embedded emulator, which isn't an issue for this round) and we should use that experience to design this better. My proposal is to make a subdirectory at the top level called sched in which we use a similar strategy as resource.

  • Build up abstractions using C++ classes like class sched_t and etc;
  • Build a command-line utility in sched/utilities like % sched> that uses these abstractions such that we can test and debug using the CLI more comprehensively (like resource-query did it for resource matching);
  • Build a sched-loop module as a thin layer atop of sched classes in sched/modules.

I also talked with @garlick and @grondo and I will copy libschedutil from flux-core to flux-sched in the corresponding location.

@dongahn
Copy link
Member Author

dongahn commented Jun 16, 2019

@garlick and @grondo: I assume you won't have queueing policies other than fcfs at the job-manager level, correct? So essentially alloc will be issued in the fcfs order from 'job-manager'. (though priority and such can change this order).

This is fine but I thought I should doublecheck.

When users want the any out of order policy like backfilling at the flux-sched level, I will assume sched will have to use "unlimited" to replicate the entire job queue.

Also initially, scheduler loop trigger events will be "alloc" and "free" only since we have no resource event yet (additional resource joined; some resources are detected to be down and/or excluded).

@garlick
Copy link
Member

garlick commented Jun 16, 2019

Correct on both counts. If this turns out to be too simplistic, let's talk.

@dongahn
Copy link
Member Author

dongahn commented Jun 17, 2019

Functionality wise, this seems okay.

So I will start to design based on these assumptions. If this appeared to be too redundant, I will call for a discussion.

BTW There are things that the current interface solves pretty nicely for me. They include not needing finite state machines and not having to deal with individual events; and easy to implement resilience scheme.

But I realize I will probably still have to implement performance optimization techniques like queue depth and delay scheduling at the scheduler level. This is fine.

But I will see if there are opportunities to implement those at the core level which can benefit all schedulers.

@SteVwonder
Copy link
Member

But I realize I will probably still have to implement performance optimization techniques like queue depth

Yeah. It would awesome if the sched / job-manager handshake could be extended to support more than just 1 and unlimited queue-depth to also support an arbitrary N.

@dongahn
Copy link
Member Author

dongahn commented Jun 26, 2019

@garlick or @SteVwonder: I see from sched-dummy.c, the module load option is now --opt=ABC using optparse as opposed to opt=ABC, which I have been using.

Did we decide to require this style of option passing for modules at this point across the board? I am implementing this part of the new qmanager service and couldn't remember which format is our requirement.

@dongahn
Copy link
Member Author

dongahn commented Jun 26, 2019

@garlick: My rc1 script for qmanager currently fails because flux-core loads in sched-simple by default. I can get around that by unloading sched-simple if present before loading up qmanager in its rc1 script. Does it sound like a reasonable short term solution?

For the long haul, though it seems we would need a way to query whether a conflicting module has been loaded if so it can be first unloaded.

@garlick
Copy link
Member

garlick commented Jun 27, 2019 via email

@dongahn
Copy link
Member Author

dongahn commented Jun 28, 2019

@garlick: A quick question. When you submit a jobspec with 1h duration at this point like

flux job submit test.t60.json

and the scheduler respond to the alloc request. Does the job-manager issues the free request right away? I am seeing my free callback request being called, but I wasn't sure if this is because I'm doing something wrong or just expected.

@dongahn
Copy link
Member Author

dongahn commented Jun 28, 2019

@garlick: also I unload sched-simple in my rc1 script for qmanager, but I'm getting the following error when I exit out of my flux instance.

2019-06-28T06:41:10.936714Z broker.err[0]: rc3: flux-module: cmb.rmmod[0] sched-simple: No such file or directory
flux-broker: module 'qmanager' was not cleanly shutdown

Any insight?

@garlick
Copy link
Member

garlick commented Jun 28, 2019

Did we decide to require this style of option passing for modules at this point across the board? I am implementing this part of the new qmanager service and couldn't remember which format is our requirement.

As discussed in the meeting, not required, but easier. If using optparse, just watch out for module argv[0] being the first argument not argv[1] in modules (need to pass argv - 1, argc + 1).

Does the job-manager issues the free request right away?

After execution completes, and execution always completes quickly because the actual launch isn't implemented yet. There is a way to simulate exection of the full duration (with a sleep in the exec system), see t2400-job-exec-test.t.

I unload sched-simple in my rc1 script for qmanager, but I'm getting the following error when I exit out of my flux instance.

Modules are normally loaded in rc1 and unloaded in rc3, so maybe you need to provide an rc3 script also? They are not automatically unloaded.

@dongahn
Copy link
Member Author

dongahn commented Jun 28, 2019

Modules are normally loaded in rc1 and unloaded in rc3, so maybe you need to provide an rc3 script also? They are not automatically unloaded.

I do have it. I will take a look at it again though.

BTW if the module (sched-simple) loaded by its rc1 got unloaded by others (like this case), presumably doing another unload by its rc3 script wouldn't lead to this error, would it?

@dongahn
Copy link
Member Author

dongahn commented Jun 28, 2019

After execution completes, and execution always completes quickly because the actual launch isn't implemented yet. There is a way to simulate exection of the full duration (with a sleep in the exec system), see t2400-job-exec-test.t.

This should be very useful!

@dongahn
Copy link
Member Author

dongahn commented Jul 11, 2019

PR #481 resolved this.

@dongahn dongahn closed this as completed Jul 11, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants