Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Revise Sched's find() and select() operations to leverage visitor/matcher design patterns #193

Closed
lipari opened this issue Sep 13, 2016 · 18 comments
Assignees

Comments

@lipari
Copy link
Contributor

lipari commented Sep 13, 2016

Spinning off the issue from the discussions in PR 189.

@dongahn said:

I believe providing matcher and visitor patterns from resrc and the friends can make resrc APIs more ease to use and robust, and make it be better adaptable when @trws's job spec design comes along, as mentioned in this comment

ROSE's AstPrePostProcessing class is one of the examples. It is an abstract class that definespreOrderVisit and postOrderVisit interfaces for an Abstract Syntax Tree (AST). One can override (e.g., C++ lingo -- meaning, implement) these interfaces in his or her AST analysis class to plug their analysis into the system. Then, this class takes care of an AST traversal and lets the user-defined analysis to be invoked on post visit and on pre visit during this depth first traversal.

I actually used this interface for the STAT project. One example code can be found in STAT's TemporarlOderAnalysisAPI in our internal bitbucket repo (here). I used this pattern to implement a way to identify all of the lexical scopes in a C/C++ source file, which contain a specific program point or points. This static information, along with the runtime stack trace, then allowed me to temporally order points of execution among MPI processes (i.e., determining relative progress).

Our case can be viewed in a similar way. Our scheduling plugins traverse each resrc hierarchy and act on a vertex that they wants to operate on. But plugins don' t need to implement raw traversal logic by themselves. Since our language is C, we can't use a OOP concept. But I think a callback that gets invoked on each node visit can bring us similar benefits.

If we abstract out data structures and visitor/match patterns from resource details (e.g., node type, PDU type, switch type and etc), a scheduler plugin writer can focus on scheduling operations like matched/unmatched, allocate/deallocate etc while leaving the traversal across potentially many different resource hierarchy to the visitor pattern provider. Good separations of concerns.

If we were to do this, I would see if we can turn some of the tree traversing logic in our current scheduler plugin to use such visitor/matcher patterns.

BTW, Clang also has Matcher patterns in here though I don't have hands-on experience.

@trws said:

Clang matchers are, for the record, one of the most awesome features ever written for manipulating C/C++.

Anyway, actual on-topic comment, the classic twalk style pattern could work well for this, but I'd probably want to simplify it to a DFS only where the function pointer's return value informs the walker as to whether to process children, skip children and continue, or abort completely. I say DFS specifically because a DFS from the root of the hardware tree, restricted to the hardware tree, can most rapidly identify branches to skip in that ordering IIRC.

@lipari lipari mentioned this issue Sep 13, 2016
@dongahn
Copy link
Member

dongahn commented Sep 16, 2016

[Rough Idea]

Wondering if we can use an one-pass algorithm using matcher/visitor patterns to merge find() and select() (also related to Issue #184)... Looking at find and select from a scalability lens, roughy speaking we have to traverse the entire tree once with find and then again with select. I mean, there are cutoff tricks, but complexity looks to me that way. If we can design visitor patterns as well as transactional aggregate updates, I am guessing an efficient one-pass algorithm is possible with many unnecessary traversing pruned.

Rough sketch:

  • A scheduler plugin implements a function that can perform as a matcher and scorer on a resrc object on each visit:

    int match_and_score (resrc *in)
    
  • There is a generic priority queue that the scheduler can use within this match function

  • The scheduler calls a traversal API with the matcher passed in

    int (*matcher_and_score_f) (resrc *) = match_and_score;
    postorder_walk (resrc *root, matcher_and_scorer_f);
    
  • postorder_walk will take the matcher to each resrc objects to be evaluated.

    • On a visit, matcher first looks at the aggregates and uses them to decide a further descent and telling the walker about descent or no descent with a return code
    • On a visit, the target resrc object is scored according to the scheduler's goal. Here, a flow-aware matcher can also evaluate all flow objects if the resrc object has a link to that flow.
    • If the object matches, matcher pushes that object into a priority queue with its evaluation score as the priority.
  • Once this pruned walk is done, the resulting priority queue will have all of the qualified resources ordered by the match score

  • Scheduler can either allocate or reserve the top resrc and also update the aggregates accordingly.

  • In case the objects in the priority queue is not enough, backfill scheduler will start this algorithm to all of the resource used by the job that will complete earliest. Repeat until all resources are reserved...

@trws
Copy link
Member

trws commented Sep 16, 2016

That sounds good Dong, and is rather like what I meant to be getting at, though much more fleshed out. The only nit I have is it would be more efficient to use a preorder rather than postorder walk because of the way exclusivity works in our resource model. If it's preorder, you can look at the exclusive flag on a resource to determine whether it's worth looking at any of its children, while the postorder would have already visited the children before seeing the parent. This is also quite easy to parallelize across threads without races except on the matches construct, which would be pretty easy to protect.

@dongahn
Copy link
Member

dongahn commented Sep 16, 2016

Yeah. It should have been preorder. That order has the pruning benefit from the perspective of both exclusivity and aggregate evaluation.

@dongahn
Copy link
Member

dongahn commented Sep 16, 2016

Also, get_all_ancestors ("flow name") exposed to the matcher would be helpful. In particular w/ flows, matcher can use that to quickly check the flow constraints of a node etc.

@trws
Copy link
Member

trws commented Sep 16, 2016

I agree that's useful, but would rather be able to traverse it by
vtx->hw_parent or similar than using a function like that. Mainly this
is to avoid having to allocate a list or array to store the list of
ancestors, unless there's a reason to want to have that list at all
times, in which case perhaps it should just be stored in each vertex.

On 16 Sep 2016, at 11:28, Dong H. Ahn wrote:

Also, get_all_ancestors ("flow name") exposed to the matcher would
be helpful. In particular w/ flows, matcher can use that to quickly
check the flow constraints of a node etc.

You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub:
#193 (comment)

@dongahn
Copy link
Member

dongahn commented Sep 16, 2016

I see your point about memory. This came from wanting to manage the potential complexities of checking ancestor vertices across potentially many hierarchies. Perhaps a matcher can create a sub-matcher to walk up various hierarchies and check the sub-constraints.

@dongahn
Copy link
Member

dongahn commented Sep 16, 2016

Like

int power_match_and_score (resrc *in);
int io_match_and_score (resrc *in);

int match_and_score (resrc *in) {
    up_walk (in, "power", power_match_and_score);
    up_walk (in, "io-hier1", io_match_and_score);

@dongahn
Copy link
Member

dongahn commented Sep 16, 2016

I have to think some more about this but if the scheduler limits the interaction with the new Resource API at the matcher and submatchers level, which uses WalkerAPI set, then perhaps this is coarse grained enough that merits a message interface... Of course, the scheduler needs to push the matcher code into the Resource service, but this is already being demonstrated with @grondo's reduction work...

@dongahn
Copy link
Member

dongahn commented Sep 17, 2016

I thought about this some more. I am pretty positive that we can do this as a message interface and we can even also take care of reservations with an one-pass algorithm.

Essentially matchers can evaluate not only the current state but also the next availble windows upto k windows (this should be a function of the resevation depth). The rsulting score of each match should reflect how early the resource will be avaiable. The score should be higher for any match whose availble time window is earlier. Precedence of the score: available time#better match - when resource is idle, the time will be 0, so they must be selected before choosing a reource that becomes available in 10 mins from now.

This way, one pass query will always produce an enough number of resources sorted by the match critetia, so the scheduler only needs to dequeue N items from the priority queue.

@SteVwonder
Copy link
Member

For IO and Power, I think the pre-order is pretty close to optimal. As you descend the tree, you can determine the maximum number of processes that you could spawn in each subtree given the amount of available BW/Power. You can bail-out early from a subtree traversal if the number of processes you can spawn is 0 (or less than the total in the case of the root).

My only question is how would a locality-aware scheduler work under this scheme? For example, if you wanted to make a scheduler that creates jobs under as few switches as possible (to limit the number of hops necessary for MPI comms), how would that work under this one-pass scheme? I could see how to write it under the two-pass scheme: (1) find all the nodes available and then (2) select the nodes under the switches with the most available nodes.

@dongahn
Copy link
Member

dongahn commented Sep 19, 2016

@SteVwonder: For locality-aware scheduling, I can think of two schemes:

  • Allow the matcher on the visit on a upper level resource (e.g., switch) to pass relevant information for the visits for its children. E.g., The matcher can pass the aggregates (e.g., edge switch has k number of available nodes) back into the walker as a opaque data upon exit, which then gets passed into invocation of the matcher on all of its child nodes. This upper level information can be used to compute the score (e.g., a node that is hung below an edge switch with higher numbers of available nodes get higher scores).
  • Use up_walk to determine a high level information on the visit of each child. In this case, the visit of each child will do up_walk to figure out the upper level info.

The first one is a top down and the other bottom up scheme. Depending on what you do, either should work better. But it seems an important thing is to identify representative operations and package them as a set of abstractions.

In any case, matcher's signature can be something like:

int match_and_score (walker_t *walker, resrc *in, void *data);

@dongahn
Copy link
Member

dongahn commented Sep 19, 2016

For IO and Power, I think the pre-order is pretty close to optimal. As you descend the tree, you can determine the maximum number of processes that you could spawn in each subtree given the amount of available BW/Power. You can bail-out early from a subtree traversal if the number of processes you can spawn is 0 (or less than the total in the case of the root).

An interesting thing to think about is the main hierarchy vs. auxiliary hierarchies. I can see a scheme where the scheduler's main tree is the default hierarchy and then use up_walk to check other constraints. But like you said, some may not care about the physical tree in which case the main hierarchy can be like io hierarchy or power hierarchy. We want to be able to support all of these operations with common abstractions.

@trws
Copy link
Member

trws commented Sep 19, 2016

A locality-aware scheduler would probably want to compute the densest clique that matches all other constraints, it could be done as a function of some pretty common graph algorithms, but now we're talking about a graph walking interface rather than just a tree walker. We'll definitely want that, but it's a much more complicated animal than we're talking about here initially. This is part of why I think it's a good idea to build on a library or framework that has graph traversal functions built in, the trivial tree cases remain trivial, but the complicated stuff becomes much more tractable.

@dongahn
Copy link
Member

dongahn commented Sep 19, 2016

This is part of why I think it's a good idea to build on a library or framework that has graph traversal functions built in, the trivial tree cases remain trivial, but the complicated stuff becomes much more tractable.

Did you happen to look at some of the already available frameworks we can possibly leverage?

@trws
Copy link
Member

trws commented Sep 19, 2016

Many, this is why I was having trouble with C as the language. There's only one that offers a C interface, and igraph is not particularly suitable. If we had to, we could make it work, but it's an O(n) operation on the size of the graph as a whole to add a node or edge. Our local graph analysis experts suggested Boost::Graph for its flexibility and support for completely custom data structures, there are various other good C++ libraries too, or there's the cayley graph library/database in go. Obviously none of these is a perfect fit to the project... That said, having an abstract interface over one of them to get started quickly seems less painful than trying to build an entire toolkit from scratch without a graph analysis expert on hand...

@dongahn
Copy link
Member

dongahn commented Sep 20, 2016

FYI -- @lipari, @trws, @SteVwonder and I talked about this at yesterday's meeting. Making a long story short, we agreed that our short-term effort will be to add some walker and priority queue support around resrc and friends and to rewrite the scheduler plugins as matchers. The cleaner separation between resrc and scheduler will then be a basis for performing a later surgery when resrc becomes a part of job and resource service via message interfaces. @lipari and I will closely work together on this one -- this can be viewed as scalability work as well. And we will get constant feedback form @trws and @SteVwonder to make later surgery more successful.

@dongahn
Copy link
Member

dongahn commented Sep 27, 2016

FYI -- To do some minimal amount of work for this issue, I am first trying to abstract out various resource generator methods like these lines into a common factory/generator class.

The idea is then for this class to generate a resrc "database" fully given a generation method and make all of the resource hierarchies available (through some accessors of the opaque data type (resrcs_ctx_t)) to the scheduler to walk and match.

  • resrc_gen.h:
enum {
    GEN_METHOD_RDL_FILE = 0,
    GEN_METHOD_RDL_BUF,
    GEN_METHOD_HWLOC_FILE,
    GEN_METHOD_HWLOC_BUF,
    GEN_METHOD_FOR_RENT
};

typedef struct gen_method gen_method_t;
typedef struct resrcs_ctx resrcs_ctx_t;

gen_method_t *gen_method_new (int type, const char *create_data);
void gen_method_destroy (gen_method_t *meth);

resrcs_ctx_t *resrcs_gen_new (gen_method_t *meth);
void resrcs_gen_destroy (resrcs_ctx_t *ctx);

Now, I find it a bit awkward having to specialize the code whether the target resrc is physical or flow, not being able to write the most generic code: e.g.,

  • resrc_gen.c:
<CUT>
static zhash_t *create_w_rdl_file (gen_method_t *meth)
{
    int i = 0;
    zhash_t *rhs = NULL;
    struct rdl *rdl = NULL;
    struct rdllib *l = NULL;
    resrc_tree_t *troot = NULL;
    resrc_flow_t *froot = NULL;
    const char *file = NULL;
    const char *h = NULL;

    <CUT>
    /*
     * Load the default hierarchy first here 
     */
    if (!(troot = rdl_load_hw_hierarchy (rdl, "default")))
        goto ret;

    zhash_insert (rhs, "default", (void *)troot);
    zhash_freefn (rhs, "default", resrc_tree_destroy_full);

    for ((h = rdl_next_hierarchy (rdl, h))) {
        if (strncmp (h, "default", sizeof ("default")) != 0) {
            if (!(froot = rdl_load_flow_hierarchy (rdl, h)))
                goto ret;
            zhash_insert (rhs, h, (void *)froot);
            zhash_freefn (rhs, h, resrc_flow_destroy);
        }
    }

And macher and walker API

  • resrc_walker.h:
enum {
    WALKER_CONT,
    WALKER_STOP,
    WALKER_ACTION_FOR_RENT
};

typedef int (*resrc_matcher_f) (resrc_tree_t *v, const resrc_walker_t *w, void *pq);
typedef int (*flow_matcher_f) (resrc_flow_t *v, const resrc_walker_t *w, void *pq);
typedef int (*resrc_walker_f) (resrc_tree_t *v, resrc_matcher_f m, const resrc_walker_t *w, void *pq);
typedef int (*flow_walker_f) (resrc_flow_t *v, flow_matcher_f m, const resrc_walker_t *w, void *pq);

typedef struct resrc_walker {
    resrc_walker_f preorder;
    resrc_walker_f postorder;
    resrc_walker_f preup;
    resrc_walker_f postup;
    flow_walker_f preup_flow;
    flow_walker_f postup_flow;
} resrc_walker_t;

resrc_walker_t *resrc_walker_new ();
void resrc_walker_destroy (resrc_walker_t *w);

I will keep working on this to demo the new ways in a small but easily modifiable way but it seems clear to me we want to represent any resource hierarchy as a generic tree and also have a common resource abstract data type covering bothresrc_t and flow resource.

@dongahn
Copy link
Member

dongahn commented Mar 20, 2018

This has been addressed by resource-query PR.

@dongahn dongahn closed this as completed Mar 20, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants