Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

A linear job search performs poorly for HTC load #182

Closed
dongahn opened this issue Aug 22, 2016 · 3 comments
Closed

A linear job search performs poorly for HTC load #182

dongahn opened this issue Aug 22, 2016 · 3 comments

Comments

@dongahn
Copy link
Member

dongahn commented Aug 22, 2016

We need an optimization for the linear job search scheme as captured in flux-distribution#14: specifically, this code. Once an optimization is done, please rerun some tests.

@dongahn
Copy link
Member Author

dongahn commented Aug 22, 2016

The scheduler currently has three job queues: pending, running and complete queues. And it only performs very basic operations on these queues: find a job across all three queues; append a job to a queue; and finally move a job from one queue to another (e.g., from the pending queue to running queue).

An obvious optimization is to add indexing based on job id.

Given the scheduler only needs limited operations on our queues, an easy solution would be to add a zhash_t object and index each job into it on the first enqueue event (i.e., into the pending queue). Then, the remaining change will be to remove a job from this hash table when it is reaped from the complete queue -- which isn't implemented yet (So I will mark it as TODO in the code); and make sure the table is initialized/finalized properly.

This will add a small run overhead on each job append, but will turn O(N) to O(1) for q_find_job() where N is the number of jobs in the queues. And also slight increase in memory of course.

Alternatively, you can have a lookup table per queue, but that seems an overkill at this point and adds higher overheads as this will incur one additional delete on each job move.

Also one can introduce a composite data structure which abstracts out these operations (traditional enqueue + constant find). But I propose that we do this if and when more complex operations are needed on our job queues.

@dongahn
Copy link
Member Author

dongahn commented Aug 23, 2016

This optimization alone doesn't seem to help much:

  • Jobs Executed Per Second (JEPS) at <2 nodes, 1 broker per node, unit sizing policy, non-mpi sleep, FCFS>: 1.77

I realized that q_move_to_rqueue() and q_move_to_cqueue() can also perform poorly as they use zlist_remove(), which also uses a linear search as in here. This needs to be optimized too. Now that I think about this, these may not be that bad w/ FCFS scheduling. Perhaps, #183 is a bigger fish to fry.

@lipari
Copy link
Contributor

lipari commented Sep 13, 2016

Close via #190

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants