Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

allocator: Less aggressive retry #2021

Merged
merged 1 commit into from
Mar 9, 2017

Conversation

aaronlehmann
Copy link
Collaborator

Instead of retrying unallocated tasks, services, and networks every time data changes in the store, limit these retries to every 5 minutes.

When a repeated attempt to allocate one of these objects fails, log it at the debug log level, to reduce noise in the logs.

cc @alexmavr @yongtang @aboch

@codecov
Copy link

codecov bot commented Mar 8, 2017

Codecov Report

Merging #2021 into master will increase coverage by 0.05%.
The diff coverage is 85.71%.

@@            Coverage Diff             @@
##           master    #2021      +/-   ##
==========================================
+ Coverage   53.66%   53.71%   +0.05%     
==========================================
  Files         109      109              
  Lines       18991    19008      +17     
==========================================
+ Hits        10191    10210      +19     
+ Misses       7578     7564      -14     
- Partials     1222     1234      +12

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update e4762bc...513d028. Read the comment docs.

@@ -401,12 +416,22 @@ func (a *Allocator) doNetworkAlloc(ctx context.Context, ev events.Event) {
case state.EventCreateNode, state.EventUpdateNode, state.EventDeleteNode:
a.doNodeAlloc(ctx, ev)
case state.EventCreateTask, state.EventUpdateTask, state.EventDeleteTask:
a.doTaskAlloc(ctx, ev)
a.doTaskAlloc(ctx, ev, nc.pendingTasks)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Couldn't doTaskAlloc(ctx,ev) retrieve pendingTasks on its own via ctx.nc.pendingTasks ?

func (a *Allocator) procUnallocatedTasksNetwork(ctx context.Context) {
nc := a.netCtx
allocatedTasks := make([]*api.Task, 0, len(nc.unallocatedTasks))
func (a *Allocator) procTasksNetwork(ctx context.Context, toAllocate map[string]*api.Task, quiet bool) {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If working on the nc retrieved from the context is equivalent, would it make sense to write this method as

func (a *Allocator) procTasksNetwork(ctx context.Context, onRetryInterval bool) {
    nc := a.netCtx
    quiet := false
    toAllocate := nc.pendingTasks
    if onRetryInterval {
        toAllocate = nc.unallocatedTasks
        quiet = true
    }
...

@aboch
Copy link

aboch commented Mar 8, 2017

Logic looks good to me.
I just have a couple of comments about the functions' prototype.

@aaronlehmann aaronlehmann force-pushed the allocator-aggressive-retry branch from 6e78fc2 to 456c2ec Compare March 9, 2017 00:23
@aaronlehmann
Copy link
Collaborator Author

Updated, thanks

allocatedTasks := make([]*api.Task, 0, len(nc.unallocatedTasks))
quiet := false
toAllocate := nc.pendingTasks
allocatedTasks := make([]*api.Task, 0, len(toAllocate))
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This line should go below the if block, after which we know what toAllocate points to

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated

Instead of retrying unallocated tasks, services, and networks every time
data changes in the store, limit these retries to every 5 minutes.

When a repeated attempt to allocate one of these objects fails, log it
at the debug log level, to reduce noise in the logs.

Signed-off-by: Aaron Lehmann <[email protected]>
@aaronlehmann aaronlehmann force-pushed the allocator-aggressive-retry branch from 456c2ec to 513d028 Compare March 9, 2017 01:33
@aboch
Copy link

aboch commented Mar 9, 2017

Looks good to me

@aluzzardi
Copy link
Member

Do we want to handle the potentially impossible case in which we don't get a commit?

e.g. we receive a commit (and turns out that it free'ed up an IP address), we're above the 5 minutes limit so we don't try, and no other commit comes after so we don't allocate the task.

@aaronlehmann
Copy link
Collaborator Author

I think that's a very good point. I had considered this but didn't want to add too much complexity, especially because I think this should be backported. Do you think it's a good idea to add a timer that triggers after 5 minutes if no commits happen during that interval?

@aluzzardi
Copy link
Member

I think it's such a rare case that we may not need to bother ...

I guess it depends if the fix would be extremely tiny?

Can this simply be another switch case with a time.After?

@aluzzardi
Copy link
Member

Or a timer that we reset every time we receive a commit

@aluzzardi
Copy link
Member

Or maybe we shouldn't bother :)

This is going to be so rare that the code to handle this case this may be buggy and we'll never notice

@aaronlehmann
Copy link
Collaborator Author

Yeah, let's not bother. I liked the suggestion of adding a time.After in the select, but then I realized this would start a timer every time we receive an event, and that timer wouldn't be reaped until it fired 5 minutes later. It could be a big waste of resources. The right way to do it would be to use an actual time.Timer and reset it whenever there's activity, but that's easier to mess up, and I think you're right that it's a very unusual case. Even without addressing that case, this PR is still making things a lot better without adding too much risk.

@aluzzardi
Copy link
Member

LGTM

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants