-
Notifications
You must be signed in to change notification settings - Fork 726
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
balance: slow down interval increase speed. #585
Conversation
if op := s.Scheduler.Schedule(cluster); op != nil { | ||
s.interval = minScheduleInterval | ||
return op | ||
} | ||
} | ||
|
||
// If we have no schedule, increase the interval exponentially. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please update the comment here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Multiplied by 1.3 every time is also 'exponentially'.
maxScheduleRetries = 10 | ||
maxScheduleInterval = time.Minute | ||
minScheduleInterval = time.Millisecond * 10 | ||
scheduleIntervalFactor = 1.3 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
any reason to use 1.3 here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's arbitrarily selected. We need a slower grow speed here.
If factor is 2, interval reaches max value 1min after about 13 retries, which takes less than 1.5min minutes in total.
When we can't schedule an operator in 1.5 minutes, it's not always true that the cluster is balanced, it may be caused by slow heartbeat or slow snapshot.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we construct a test to verify the change is ok?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll see what I can do.
LGTM |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rest LGTM
PTAL @siddontang |
LGTM |
As we have observed, the schedule interval increases too fast.
We should not increase interval after each retry failure (it seems to be a bug). Also the increase factor is changed from 2 to 1.3.
/cc @nolouch @siddontang @andelf