-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Added simplified proposal for priority and fairness in apiservers #933
Added simplified proposal for priority and fairness in apiservers #933
Conversation
…ver requests No sharing between priority levels. No distinction between mutating and read-only requests.
CC @yue9944882 |
Since this proposal does not have hardly any relations between priority levels, characterizing each by a number qualifies as random precision (and we would not want to wear out our welcome).
It might be useful to add a section regarding how API Model configuration is read/updated for the request manager. My open questions are:
|
hash value is used to shuffle the queue indices and deal a hand of | ||
size H, as follows. We use a hash function that produces at least 64 | ||
bits, and 64 of those bits are taken as an unsigned integer we will | ||
call V. The next step is finding the unique set of integers A[0] in |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the shuffle sharding implementation, the method Daniel referenced in his BucketClasses document seems a bit easier to understand and possibly implement in my opinion. Does that method seem equally sufficient and if so would it make sense to reference here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That technique is a bit simpler and suffers a consequence: there can be duplicate values in a hand. Continuing the example from that doc, where a hand consists of eight 6-bit indices: if the hash value is 0x041041041041 then the hand contains eight copies of index 1.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The technique outlined here has a simple recursive implementation. The shuffle sharding can be accomplished by the function call queueIndex := shuffleDealAndPick(hashValue, numQueues, func(a int) int {return a}, handSize, maxInt, -1)
where
type mapper func(int) int
func shuffleDealAndPick(v, nq uint64, mr mapper, nRem, minLen, bestIdx int) int {
if nRem < 1 {
return bestIdx
}
vNext := v/nq
ai := int(v - nq*vNext)
ii := mr(ai)
mrNext := func(a int) int {
if a < ai {
return mr(a)
}
return mr(a+1)
}
lenI := lengthOfQueue(ii)
if lenI < minLen {
minLen = lenI
bestIdx = ii
}
return shuffleDealAndPick(vNext, nq-1, mrNext, nRem-1, minLen, bestIdx)
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
https://play.golang.org/p/Ejb56vVnV_W
Clever :) 64 bits is probably not quite enough, though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's easy to get more bits, we can just XOR the next N bits with the top N bits of v
, where N is something like log2(nq). That's easier than doing multi-word division :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, we could pull in more bits of v. But I doubt we actually will need a large hand size with a large number of queues (i.e., large falling factorial). My suggestion is to start simple, and implement more bits if we find a need.
BTW, rand.Uint64()
is a thing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For a slightly more elaborate version of that recursion, with the domain and range of the mapping function commented and with debug printing, see https://play.golang.org/p/rwtKLj3yGoF
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would be good in a followup to capture the code samples as e.g. footnotes or appendices. I think they will be easier for many to read than the textual description.
priority level: | ||
|
||
``` | ||
ACV(l) = ceil( SCL * ACS(l) / ( sum[priority levels k] ACS(k) ) ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My understanding of this equation with some #s:
SCL = 400
( sum[priority levels k] ACS(k) ) = 800 (eight priority levels, assuredConcurrencyShares:100 for all)
ACS(l) = 100 (priority level l, assuredConcurrencyShares:100)
ACV(l) = ceil(400 * 100/800)
ACV(l) = 50
Is this understanding correct? That as ( sum[priority levels k] ACS(k) )
increases the ACV for all priority levels decreases? Also is a priority level each individual RequestPriority?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with the values you calculated in that example. As for your general observation, note that ( sum[priority levels k] ACS(k) )
can increase only if some individual ACS(k)
values increase. If they all increase in the same proportion, then there is no change in the result. If they change in different proportions, then the distribution shifts around.
Yes, modulo defaulting, there is a one-to-one correspondence between the intuitive notion of request level and RequestPriority API objects.
Note that because there is no actual precedence between priorities in this simple proposal, we do not bother giving a numeric "level" to each priority. Rather, the exempt
and catchAll
bits capture the only remaining consequences of the numeric levels in the more sophisticated proposal (#930). Further, note also that there is a fairly easy way to evolve to the more sophisticated proposal. We can add a numeric level, requiring/making the exempt
and catchAll
bits consistent with the numeric level, and later retire the exempt
and catchAll
bits.
Hmm, I see now that the current text does not clearly say so up front, but yes:
The section on defaulting is working in that setting. |
version this corresponds to adding a service time (i.e., duration) to | ||
virtual time. | ||
|
||
The third difference is handled by modifying the algorithm to dispatch |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have begun a POC for a fair queuing algorithm with the modifications specified in this repo: aaron-prindle/fq.
From the suggested changes, I have made the following psuedocode (and this code):
// called when queued
func (q *fqscheduler) updateTimeQueued(packet *Packet, queue *Queue) {
if len(queue.Packets) == 0 && !queue.requestsexecuting {
// queue.virStart is ignored
// queues.lastvirfinish is in the virtualpast
queue.virstart = uint64(time.Now().UnixNano())
}
if len(queue.Packets) == 0 && queue.requestsexecuting {
queue.virstart = queue.lastvirfinish
}
packet.virfinish = len(queue.Packets)*packet.estservicetime + queue.virstart
if len(queue.Packets) > 0 {
// last virtual finish time of the queue is the virtual finish
// time of the last request in the queue
queue.lastvirfinish = packet.virfinish // this pkt is last pkt as it was just queued and the queue is locked
}
}
// called when dequeud
func (q *fqscheduler) updateTimeDequeued(packet *Packet, queue *Queue) {
queue.virstart += packet.estservicetime
}
// callback used once the packet request has completed and the actual service time has been set
func (q *fqscheduler) updateTimeFinished(packet *Packet, queue *Queue) {
// assuming packet.actservicetime is set
S := packet.actservicetime
G := packet.estservicetime
queue.virstart -= (G - S)
}
There is also a psuedo-impl branch in that repo which reflects a direct implementation of the psuedocode from the fair queuing wiki page.
This was used to validate the tests.
- Does this psueocode understanding agree with the logic in the document?
- Does it make sense to put something early like this into the feature branch for visibility and move this discussion there? What I am hoping to get from the POC is some consensus on the tests used for the fair queuing implementation.
- Regarding test cases for fair queueing, I adapted tests from tagdlines/wfq which calculates the variance in the packets dequeud amongst different input flows. Does this method of checking the variance seem sufficient for verifying the fairness of fair queuing? Are there any suggested references to look for testing the fair queuing implementation?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is a major misunderstanding. In the implementation outline from Wikipedia, now()
is reading the virtual clock, not the real clock. I pushed an update to the proposal to emphasize this.
I see that the Wikipedia article was misleading on this point. I updated it. Where it says now()
, it is referring to what the Demers et al paper notated as "R(t)".
Note also that the proposal discusses making a change in the data structures used, so that each packet does not have a field holding its virtual finish time and a queue does not have a field holding its last virtual finish time; instead these values are computed when and where needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note also that there is no need to give each queue a "key". The queues are simply identified by their position in the slice of queues.
- inSet: | ||
field: resource | ||
set: [ "tokenreviews", "subjectaccessreviews" ] | ||
- superSet: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My understanding of all the possible 'Match' directives is here: https://gist.github.com/aaron-prindle/b53d3d3687ef442fe113325db5fd4dff
Is my understanding of SuperSet correct in that the value needs to be in the set of the values in the field? I'm not familiar with how Groups are represented in k8s, would this be analagous to the value being a substring of the field?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Regarding superset: note that authentication produces a set of strings, each of which names one of the groups that the user is authenticated to belong to. Confusingly, this set is often called simply group
.
The general idea of the superset match is that when the authenticated request attribute identified by field
is a set --- and let's call it set S1 --- and the match specified anothet set --- and let's call this one set S2 --- then we have a match if and only if every member of S2 is also a member of S1. This is the reverse of subset, which would be true if and only if every member of S1 is also a member of S2.
So, for example, if the set of authenticated group strings is {"g1", "g2", "g4"} and the match specifies field: groups, set: ["g1", "g2"]
then this is a successful match because both g1 and g2 are in the authenticated set. OTOH, if the match said field: groups, set: ["g2", "g3"]
then we would NOT have a match.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah makes sense, thanks!
Removed last vestige of treating mutating and read-only requests differently. Also added some emphasis to the fact that `now()` reads the virtual clock rather than the real clock.
packet that is really being transmitted at the moment); in this | ||
conception, a packet is considered to be "in" its queue until the | ||
packet’s transmission is finished. For our problem, we can define a | ||
round to be giving one nanosecond of CPU to every non-empty queue in |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In a single round, is only one packet dequeud/sent? The Demers paper refers to round robin rounds but my understanding is that you always dequeue the minimum virtual finish time regardless of what queue it is in.
Overall, what steps happen in a 'round'? The virtual clock "ticks" and then a one or more dequeus take place?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remember that the rounds describe a fantasy, and that fantasy is used to guide what really happens. In the Demers paper a round consists of transmitting one bit from each queue; the actual transmission does not go that way. The actual packet transmissions are ordered according to completion times of packets in these fantasy rounds.
executing requests from that queue), and define R(t) = (server start | ||
time) + (1 ns) * (number of rounds since server start). Let us write | ||
NEQ(t) for that number of non-empty queues in the apiserver at time t. | ||
For a given queue "q", let us also write "reqs(q, t)" for the number |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To clarify, is reqs(q,t) the number of requests-queued + requests-executing or only requests-executing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is that sum.
Regarding the question of what is the probability of a mouse being squashed in the presence of a certain number of elephants, given a certain choice for number of queues and hand size, see https://play.golang.org/p/sWvC8nOkPPp . This shows, for |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I want to lock in progress on this. I left some comments that can be addressed in followups.
At the first stage of development, an apiserver’s concurrency limit | ||
will be derived from the existing configuration options for | ||
max-mutating-in-flight and max-readonly-in-flight, by taking their | ||
sum. Later we may migrate to a single direct configuration option. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In a followup: even later, we intend to compute (autotune) this value based on the resources available to apiserver.
An apiserver is also configured with a limit on the amount of time | ||
that a request may wait in its queue. If this time passes while a | ||
request is still waiting for service then the request will be | ||
rejected. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This may mean we need to revisit the scalability tests-- this protection could keep us from violating latency SLOs even though we are dropping many requests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that we will need to revisit SLI and SLO itself.
What is coming on my mind for some time, is that we should really count toward latency SLI only requests that were really processed. And have a second SLI/SLO for percentage of rejected requests.
I will try to get to it soon, but it won't happen before Kubecon for sure.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is also the question of latency of rejected requests. It is somewhat hostile to hang onto a request for a long time and then reject it.
|
||
For implementation simplicity (see below), let us use the same initial | ||
service time guess for every request; call that duration G. A good | ||
choice might be the service time limit (1 minute). Different guesses |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in a followup: the vast majority of our requests are much shorter than that, even 1s is probably an overestimate. I actually suggest 1s.
This probably doesn't matter much at all until/unless we start making different guesses about different requests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Based on hand simulation, using the same guess for every request is safe and works well when the guess is never low. If the guess is low in some cases then the dispatching pattern can show more burstiness. That is why I recommend setting the guess at the maximum.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am inclinded to avoid saying more in the text until we have more shared evidence.
matchingPriority: (doesn’t really matter) | ||
requestPriority: | ||
name: (name of an effectively existing RequestPriority, whether | ||
that is real or backstop, with catchAll==true) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in a followup: tabs and yaml don't mix :)
- apiserver_request_latencies_summary | ||
|
||
This KEP adds the following metrics. | ||
- apiserver_rejected_requests (count, broken down by priority, FlowSchema, when (arrival vs timeout)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did we discuss whether including flowschema is too high-cardinality? I think it should be OK if we don't multiply it by too many other dimensions. It might be good to state the expected cardinality here in a followup; ending up with too many distinct label values is a common failure mode in metrics.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that what is stipulated here is the schema, not the individual flow. There is a statement elsewhere about the expected cardinality of schemas.
- apiserver_current_executing_requests (gauge, broken down by priority, FlowSchema) | ||
- apiserver_dispatched_requests (count, broken down by priority, FlowSchema) | ||
- apiserver_wait_duration (histogram, broken down by priority, FlowSchema) | ||
- apiserver_service_duration (histogram, broken down by priority, FlowSchema) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would be good to give some quantification of the queues in each priority, things like:
- how long is the longest queue?
- how long is the shortest queue?
- how many queues are e.g. 90% as long as the longest queue?
Basically the goal is to help administrators figure out if an elephant is properly getting confined to a subset of queues, or if they need to revisit some flowschema distinguisher method.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So how about quantiles of queue length, at 0%, 50%, 90%, and 100%?
hash value is used to shuffle the queue indices and deal a hand of | ||
size H, as follows. We use a hash function that produces at least 64 | ||
bits, and 64 of those bits are taken as an unsigned integer we will | ||
call V. The next step is finding the unique set of integers A[0] in |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would be good in a followup to capture the code samples as e.g. footnotes or appendices. I think they will be easier for many to read than the textual description.
flow hashing to the same set of 6 queues as the heavy flow is about | ||
one in 5.4 billion. | ||
|
||
It is the queues that compete fairly. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should probably explicitly point out the corollary, which is that if the distinguisher method doesn't match the commonalities in the elephant's traffic, then the elephant can stop on all queues simultaneously. E.g., if the distinguisher method is by username, but actually the elephant is e.g. a sidecar which makes requests using its host pod's service account, then this evades the fairness.
I have a somewhat hairbrained scheme to address this, I don't have time to clean it up right now and I think we can live with this in the first pass anyway. Sysadmins, use correct distinguisher methods!
|
||
This KEP adds the following metrics. | ||
- apiserver_rejected_requests (count, broken down by priority, FlowSchema, when (arrival vs timeout)) | ||
- apiserver_current_inqueue_requests (gauge, broken down by priority, FlowSchema) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can the same flowschema map to different priorities? Do we need to include priority here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A flow schema is associated with exactly one priority (at a time :-).
- apiserver_dispatched_requests (count, broken down by priority, FlowSchema) | ||
- apiserver_wait_duration (histogram, broken down by priority, FlowSchema) | ||
- apiserver_service_duration (histogram, broken down by priority, FlowSchema) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Logging: We should log the matching flowschema along with the current log entry for the request. We should probably also include it in the audit logs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What code structure would accomplish that?
/lgtm |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: lavalamp, MikeSpreitzer The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
No sharing between priority levels.
No distinction between mutating and read-only requests.
I think this is what we agreed in the meeting today.