Added simplified proposal for priority and fairness in apiservers #933

MikeSpreitzer · 2019-04-02T17:56:40Z

No sharing between priority levels.

No distinction between mutating and read-only requests.

I think this is what we agreed in the meeting today.

…ver requests No sharing between priority levels. No distinction between mutating and read-only requests.

MikeSpreitzer · 2019-04-02T17:57:15Z

Since this proposal does not have hardly any relations between priority levels, characterizing each by a number qualifies as random precision (and we would not want to wear out our welcome).

aaron-prindle · 2019-04-11T22:51:17Z

It might be useful to add a section regarding how API Model configuration is read/updated for the request manager. My open questions are:

The examples use kubernetes API syntax, is the idea that these will be k8s API objects in etcd?
Are the configurations mutable during the lifespan of a running apiserver or are they configurable only at initialization?

aaron-prindle · 2019-04-11T23:15:30Z

keps/sig-api-machinery/20190228-priority-and-fairness.md

+hash value is used to shuffle the queue indices and deal a hand of
+size H, as follows.  We use a hash function that produces at least 64
+bits, and 64 of those bits are taken as an unsigned integer we will
+call V.  The next step is finding the unique set of integers A[0] in


For the shuffle sharding implementation, the method Daniel referenced in his BucketClasses document seems a bit easier to understand and possibly implement in my opinion. Does that method seem equally sufficient and if so would it make sense to reference here?

That technique is a bit simpler and suffers a consequence: there can be duplicate values in a hand. Continuing the example from that doc, where a hand consists of eight 6-bit indices: if the hash value is 0x041041041041 then the hand contains eight copies of index 1.

The technique outlined here has a simple recursive implementation. The shuffle sharding can be accomplished by the function call queueIndex := shuffleDealAndPick(hashValue, numQueues, func(a int) int {return a}, handSize, maxInt, -1) where

type mapper func(int) int func shuffleDealAndPick(v, nq uint64, mr mapper, nRem, minLen, bestIdx int) int { if nRem < 1 { return bestIdx } vNext := v/nq ai := int(v - nq*vNext) ii := mr(ai) mrNext := func(a int) int { if a < ai { return mr(a) } return mr(a+1) } lenI := lengthOfQueue(ii) if lenI < minLen { minLen = lenI bestIdx = ii } return shuffleDealAndPick(vNext, nq-1, mrNext, nRem-1, minLen, bestIdx) }

https://play.golang.org/p/Ejb56vVnV_W

Clever :) 64 bits is probably not quite enough, though.

It's easy to get more bits, we can just XOR the next N bits with the top N bits of v, where N is something like log2(nq). That's easier than doing multi-word division :)

Yes, we could pull in more bits of v. But I doubt we actually will need a large hand size with a large number of queues (i.e., large falling factorial). My suggestion is to start simple, and implement more bits if we find a need.

BTW, rand.Uint64() is a thing.

For a slightly more elaborate version of that recursion, with the domain and range of the mapping function commented and with debug printing, see https://play.golang.org/p/rwtKLj3yGoF

I think it would be good in a followup to capture the code samples as e.g. footnotes or appendices. I think they will be easier for many to read than the textual description.

aaron-prindle · 2019-04-11T23:47:28Z

keps/sig-api-machinery/20190228-priority-and-fairness.md

+priority level:
+
+```
+ACV(l) = ceil( SCL * ACS(l) / ( sum[priority levels k] ACS(k) ) )


My understanding of this equation with some #s:

SCL = 400 ( sum[priority levels k] ACS(k) ) = 800 (eight priority levels, assuredConcurrencyShares:100 for all) ACS(l) = 100 (priority level l, assuredConcurrencyShares:100) ACV(l) = ceil(400 * 100/800) ACV(l) = 50

Is this understanding correct? That as ( sum[priority levels k] ACS(k) ) increases the ACV for all priority levels decreases? Also is a priority level each individual RequestPriority?

I agree with the values you calculated in that example. As for your general observation, note that ( sum[priority levels k] ACS(k) ) can increase only if some individual ACS(k) values increase. If they all increase in the same proportion, then there is no change in the result. If they change in different proportions, then the distribution shifts around.

Yes, modulo defaulting, there is a one-to-one correspondence between the intuitive notion of request level and RequestPriority API objects.

Note that because there is no actual precedence between priorities in this simple proposal, we do not bother giving a numeric "level" to each priority. Rather, the exempt and catchAll bits capture the only remaining consequences of the numeric levels in the more sophisticated proposal (#930). Further, note also that there is a fairly easy way to evolve to the more sophisticated proposal. We can add a numeric level, requiring/making the exempt and catchAll bits consistent with the numeric level, and later retire the exempt and catchAll bits.

MikeSpreitzer · 2019-04-12T03:40:54Z

Hmm, I see now that the current text does not clearly say so up front, but yes:

configuration is via API objects, and
they can be created/updated/deleted at any time.

The section on defaulting is working in that setting.

aaron-prindle · 2019-04-16T00:47:57Z

keps/sig-api-machinery/20190228-priority-and-fairness.md

+version this corresponds to adding a service time (i.e., duration) to
+virtual time.
+
+The third difference is handled by modifying the algorithm to dispatch


I have begun a POC for a fair queuing algorithm with the modifications specified in this repo: aaron-prindle/fq.
From the suggested changes, I have made the following psuedocode (and this code):

// called when queued func (q *fqscheduler) updateTimeQueued(packet *Packet, queue *Queue) { if len(queue.Packets) == 0 && !queue.requestsexecuting { // queue.virStart is ignored // queues.lastvirfinish is in the virtualpast queue.virstart = uint64(time.Now().UnixNano()) } if len(queue.Packets) == 0 && queue.requestsexecuting { queue.virstart = queue.lastvirfinish } packet.virfinish = len(queue.Packets)*packet.estservicetime + queue.virstart if len(queue.Packets) > 0 { // last virtual finish time of the queue is the virtual finish // time of the last request in the queue queue.lastvirfinish = packet.virfinish // this pkt is last pkt as it was just queued and the queue is locked } } // called when dequeud func (q *fqscheduler) updateTimeDequeued(packet *Packet, queue *Queue) { queue.virstart += packet.estservicetime } // callback used once the packet request has completed and the actual service time has been set func (q *fqscheduler) updateTimeFinished(packet *Packet, queue *Queue) { // assuming packet.actservicetime is set S := packet.actservicetime G := packet.estservicetime queue.virstart -= (G - S) }

There is also a psuedo-impl branch in that repo which reflects a direct implementation of the psuedocode from the fair queuing wiki page.
This was used to validate the tests.

Does this psueocode understanding agree with the logic in the document?

Does it make sense to put something early like this into the feature branch for visibility and move this discussion there? What I am hoping to get from the POC is some consensus on the tests used for the fair queuing implementation.

Regarding test cases for fair queueing, I adapted tests from tagdlines/wfq which calculates the variance in the packets dequeud amongst different input flows. Does this method of checking the variance seem sufficient for verifying the fairness of fair queuing? Are there any suggested references to look for testing the fair queuing implementation?

There is a major misunderstanding. In the implementation outline from Wikipedia, now() is reading the virtual clock, not the real clock. I pushed an update to the proposal to emphasize this.

I see that the Wikipedia article was misleading on this point. I updated it. Where it says now(), it is referring to what the Demers et al paper notated as "R(t)".

Note also that the proposal discusses making a change in the data structures used, so that each packet does not have a field holding its virtual finish time and a queue does not have a field holding its last virtual finish time; instead these values are computed when and where needed.

Note also that there is no need to give each queue a "key". The queues are simply identified by their position in the slice of queues.

aaron-prindle · 2019-04-17T23:53:18Z

keps/sig-api-machinery/20190228-priority-and-fairness.md

+    - inSet:
+      field: resource
+      set: [ "tokenreviews", "subjectaccessreviews" ]
+    - superSet:


My understanding of all the possible 'Match' directives is here: https://gist.github.com/aaron-prindle/b53d3d3687ef442fe113325db5fd4dff

Is my understanding of SuperSet correct in that the value needs to be in the set of the values in the field? I'm not familiar with how Groups are represented in k8s, would this be analagous to the value being a substring of the field?

Regarding superset: note that authentication produces a set of strings, each of which names one of the groups that the user is authenticated to belong to. Confusingly, this set is often called simply group.

The general idea of the superset match is that when the authenticated request attribute identified by field is a set --- and let's call it set S1 --- and the match specified anothet set --- and let's call this one set S2 --- then we have a match if and only if every member of S2 is also a member of S1. This is the reverse of subset, which would be true if and only if every member of S1 is also a member of S2.

So, for example, if the set of authenticated group strings is {"g1", "g2", "g4"} and the match specifies field: groups, set: ["g1", "g2"] then this is a successful match because both g1 and g2 are in the authenticated set. OTOH, if the match said field: groups, set: ["g2", "g3"] then we would NOT have a match.

ah makes sense, thanks!

Removed last vestige of treating mutating and read-only requests differently. Also added some emphasis to the fact that `now()` reads the virtual clock rather than the real clock.

aaron-prindle · 2019-04-22T17:52:42Z

keps/sig-api-machinery/20190228-priority-and-fairness.md

+packet that is really being transmitted at the moment); in this
+conception, a packet is considered to be "in" its queue until the
+packet’s transmission is finished.  For our problem, we can define a
+round to be giving one nanosecond of CPU to every non-empty queue in


In a single round, is only one packet dequeud/sent? The Demers paper refers to round robin rounds but my understanding is that you always dequeue the minimum virtual finish time regardless of what queue it is in.

Overall, what steps happen in a 'round'? The virtual clock "ticks" and then a one or more dequeus take place?

Remember that the rounds describe a fantasy, and that fantasy is used to guide what really happens. In the Demers paper a round consists of transmitting one bit from each queue; the actual transmission does not go that way. The actual packet transmissions are ordered according to completion times of packets in these fantasy rounds.

aaron-prindle · 2019-04-22T17:52:44Z

keps/sig-api-machinery/20190228-priority-and-fairness.md

+executing requests from that queue), and define R(t) = (server start
+time) + (1 ns) * (number of rounds since server start).  Let us write
+NEQ(t) for that number of non-empty queues in the apiserver at time t.
+For a given queue "q", let us also write "reqs(q, t)" for the number


To clarify, is reqs(q,t) the number of requests-queued + requests-executing or only requests-executing?

It is that sum.

MikeSpreitzer · 2019-04-30T18:31:53Z

Regarding the question of what is the probability of a mouse being squashed in the presence of a certain number of elephants, given a certain choice for number of queues and hand size, see https://play.golang.org/p/sWvC8nOkPPp . This shows, for nHands elephants, a value probNextCovered that is the probability that a random mouse completely collides with the elephants.

lavalamp

I want to lock in progress on this. I left some comments that can be addressed in followups.

lavalamp · 2019-05-01T02:48:13Z

keps/sig-api-machinery/20190228-priority-and-fairness.md

+At the first stage of development, an apiserver’s concurrency limit
+will be derived from the existing configuration options for
+max-mutating-in-flight and max-readonly-in-flight, by taking their
+sum.  Later we may migrate to a single direct configuration option.


In a followup: even later, we intend to compute (autotune) this value based on the resources available to apiserver.

lavalamp · 2019-05-01T02:49:51Z

keps/sig-api-machinery/20190228-priority-and-fairness.md

+An apiserver is also configured with a limit on the amount of time
+that a request may wait in its queue.  If this time passes while a
+request is still waiting for service then the request will be
+rejected.


This may mean we need to revisit the scalability tests-- this protection could keep us from violating latency SLOs even though we are dropping many requests.

I think that we will need to revisit SLI and SLO itself.
What is coming on my mind for some time, is that we should really count toward latency SLI only requests that were really processed. And have a second SLI/SLO for percentage of rejected requests.

I will try to get to it soon, but it won't happen before Kubecon for sure.

There is also the question of latency of rejected requests. It is somewhat hostile to hang onto a request for a long time and then reject it.

lavalamp · 2019-05-01T02:55:34Z

keps/sig-api-machinery/20190228-priority-and-fairness.md

+
+For implementation simplicity (see below), let us use the same initial
+service time guess for every request; call that duration G.  A good
+choice might be the service time limit (1 minute).  Different guesses


in a followup: the vast majority of our requests are much shorter than that, even 1s is probably an overestimate. I actually suggest 1s.

This probably doesn't matter much at all until/unless we start making different guesses about different requests.

Based on hand simulation, using the same guess for every request is safe and works well when the guess is never low. If the guess is low in some cases then the dispatching pattern can show more burstiness. That is why I recommend setting the guess at the maximum.

I am inclinded to avoid saying more in the text until we have more shared evidence.

lavalamp · 2019-05-01T02:59:49Z

keps/sig-api-machinery/20190228-priority-and-fairness.md

+  matchingPriority: (doesn’t really matter)
+  requestPriority:
+    name: (name of an effectively existing RequestPriority, whether
+	that is real or backstop, with catchAll==true)


in a followup: tabs and yaml don't mix :)

lavalamp · 2019-05-01T03:03:31Z

keps/sig-api-machinery/20190228-priority-and-fairness.md

+- apiserver_request_latencies_summary
+
+This KEP adds the following metrics.
+- apiserver_rejected_requests (count, broken down by priority, FlowSchema, when (arrival vs timeout))


Did we discuss whether including flowschema is too high-cardinality? I think it should be OK if we don't multiply it by too many other dimensions. It might be good to state the expected cardinality here in a followup; ending up with too many distinct label values is a common failure mode in metrics.

Note that what is stipulated here is the schema, not the individual flow. There is a statement elsewhere about the expected cardinality of schemas.

lavalamp · 2019-05-01T03:06:54Z

keps/sig-api-machinery/20190228-priority-and-fairness.md

+- apiserver_current_executing_requests (gauge, broken down by priority, FlowSchema)
+- apiserver_dispatched_requests (count, broken down by priority, FlowSchema)
+- apiserver_wait_duration (histogram, broken down by priority, FlowSchema)
+- apiserver_service_duration (histogram, broken down by priority, FlowSchema)


I think it would be good to give some quantification of the queues in each priority, things like:

how long is the longest queue?

how long is the shortest queue?

how many queues are e.g. 90% as long as the longest queue?

Basically the goal is to help administrators figure out if an elephant is properly getting confined to a subset of queues, or if they need to revisit some flowschema distinguisher method.

So how about quantiles of queue length, at 0%, 50%, 90%, and 100%?

lavalamp · 2019-05-01T03:25:43Z

keps/sig-api-machinery/20190228-priority-and-fairness.md

+hash value is used to shuffle the queue indices and deal a hand of
+size H, as follows.  We use a hash function that produces at least 64
+bits, and 64 of those bits are taken as an unsigned integer we will
+call V.  The next step is finding the unique set of integers A[0] in


I think it would be good in a followup to capture the code samples as e.g. footnotes or appendices. I think they will be easier for many to read than the textual description.

lavalamp · 2019-05-01T03:35:24Z

keps/sig-api-machinery/20190228-priority-and-fairness.md

+flow hashing to the same set of 6 queues as the heavy flow is about
+one in 5.4 billion.
+
+It is the queues that compete fairly.


We should probably explicitly point out the corollary, which is that if the distinguisher method doesn't match the commonalities in the elephant's traffic, then the elephant can stop on all queues simultaneously. E.g., if the distinguisher method is by username, but actually the elephant is e.g. a sidecar which makes requests using its host pod's service account, then this evades the fairness.

I have a somewhat hairbrained scheme to address this, I don't have time to clean it up right now and I think we can live with this in the first pass anyway. Sysadmins, use correct distinguisher methods!

lavalamp · 2019-05-01T03:37:03Z

keps/sig-api-machinery/20190228-priority-and-fairness.md

+
+This KEP adds the following metrics.
+- apiserver_rejected_requests (count, broken down by priority, FlowSchema, when (arrival vs timeout))
+- apiserver_current_inqueue_requests (gauge, broken down by priority, FlowSchema)


Can the same flowschema map to different priorities? Do we need to include priority here?

A flow schema is associated with exactly one priority (at a time :-).

lavalamp · 2019-05-01T03:47:42Z

keps/sig-api-machinery/20190228-priority-and-fairness.md

+- apiserver_dispatched_requests (count, broken down by priority, FlowSchema)
+- apiserver_wait_duration (histogram, broken down by priority, FlowSchema)
+- apiserver_service_duration (histogram, broken down by priority, FlowSchema)
+


Logging: We should log the matching flowschema along with the current log entry for the request. We should probably also include it in the audit logs.

What code structure would accomplish that?

lavalamp · 2019-05-01T03:49:34Z

/lgtm
/approved

k8s-ci-robot · 2019-05-01T03:49:42Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: lavalamp, MikeSpreitzer

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~keps/sig-api-machinery/OWNERS~~ [lavalamp]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Added simplified proposal for priority and fairness in serving apiser…

6f6b4ea

…ver requests No sharing between priority levels. No distinction between mutating and read-only requests.

k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Apr 2, 2019

k8s-ci-robot requested review from deads2k and lavalamp April 2, 2019 17:56

k8s-ci-robot added kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. labels Apr 2, 2019

This was referenced Apr 2, 2019

Added proposal with priority but assured concurrency and fairness #930

Closed

Added latest proposal and understanding of the design issues #891

Closed

Replaced numeric priority with two qualitative bits

c91083c

Since this proposal does not have hardly any relations between priority levels, characterizing each by a number qualifies as random precision (and we would not want to wear out our welcome).

aaron-prindle reviewed Apr 11, 2019

View reviewed changes

aaron-prindle reviewed Apr 16, 2019

View reviewed changes

aaron-prindle reviewed Apr 17, 2019

View reviewed changes

Fixed oversight in 20190228-priority-and-fairness.md

3b9d684

Removed last vestige of treating mutating and read-only requests differently. Also added some emphasis to the fact that `now()` reads the virtual clock rather than the real clock.

aaron-prindle mentioned this pull request Apr 19, 2019

Umbrella Tracking Issue For WIP Alpha Feature - Priority and Fairness For API Server Requests kubernetes/kubernetes#76846

Closed

33 tasks

aaron-prindle reviewed Apr 22, 2019

View reviewed changes

lavalamp reviewed May 1, 2019

View reviewed changes

k8s-ci-robot assigned lavalamp May 1, 2019

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label May 1, 2019

k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label May 1, 2019

k8s-ci-robot merged commit 6edc986 into kubernetes:master May 1, 2019

MikeSpreitzer deleted the add-req-fq-isolated branch May 1, 2019 19:32

Added simplified proposal for priority and fairness in apiservers #933

Added simplified proposal for priority and fairness in apiservers #933

Conversation

MikeSpreitzer commented Apr 2, 2019

MikeSpreitzer commented Apr 2, 2019

aaron-prindle commented Apr 11, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

MikeSpreitzer Apr 12, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

MikeSpreitzer Apr 12, 2019 • edited Loading

Choose a reason for hiding this comment

MikeSpreitzer Apr 29, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

aaron-prindle Apr 11, 2019 • edited Loading

Choose a reason for hiding this comment

MikeSpreitzer Apr 12, 2019 • edited Loading

Choose a reason for hiding this comment

MikeSpreitzer commented Apr 12, 2019

Choose a reason for hiding this comment

MikeSpreitzer Apr 18, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

aaron-prindle Apr 18, 2019 • edited Loading

Choose a reason for hiding this comment

aaron-prindle Apr 22, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

MikeSpreitzer commented Apr 30, 2019

lavalamp left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lavalamp commented May 1, 2019

k8s-ci-robot commented May 1, 2019

aaron-prindle commented Apr 11, 2019 •

edited

Loading

MikeSpreitzer Apr 12, 2019 •

edited

Loading

MikeSpreitzer Apr 12, 2019 •

edited

Loading

MikeSpreitzer Apr 29, 2019 •

edited

Loading

aaron-prindle Apr 11, 2019 •

edited

Loading

MikeSpreitzer Apr 12, 2019 •

edited

Loading

MikeSpreitzer Apr 18, 2019 •

edited

Loading

aaron-prindle Apr 18, 2019 •

edited

Loading

aaron-prindle Apr 22, 2019 •

edited

Loading