Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

KEP: Consistent Reads from Cache #1404

Merged
merged 12 commits into from
Feb 20, 2020

Conversation

jpbetz
Copy link
Contributor

@jpbetz jpbetz commented Dec 11, 2019

This idea has been around for awhile and there is interest in working on it for 1.18, so I've put it in KEP form.

cc @wojtek-t @jingyih @liggitt

@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Dec 11, 2019
@jpbetz jpbetz force-pushed the consistent-reads-from-cache-kep branch 2 times, most recently from 7f3d10b to 6104f8a Compare December 11, 2019 01:34
@jpbetz jpbetz force-pushed the consistent-reads-from-cache-kep branch from 6104f8a to 535968b Compare December 11, 2019 01:46
@mm4tt
Copy link

mm4tt commented Dec 17, 2019

Interesting idea!

I took a cursory glance a question came to my mind. How would this work for paginated list calls? I couldn't find anything in the proposal on that.

@wojtek-t
Copy link
Member

I took a cursory glance a question came to my mind. How would this work for paginated list calls? I couldn't find anything in the proposal on that.

It's not solved. That said, I think that we don't necessary need to address that from day1.
In my opinion, the biggest gain from it we can get from node-originating requests (e.g. kubelet listing pods scheduled on its node). For those requests, the size of the response is small (it fits a single page, assuming you won't make it extremely small), whereas the number of objects to process is proportional to cluster-size (so fairly big).

So as an example, in 5k-node cluster with 30pods/node, listing "pods from node X" returns 30 items, but listing from etcd would result in listing 150k pods, deserializing them, filtering and returning those 30. Listing from cache means just filtering and returning those 30, and once kubernetes/kubernetes#85445 merges (and one more small PR on top of it), it will be proportional to the size of result.

So yes - this doesn't immediately solve all problem, but it may visibly help with many of those.

I can imagine the algorithm like:

  • if limit is set, check if we set selectors - if not fallback to etcd, if so, assume high probability the result will not be too large
  • ensure cache is up-to-date
  • try to list from cache
  • if result is too large, fallback to list from etcd

@jpbetz - while the above algorithm is probably not something we want to put in the KEP (do we?), the above explanation I made actually is probably worth putting (parts of it should do into motivation, I think).

@jpbetz jpbetz force-pushed the consistent-reads-from-cache-kep branch 2 times, most recently from 454a0d2 to a5df1d7 Compare December 31, 2019 00:46
@jpbetz jpbetz force-pushed the consistent-reads-from-cache-kep branch from a5df1d7 to 7ef2bbd Compare December 31, 2019 00:52
@jpbetz
Copy link
Contributor Author

jpbetz commented Jan 7, 2020

re #1404 (comment)

Joe Betz - while the above algorithm is probably not something we want to put in the KEP (do we?), the above explanation I made actually is probably worth putting (parts of it should do into motivation, I think).

This is very useful. I've added it to the motivation section.

For pagination I've added a small section about it and also listed it in the graduation criteria section as something we'd need to address fully before going to beta.

@jpbetz
Copy link
Contributor Author

jpbetz commented Jan 7, 2020

To clarify - I'm not worried in the case of fallback - then additional latency sounds reasonable to me (because that means apiserver has some problems with watch cache).
What I am worried, is requesting this progress notify even explicitly.

I'm a bit worried about that too.

I've restructured the doc to propose two alternatives: 1) Use WithProgressNotify to enable automatic watch updates, and 2) Use WatchProgressRequest to request watch updates when needed. If we can demonstrate alternative 1 is sufficient, I don't think we need to explicitly request progress notifies at all. Let's figure that out experimentally, trying alternative 1 first like you suggested.

@jpbetz
Copy link
Contributor Author

jpbetz commented Feb 18, 2020

@lavalamp, @deads2k Can we merge this PR as provisional? @jfbai has proposed an improvement to this KEP in kubernetes/kubernetes#88264 and I'd like to get it merged so we can more easily collaborate. All outstanding comments on this PR have been addressed.

@jpbetz jpbetz force-pushed the consistent-reads-from-cache-kep branch from 71057a1 to 5234cc7 Compare February 18, 2020 22:25
@lavalamp
Copy link
Member

I think I'm OK with provisional. @deads2k?

@jfbai
Copy link

jfbai commented Feb 19, 2020

@lavalamp, @deads2k Can we merge this PR as provisional? @jfbai has proposed an improvement to this KEP in kubernetes/kubernetes#88264 and I'd like to get it merged so we can more easily collaborate. All outstanding comments on this PR have been addressed.

Sure, I will update the idea using WithLastRev when after this KEP is merged.

@deads2k
Copy link
Contributor

deads2k commented Feb 19, 2020

I don't see a section talking about how I get to opt out of this behavior and still get a "normal" quorum read. Given the difficulty we've had with nearly every watch cache feature, I think we'll need that ability for our own debugging if nothing else. When these problems struck in the past, we were able to do find an affected cluster and use a mix of cached and live requests to figure out what was happening. I don't want to give up that ability entirely.

I don't think this is a case where, "then everyone will use it", because the normal flow should be good enough to convince people to use it. If it's not and we truly want to force them, it could become an ACL check, but hopefully the feature is good enough that everyone wants to use it.

@deads2k
Copy link
Contributor

deads2k commented Feb 19, 2020

Add the "UNRESOLVED" markers on the sections I commented on (https://github.com/kubernetes/enhancements/pull/1545/files#diff-bb6317ae71eb96981343e559d9598d80R38) and I can go for provisional


- Resolve the "stale read" problem (https://github.com/kubernetes/kubernetes/issues/59848)
- Improve the scalability and performance of Kubernetes for Get and List requests, when the watch cache is enabled

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if you aren't going to try to avoid allowing true quorum reads as we have them today, please list as an explicit non-goal. Also, per my overall comment, I think this is a shortcoming and should be addressed. it doesn't need to be a default, but it should be possible.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this needs more discussion. I've added a unresolved in non-goals.

@jpbetz
Copy link
Contributor Author

jpbetz commented Feb 19, 2020

Thanks @deads2k. Feedback is applied and <<[UNRESOLVED]>>...<<[/UNRESOLVED]>> sections have been added.

@wojtek-t
Copy link
Member

I would like to make a one more deeper pass on it before we move it to implementable and I won't have time for it in the next 2.5 weeks, but I'm definitely fine with provisional.

@jpbetz
Copy link
Contributor Author

jpbetz commented Feb 19, 2020

I would like to make a one more deeper pass on it before we move it to implementable and I won't have time for it in the next 2.5 weeks, but I'm definitely fine with provisional.

Sounds good, I've listed @wojtek-t, @lavalamp and @deads2k as the approvers for this to make sure we get your feedback.

@jpbetz jpbetz force-pushed the consistent-reads-from-cache-kep branch from 9d890f9 to 4fd8777 Compare February 19, 2020 22:15
@jpbetz
Copy link
Contributor Author

jpbetz commented Feb 20, 2020

@deads2k, @lavalamp Would one of you tag this?


Create etcd watches with `WithProgressNotify` enabled (available in all etcd 3.x versions).

When `WithProgressNotify` is enabled on an etcd watch, etcd sends progress
Copy link

@xiang90 xiang90 Feb 20, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the proposal is to make the duration configurable on a per watch stream basis.


When an consistent LIST request is received and the watch cache is enabled:

- Get the current revision from etcd for the resource type being served. The returned revision is strongly consistent (guaranteed to be the latest revision via a quorum read).
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does "current revision" mean exactly? We can get the largest modified revision for the listed resources (vs current revision of the entire key space). This can reduce the block time (for less frequently changed objects, there should be just no blocking).

Alternatively, we can add a new watch request type "urgent" or something to let etcd simply deliver all pending watch events or an empty result (if there is no new changes).

### Non-Goals

<<[UNRESOLVED @deads]>>
- Avoid allowing true quorum reads. We should think carefully about this, see: https://github.com/kubernetes/enhancements/pull/1404#discussion_r381528406
Copy link

@xiang90 xiang90 Feb 20, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i feel the "right and traditional" solution for this is to enable v system leasing in the apiserver etcd pkg. here is the description on how it works: http://web.stanford.edu/class/cs240/readings/89-leases.pdf. we already implemented it here: https://github.com/etcd-io/etcd/pull/8341/files. we can port the code into the apiserver side if we want. but this is pretty heavy :P

@deads2k
Copy link
Contributor

deads2k commented Feb 20, 2020

/lgtm

I think this is a good point to start iterating from. The unresolved sections will make sure we dont' accidentally skip agreeing on what's in and out.

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Feb 20, 2020
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: deads2k, jpbetz

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Feb 20, 2020
@k8s-ci-robot k8s-ci-robot merged commit 2a27144 into kubernetes:master Feb 20, 2020
@k8s-ci-robot k8s-ci-robot added this to the v1.18 milestone Feb 20, 2020
### Non-Goals

<<[UNRESOLVED @deads]>>
- Avoid allowing true quorum reads. We should think carefully about this, see: https://github.com/kubernetes/enhancements/pull/1404#discussion_r381528406
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why would we stop permitting true quorum reads? I didn't get that at all out of this doc when I read it?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory lgtm "Looks good to me", indicates that a PR is ready to be merged. sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.