-
Notifications
You must be signed in to change notification settings - Fork 9.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Put operation with lease takes linear apply time depending on number of keys already attached to lease #15993
Comments
I just realized that this performance issue was actually a reason why we have seen such a huge improvement in Kubernetes when we started limiting max number of events attached to a single lease (10x CPU reduction of Etcd): kubernetes/kubernetes#98257 (comment) |
Thanks for the detailed report and repro @marseel :) I've been quickly running this through a profiler with the following result: That should be relatively simple to optimize, it's mostly on the creating a slice from a map which is then checked individually per key. etcd/server/etcdserver/apply/apply_auth.go Lines 125 to 136 in 004195b
This TODO also looks like a nice candidate to optimize: Lines 847 to 850 in 004195b
|
Mitigates etcd-io#15993 by not checking each key individually for permission when auth is entirely disabled. Signed-off-by: Thomas Jungblut <[email protected]>
Thanks for looking into this. I think it explains why I've seen different results in this repro and our production cluster. Line 873 in fbed8cb
So maybe something like, etcd/server/etcdserver/apply/apply_auth.go Line 128 in 41ff237
|
I spoke a little too soon about the simplicity of the fix, this is actually a much bigger story for the following reasons.
As you can see most of the time we're spending on converting between string and byte slices here. Map keys can't be Because etcd is only Obviously, the
It's a cool feature, but it seems we have to continuously check for auth. Otherwise we could just switch out the I wrote a little mitigation that would (at least) fix it for the cases where auth is not enabled: which I assume would be most relevant for k8s anyway. Let me know if this is too pragmatic 🧷 |
ah gotcha, how are you configuring etcd in that case? |
Similar, but with auth enabled and only the root user is writing keys. As we can see here: Line 873 in fbed8cb
permission checks for root user and does not depend on key value so we could have short path for it too. |
Mitigates etcd-io#15993 by not checking each key individually for permission when auth is entirely disabled or admin user is calling the method. Signed-off-by: Thomas Jungblut <[email protected]>
Mitigates etcd-io#15993 by not checking each key individually for permission when auth is entirely disabled or admin user is calling the method. Signed-off-by: Thomas Jungblut <[email protected]>
luckily there's a check for that already: tjungblu@014b17d#diff-85ead21002d52058142cc354a8367cfe808d7a5d7cdc65ff315aa84edb947c48R128-R132 but yeah, let's hear the thoughts of the others. I think this would be a much bigger refactoring otherwise. |
lgtm from my point of view. Thanks! |
Thanks, opened it as a PR then, let's see what the reviewers think. |
Mitigates etcd-io#15993 by not checking each key individually for permission when auth is entirely disabled or admin user is calling the method. Signed-off-by: Thomas Jungblut <[email protected]>
Mitigates etcd-io#15993 by not checking each key individually for permission when auth is entirely disabled or admin user is calling the method. Backport of etcd-io#16005 Signed-off-by: Thomas Jungblut <[email protected]>
Mitigates etcd-io#15993 by not checking each key individually for permission when auth is entirely disabled or admin user is calling the method. Backport of etcd-io#16005 Signed-off-by: Thomas Jungblut <[email protected]>
Currently, each etcd client is associated with a single session, and the corresponding lease is attached to all upserted keys (if the lease parameter is set). This approach, though, suffers from performance issues, because put requests in etcd take linear apply time depending on the number of keys already attached to the lease (etcd-io/etcd#15993). This performance penalty is planned to be fixed in etcd (at least for the common case in which the user which performs the request has root role). In the meanwhile, let's make sure that we attach a limited number of keys to a single lease. In particular, this commit introduces the etcd lease manager, which is responsible for managing the lease acquisitions, tracking the keys that are attached to each of them. Once the number of keys per lease exceeds the configured threshold, a new lease gets automatically acquired. The lease usage counter is decremented when a given key gets deleted. Finally, in case one of the leases fails to be renewed, the manager allows to emit a notification event for all the keys that were attached to it. Signed-off-by: Marco Iorio <[email protected]>
Currently, each etcd client is associated with a single session, and the corresponding lease is attached to all upserted keys (if the lease parameter is set). This approach, though, suffers from performance issues, because put requests in etcd take linear apply time depending on the number of keys already attached to the lease (etcd-io/etcd#15993). This performance penalty is planned to be fixed in etcd (at least for the common case in which the user which performs the request has root role). In the meanwhile, let's make sure that we attach a limited number of keys to a single lease. In particular, this commit introduces the etcd lease manager, which is responsible for managing the lease acquisitions, tracking the keys that are attached to each of them. Once the number of keys per lease exceeds the configured threshold, a new lease gets automatically acquired. The lease usage counter is decremented when a given key gets deleted. Finally, in case one of the leases fails to be renewed, the manager allows to emit a notification event for all the keys that were attached to it. Signed-off-by: Marco Iorio <[email protected]>
Mitigates etcd-io#15993 by not checking each key individually for permission when auth is entirely disabled or admin user is calling the method. Signed-off-by: Thomas Jungblut <[email protected]>
Currently, each etcd client is associated with a single session, and the corresponding lease is attached to all upserted keys (if the lease parameter is set). This approach, though, suffers from performance issues, because put requests in etcd take linear apply time depending on the number of keys already attached to the lease (etcd-io/etcd#15993). This performance penalty is planned to be fixed in etcd (at least for the common case in which the user which performs the request has root role). In the meanwhile, let's make sure that we attach a limited number of keys to a single lease. In particular, this commit introduces the etcd lease manager, which is responsible for managing the lease acquisitions, tracking the keys that are attached to each of them. Once the number of keys per lease exceeds the configured threshold, a new lease gets automatically acquired. The lease usage counter is decremented when a given key gets deleted. Finally, in case one of the leases fails to be renewed, the manager allows to emit a notification event for all the keys that were attached to it. Signed-off-by: Marco Iorio <[email protected]>
Mitigates etcd-io#15993 by not checking each key individually for permission when auth is entirely disabled or admin user is calling the method. Backport of etcd-io#16005 Signed-off-by: Thomas Jungblut <[email protected]>
This issue has been automatically marked as stale because it has not had recent activity. It will be closed after 21 days if no further activity occurs. Thank you for your contributions. |
This was fixed via #16005 |
What happened?
Put request with lease takes linear apply time depending on the number of keys already attached to the lease. With the increasing number of keys and constant QPS, it's increasing to the point, where etcd spend almost 100% of time applying put requests making lease renewal fail.
What did you expect to happen?
I would expect put with lease not to depend linearly on number of already attached keys to lease.
How can we reproduce it (as minimally and precisely as possible)?
Run Etcd with:
Test client:
Apply duration increase linearly:
Anything else we need to know?
The above repro is using quite a significant QPS of 1k, but I've observed a similar issue with ~50k entries attached to the lease and only 20 QPS of put requests:
As you can see above, with
sum(rate(etcd_server_apply_duration_seconds_sum[10s]))
approaching 1.0 - leases start failing to renew (red line is Lease renewal returning unavailable error)Etcd version (please run commands below)
I've tested both 3.5.9 (repro) and 3.5.4 (production setup with 20 QPS)
Etcd configuration (command line flags or environment variables)
Provided in reproduction
Etcd debug information (please run commands below, feel free to obfuscate the IP address or FQDN in the output)
No response
Relevant log output
No response
The text was updated successfully, but these errors were encountered: