Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

etcd is easily overloaded #195

Open
u-kyou opened this issue Sep 26, 2023 · 2 comments
Open

etcd is easily overloaded #195

u-kyou opened this issue Sep 26, 2023 · 2 comments
Labels
enhancement New feature or request

Comments

@u-kyou
Copy link

u-kyou commented Sep 26, 2023

Description

Kelemetry is perfectly meets our needs, and I have been running it for a few days.
But one thing that confused me is the db size of etcd keeps increasing. It brings io pressure on the etcd server, and sometimes cause proposals pending. This could be due to a high number of k8s events(We have also optimized the data disk of etcd by using SSDs).

If Kelemetry could support event filtering, it would be a great help. For example, filter periodic events which I don't care about
(Is redactPattern a similar filtering functionality? I had tried, but not works will)

User story

No response

@u-kyou u-kyou added the enhancement New feature or request label Sep 26, 2023
@SOF3
Copy link
Member

SOF3 commented Sep 26, 2023

  • diff-controller-redact-pattern tells the diff controller not to record the contents of an object. This is primarily used to prevent exposing secret contents to Kelemetry viewers directly.
  • filter-exclude-types tells all components (audit, diff, k8s-event) to ignore all events related to certain object type. This is primarily used to suppress noisy objects like leases (which update several times for each node every minute due to leader lease renewal) and events (we skip audit and diff for events because k8s-event will track them anyway)
  • filter-exclude-user-agent tells the audit consumer to ignore events from certain user agents, such as those that are explicitly for leader election.

Regarding "db size keeps increasing", there are some TTL values that you may want to tune them as well:

  • diff-cache-patch-ttl, diff-cache-snapshot-ttl: this is the time that a cached object diff remains in the database after its create/update/delete event so that the audit consumer can read it. This can be set to a smaller value if your audit webhook/producer delivers events to the consumer quickly enough.

Would you provide more details on the "periodic events" that are causing problems for you? Could you try checking which types of keys have the highest traffic in your cluster? This information would help us in deciding if adding support for more robust database backends (e.g. redis, tikv, etc) is necessary.

@u-kyou
Copy link
Author

u-kyou commented Sep 27, 2023

  • diff-controller-redact-pattern tells the diff controller not to record the contents of an object. This is primarily used to prevent exposing secret contents to Kelemetry viewers directly.
  • filter-exclude-types tells all components (audit, diff, k8s-event) to ignore all events related to certain object type. This is primarily used to suppress noisy objects like leases (which update several times for each node every minute due to leader lease renewal) and events (we skip audit and diff for events because k8s-event will track them anyway)
  • filter-exclude-user-agent tells the audit consumer to ignore events from certain user agents, such as those that are explicitly for leader election.

Regarding "db size keeps increasing", there are some TTL values that you may want to tune them as well:

  • diff-cache-patch-ttl, diff-cache-snapshot-ttl: this is the time that a cached object diff remains in the database after its create/update/delete event so that the audit consumer can read it. This can be set to a smaller value if your audit webhook/producer delivers events to the consumer quickly enough.

Would you provide more details on the "periodic events" that are causing problems for you? Could you try checking which types of keys have the highest traffic in your cluster? This information would help us in deciding if adding support for more robust database backends (e.g. redis, tikv, etc) is necessary.

Thank you very much for your reply! @SOF3

The "periodic events" I mentioned above:
We use many CRDs which could produce many events periodically, for example:

  • apisixupstreams caused too many ResourcesSynced events which I don't want to show them in kelemtry

apisixupstreams-event

I tried filter-exclude-types to exclude apisix.apache.org/apisixupstreams, and it works will. That's exactly what I need.

I also tried to set a smaller value for diff-cache-patch-ttl and diff-cache-snapshot-ttl these two params, the total number of etcd keys decreased significantly. After I did a 'defrag' to etcd server, the db size has shrinked to a reasonable value. But from our etcd's monitoring, the write iops did not decrease and there are still a few proposals pending (maybe
adding support for redis is a good idea?!)

etcd-proposals_pending

@SOF3 SOF3 changed the title Can Kelemetry support event filtering etcd is easily overloaded Sep 27, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants