etcd is easily overloaded #195

u-kyou · 2023-09-26T09:53:57Z

Description

Kelemetry is perfectly meets our needs, and I have been running it for a few days.
But one thing that confused me is the db size of etcd keeps increasing. It brings io pressure on the etcd server, and sometimes cause proposals pending. This could be due to a high number of k8s events(We have also optimized the data disk of etcd by using SSDs).

If Kelemetry could support event filtering, it would be a great help. For example, filter periodic events which I don't care about
(Is redactPattern a similar filtering functionality? I had tried, but not works will)

User story

No response

SOF3 · 2023-09-26T11:21:44Z

diff-controller-redact-pattern tells the diff controller not to record the contents of an object. This is primarily used to prevent exposing secret contents to Kelemetry viewers directly.
filter-exclude-types tells all components (audit, diff, k8s-event) to ignore all events related to certain object type. This is primarily used to suppress noisy objects like leases (which update several times for each node every minute due to leader lease renewal) and events (we skip audit and diff for events because k8s-event will track them anyway)
filter-exclude-user-agent tells the audit consumer to ignore events from certain user agents, such as those that are explicitly for leader election.

Regarding "db size keeps increasing", there are some TTL values that you may want to tune them as well:

diff-cache-patch-ttl, diff-cache-snapshot-ttl: this is the time that a cached object diff remains in the database after its create/update/delete event so that the audit consumer can read it. This can be set to a smaller value if your audit webhook/producer delivers events to the consumer quickly enough.

Would you provide more details on the "periodic events" that are causing problems for you? Could you try checking which types of keys have the highest traffic in your cluster? This information would help us in deciding if adding support for more robust database backends (e.g. redis, tikv, etc) is necessary.

u-kyou · 2023-09-27T08:20:51Z

diff-controller-redact-pattern tells the diff controller not to record the contents of an object. This is primarily used to prevent exposing secret contents to Kelemetry viewers directly.

filter-exclude-types tells all components (audit, diff, k8s-event) to ignore all events related to certain object type. This is primarily used to suppress noisy objects like leases (which update several times for each node every minute due to leader lease renewal) and events (we skip audit and diff for events because k8s-event will track them anyway)

filter-exclude-user-agent tells the audit consumer to ignore events from certain user agents, such as those that are explicitly for leader election.

Regarding "db size keeps increasing", there are some TTL values that you may want to tune them as well:

diff-cache-patch-ttl, diff-cache-snapshot-ttl: this is the time that a cached object diff remains in the database after its create/update/delete event so that the audit consumer can read it. This can be set to a smaller value if your audit webhook/producer delivers events to the consumer quickly enough.

Would you provide more details on the "periodic events" that are causing problems for you? Could you try checking which types of keys have the highest traffic in your cluster? This information would help us in deciding if adding support for more robust database backends (e.g. redis, tikv, etc) is necessary.

Thank you very much for your reply! @SOF3

The "periodic events" I mentioned above:
We use many CRDs which could produce many events periodically, for example:

apisixupstreams caused too many ResourcesSynced events which I don't want to show them in kelemtry

I tried filter-exclude-types to exclude apisix.apache.org/apisixupstreams, and it works will. That's exactly what I need.

I also tried to set a smaller value for diff-cache-patch-ttl and diff-cache-snapshot-ttl these two params, the total number of etcd keys decreased significantly. After I did a 'defrag' to etcd server, the db size has shrinked to a reasonable value. But from our etcd's monitoring, the write iops did not decrease and there are still a few proposals pending (maybe
adding support for redis is a good idea?!)

u-kyou added the enhancement New feature or request label Sep 26, 2023

SOF3 changed the title ~~Can Kelemetry support event filtering~~ etcd is easily overloaded Sep 27, 2023

SOF3 mentioned this issue Jun 7, 2024

support for redis #306

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

etcd is easily overloaded #195

etcd is easily overloaded #195

u-kyou commented Sep 26, 2023

SOF3 commented Sep 26, 2023

u-kyou commented Sep 27, 2023

etcd is easily overloaded #195

etcd is easily overloaded #195

Comments

u-kyou commented Sep 26, 2023

Description

User story

SOF3 commented Sep 26, 2023

u-kyou commented Sep 27, 2023