feat(kuma-cp) Read only cache manager #634
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Summary
When Envoy connects to CP through the XDS API, the CP starts a goroutine with a reconciliation process that pulls Dataplane and Mesh definition as well as list of policies (TrafficLog, TrafficPermissions etc.). Then it does computation which policies are relevant and builds Envoy config to push.
This process is executed every X second (1s by default) for every goroutine. With let's say 1000 goroutines executing every second. We can cache common requests across goroutines (list of TrafficLogs, TrafficPermissions etc.). Let's say we've got 10 common requests, it can save us 9000rps.
This cache not only saves us time for accessing DB but also we don't need to convert from native model to Kuma model.
The cache is turned on by default with 1s of expiration time.
Performance tests
I did a performance tests on my 2019 Macbook with 6 core i7 and 16GB ram.
I applied 50 TrafficPermissions and TrafficLogs, and I used the test client available here
https://github.com/Kong/kuma/blob/master/pkg/test/xds/client/app/main.go
Before change:
200 dataplanes - Kuma CP ~350% CPU, Postgres ~4% CPU
250 dataplanes - Kuma CP ~800% CPU, Postgres ~8% CPU
300 dataplanes - Kuma CP ~1000% CPU, Postgres ~1% CPU
After change:
300 dataplanes - Kuma CP ~50% CPU, Postgres ~0.1% CPU
1000 dataplanes - Kuma CP ~700% CPU, Postgres ~4% CPU
1500 dataplanes - Kuma CP ~900% CPU, Postgres ~1% CPU
In the last case, I can only guess that CP spent too much on generating config so it won't hit the DB often enough. To find out exact bottlenecks we need to spend more time on profiling and perf tuning, but it's clear that the performance gain is around 5-10x on Kuma CP and Postgres CPU usage and depending on the case.