-
Notifications
You must be signed in to change notification settings - Fork 387
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Growing CPU usage on server agents #687
Comments
Some update there.
Though it's still visible that not everything is still deleted from DB, as CPU and RAM usage also are growing, but way slower. |
Thanks @andreygolev, I'm close following this issue, thanks for the investigation. I need to check what you are showing here. I'll update here |
I also spent some time with investigation what really happens behind the scenes, and I'm pretty confident that execution cleanup after MaxExecutions reached works totally fine, and it.Seek() through the BadgerDB also iterates not more than MaxExecution elements. So, there are no problems with logic in Dkron code, but there are definitely problems with BadgerDB itself. Also I saw some similar issues in BadgerDB repo. Users were claiming about degrading performance with deleted keys and growing storage. Badger maintainers suggested running ValueLogGC more often, but in it also didn't help at all for me. So, after some trial and errors with no luck I decided to give a try to a BoltDB instead of BadgerDB, and it's rock solid now. Please take a look at the attached screenshots. Performance though is slower than Badger, but I'm pretty sure it's easy to solve by going with inmemory DB. In my case I just moved BoltDB db file to tmpfs. This is how I implemented it: andreygolev@c098107 |
Yes and probably :)
I'm following them too, this is clearly related to dgraph-io/badger#718
I have some test branches using different engines, @andreygolev using your special use case, may I ask you to try with this?d235916#diff-58f7996e41c5c348c30c7362170178d1L465 That would be helpful. I initially wanted to go with Badger because I liked the interface and the features, as you said you need to use tmpfs with BoldDB which I don't like, but, BuntDB have an almost identical interface of Badger, which makes very easy to switch between them. Thanks |
Totally agree, tmpfs is not the way :) By the way, the issue was easily reproducible with running 50 jobs @every 3s with MaxExecutions = 5 |
Thanks for the test @andreygolev, yes go ahead with the PR |
Fixed in #702 |
I encountered a similar situation. I frequently deleted and inserted key-value pairs, and the CPU kept increasing. This CPU was mainly caused by iterators. I accidentally found the badger warn log: Block cache might be too small, so I increased this parameter. Now it seems normal. I need to continue to observe for a while. |
Hi @xtbwqtq and thanks for the deep investigation, but in this issue, we're talking about Badger and it was removed from Dkron a long time ago due to these kind of issues 🫤 as you can see in the issue that fixed this we moved onto BuntDB that has proven way smoother and efficient. Is it a possibility for you to upgrade Dkron? |
Describe the bug
Hi,
We're trying Dkron in production workloads. And currently we have 30 jobs. Most of them are running each 1-3 seconds.
During normal operation, CPU usage on server agents is growing over time, let's say after 24-48h it's consuming more than 100% of CPU.
I've compiled Dkron with CPU profiling enabled and run it for some time in production workload and it looks that issue is at
dkron/store.go listTxnFunc
Please take a look at the attached PDF with profiling info.
profile001.pdf
In order to mitigate and investigate growing CPU, I tried to create iterator in listTnxfunc with different options
And it looks that it helps a a bit. CPU usage now grows way slower, but still not stopping to grow over time. At around 15:00 it's visible that CPU usage became growing way slower than it was before, because of PrefetchSize changes. PrefetechSize = 5 and PrefetchValues = true have the same good effect on CPU usage.
I suspected that the CPU usage will stop increasing after MaxExecutions value is reached, then old executions will be cleaned up according to code. I changed default MaxExecutions value 100 to 5 and noticed that in Dkron WebUI execution amount really became not more than 5 and RAM usage became way smaller. Though CPU usage is still growing same way.
More info
Also visible on file cache memory usage. It's always growing also along with CPU which makes me think that it's all related
Though no ideas so far on how to fix it.
Is anyone encountered the same behaviour?
Expected behavior
CPU usage is not growing over time in case jobs are not being added to Dkron.
** Specifications:**
Additional context
It's all run in Kubernetes
The text was updated successfully, but these errors were encountered: