-
Notifications
You must be signed in to change notification settings - Fork 930
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Extremely high CPU usage in workers #8642
Comments
Appears related to #8629 |
Currently assessing on our side if we can manage to reproduce (hence the status change) |
Maybe related to missing queue in the rabbit. Do you remember doing some operation/cleanup/maintenance on the rabbitmq? |
I saw another similar thread, and you had a potential solution of the below:
This resolved the issue initially but ingestion slowed right down within hours, yet resource usage remained high across all workers. Other than that, I didn't do anything else with rabbitmq. |
Do you have an RSS/TAXII/CSV feed configured? |
I have 6 CSV feeds configured. I did check rabbitmq queues after the above and there was nothing there. Should I have done something about the CSV feeds as well, as the delete option was greyed out? |
Oddly, only 1 of 6 workers are currently connected according to the UI, yet CPU is maxed out due to worker.py (x6) |
@richard-julien @nino-filigran @MaxwellDPS just wanted to update you, I thought 6.3.6 resolved my issue but the issue has returned. Massive CPU usage and, according to the UI, the workers are not even connected.
|
Can you check the logs of the worker the consume a lot of CPU? |
I have lots of logs for the worker(s). Let me know if you want me to dig out any in particular:
|
Looks like you have a lot of problem connecting the worker to the rabbitmq. Like
Its really difficult to know whats going on here as looks like a connectivity issue with the rabbitmq. |
Thanks Julien. A small development; I disabled TLS for the ingestion platforms, which has meant that workers have stayed connected, and CPU usage has been acceptable. That said, ingestion did eventually become very slow. Now,
|
Can you try to down the number of worker to 1 and check the CPU usage of the node? |
You're probably right about clustering. I have so far only done this with the platform (1x frontend, 2x ingestion). Like I say though, CPU usage is currently at an acceptable level. I will try with 1 or 2 less workers. |
Description
Recently, we stopped all connectors/workers, then cleared them in the UI. I then restarted them to re-register. I noticed I wasn't fully utilising available resources on my system, so I created a 2nd ingestion platform (I also have a frontend platform) with 3 extra workers, totalling 6. This took me to close to fully utilising CPU (16 cores!), and bundles were being processed quickly. This was yesterday. I have since, this morning, noticed upwards of 1million bundles, workers failing, connectors going inactive, and load average in the 100s-1000s (was <16 yesterday). I was discussing the urlhaus payloads connector with someone, and they were seeing HUGE messages, so I disabled that connector. I am not seeing messages like that anymore, but CPU is still spiking, and I can't seem to track down what's causing it. I understand rules engine, and others, do not use the worker so it shouldn't be that? Is there a way I can better understand what's causing workers to be working SO hard all of a sudden? Or why ingestion is initially fast (between 15-30 bundles/sec) and then grinds to a halt eventually as CPU usage on the system spikes massively? When checking
docker stats
for workers, I am seeing 100s of PIDs for each (processes/threads created):Worker log sample:
Environment
The text was updated successfully, but these errors were encountered: