-
Notifications
You must be signed in to change notification settings - Fork 285
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
100% CPU usage after restarting one of the cluster nodes #1134
Comments
Some questions:
The fact that all 3 nodes were experiencing high CPU usage makes me believe that it may have been due to the amount of clients reconnecting at the same time. If not, and if the stopped node would have been down for a while and required a lot of catching up on restart, this would be with the leader, so would then expect only 2 to - possibly - be very active. |
|
CPU jumps in all nodes after restart one of them. |
To make sure I have correctly understood the description of the issue: The issue was triggered when simply restarting one NATS Streaming server of the cluster that probably had between 12 to 15,000 clients. After the restart, all nodes were 100% CPU for 6 to 8 hours. Is it about right? |
Correct.
Maybe more, I don't tested.
Clients are located on the 10.0.0.0/8 network, I block all incoming packets for about 20 minutes, after that the CPU load drops. If I unlock incoming packets, the load increases again. To prevent this, I am forced to unlock incoming packets partially, by /16 every 30 seconds. |
12,000 - 15,000. Sorry. |
Yes, I understood that, no worries.
So do you get to a point where all clients traffic is resumed and servers behave normally (no high CPU usage on all nodes)? |
Correct. Right now the cluster is in this state and the unlock script is running on all nodes :) |
Oh so you mean that you are still filtering clients? So if you were to "open the gate" there would still be the issue? How many channels do you have, and what is the normal message rate? How many subscriptions exist and are they mainly 1 to 1 or do 1 message go to many different subscriptions? |
The problem has been observed for several months and every time it appears I have to solve it by blocking/unblocking clients.
Sometimes the load increases without restarting the node. The cluster works normally, then suddenly jumps. The problem is solved by blocking / unblocking clients.
~42,000 channels.
|
Looks like you may have 1 channel per client since you mentioned about 40,000 clients and 40,000 channels. I see that there is no limit on these channels except for max_age or max_inactivity. Is there cases where those channels are removed due to inactivity or is it mainly messages removed after 6 days? When the event occurs (either on server restart or on its own), have you correlated to an increase in message rate in some channels? Have you noticed some channels unexpectedly filling up with messages maybe? |
I need the help of my teammate. He'll be able to respond in a few hours. |
We have one channel with 40,000 subscribers and 40,000 1-to-1 channels.
Messages removed after 6 days.
We need to check it out.
Noticed - where?
Both of them. |
Ouch, so 1 message will be sent to 40,000 subscribers... this could become a bottleneck if the message rate to this channel is high.
Are you using any kind of metric or the monitoring endpoints to see if some of the channels unexpectedly increase in size? In case you don't know, you can use monitoring endpoints to check on some of the streaming resources: https://docs.nats.io/nats-streaming-concepts/monitoring and https://docs.nats.io/nats-streaming-concepts/monitoring/endpoints
I have a nagging suspicion that it either a sudden influx of streaming messages or NATS related (or some subscriptions being constantly created/destroyed). The reason I say that is that all 3 nodes are affected. That can be the case if streaming messages are added to the system since they get replicated to all 3 nodes. That being said it is the same for any other activity: connection create/close, subscription create/close, messages delivered to sub/ack'ed: all that get replicated. So an increase in activity would cause each node to have to work to process the replicated information. If you have any way that you measure normal activity in the cluster, you should then see if you notice any increase activity. You could even have a simple group of NATS (not streaming) consumers on ">" that simply measure the message rate/size and report every second. It does not have to be too sophisticated, just a way to see a jump. Or, when the situation occurs (100% CPU), start a NATS consumer on ">" and capture traffic for few seconds and try to see what the bulk of the traffic is? As you can see, I am a bit at loss to explain what is happening... |
1-2 messages per month.
We use prometheus-metrics-exporter, but we haven't set up triggers yet. Two questions:
What is it? CPU load is 2900/3200.
They appear in dozens per second. Maybe this is the case? |
The Streaming server is not a server per-se (https://docs.nats.io/nats-streaming-concepts/intro and https://docs.nats.io/nats-streaming-concepts/relation-to-nats). It creates NATS connections to the NATS Server and uses subscriptions to receive messages from clients. This subscription is for "subscription requests" coming from clients. By default NATS subscriptions have a limit of pending messages (waiting to be dispatched/processed) after which the connection does not enqueue them but drop them and notifies the async error handler.
Unless this is a new issue and/or a side effect of something else, it could indeed be the reason for the issue. Could it be that you have an application that tries to create a durable subscription and for some reason loops trying to do so? It could be that initially it got a timeout waiting for the response, but the durable was actually created. In that case, your app may want to check for the error returned and in the case of "Duplicate durable subscription" it would have to close the connection otherwise any other attempt will fail. |
@yadvlz I was actually wondering if the issue was from a colleague of yours. If not, there have been 2 issues reported in the streaming server repo and 2 in the stan.go for the same "duplicate durable" issue in a short amount of time, so this is why I was wondering if they come from the same deployment or are actually different users. |
I am closing this since PR #1136 has been merged. Thank you for your patience! |
I deployed new version of nats-streaming-server (with "replace_durable: true"), but my problem still exist.
|
I am re-opening this issue. |
These messages appear cyclically on the node after restart. |
If I block clients, the log looks like this:
|
It looks to me that the server - as soon as routes are created - is overwhelmed with traffic to the point that the internal connections requests to the NATS server fail! Now the fact that the server is embedding its own NATS server makes it even more worrisome because this is just loopback. The default connect timeout is 2 seconds, so that may explain why it fails seeing those warnings. nats-streaming-server/server/server.go Line 1581 in 814b51a and you could add:
before the |
So how can I deal with it? |
I described some of the ways we could try to debug this by capturing some of the traffic and see if this traffic is "expected". From your earlier description, it sounded to me like there are times where all is well, but then suddenly an event (server restart or other) causes this issue and the only way to restore is to basically stop client traffic for a while. If we determine that the traffic is normal, just higher volume from time to time, then maybe you need to have several clusters and separate connections/traffic. |
Tried this, didn't help.
I think the problem occurs in cases where multiple clients are trying to connect to the server (establish new connection) at the same time. |
If this is following a disconnect/reconnect, and depending which library (and version) you use, we have some ways to add jitter to the reconnect logic. This may help to prevent too many connections from reconnecting at the same time. |
Our cluster consists of three nodes:
Configuration file:
Number of clients: ~ 40 000.
The average consumption of CPU resources: 400/3200 (from "top").
If I restart one of the nodes, the utilization on all nodes jumps to 3200.
Log:
The text was updated successfully, but these errors were encountered: