-
Notifications
You must be signed in to change notification settings - Fork 172
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Silent connection breakage #1283
Comments
Current suspicion is that executor might have been blocked on a thread that async-nats's task was sharing, will test this theory. |
Hey! First of all, keep in mind that it's totally normal for the client to not know that the connection is down, especially if there is not much traffic coming through. That's why there is In most cases, the client will figure out that the TCP connection is down (or rather, TCPStream will), but, not always. If you want to make the client more sensitive, you can specify lower ping interval when creating the client. When it comes to the publish - in Core NATS, client has a publish buffer that aggregates some messages to optimize the throughput. If connection is stale, client will buffer those messages, and try to send them, but until buffer is full, it will not error. You can tweak this using the client capacity. For getting events on connection, use events_callback |
I understand, but I don't believe there was a network issue for connection to be dropped in the first place since other clients running on the same exact machine are fine and that means TCP connection will not go down for no reason. I checked whether executor was blocked and I don't believe that was the case. According to latest investigation details even sending channel inside of NATS client was clogged, so somehow NATS client was neither sending nor receiving messages for quite a few seconds. Here is a delay on
And client did not receive messages from the server in the middle of that:
I asked user to run an app with |
async-nats logging is not super consistent (
Looks like it doesn't send pings unless there is a lack of other activity, which makes sense, but why would it disconnect so frequently with |
After reading through various tokio issues I have found tokio-rs/tokio#4941 and decided to try tokio-rs/tokio#4936 to work around it, but client still stops receiving messages the same way from time to time. What else can I do to find why connection breaks? I see no reason for this to happen, yet it does. |
I wonder - Isn't this a case for a Slow Consumer? Please chceck the server logs. Also: can you provide reproduction, so I can run it on my machine? |
Server logs slow consumer (check above messages), but client doesn't print slow consumer, so it is fine. Rust client just simply stops sending and receiving messages occasionally and later wakes up all of a sudden. I'm quite confident client is able to keep up with orders of magnitude higher load. There is not reproduction unfortunately and it is not 100% deterministic, it just happens from once/twice an hour to once every few hours. The app is quite large (though 100% open source) and extremely heavy on CPU and RAM, so I would not necessarily recommend trying to run it. NATS server and client hardware for context are mentioned in the very first message. |
Server will immediately shut down a client that is causing Slow Consumer. |
Yes, I can see that from server logs:
This is why I'm wondering why would client stop sending and receiving messages all of a sudden if the app is running fine and wired network is up all the time. |
Ah, you think slow consumer is the outcome of client not processing anything, not other way around. I can run some workloads, but I need at least the rough scenario of what needs to happen to reproduce it. |
Yes, that is my only plausible conclusion so far.
There are two services (often happen to be on the same machine) communicating through NATS (always on a different machine). The pattern is of stream response: one app is sending request and another is streaming responses to requester's subject and requester sends async acks back for backpressure. The responder is the app that is disconnecting periodically, never requester. Streaming response has ~1G of data chunked into ~1.8M chunks (server is configured with 2M message size limit). Responder sends two messages and waits for one ack before sending each next message. Responder is sometimes streaming back multiple responses to differnet clients interleaved with requests to a third service (tiny few bytes request, single ~1M response for each request). |
This is just Core NATS, without JetStream, and I assume you're using |
Just Core NATS, regular Here is the exact code for stream request and stream request below it (Iknow it is a lot, just it case it is helpful): https://github.com/subspace/subspace/blob/fabdd292ec43c492d975eab1830bd391c8ad6aa6/crates/subspace-farmer/src/cluster/nats_client.rs#L658-L693 |
Are you by any chance running blocking tasks on the async runtime? |
Generally: yes, but it is done carefully in a very few strategic places with Since NATS client is in its own task created by the library, it should have been able to at least fill channels with incoming messages and print slow consumer client-side, but it didn't do anything at all. I have tripple-checked the code and have not found any violations that would cause such behavior. |
Client side slow consumers is something different than server side slow consumers: client side slow consumers is triggered when client pushes received messages to internal buffer quicker than the user is consuming them, causing buffer to reach its capacity. |
Yes, I found that in the code already. I see no reason for application to be unable to read a message from the socket and push it into the subscription-specific channel for 10 seconds straight right now. |
hey @nazar-pc I tried to reproduce it locally, and so far, no luck. |
Thanks for checking it! |
After tracking a few very confusing bugs I no longer believe this is a bug in nats.rs, sorry for wasting your time |
No worries. Good to hear you were able to fix the problem! |
Observed behavior
Under heavy CPU load connection sometimes breaks in a way that is not observable and root cause is not clear.
There are many applications connecting to NATS server on this machine, but only one of them broke (one that is most heavily loaded). Application sends a notification and waits for another app to acknowledge, which never arrived:
In fact the message going out (with
client.publish
) silently didn't make it out at all, another application that was supposed to receive this message worked fine all this time and only missed messages from this broken sender.Just a few seconds ago prior to this it was working fine and making successful requests, there was also a subscription started after some of the messages were published that may or may not be related here:
I can see from NATS server logs that server wasn't able to deliver messages to the client for some reason (even though other apps on the same machine were working just fine, so network link is not an issue here):
Only much later client detected that it is apparently disconnected, there were other async_nats messages with default log level before this all the way to initial connection:
And things started working again for some time after that.
While connection was broken requests were all failing with
request timed out: deadline has elapsed
.The only connection option customized is request timeout, which was set to 5 minutes (due to some async responders that may take a while to respond).
I'll try to collect logs with
async_nats=debug
orasync_nats=trace
for better insight into what was happening there.Expected behavior
I expect to see errors or for errors to be returned from the library when messages are not making it out, not for things to break silently.
Server and client version
nats-server: v2.10.17
async-nats 0.35.1
Host environment
NATS server running on Dual Socket 2680v3 24 cores (48 threads) with 20 Gbps network.
Client (with other apps) is running on Dual Socket 7742 128 cores (256 threads) with 100 Gbps link.
This is a user-reported setup, I do not have access to it myself and not sure what kind of switch they're using to connect those machines together, but I don't think it matters much in this case.
Steps to reproduce
No response
The text was updated successfully, but these errors were encountered: