-
-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Resetting WAL state errors on workqueues #4033
Comments
We have made many improvements in the upcoming 2.9.16 release in this area. Should be released next week. In the meantime, you have two general approaches to repairing the raft layer.
|
Thanks so much for the information @derekcollison !! I hope the new release helps mitigating these issues 🤞🏼 A couple of follow-up questions:
|
Feel free to test the RC.7 candidate for 2.9.16, it is under synadia's docker hub for In general for rolling updates we suggest to lame duck a server, once shutdown restart and wait for |
Thanks @derekcollison ! I'll give the new version a try. One odd thing that I noticed is that, sometimes, a consumer gets "stuck" when one of the NATS servers is restarted (usually when it's the leader the one that's restarted). I mean, no messages are delivered to its subscribers even though there are millions of messages pending in the stream. Running
Do you have any explanation for this behaviour? Is there anything I can do to detect a consumer is "stuck"? |
Related to the behaviour I mentioned above. I noticed in the Grafana dashboard I have for monitoring JetStream (based on this dashboard) that, when this happens, the number of messages with ACK pending soars for that consumer, see: How can this happens if I'm setting $ nats consumer info FOO-queue FOO-BAR-workers
Configuration:
Name: FOO-BAR-workers
Delivery Subject: _INBOX.5PwawLU8od7pwjLEwcfTPW
Filter Subject: FOO.BAR
Deliver Policy: All
Deliver Queue Group: FOO-BAR-workers
Ack Policy: Explicit
Ack Wait: 10s
Replay Policy: Instant
Maximum Deliveries: 10
Max Ack Pending: 100
Flow Control: false
Cluster Information:
Name: nats
Leader: nats-0
Replica: nats-1, current, seen 0.00s ago
Replica: nats-2, current, seen 0.00s ago |
We would have to triage your specific setup to gain more insight. You do have the ability to set max ack pending which we highly recommend doing for systems. |
Hi @derekcollison , we experienced the same issue on our side, but in our case, we use 5 replicas for the workqueue stream, and in the latest version 2.9.19 In addition, we got one message "lost" in the queue, during this period where our cluster was unavailable. |
So you had log messages about resetting WAL? For which servers? What did stream info report for the stream? |
Meaning messages that were in the work queue were lost on a leader transfer? |
How can I make sure of that? I only can tell the message was successfully sent by our NATs client and after the leader restart message was gone from every available replica. |
ok and the message had not been properly delivered etc yes? |
Could it be that the problem is with the consumer running out of number of deliveries because of being a push consumer if there's no-one to listening to the delivery subject at the time (because one of the server's being restarted, some client applications are being disconnected and need to reconnect to other servers)? And the warning messages you are seeing on the server are related to the server restarting but just warnings? I am saying this because of your consumer info:
Is it a push consumer with a queue group and your delivery subject is an _INBOX subject? I would recommend using pull consumers rather than push consumer on those working queue streams |
we are already using pull consumers since we identified in the past issues with the push consumers as you mentioned:
|
We would need help being able to reproduce this because with an R3 consumer on an R3 stream you should be able to have one server down and still operate. We would need more details about the failure scenario for example, is it just one of the servers going down and then coming back up later or more than one server being down? Can it access its The reseting WAL state warning can happen for many reasons when you have servers going up and down and trying to recover their state through RAFT votes but should not result in lost messages. For deployment over k8s we strongly recommend using the Helm chart (https://github.com/nats-io/k8s/releases/tag/nats-1.0.0-rc.1) |
we experience the same using offical helm chart; in k8s we dont see any recent restarts or useful logs either. so im not sure how to debug it. im betting though it's a memory/resource limit issue of some sort |
Think this was likely fixed via #5506 released as part of v2.10.17, closing but feel free to reopen if can reproduce on v2.10.19 |
Defect
NATS servers logs are filled with warning such as the one below and I'm unable to find any documentation about what's the cause for this:
Versions of
nats-server
and affected client libraries used:2.9.15
OS/Container environment:
Linux containers running on a K8s cluster using a 3 replicas statefulset (pods named
nats-0
,nats-1
&nats-2
). Each replica has its own PVC.NATS servers running on clustered mode with JetStream enabled. The configuration can be found below:
There are 4 different streams configured in the cluster with ~50 subjects on each stream. Streams configuration:
Steps or code to reproduce the issue:
The issue seems to starts after one of the NATS servers get restarted and, once it happens, it doesn't stop (I can see 20K logs like this one in the last 12 hours for instance).
Expected result:
The system should tolerate the loss of one NATS server according to JetStream documentation given we're using a replication factor of 3.
Actual result:
Some streams are totally unusable when this happens (publishers can't add new message & subscriber don't receive new messages) while other streams seem to be working as expected.
The text was updated successfully, but these errors were encountered: