Resetting WAL state errors on workqueues #4033

juan131 · 2023-04-07T08:29:41Z

Defect

NATS servers logs are filled with warning such as the one below and I'm unable to find any documentation about what's the cause for this:

[37] [WRN] RAFT [cnrtt3eg - C-R3F-7al3bC6S] Resetting WAL state

Versions of `nats-server` and affected client libraries used:

NATS server version: 2.9.15
NATS client (Golang library: github.com/nats-io/nats.go v1.19.0)

OS/Container environment:

Linux containers running on a K8s cluster using a 3 replicas statefulset (pods named nats-0, nats-1 & nats-2). Each replica has its own PVC.

NATS servers running on clustered mode with JetStream enabled. The configuration can be found below:

server_name: $NATS_SERVER_NAME
listen: 0.0.0.0:4222
http: 0.0.0.0:8222

# Logging options
debug: false
trace: false
logtime: false

# Some system overrides
max_connections: 10000
max_payload: 65536

# Clustering definition
cluster {
  name: "nats"
  listen: 0.0.0.0:6222
  routes = [
    nats://nats:6222
  ]
}

# JetStream configuration
jetstream: enabled
jetstream {
  store_dir: /data/jetstream
  max_memory_store: 3G
  max_file_store: 20G
}

There are 4 different streams configured in the cluster with ~50 subjects on each stream. Streams configuration:

             Replicas: 3
              Storage: File

Options:

            Retention: WorkQueue
     Acknowledgements: true
       Discard Policy: New
     Duplicate Window: 12h0m0s
    Allows Msg Delete: true
         Allows Purge: true
       Allows Rollups: false

Limits:

     Maximum Messages: unlimited
  Maximum Per Subject: unlimited
        Maximum Bytes: unlimited
          Maximum Age: 12h0m0s
 Maximum Message Size: 1.0 KiB
    Maximum Consumers: unlimited

Steps or code to reproduce the issue:

The issue seems to starts after one of the NATS servers get restarted and, once it happens, it doesn't stop (I can see 20K logs like this one in the last 12 hours for instance).

Expected result:

The system should tolerate the loss of one NATS server according to JetStream documentation given we're using a replication factor of 3.

Actual result:

Some streams are totally unusable when this happens (publishers can't add new message & subscriber don't receive new messages) while other streams seem to be working as expected.

The text was updated successfully, but these errors were encountered:

derekcollison · 2023-04-07T12:50:54Z

We have made many improvements in the upcoming 2.9.16 release in this area. Should be released next week.

In the meantime, you have two general approaches to repairing the raft layer.

Have the leader of the asset in question stepdown. You can track down the mappings from the raft layer name to the named asset using /jsz??acc=YOUR_ACCOUNT&consumers=true&raft=true. Then issue nats consumer cluster step-down <stream> <consumer>
Scale the owning stream down to 1 and back up to the previous replica count. nats stream update --replicas=1 <stream>

juan131 · 2023-04-10T06:44:17Z

Thanks so much for the information @derekcollison !! I hope the new release helps mitigating these issues 🤞🏼

A couple of follow-up questions:

Would it be possible to perform a rolling update on NATS pods (setting maxUnavailable=1) with my current configuration & trust on the system to continue working?
Is there anything I can do to improve the stability of the system?

derekcollison · 2023-04-10T17:01:19Z

Feel free to test the RC.7 candidate for 2.9.16, it is under synadia's docker hub for nats-server.

In general for rolling updates we suggest to lame duck a server, once shutdown restart and wait for /haelthz to return ok, then move to the next server.

juan131 · 2023-04-11T06:35:33Z

Thanks @derekcollison ! I'll give the new version a try.

One odd thing that I noticed is that, sometimes, a consumer gets "stuck" when one of the NATS servers is restarted (usually when it's the leader the one that's restarted). I mean, no messages are delivered to its subscribers even though there are millions of messages pending in the stream.

Running nats consumer cluster step-down <stream> <consumer> doesn't help. However, it does help removing the consumer (nats consumer cluster rm <stream> <consumer>), create it again & recreate the subscriptions.

Note: I use explicit ack push consumers with a queue group of subscribers. The subscriptions are "channel-based" created using ChanQueueSubscribe function.

Do you have any explanation for this behaviour? Is there anything I can do to detect a consumer is "stuck"?

juan131 · 2023-04-11T07:21:05Z

Related to the behaviour I mentioned above. I noticed in the Grafana dashboard I have for monitoring JetStream (based on this dashboard) that, when this happens, the number of messages with ACK pending soars for that consumer, see:

How can this happens if I'm setting Max Ack Pending: 100 on the consumer?

$ nats consumer info FOO-queue FOO-BAR-workers
Configuration:

                Name: FOO-BAR-workers
    Delivery Subject: _INBOX.5PwawLU8od7pwjLEwcfTPW
      Filter Subject: FOO.BAR
      Deliver Policy: All
 Deliver Queue Group: FOO-BAR-workers
          Ack Policy: Explicit
            Ack Wait: 10s
       Replay Policy: Instant
  Maximum Deliveries: 10
     Max Ack Pending: 100
        Flow Control: false

Cluster Information:

                Name: nats
              Leader: nats-0
             Replica: nats-1, current, seen 0.00s ago
             Replica: nats-2, current, seen 0.00s ago

derekcollison · 2023-04-11T13:57:03Z

We would have to triage your specific setup to gain more insight.

You do have the ability to set max ack pending which we highly recommend doing for systems.

ggarri · 2023-07-10T08:15:05Z

Hi @derekcollison , we experienced the same issue on our side, but in our case, we use 5 replicas for the workqueue stream, and in the latest version 2.9.19

In addition, we got one message "lost" in the queue, during this period where our cluster was unavailable.

derekcollison · 2023-07-10T16:49:20Z

So you had log messages about resetting WAL? For which servers? What did stream info report for the stream?

ggarri · 2023-07-11T08:09:43Z

Our WorkQueue streams leader was moved to a different instance right after that, but messages were lost in the transaction

derekcollison · 2023-07-11T15:30:13Z

Meaning messages that were in the work queue were lost on a leader transfer?

ggarri · 2023-07-11T21:44:35Z

How can I make sure of that? I only can tell the message was successfully sent by our NATs client and after the leader restart message was gone from every available replica.

derekcollison · 2023-07-11T21:48:34Z

ok and the message had not been properly delivered etc yes?

jnmoyne · 2023-07-25T01:13:49Z

Could it be that the problem is with the consumer running out of number of deliveries because of being a push consumer if there's no-one to listening to the delivery subject at the time (because one of the server's being restarted, some client applications are being disconnected and need to reconnect to other servers)? And the warning messages you are seeing on the server are related to the server restarting but just warnings?

I am saying this because of your consumer info:

                Name: FOO-BAR-workers
    Delivery Subject: _INBOX.5PwawLU8od7pwjLEwcfTPW
      Filter Subject: FOO.BAR
      Deliver Policy: All
 Deliver Queue Group: FOO-BAR-workers

Is it a push consumer with a queue group and your delivery subject is an _INBOX subject?
That would not be the right thing to do.

I would recommend using pull consumers rather than push consumer on those working queue streams

ggarri · 2023-07-25T11:50:36Z

we are already using pull consumers since we identified in the past issues with the push consumers as you mentioned:

sub, err = s.dialer.JetStream.PullSubscribe(subject, s.subCfg.Name, s.subCfg.PullSubOpts()...)
if err != nil {
break
}

jnmoyne · 2023-07-26T06:00:26Z

We would need help being able to reproduce this because with an R3 consumer on an R3 stream you should be able to have one server down and still operate. We would need more details about the failure scenario for example, is it just one of the servers going down and then coming back up later or more than one server being down? Can it access its store_dir right away (each server must have its own independent storage directory and be able to access it). If there's any chance that more than one pod is restarting and there's overlap between their outage time consider using a 5+ nodes cluster and R5 for better fault-tolerance.

The reseting WAL state warning can happen for many reasons when you have servers going up and down and trying to recover their state through RAFT votes but should not result in lost messages.

For deployment over k8s we strongly recommend using the Helm chart (https://github.com/nats-io/k8s/releases/tag/nats-1.0.0-rc.1)

ro0NL · 2024-05-29T11:54:01Z

we experience the same using offical helm chart; in k8s we dont see any recent restarts or useful logs either. so im not sure how to debug it.

im betting though it's a memory/resource limit issue of some sort

wallyqs · 2024-08-28T05:11:09Z

Think this was likely fixed via #5506 released as part of v2.10.17, closing but feel free to reopen if can reproduce on v2.10.19

juan131 added the 🐞 bug label Apr 7, 2023

derekcollison added the stale This issue has had no activity in a while label Jun 11, 2023

bruth added defect Suspected defect such as a bug or regression and removed 🐞 bug labels Aug 18, 2023

wallyqs changed the title ~~Resetting WAL state errors~~ Resetting WAL state errors on workqueues Aug 28, 2024

wallyqs closed this as completed Aug 28, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Resetting WAL state errors on workqueues #4033

Resetting WAL state errors on workqueues #4033

juan131 commented Apr 7, 2023 •

edited

Loading

derekcollison commented Apr 7, 2023

juan131 commented Apr 10, 2023

derekcollison commented Apr 10, 2023

juan131 commented Apr 11, 2023

juan131 commented Apr 11, 2023

derekcollison commented Apr 11, 2023

ggarri commented Jul 10, 2023

derekcollison commented Jul 10, 2023

ggarri commented Jul 11, 2023

derekcollison commented Jul 11, 2023

ggarri commented Jul 11, 2023

derekcollison commented Jul 11, 2023

jnmoyne commented Jul 25, 2023

ggarri commented Jul 25, 2023

jnmoyne commented Jul 26, 2023

ro0NL commented May 29, 2024 •

edited

Loading

wallyqs commented Aug 28, 2024

Resetting WAL state errors on workqueues #4033

Resetting WAL state errors on workqueues #4033

Comments

juan131 commented Apr 7, 2023 • edited Loading

Defect

Versions of nats-server and affected client libraries used:

OS/Container environment:

Steps or code to reproduce the issue:

Expected result:

Actual result:

derekcollison commented Apr 7, 2023

juan131 commented Apr 10, 2023

derekcollison commented Apr 10, 2023

juan131 commented Apr 11, 2023

juan131 commented Apr 11, 2023

derekcollison commented Apr 11, 2023

ggarri commented Jul 10, 2023

derekcollison commented Jul 10, 2023

ggarri commented Jul 11, 2023

derekcollison commented Jul 11, 2023

ggarri commented Jul 11, 2023

derekcollison commented Jul 11, 2023

jnmoyne commented Jul 25, 2023

ggarri commented Jul 25, 2023

jnmoyne commented Jul 26, 2023

ro0NL commented May 29, 2024 • edited Loading

wallyqs commented Aug 28, 2024

juan131 commented Apr 7, 2023 •

edited

Loading

Versions of `nats-server` and affected client libraries used:

ro0NL commented May 29, 2024 •

edited

Loading