Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Resetting WAL state errors on workqueues #4033

Closed
juan131 opened this issue Apr 7, 2023 · 17 comments
Closed

Resetting WAL state errors on workqueues #4033

juan131 opened this issue Apr 7, 2023 · 17 comments
Labels
defect Suspected defect such as a bug or regression stale This issue has had no activity in a while

Comments

@juan131
Copy link

juan131 commented Apr 7, 2023

Defect

NATS servers logs are filled with warning such as the one below and I'm unable to find any documentation about what's the cause for this:

[37] [WRN] RAFT [cnrtt3eg - C-R3F-7al3bC6S] Resetting WAL state

Versions of nats-server and affected client libraries used:

  • NATS server version: 2.9.15
  • NATS client (Golang library: github.com/nats-io/nats.go v1.19.0)

OS/Container environment:

Linux containers running on a K8s cluster using a 3 replicas statefulset (pods named nats-0, nats-1 & nats-2). Each replica has its own PVC.

NATS servers running on clustered mode with JetStream enabled. The configuration can be found below:

server_name: $NATS_SERVER_NAME
listen: 0.0.0.0:4222
http: 0.0.0.0:8222

# Logging options
debug: false
trace: false
logtime: false

# Some system overrides
max_connections: 10000
max_payload: 65536

# Clustering definition
cluster {
  name: "nats"
  listen: 0.0.0.0:6222
  routes = [
    nats://nats:6222
  ]
}

# JetStream configuration
jetstream: enabled
jetstream {
  store_dir: /data/jetstream
  max_memory_store: 3G
  max_file_store: 20G
}

There are 4 different streams configured in the cluster with ~50 subjects on each stream. Streams configuration:

             Replicas: 3
              Storage: File

Options:

            Retention: WorkQueue
     Acknowledgements: true
       Discard Policy: New
     Duplicate Window: 12h0m0s
    Allows Msg Delete: true
         Allows Purge: true
       Allows Rollups: false

Limits:

     Maximum Messages: unlimited
  Maximum Per Subject: unlimited
        Maximum Bytes: unlimited
          Maximum Age: 12h0m0s
 Maximum Message Size: 1.0 KiB
    Maximum Consumers: unlimited

Steps or code to reproduce the issue:

The issue seems to starts after one of the NATS servers get restarted and, once it happens, it doesn't stop (I can see 20K logs like this one in the last 12 hours for instance).

Expected result:

The system should tolerate the loss of one NATS server according to JetStream documentation given we're using a replication factor of 3.

Actual result:

Some streams are totally unusable when this happens (publishers can't add new message & subscriber don't receive new messages) while other streams seem to be working as expected.

@derekcollison
Copy link
Member

We have made many improvements in the upcoming 2.9.16 release in this area. Should be released next week.

In the meantime, you have two general approaches to repairing the raft layer.

  1. Have the leader of the asset in question stepdown. You can track down the mappings from the raft layer name to the named asset using /jsz??acc=YOUR_ACCOUNT&consumers=true&raft=true. Then issue nats consumer cluster step-down <stream> <consumer>
  2. Scale the owning stream down to 1 and back up to the previous replica count. nats stream update --replicas=1 <stream>

@juan131
Copy link
Author

juan131 commented Apr 10, 2023

Thanks so much for the information @derekcollison !! I hope the new release helps mitigating these issues 🤞🏼

A couple of follow-up questions:

  • Would it be possible to perform a rolling update on NATS pods (setting maxUnavailable=1) with my current configuration & trust on the system to continue working?
  • Is there anything I can do to improve the stability of the system?

@derekcollison
Copy link
Member

Feel free to test the RC.7 candidate for 2.9.16, it is under synadia's docker hub for nats-server.

In general for rolling updates we suggest to lame duck a server, once shutdown restart and wait for /haelthz to return ok, then move to the next server.

@juan131
Copy link
Author

juan131 commented Apr 11, 2023

Thanks @derekcollison ! I'll give the new version a try.

One odd thing that I noticed is that, sometimes, a consumer gets "stuck" when one of the NATS servers is restarted (usually when it's the leader the one that's restarted). I mean, no messages are delivered to its subscribers even though there are millions of messages pending in the stream.

Running nats consumer cluster step-down <stream> <consumer> doesn't help. However, it does help removing the consumer (nats consumer cluster rm <stream> <consumer>), create it again & recreate the subscriptions.

Note: I use explicit ack push consumers with a queue group of subscribers. The subscriptions are "channel-based" created using ChanQueueSubscribe function.

Do you have any explanation for this behaviour? Is there anything I can do to detect a consumer is "stuck"?

@juan131
Copy link
Author

juan131 commented Apr 11, 2023

Related to the behaviour I mentioned above. I noticed in the Grafana dashboard I have for monitoring JetStream (based on this dashboard) that, when this happens, the number of messages with ACK pending soars for that consumer, see:

Screenshot 2023-04-11 at 09 16 49

How can this happens if I'm setting Max Ack Pending: 100 on the consumer?

$ nats consumer info FOO-queue FOO-BAR-workers
Configuration:

                Name: FOO-BAR-workers
    Delivery Subject: _INBOX.5PwawLU8od7pwjLEwcfTPW
      Filter Subject: FOO.BAR
      Deliver Policy: All
 Deliver Queue Group: FOO-BAR-workers
          Ack Policy: Explicit
            Ack Wait: 10s
       Replay Policy: Instant
  Maximum Deliveries: 10
     Max Ack Pending: 100
        Flow Control: false

Cluster Information:

                Name: nats
              Leader: nats-0
             Replica: nats-1, current, seen 0.00s ago
             Replica: nats-2, current, seen 0.00s ago

@derekcollison
Copy link
Member

We would have to triage your specific setup to gain more insight.

You do have the ability to set max ack pending which we highly recommend doing for systems.

@derekcollison derekcollison added the stale This issue has had no activity in a while label Jun 11, 2023
@ggarri
Copy link

ggarri commented Jul 10, 2023

Hi @derekcollison , we experienced the same issue on our side, but in our case, we use 5 replicas for the workqueue stream, and in the latest version 2.9.19

In addition, we got one message "lost" in the queue, during this period where our cluster was unavailable.

@derekcollison
Copy link
Member

So you had log messages about resetting WAL? For which servers? What did stream info report for the stream?

@ggarri
Copy link

ggarri commented Jul 11, 2023

image

Our WorkQueue streams leader was moved to a different instance right after that, but messages were lost in the transaction

@derekcollison
Copy link
Member

Meaning messages that were in the work queue were lost on a leader transfer?

@ggarri
Copy link

ggarri commented Jul 11, 2023

How can I make sure of that? I only can tell the message was successfully sent by our NATs client and after the leader restart message was gone from every available replica.

@derekcollison
Copy link
Member

ok and the message had not been properly delivered etc yes?

@jnmoyne
Copy link
Contributor

jnmoyne commented Jul 25, 2023

Could it be that the problem is with the consumer running out of number of deliveries because of being a push consumer if there's no-one to listening to the delivery subject at the time (because one of the server's being restarted, some client applications are being disconnected and need to reconnect to other servers)? And the warning messages you are seeing on the server are related to the server restarting but just warnings?

I am saying this because of your consumer info:

                Name: FOO-BAR-workers
    Delivery Subject: _INBOX.5PwawLU8od7pwjLEwcfTPW
      Filter Subject: FOO.BAR
      Deliver Policy: All
 Deliver Queue Group: FOO-BAR-workers

Is it a push consumer with a queue group and your delivery subject is an _INBOX subject?
That would not be the right thing to do.

I would recommend using pull consumers rather than push consumer on those working queue streams

@ggarri
Copy link

ggarri commented Jul 25, 2023

we are already using pull consumers since we identified in the past issues with the push consumers as you mentioned:

sub, err = s.dialer.JetStream.PullSubscribe(subject, s.subCfg.Name, s.subCfg.PullSubOpts()...)
if err != nil {
break
}

@jnmoyne
Copy link
Contributor

jnmoyne commented Jul 26, 2023

We would need help being able to reproduce this because with an R3 consumer on an R3 stream you should be able to have one server down and still operate. We would need more details about the failure scenario for example, is it just one of the servers going down and then coming back up later or more than one server being down? Can it access its store_dir right away (each server must have its own independent storage directory and be able to access it). If there's any chance that more than one pod is restarting and there's overlap between their outage time consider using a 5+ nodes cluster and R5 for better fault-tolerance.

The reseting WAL state warning can happen for many reasons when you have servers going up and down and trying to recover their state through RAFT votes but should not result in lost messages.

For deployment over k8s we strongly recommend using the Helm chart (https://github.com/nats-io/k8s/releases/tag/nats-1.0.0-rc.1)

@bruth bruth added defect Suspected defect such as a bug or regression and removed 🐞 bug labels Aug 18, 2023
@ro0NL
Copy link

ro0NL commented May 29, 2024

we experience the same using offical helm chart; in k8s we dont see any recent restarts or useful logs either. so im not sure how to debug it.

im betting though it's a memory/resource limit issue of some sort

@wallyqs wallyqs changed the title Resetting WAL state errors Resetting WAL state errors on workqueues Aug 28, 2024
@wallyqs
Copy link
Member

wallyqs commented Aug 28, 2024

Think this was likely fixed via #5506 released as part of v2.10.17, closing but feel free to reopen if can reproduce on v2.10.19

@wallyqs wallyqs closed this as completed Aug 28, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
defect Suspected defect such as a bug or regression stale This issue has had no activity in a while
Projects
None yet
Development

No branches or pull requests

7 participants