-
Notifications
You must be signed in to change notification settings - Fork 285
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
nats streaming follower is not restoring channels from leader #1283
Comments
@fowlerp-qlik I think that The original panic (the last in the file since it looks like the lines timestamp is from latest to oldest), seem to indicate that this server (a follower) received a snapshot from the leader with a channel that has an ID that is lower than its own, which should NEVER happen. Channels have an ID to be able to handle the situations where a channel "foo" is created, then is deleted due to max inactivity, but then recreated later. Without a unique ID when the channel is created, when a server restarts, it may recover its own snapshot file that could have information about a channel "foo" that at the time had messages 1 to 10 for instance: this was the state of this channel at the time of the snapshot. But later this channel was expired and then new messages on the same channel name were added, say "foo" from 1 to 3. If the server was restarted and recovers the snapshot, it would think that channel "foo" should have messages 1 to 10 (from the snapshot information), while its own streaming state has only from 1 to 3. Reverse could be true, that is, suppose that the local storage has messages 1 to 20, it would think it has 10 more messages that it should and would try to delete them. This is why channels are assigned an ID when created and that is part of the snapshot, etc.. With that being said, it looks like when "nats-streaming-2" received a snapshot from the leader about a channel say "foo", it had an ID, say 10, while its own version of the channel had for that same channel name ID say 20, which would indicate that it has a newer version of this channel. A function that is supposed to lookup the channel with a given ID notices that and returns Could it be that this server was stopped but its state not cleared while the two other servers had at one point their state cleared and restarted and ran for a while before "nats-streaming-2" was restarted but with its previous state? |
The nats cluster is part of a large and live production environment so a server restart would be disruptive. The Raft log if file based but our Kubernetes chart specifies a ram disk (for speed) so if we shut all three nats streaming pods down and restart them there will be message loss. Currently the follower seemingly doesn't even try to restore its channels from the leader on a pod restart. Is there any way to prod it to do so? |
@fowlerp-qlik I have not suggested to restart the whole cluster, just the one server that had the panic and does not seem to recover channels. Again, I was saying to clear its state (both RAFT and datastore directories) and it should recover the whole state from the leader. I have tried to figure out what conditions could lead to the panic and could not find it, unless (and now that I know that the state is ephemeral due to RAMDISK) that server (nats-streaming-2) was separated from the network, but not stopped, so it maintained its state, while the 2 others had been restarted at one point with a clean state. But you would likely have noticed that. |
Hi. On January 16th we performed a Kubernetes upgrade. This involved rolling the kubernetes nodes in a given order such that the
nats nodes/pods would roll first then the nats streaming pods would roll with the two followers first and the leader last.
I see the following sequence:
Is there a way to mitigate/correct such that nats-streaming-2 restores the channels. I worry that if there is a subsequent leader election, nats-streaming-2 may become the leader with potentially bad results.
Log file from nats-streaming-2:
nats-streaming-not-attempting-restore.csv
The text was updated successfully, but these errors were encountered: