-
Notifications
You must be signed in to change notification settings - Fork 285
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Nats streaming unable to restore messages on startup #1271
Comments
This error indicates that this node needed to restore messages based on the snapshot it got and its current state of the store: some messages were missing, and it needs then to get those messages from the leader. If no leader was available, then it would not be able to get those messages back and cannot proceed. This seem to indicate that either connectivity was missing between this node and the rest of the cluster, or other servers were also restarted and there were no leader election possible because none of the restarted nodes were in a situation where they could proceed? |
Would setting cluster_proceed_on_restore_failure=true be wise even if it means potential loss of messages? Is there any way to avoid/mitigate this issue? |
It depends. If you are in a situation where no leader can be elected, then it will allow you to start (with the understanding that some channels may not be recovered). It is a bit better than removing the whole state since some of the channels may not have had any problem. But again, this is a decision that you have to make after judging the impact. And this is likely something that you may not want to leave "on" by default, but just enable in a bad situation.
Make sure that there is a leader when you restart nodes, which also means, maybe start recycling followers instead of the leader. |
Any update on this? Should I close? |
Could you keep open a bit longer. We are actively reviewing the logs to see how Kubernetes pod rolling may have impacted leadership and whether we can mitigate somehow. |
No problem! |
Sequence
So could there be a weak area where the leader is killed/restarted while the followers are restoring their channel snapshots? |
Yes, you should ensure that you recycle one node at a time and wait for it to be fully recovered/active before moving to the next. Ideally, you would start with the non leader - but it is understood that leadership may change while a node is restarted. |
We are using nats-streaming version 0.24.4 running as three pods (Kubernetes). When nats-streaming was deployed the pods rolled in an order that does not take the nats streaming leader into account. We have 96 channels. During startup received 10 of
[1] 2022/10/03 19:20:39.135630 [ERR] STREAM: channel "system-events.user-identity" - unable to restore messages (snapshot 75859347/75979326, store 75872433/75962095, cfs 75859347): nats: timeout
every three seconds then that nats-streaming pod would abort/exit. Kubernetes would start a new instance and again same issue.
Our message store is in a ram disk so we eventually shut down all pods and restarted from scratch (loosing all messages). This recovered nat-streaming. nats pods were not rolled during the nats-streaming deployment.
In terms of order system-events.user-identity is not the first nor the last channel based on nats-streaming channel creation logs order.
What would cause this problem?
The text was updated successfully, but these errors were encountered: