Node unable to persist log, but keeps being elected #614

k-jingyang · 2024-09-10T15:48:55Z

Hello,

Recently, we faced an issue where we encounter a case where the leader node had a bad persistent store:

leader node was unable to persist logs -> demotes itself -> starts election -> becomes leader -> unable to persists logs -> ... (the cycle repeats)
- Demotion is caused by https://github.com/hashicorp/raft/blob/main/raft.go#L1271

This cycle caused unstable leadership during the period. For us, this cycle persisted for 10 mins until another node was finally elected leader.

Wondering if there are recommendations or good practices for handling such cases? Given that Hashicorp runs your own cloud offerings too.

Also, wondering if there is an optimisation that we can do here in the library? I understand there's some nuances to this.

Based on my understanding, the current way to fend against this is that heartbeat timeouts has a form of randomness.
Given a cluster: node A (leader), node B, node C:
- When node A demotes itself, because of randomness, node C has a chance to timeout earlier and becomes a candidate before node A becomes one
- However, if node B doesn't timeout, it will still think that node A is the leader, and will always reject node C's vote request.
- In such cases, node A has the natural advantage in winning elections. This is not preferred when node A has a persistent store issue

k-jingyang · 2024-09-14T12:47:40Z

Hmmm, I realised that this issue has got more to do with the eccentricity of our logs store, as only StoreLogs was failing and not other StableStore operations

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Node unable to persist log, but keeps being elected #614

Node unable to persist log, but keeps being elected #614

k-jingyang commented Sep 10, 2024 •

edited

Loading

k-jingyang commented Sep 14, 2024

Node unable to persist log, but keeps being elected #614

Node unable to persist log, but keeps being elected #614

Comments

k-jingyang commented Sep 10, 2024 • edited Loading

k-jingyang commented Sep 14, 2024

k-jingyang commented Sep 10, 2024 •

edited

Loading