PB-7411: Update heartbeatTimeoutSecs,electionTimeoutMillis for the pxcentral mongodb template. #589

sgajawada-px · 2024-07-01T05:05:10Z

For pxcental mongodb template: Apply replicaset(rs) reconfig by increasing the heartbeatTimeoutSecs and electionTimeoutMillis

What this PR does / why we need it:
To fix the px-backup pod crash issue as the mongoDB went into the non-writable state.

Which issue(s) this PR fixes (optional)
Closes #PB-7411

Special notes for your reviewer:

… by increasing the heartbeatTimeoutSecs and electionTimeoutMillis

lalat-das · 2024-07-02T05:47:21Z

charts/px-central/templates/px-backup/pxcentral-mongodb.yaml

+      echo "This node is currently PRIMARY - will apply rs.conf settings"
+
+      usernameAndPassword="-u ${MONGODB_ROOT_USER} -p ${MONGODB_ROOT_PASSWORD}"
+      settingsToConfigure="${settingsToConfigure}cfg.settings.heartbeatTimeoutSecs = 60; "


The way we are consuming Mongo, we don't expect too many failures of mongo pods or intentional restarts of our px-backup namespace pods. otherwise I feel heartbeattimeouts increasing can have some negative side effects as below.

The chat-gpt vomited below concerns about negative side effects when we increase heartbeat. I know I am raising a blanket question. But you can vet it out if they are applicable for us. if we have answers for below things we are more confident.

Delayed Failover:
One of the main trade-offs is that failover times will increase. If a primary node actually goes down, the remaining members will wait longer before initiating an election to choose a new primary. This can lead to longer periods of unavailability for write operations.

Slow Detection of Issues:
Real issues, such as a node genuinely going down, will take longer to be detected. This delay can impact the overall resilience and responsiveness of the cluster in dealing with actual failures.

Impact on Cluster Operations:
Operations that depend on timely heartbeat responses, such as replica set reconfigurations or maintenance tasks, might be affected. The cluster might take longer to stabilize after changes or disruptions.

Potential Data Inconsistency:
If a primary node is slow to respond and is eventually considered down after a longer timeout, there's a risk of split-brain scenarios or data inconsistency if the network partitions and nodes believe they are still part of a majority.

lalat-das · 2024-07-02T05:49:46Z

Why this PR is raised against 2.7.2 . Does helm repo works this way ? I mean master branch is meant for latest released branch and release branch names for upcoming release.

PB-7411: For pxcental mongodb template: Apply replicaset(rs) reconfig…

8f64215

… by increasing the heartbeatTimeoutSecs and electionTimeoutMillis

sgajawada-px requested review from prashanthpx, siva-portworx and ss-px July 1, 2024 05:05

lalat-das self-requested a review July 2, 2024 05:32

lalat-das reviewed Jul 2, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PB-7411: Update heartbeatTimeoutSecs,electionTimeoutMillis for the pxcentral mongodb template. #589

PB-7411: Update heartbeatTimeoutSecs,electionTimeoutMillis for the pxcentral mongodb template. #589

sgajawada-px commented Jul 1, 2024

lalat-das Jul 2, 2024 •

edited

Loading

lalat-das commented Jul 2, 2024 •

edited

Loading

PB-7411: Update heartbeatTimeoutSecs,electionTimeoutMillis for the pxcentral mongodb template. #589

Are you sure you want to change the base?

PB-7411: Update heartbeatTimeoutSecs,electionTimeoutMillis for the pxcentral mongodb template. #589

Conversation

sgajawada-px commented Jul 1, 2024

lalat-das Jul 2, 2024 • edited Loading

Choose a reason for hiding this comment

lalat-das commented Jul 2, 2024 • edited Loading

lalat-das Jul 2, 2024 •

edited

Loading

lalat-das commented Jul 2, 2024 •

edited

Loading