Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PB-7411: Update heartbeatTimeoutSecs,electionTimeoutMillis for the pxcentral mongodb template. #589

Draft
wants to merge 1 commit into
base: 2.7.2
Choose a base branch
from

Conversation

sgajawada-px
Copy link
Contributor

For pxcental mongodb template: Apply replicaset(rs) reconfig by increasing the heartbeatTimeoutSecs and electionTimeoutMillis

What this PR does / why we need it:
To fix the px-backup pod crash issue as the mongoDB went into the non-writable state.

Which issue(s) this PR fixes (optional)
Closes #PB-7411

Special notes for your reviewer:
Screenshot from 2024-06-29 20-11-06
Screenshot from 2024-06-29 19-58-51

… by increasing the heartbeatTimeoutSecs and electionTimeoutMillis
echo "This node is currently PRIMARY - will apply rs.conf settings"

usernameAndPassword="-u ${MONGODB_ROOT_USER} -p ${MONGODB_ROOT_PASSWORD}"
settingsToConfigure="${settingsToConfigure}cfg.settings.heartbeatTimeoutSecs = 60; "
Copy link

@lalat-das lalat-das Jul 2, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The way we are consuming Mongo, we don't expect too many failures of mongo pods or intentional restarts of our px-backup namespace pods. otherwise I feel heartbeattimeouts increasing can have some negative side effects as below.

The chat-gpt vomited below concerns about negative side effects when we increase heartbeat. I know I am raising a blanket question. But you can vet it out if they are applicable for us. if we have answers for below things we are more confident.

Delayed Failover:
One of the main trade-offs is that failover times will increase. If a primary node actually goes down, the remaining members will wait longer before initiating an election to choose a new primary. This can lead to longer periods of unavailability for write operations.

Slow Detection of Issues:
Real issues, such as a node genuinely going down, will take longer to be detected. This delay can impact the overall resilience and responsiveness of the cluster in dealing with actual failures.

Impact on Cluster Operations:
Operations that depend on timely heartbeat responses, such as replica set reconfigurations or maintenance tasks, might be affected. The cluster might take longer to stabilize after changes or disruptions.

Potential Data Inconsistency:
If a primary node is slow to respond and is eventually considered down after a longer timeout, there's a risk of split-brain scenarios or data inconsistency if the network partitions and nodes believe they are still part of a majority.

@lalat-das
Copy link

lalat-das commented Jul 2, 2024

Why this PR is raised against 2.7.2 . Does helm repo works this way ? I mean master branch is meant for latest released branch and release branch names for upcoming release.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants