-
Notifications
You must be signed in to change notification settings - Fork 31
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve the failover of galera service #289
Improve the failover of galera service #289
Conversation
looks good to me |
|
# filter out system and localhost connections, only consider clients with a port in the host field | ||
# from that point, clients will automatically reconnect to another node | ||
CLIENTS=$(mysql -uroot -p${DB_ROOT_PASSWORD} -nN -e "select id from information_schema.processlist where host like '%:%';") | ||
echo -n "$CLIENTS" | tr '\n' ',' | xargs mysqladmin -uroot -p${DB_ROOT_PASSWORD} kill |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we make this graceful by only killing the client after it finished the in progress query if any?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sadly no, there's no option in mysql to mark a connection for closing after it finished its processing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I ran couple of rollouts back an forth while polling the keystone API and at least I was not able to hit the case when an ongoing query was interrupted. Actually I was not able to reproduce any API outage during galera rollout after this fix. So I'm OK to merge this. Thanks @dciabrin for fixing this.
When a galera node is in the process of shutting down (e.g. during a rolling restart caused by a minor update), the node is unable to serve SQL queries, however it is still connected to clients. This confuses clients who get unexpected SQL status [1] and prevent them from retrying their queries, causing unexpected errors down the road. Improve the pod stop pre-hook to failover the active endpoint to another pod prior to shutting down the galera server, and kill connected clients to force them to reconnect to the new active endpoint. At this stage, the galera server can be safely shutdown as no client will see its WSREP state update. Also update the failover script: 1) when no endpoint is available, ensure no traffic is going through any pod. 2) do not trigger a endpoint failover as long as the current endpoint targets a galera node that is still part of the primary partition (i.e. it is still able to serve traffic). [1] 'WSREP has not yet prepared node for application use' Jira: OSPRH-11488
Rebasing this PR now to fix the kuttl test failure:
|
e384bd5
to
128d48d
Compare
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: dciabrin, gibizer The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/cherry-pick 18.0-fr1 |
@stuggi: once the present PR merges, I will cherry-pick it on top of In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
c6bef3c
into
openstack-k8s-operators:main
@stuggi: new pull request created: #290 In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
# select the first available node in the primary partition to be the failover endpoint | ||
NEW_ENDPOINT=$(echo "$MEMBERS" | grep -v "${PODNAME}" | head -1) | ||
if [ -z "${NEW_ENDPOINT}" ]; then | ||
log "No other available node to become the active endpoint." |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we probably should return with that fact?
unless the intention is to call service_endpoint() with "empty endpoint means "block incoming traffic"" (then please add a log msg for that)
NEW_SVC=$(echo "$CURRENT_SVC" | service_endpoint "$NEW_ENDPOINT") | ||
[ $? == 0 ] || return 1 | ||
|
||
log "Configuring a new active endpoint for service ${SERVICE}: '${CURRENT_ENDPOINT}' -> '${NEW_ENDPOINT}'" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
... or this would look like -> (empty)
if echo "${STATUS}" | grep -i -q -e 'failover'; then | ||
mysql_probe_state | ||
if [ $? != 0 ]; then | ||
log_error "Could not probe missing mysql information. Aborting" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"... Aborting during failover" would make this msg looking different to when mysql was started
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sorry being late to the party
When a galera node is in the process of shutting down (e.g. during a rolling restart caused by a minor update), the node is unable to serve SQL queries, however it is still connected to clients. This confuses clients who get unexpected SQL status [1] and prevent them from retrying their queries, causing unexpected errors down the road.
Improve the pod stop pre-hook to failover the active endpoint to another pod prior to shutting down the galera server, and kill connected clients to force them to reconnect to the new active endpoint. At this stage, the galera server can be safely shutdown as no client will see its WSREP state update.
Also update the failover script: 1) when no endpoint is available, ensure no traffic is going through any pod. 2) do not trigger a endpoint failover as long as the current endpoint targets a galera node that is still part of the primary partition (i.e. it is still able to serve traffic).
[1] 'WSREP has not yet prepared node for application use'
Jira: OSPRH-11488