Cloning failure after PVC removal and pod deletion of master pod #412

dougfales · 2019-10-18T18:36:47Z

We are evaluating the operator for use in some of our apps, and during our testing we uncovered a case where the pod which replaces the original master is unable to rejoin the cluster due to a failed clone-and-init. This bug is reminiscent of #250 although I believe our specific case is slightly different.

We have developed a fix for this issue which I'd be happy to submit as a PR if you think it's an acceptable approach. I'll describe our strategy below.

Steps to Reproduce

Create a cluster with three mysql instances.
Delete the PVC for the master pod and then delete the master pod.
Observe the cluster in Orchestrator to verify that the new mysql-0 instance is unable to rejoin the cluster and replicate from the new master. The logs from the init container show that cloning did not occur from one of the sibling replicas as it should have.

Expected results: The new pod should be restored from one of the other instances during its clone-and-init phase, and then it should rejoin the cluster as a replica.

Actual results: The new pod fails to complete clone-and-init because the code which chooses a clone source chooses a clone source based on server IDs, regardless of pod health.

We have written an e2e test case which reproduces this, which I would of course include in our PR.

Proposed Solution

In discussing this issue with my teammates @stankevich and @eik3, who are leading our evaluation of the mysql-operator, they came up with the idea to use k8s services to make finding a clone source more robust. Their idea was to add two new services pointing to the SidecarServerPort, one containing healthy replicas and one containing the master mysql pod.

During the RunCloneCommand, we then attempt to clone from a replica first by pointing cloneFromSource to this new healthy replica service. If that fails after some number of retries, we will try to clone from this new master service.

This approach makes the sidecar clone source detection more robust because we are able to leverage the work the operator is already doing (labeling healthy pods) in order to get a healthy clone source, regardless of our serverID. We can also retry the clone operation if needed. This is a nice bonus for large cloning operations (hundreds or thousands of GBs) which are more susceptible to being interrupted by a network blip or having the clone source pod die during the clone operation.

If you think this solution sounds acceptable, I would love to put together a PR next week with our changes and the test case for your review. Thanks!

The text was updated successfully, but these errors were encountered:

AMecea · 2019-10-23T14:58:07Z

Hi @dougfales,

This is indeed a known issue. We fixed #250 in version v0.2.x and introduced it as regression in v0.3.0 (#326). We plan to fix it but I didn't have the time to think about it and fix it.

Your proposed solution sounds good! But you can use the already created services, you don't have to create new ones. Just add the SidecarServerPort port to master service <cluster-name>-mysql-master and slave nodes <cluster-name>-mysql (this service selects all healthy nodes that include the master node).

So, your PR is welcome!
Thank you!

dougfales · 2019-10-23T19:27:20Z

Thanks @AMecea, that sounds great. I will prepare a PR. And that's a good idea to add the port to the existing services; I'll incorporate that into my changes. Thanks again!

neumantm · 2022-01-30T20:21:03Z

Is this issue actually still open, or did #422 solve this?

dougfales · 2022-01-31T23:08:32Z

Thanks for asking @neumantm. It's been a while, but as I recall, #422 solved this issue, and I probably should have closed it when that PR was merged. I'll go ahead and close it now.

dougfales mentioned this issue Nov 5, 2019

Clone from k8s services instead of individual hosts #422

Merged

AMecea added the bug label Nov 7, 2019

AMecea mentioned this issue Nov 7, 2019

"the master has purged binary logs containing GTIDs that the slave requires" #425

Closed

dougfales closed this as completed Jan 31, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cloning failure after PVC removal and pod deletion of master pod #412

Cloning failure after PVC removal and pod deletion of master pod #412

dougfales commented Oct 18, 2019

AMecea commented Oct 23, 2019 •

edited

Loading

dougfales commented Oct 23, 2019

neumantm commented Jan 30, 2022 •

edited

Loading

dougfales commented Jan 31, 2022

Cloning failure after PVC removal and pod deletion of master pod #412

Cloning failure after PVC removal and pod deletion of master pod #412

Comments

dougfales commented Oct 18, 2019

Steps to Reproduce

Proposed Solution

AMecea commented Oct 23, 2019 • edited Loading

dougfales commented Oct 23, 2019

neumantm commented Jan 30, 2022 • edited Loading

dougfales commented Jan 31, 2022

AMecea commented Oct 23, 2019 •

edited

Loading

neumantm commented Jan 30, 2022 •

edited

Loading