Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cloning failure after PVC removal and pod deletion of master pod #412

Closed
dougfales opened this issue Oct 18, 2019 · 4 comments
Closed

Cloning failure after PVC removal and pod deletion of master pod #412

dougfales opened this issue Oct 18, 2019 · 4 comments
Labels

Comments

@dougfales
Copy link
Contributor

We are evaluating the operator for use in some of our apps, and during our testing we uncovered a case where the pod which replaces the original master is unable to rejoin the cluster due to a failed clone-and-init. This bug is reminiscent of #250 although I believe our specific case is slightly different.

We have developed a fix for this issue which I'd be happy to submit as a PR if you think it's an acceptable approach. I'll describe our strategy below.

Steps to Reproduce

  1. Create a cluster with three mysql instances.
  2. Delete the PVC for the master pod and then delete the master pod.
  3. Observe the cluster in Orchestrator to verify that the new mysql-0 instance is unable to rejoin the cluster and replicate from the new master. The logs from the init container show that cloning did not occur from one of the sibling replicas as it should have.

Expected results: The new pod should be restored from one of the other instances during its clone-and-init phase, and then it should rejoin the cluster as a replica.

Actual results: The new pod fails to complete clone-and-init because the code which chooses a clone source chooses a clone source based on server IDs, regardless of pod health.

We have written an e2e test case which reproduces this, which I would of course include in our PR.

Proposed Solution

In discussing this issue with my teammates @stankevich and @eik3, who are leading our evaluation of the mysql-operator, they came up with the idea to use k8s services to make finding a clone source more robust. Their idea was to add two new services pointing to the SidecarServerPort, one containing healthy replicas and one containing the master mysql pod.

During the RunCloneCommand, we then attempt to clone from a replica first by pointing cloneFromSource to this new healthy replica service. If that fails after some number of retries, we will try to clone from this new master service.

This approach makes the sidecar clone source detection more robust because we are able to leverage the work the operator is already doing (labeling healthy pods) in order to get a healthy clone source, regardless of our serverID. We can also retry the clone operation if needed. This is a nice bonus for large cloning operations (hundreds or thousands of GBs) which are more susceptible to being interrupted by a network blip or having the clone source pod die during the clone operation.

If you think this solution sounds acceptable, I would love to put together a PR next week with our changes and the test case for your review. Thanks!

@AMecea
Copy link
Contributor

AMecea commented Oct 23, 2019

Hi @dougfales,

This is indeed a known issue. We fixed #250 in version v0.2.x and introduced it as regression in v0.3.0 (#326). We plan to fix it but I didn't have the time to think about it and fix it.

Your proposed solution sounds good! But you can use the already created services, you don't have to create new ones. Just add the SidecarServerPort port to master service <cluster-name>-mysql-master and slave nodes <cluster-name>-mysql (this service selects all healthy nodes that include the master node).

So, your PR is welcome!
Thank you!

@dougfales
Copy link
Contributor Author

Thanks @AMecea, that sounds great. I will prepare a PR. And that's a good idea to add the port to the existing services; I'll incorporate that into my changes. Thanks again!

@neumantm
Copy link

neumantm commented Jan 30, 2022

Is this issue actually still open, or did #422 solve this?

@dougfales
Copy link
Contributor Author

Thanks for asking @neumantm. It's been a while, but as I recall, #422 solved this issue, and I probably should have closed it when that PR was merged. I'll go ahead and close it now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants