-
-
Notifications
You must be signed in to change notification settings - Fork 276
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Cloning failure after PVC removal and pod deletion of master pod #412
Comments
Hi @dougfales, This is indeed a known issue. We fixed #250 in version Your proposed solution sounds good! But you can use the already created services, you don't have to create new ones. Just add the So, your PR is welcome! |
Thanks @AMecea, that sounds great. I will prepare a PR. And that's a good idea to add the port to the existing services; I'll incorporate that into my changes. Thanks again! |
Is this issue actually still open, or did #422 solve this? |
We are evaluating the operator for use in some of our apps, and during our testing we uncovered a case where the pod which replaces the original master is unable to rejoin the cluster due to a failed
clone-and-init
. This bug is reminiscent of #250 although I believe our specific case is slightly different.We have developed a fix for this issue which I'd be happy to submit as a PR if you think it's an acceptable approach. I'll describe our strategy below.
Steps to Reproduce
mysql-0
instance is unable to rejoin the cluster and replicate from the new master. The logs from the init container show that cloning did not occur from one of the sibling replicas as it should have.Expected results: The new pod should be restored from one of the other instances during its
clone-and-init
phase, and then it should rejoin the cluster as a replica.Actual results: The new pod fails to complete
clone-and-init
because the code which chooses a clone source chooses a clone source based on server IDs, regardless of pod health.We have written an e2e test case which reproduces this, which I would of course include in our PR.
Proposed Solution
In discussing this issue with my teammates @stankevich and @eik3, who are leading our evaluation of the mysql-operator, they came up with the idea to use k8s services to make finding a clone source more robust. Their idea was to add two new services pointing to the
SidecarServerPort
, one containing healthy replicas and one containing the master mysql pod.During the
RunCloneCommand
, we then attempt to clone from a replica first by pointingcloneFromSource
to this new healthy replica service. If that fails after some number of retries, we will try to clone from this new master service.This approach makes the sidecar clone source detection more robust because we are able to leverage the work the operator is already doing (labeling healthy pods) in order to get a healthy clone source, regardless of our serverID. We can also retry the clone operation if needed. This is a nice bonus for large cloning operations (hundreds or thousands of GBs) which are more susceptible to being interrupted by a network blip or having the clone source pod die during the clone operation.
If you think this solution sounds acceptable, I would love to put together a PR next week with our changes and the test case for your review. Thanks!
The text was updated successfully, but these errors were encountered: