-
-
Notifications
You must be signed in to change notification settings - Fork 276
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Clone from k8s services instead of individual hosts #422
Conversation
This change adds the SidecarServerPort to the MasterService and introduces one new service, HealthyReplicasService, so that we can try to clone from replicas first, then fall back to master. * Added a test suite for RunCloneCommand logic, along with a mock backup server. * Added checks for service availability when cloning. * Added "fail fast" logic when unexpected errors occur during cloning/download. * Added dataDir cleanup code so that interrupted cloning does not leave dataDir in an inconsistent state. * Added e2e test demonstrating cloning failure when PVC is removed and pod recreated. * Changed the connect timeout from the default of 30s to 5s so that an empty k8s service will not cause cloning attempts to hang unnecessarily for 30s.
pkg/sidecar/server.go
Outdated
return &http.Transport{ | ||
Proxy: http.ProxyFromEnvironment, | ||
DialContext: (&net.Dialer{ | ||
Timeout: time.Duration(dialTimeout) * time.Second, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I could not find a better way to adjust the timeout for the initial connect
to an empty service. The problem we are trying to solve: when a k8s service is empty, the connect operation will hang until the timeout is reached.The default being 30s meant that we had to wait 30s before trying the next service.
// nolint: gosec | ||
xtbkCmd := exec.Command("xtrabackup", "--prepare", | ||
xtbkCmd := exec.Command(xtrabackupCommand, "--prepare", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
By replacing these with a variable, I can disable them as needed in unit tests. For example, when we pass a valid xbstream which is not necessarily also a valid xtrabackup directory.
return os.RemoveAll(lfPath) | ||
} | ||
|
||
func cleanDataDir() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was done to avoid leaving the dataDir
in an unknown or partially restored state. The case we ran into was a large backup (hundreds of GBs) being interrupted midstream due to a network blip. On subsequent pod re-starts, mysql would continually fail to start as the PVC had only a partial backup.
I will try to get the linter issues resolved today. |
@AMecea I've managed to get the linter errors fixed and tests to pass in drone, but please consider this a WIP for now. We are still working out some issues in our own environment, so there will likely be a few more tweaks to this before it should be considered for review/merging. I'll ping you again when we feel like this is ready for your review. |
When no services are available, we want to die and let k8s re-initiialize the pod, *unless* this is the initial pod, in which case it is valid to allow initialization to coninue without cloning from a service or bucket.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good overall and we'll include the changes in the upcoming 0.4 release. Thanks for your contribution!
Sounds great! Thank you @calind! 👍 |
Hi @calind I just noticed that I never re-requested your review after I made the requested changes last month. I'm re-submitting for your review now. Let me know if you want any other changes and I'll be happy to make them. Thanks! |
Hi @dougfales, I'm sorry for the delay, we were focused on something else lately 😞 I will do a few more tests and maybe some code cleanup and I will do a new release that contains those changes Thank you for this PR, you did a grate job 👍 ! |
Thanks for merging @AMecea! It was a pleasure working on this PR! 😄 👍 We'll keep an eye out for |
This PR is an attempt to make the cloning process, especially during/after failover, more resilient by using k8s services to locate a source to clone from.
The Problem
The issue that sparked this is #412. In short, if you delete the master's PVC and delete the master pod, then the master will never be re-cloned and can never re-join the cluster.
Solution: Clone From a Service
Our solution is based on the idea that we could just use k8s services to point to the sidecar backup service on the master and on healthy replicas. To do that, we simply added the
SidecarServerPort
to the existing master service. We had to add one new service,HealthyReplicasSVC
, to point to healthy replicas only (excluding the master).The logic for cloning then becomes:
If during the course of these attempts there is an error, we return with the error immediately. This causes the
clone-and-init
command toexit(1)
, which results in the pod initialization being retried later by k8s. This "fail fast" approach is intentional; we did not want to embed a bunch of retry logic in this function when k8s is already there to try again in a much more visible way.Tests
We created an e2e test case which demonstrates the failure described in the issue here.
We have also added unit tests for the functionality in
RunCloneCommand
. These tests rely on a mock backup server, defined inappclone_fakeserver_test.go
. This mock server runs on two separate ports because that was the simplest way we could think of to emulate two separate k8s services in a unit test environment.Caveats/Questions
We are still testing this approach ourselves, but so far it fixes the original issue and seems not to cause any new ones.
However, looking at the cloning logic has raised a few questions for us about what the operator should do in some cases. For instance:
If for some reason both replica and master services are temporarily unavailable, and the cluster has an init bucket URL, we will clone from that bucket. This may not always make sense, e.g. if someone has a very old init bucket URL. This might be something to consider for future documentation or improvement.
It seems possible that, at a specific point during a failover, the healthy replica service could be empty while the master service is not (the window is small, but it could happen). In that case, is it acceptable that this code would fall back to the master service, putting extra strain on the master from
xtrabackup
? We think this acceptable for now, but I'd like to make this scenario less likely in the future.If these services were unavailable and there is no init bucket URL configured, an empty replica might be initialized when what we really wanted was a cloned replica. This replica would then likely fail to replicate from the master, and I assume it would be killed and recreated eventually by the operator. This is probably a scenario to cover with an e2e test case eventually.
Notes
I will attempt to leave some comments throughout the code as some of the changes are much clearer in context.
Thanks again for considering this PR @AMecea. Please let us know if you want any changes or if you think there's a better way to go about this solution.
Thanks to @stankevich and @eik3 for helping put this together.