-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
MachineHealthCheck issues handling machines right after the control plane becomes accessible #3026
Comments
/milestone v0.3.x |
fwiw I believe "2" was originally to avoid scenarios exactly like what Andy describes in “1”. Changing lastUpdate means machine is making progress so the last lastUpdate (provisioned -> running) is actually the real nodeStartupTimeout. |
What about waiting for Status.NodeRef to exist before starting to health-checking a machine? |
@fabriziopandini that's actually what this issue is about 😄. The "node startup timeout" is how long to wait for a node ref. There are two different timeouts that MHC handles - node startup timeout, and node health conditions. |
@ncdc yeah, probably we are saying the same thing. I was mostly focused on
|
Probably worth discussing this issue on the community call, but I'll add my thoughts here as they are at the moment When I think of the NodeStartupTimeout, I expect that to be how long the cloud provider has to start the machine and for the machine to join the cluster, so perhaps it makes sense for the check to only start that timer when the phase is The alternative is to wait for the cluster to be ready before processing an MHC, we already check that it is not paused, so this could be a reasonable addition as far as I can see. I don't see any real issues or complications with this approach at the moment. It short circuits all of the logic earlier and prevents us from checking individual machines when we know none of them will be healthy yet |
AFAIK this only means that the cluster infrastructure is ready (ELB, VPC, etc)... |
I think it makes sense to ignore machines that are in phases before |
@JoelSpeed re waiting for We have said previously that we wouldn't write logic against phases, for better or for worse 😄, so I'll translate to what we could check:
And, FWIW, there is one other combination that we set a phase for:
|
Ahh, I thought it was set to provisioned once the Machine had a cloud provider instance created, as in, EC2 RunInstances request has succeeded, EC2 instance is now starting <-- this is basically where we want the node startup timeout to start from, that call to the cloud provider I agree with the logic you've suggested above. The |
InfrastructureMachines today can only report a "ready" bool back to the Machine controller. For CAPA specifically, ready is true iff the EC2 instance's state is Running. When we start adding conditions, we may consider defining additional ones beyond a single ready bool. The kubeadm bootstrapper does wait for the cluster infra to be ready before proceeding with bootstrap data generation, and if other bootstrappers do the same, then yes, that does act as a significant guard. It appears that I omitted the cluster infra ready check from https://cluster-api.sigs.k8s.io/developer/providers/bootstrap.html (not sure if that was on purpose because I didn't think it was a hard requirement, or an accidental omission, though...). |
When we have conditions added, we should probably think about doing something like this w.r.t. nodeStartupTimeout: For control plane machines, don't start the clock until the "cluster infrastructure ready" condition is true. For example, on AWS, it can take a long time for the ELB's DNS name to be resolvable. This time should not count against node startup. For non control plane machines, don't start the clock until the cluster's "control plane initialized" condition is true. |
+1 Waiting for conditions |
+1 waiting for conditions too |
+1 waiting for conditions |
If I got Al this thread right is to have a condition at machine level that signals when the bootstrap data is ready and use this info to start the timer for nodeStartupTimeout. Is that right? |
I think we want to wait to have |
What @vincepri wrote. Adding logic around bootstrap ready will be problematic now, since bootstrap data is technically optional (although some providers, such as CAPA, require it). |
/milestone v0.3.9 |
@vincepri: Please ensure the request meets the requirements listed here. If this request no longer meets these requirements, the label can be removed In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/milestone v0.3.10 |
/milestone v0.4.0 |
Thinking more about the various scenarios here... but first a definition/question Should "control plane available" mean
Brand new cluster, first control plane machine
Control plane machine join
Worker machine join
Thoughts on the comparison for nodeStartupTimeout
There's a lot to unpack in here. I'd recommend reading through a couple of times before commenting 😄. Thanks! |
It seems like we're talking about two separate signals here and conflating them a bit. Maybe it would be good to track a timeout for machine infrastructure readiness independent from
I think we need to be cautious about supporting automated remediation for the initial control plane instance, unless we have a good way to ensure it is actually the initial control plane instance and we didn't somehow end up in a situation where the control plane has gone away and we are re-creating it. Otherwise we could essentially get into a scenario where the underlying cluster has been completely deleted along with all it's data and we bring things back as if everything is fine. I don't necessarily think we want to rely on Status for this, since it could lead to issues around This situation gets even more complicated in the case that the control plane is recreated while existing workers are left around.
+1, I think this will likely help improve things in the case that control plane availability is impacted for some reason or another to account for the experimental retry logic we have in the kubeadm bootstrapper. I do wonder though if we need to be concerned about being able to detect flapping availability of the control plane. I think the above scenario works if we assume that control plane availability could potentially blip, but if we hit a scenario where control plane availability flaps continuously we might end up in a situation where we never hit the timeout. We could potentially track this through a field in the status, or potentially just keeping track of it through some in-memory state of the controller. I don't think it's anything that would need to be persisted across a
|
Leaning towards this definition, continue health check throughout the lifecycle of the cluster. One example this would become useful is during upgrades or other maintenance operations, where the control plane might become unavailable, we probably want to offset any new node join or machine creation when the health check became available again.
This is going a little bit further, although probably worth discussing. We should probably understand the possible implications about having health checks or remediations kick off after the cluster has already been initialized. How do we avoid to re-init the control plane if the first control plane fails? What do we expect remediation to do?
Does this work in all cases? Today, IIRC (need to double check), when creating a new cluster worker nodes get created immediately after the control plane has been initialized, but if the control plane isn't yet available for one reason or another (Example: AWS' ELBs are notoriously slow to be up and running, or propagate DNS names), they might not be able to join the cluster in time. The hybrid approach described in the alternative section seem reasonable. |
Ok, so it sounds like:
Re control plane availability flapping, what would you expect as a user if you're trying to bootstrap new nodes and the control plane is going up and down? If infra is provisioned for a machine, and it can't bootstrap due to control plane unavailability, presumably it's going to sit there (potentially costing $) until MHC deletes it. I think I'd want the system to delete the busted machine and keep trying with new ones. This is somewhat why I like our current controlPlaneInitialized, because it means the control plane was initially up and running, and it would allow us to compare nodeStartupTimeout to now() - machineCreationTimestamp... |
+1, I don't think it's a concern if using ControlPlaneInitialized, only if relying on ControlPlane availability. |
So do we want to add a ControlPlaneInitialized condition on the Cluster? |
Uh oh, getting confused, I thought we were about to add ControlPlaneAvailable 😄 and use that to offset all machines (other than the first).
Available would probably include Initialized? |
I wrote about adding Available in #3779 before @detiber brought up flapping + MHC + the potential for an infinite timeout. I think it is useful to have one indicator that says "the first control plane member has come online successfully". This would be a write-once condition (aka ControlPlaneInitialized). I think it is also useful to have Available (apiserver is functional), write-many (can change between true, false, and maybe unknown). And it's also good to have Ready (superset of Available, indicates all members are functional), write-many also. WDYT? |
+1 to this from me. Just want to confirm on the last point, this means that we aren't (initially at least) going to read or use machine infra ready in any of the calculations for We will basically be checking the cluster infra for ready and cluster for initialized and only if both of those are true, checking In what scenarios could cluster infra be not ready, but the cluster be initialized? Is there not a dependency here which means we would only need to check cluster initialized? |
Based on suggestions above, that is correct.
Yes
That's really up to an infra provider, but I could envision things like some sort of networking issue that the infra provider detects and changes the cluster infra readiness to false. We can probably optimize here, and omit the cluster infra readiness check, as it could be subject to the same control plane available flapping issue that Jason described above. |
I added #3798 for the ControlPlaneInitialized condition |
This is a good circuit breaker. If the underlying infrastructure is having issues, and the InfraCluster reconciler can detect it and stop Machines to be marked unhealthy + remediated possibly causing even more failure, that's a good reason enough to keep the behavior? |
Counter-argument: if an infra machine powers on (cluster infra is ready) but fails to bootstrap (cluster infra flaps to not ready), shouldn't we remediate and mark the machine unhealthy? |
If that happens I'd assume there is something wrong, and we should probably delay/stop operations? |
I think subsequent infra machine creation would be delayed (e.g. we don't create ec2 instances if cluster infra is not ready), but any existing machines should be marked unhealthy, right? |
I've picked this back up and had a chat with @vincepri and here's what we're thinking: Add the previously-discussed ControlPlaneInitialized condition to Cluster. This replaces cluster.status.controlPlaneInitialized from v1alpha3. It is a write-once value, set to true whenever the control plane provider sets status.initialized (TBD if we convert that to a condition), or when at least one control plane machine has a node ref (when not using a control plane provider). This is #3798 needsRemediation logic - easy checks:
needsRemediationLogic - when we don't yet have a node ref:
The remaining logic (checking node conditions) remains unchanged. What do you all think about this? |
Issues go stale after 90d of inactivity. If this issue is safe to close now please do so with Send feedback to sig-contributor-experience at kubernetes/community. |
/lifecycle frozen |
What steps did you take and what happened:
.status.lastUpdated
, which only changes when we adjust.status.phase
In this case, it took ~ 8m19s for the control plane to be accessible, leaving only 1m41s for the machine to get a node ref...
However, because the machine's lastUpdated time changes as it changes phases (e.g. going from pending to provisioning because the control plane is now up), the deadline for getting a node ref is essentially extended each time lastUpdated changes.
What did you expect to happen:
Anything else you would like to add:
Environment:
kubectl version
): v1.18.1/kind bug
/area health
cc @JoelSpeed
The text was updated successfully, but these errors were encountered: