Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bootstrap failure detection #603

Closed
ncdc opened this issue May 8, 2020 · 13 comments · Fixed by #1232
Closed

Bootstrap failure detection #603

ncdc opened this issue May 8, 2020 · 13 comments · Fixed by #1232
Assignees
Labels
kind/feature Categorizes issue or PR as related to a new feature. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release.
Milestone

Comments

@ncdc
Copy link
Contributor

ncdc commented May 8, 2020

/kind feature

Describe the solution you'd like
If bootstrapping doesn't succeed, set AzureMachine.Status.FailureReason and FailureMessage to indicate there was an error.

Anything else you would like to add:
More information in kubernetes-sigs/cluster-api#2554.

We may need to amend the bootstrap provider contract to require them to write a sentinel file indicating success to a specific location because not all bootstrap providers will necessarily use cloud-init and have that as a consistent means for checking success/failure. I will probably write up a separate proposal for that.

@k8s-ci-robot k8s-ci-robot added the kind/feature Categorizes issue or PR as related to a new feature. label May 8, 2020
@ncdc
Copy link
Contributor Author

ncdc commented May 8, 2020

We will need a way for the CAPZ controller to be able to determine the status for a given AzureMachine. Does Azure have any services where the bootstrap logic in the VM could send an update to an Azure service (such as a message queue) to indicate bootstrap success? Or maybe some way to label or tag the Azure VM? This would need to be properly secured, presumably.

@alexeldeib
Copy link
Contributor

alexeldeib commented May 10, 2020

I think the credentials in azure.json (or MSI) should do the trick for worker authentication, assuming we setup RBAC correctly.

There's an existing boot diagnostics option on VMs that streams cloud init + boot time logs into a plain text file in a storage account. Uploading a sentinel file that way or tagging a VM would probably be doable.

If you're thinking more like a message queue (perhaps to build on later?), Service Bus is a pretty traditional pub/sub message broker and Event Grid is a more loosely coupled event source/sink model with handlers.

edit: I now see the AWS discussion from the CAPA issue, missed it in the CAPI issue

@nader-ziada
Copy link
Contributor

azure also supports the option to add a tag to the VM

@ncdc
Copy link
Contributor Author

ncdc commented May 11, 2020

Whatever we do, we need to make sure that the information is available for inspection any time, and that CAPZ doesn't have to be running to receive a notification. I want to avoid a situation where bootstrapping fails, but CAPZ crashes/goes offline for whatever reason, and because it's down, we miss the notification that bootstrapping failed. It sounds like the options you suggested are all feasible, as long as messages are persisted until consumption.

@nader-ziada
Copy link
Contributor

/assign

@CecileRobertMichon
Copy link
Contributor

@nader-ziada the tags are only visible to the Azure user, that might be one downside of using tags (ie. if we wanted to ever gather data on percentage of bootstrap failures/success).

@CecileRobertMichon
Copy link
Contributor

/priority important-soon

@k8s-ci-robot k8s-ci-robot added the priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. label Jun 4, 2020
@CecileRobertMichon CecileRobertMichon modified the milestones: next, v0.4.x Jul 10, 2020
@CecileRobertMichon
Copy link
Contributor

/assign @jackfrancis
/assign

Jack and I are going to start putting together a CAPZ proposal / design doc for different ways we can implement this in Azure. Will drop a link to a doc here for collaboration once it exists.

@k8s-ci-robot
Copy link
Contributor

@CecileRobertMichon: GitHub didn't allow me to assign the following users: jackfrancis.

Note that only kubernetes-sigs members, repo collaborators and people who have commented on this issue/PR can be assigned. Additionally, issues/PRs can only have 10 assignees at the same time.
For more information please see the contributor guide

In response to this:

/assign @jackfrancis
/assign

Jack and I are going to start putting together a CAPZ proposal / design doc for different ways we can implement this in Azure. Will drop a link to a doc here for collaboration once it exists.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@jackfrancis
Copy link
Contributor

/assign

@jackfrancis
Copy link
Contributor

@CecileRobertMichon and I have written up our investigation and conclusions here:

https://docs.google.com/document/d/1U0GxvO6ltgIINMjpQz96UD4bExlN2h21wyyn-3ENezc

I think next steps are to produce a concrete proposal for implemeting a capz-specific, named Azure VM Extension to include with capz-provisioned AzureMachine vms.

@nader-ziada nader-ziada modified the milestones: v0.4.8, v0.4.9 Sep 3, 2020
@devigned
Copy link
Contributor

devigned commented Feb 1, 2021

@CecileRobertMichon is this still open?

@CecileRobertMichon
Copy link
Contributor

yes, the last part is to add the script in the extension that checks for the sentinel file and set conditions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/feature Categorizes issue or PR as related to a new feature. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants