Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable configuration and disabling of boot diagnostics #2401

Closed
JoelSpeed opened this issue Jun 20, 2022 · 8 comments · Fixed by #2528
Closed

Enable configuration and disabling of boot diagnostics #2401

JoelSpeed opened this issue Jun 20, 2022 · 8 comments · Fixed by #2528
Assignees
Labels
kind/feature Categorizes issue or PR as related to a new feature.

Comments

@JoelSpeed
Copy link
Contributor

/kind feature

Describe the solution you'd like
[A clear and concise description of what you want to happen.]

Currently, as per the implementation from #901, boot diagnostics are enabled by default and use an Azure managed storage account. This storage account limits the boot logs to 1Gb per machine and has a cost of $0.05/perGB/Month.

You can also use your own storage account with boot diagnostics which would allow you to set customer retention policies for the data and also have more access to the underlying files being created should you need to.

I would like to propose that we add configuration options so that a user can configure their own storage account if they need to, but then also have the ability to disable the diagnostics should they not be interested.

In large deployments, eg a managed service running thousands of Machines, this data is costly and likely not always required (how often do we expect boot failures?), so I think it would be good to have the choice to disable the diagnostics as well to save money in large deployments.

I would expect an API something along the lines of, but happy to have discussion if others have suggestions:

diagnostics:
  boot:
     storageAccountType: AzureManaged | CustomerManaged | Disabled # defaults to AzureManaged for backwards compatibility
     customerManaged: # This is only valid to be set when the account type is customer managed (a discriminated union)
       storageAccountName: <blah>

Happy to volunteer to implement this if this is something maintainers feel comfortable supporting.

Anything else you would like to add:
[Miscellaneous information that will assist in solving the issue.]

Environment:

  • cluster-api-provider-azure version:
  • Kubernetes version: (use kubectl version):
  • OS (e.g. from /etc/os-release):
@k8s-ci-robot k8s-ci-robot added the kind/feature Categorizes issue or PR as related to a new feature. label Jun 20, 2022
@CecileRobertMichon
Copy link
Contributor

No objections from my part to making this configurable and enabled by default to preserve back compat.

For the API, we might want to have storageUri instead of storage account name since that's what the Azure APIs use. I would also avoid the terms "CustomerManaged" and "AzureManaged" and instead stick with the official terms, i.e. Managed and UserManaged https://docs.microsoft.com/en-us/azure/virtual-machines/boot-diagnostics.

The way the Azure API works is as follows, for reference:

// DiagnosticsProfile specifies the boot diagnostic settings state. <br><br>Minimum api-version:
// 2015-06-15.
type DiagnosticsProfile struct {
	// BootDiagnostics - Boot Diagnostics is a debugging feature which allows you to view Console Output and Screenshot to diagnose VM status. <br>**NOTE**: If storageUri is being specified then ensure that the storage account is in the same region and subscription as the VM. <br><br> You can easily view the output of your console log. <br><br> Azure also enables you to see a screenshot of the VM from the hypervisor.
	BootDiagnostics *BootDiagnostics `json:"bootDiagnostics,omitempty"`
}

// BootDiagnostics boot Diagnostics is a debugging feature which allows you to view Console Output and
// Screenshot to diagnose VM status. <br><br> You can easily view the output of your console log. <br><br>
// Azure also enables you to see a screenshot of the VM from the hypervisor.
type BootDiagnostics struct {
	// Enabled - Whether boot diagnostics should be enabled on the Virtual Machine.
	Enabled *bool `json:"enabled,omitempty"`
	// StorageURI - Uri of the storage account to use for placing the console output and screenshot. <br><br>If storageUri is not specified while enabling boot diagnostics, managed storage will be used.
	StorageURI *string `json:"storageUri,omitempty"`
}

@JoelSpeed
Copy link
Contributor Author

That makes sense to me, on the StorageURI front, the only question I have is, since that's a common format https://<account-name>.blob.core.windows.net/, I was thinking we could make it easier for users by just doing that string substitution for them, rather than having them have to provider the full URI. Is that not something you would normally do in CAPZ?

@CecileRobertMichon
Copy link
Contributor

It would be something we could do if we were confident no other formats were possible but given https://docs.microsoft.com/en-us/azure/storage/common/storage-account-overview#azure-dns-zone-endpoints-preview that may already not be true.

@JoelSpeed
Copy link
Contributor Author

Cool, I'll look into working up a PR over the next few days

@CecileRobertMichon
Copy link
Contributor

/assign @JoelSpeed

@damdo
Copy link
Member

damdo commented Aug 2, 2022

Following up from the discussion, here is a PR that implements the agreed design: #2528

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle stale
  • Mark this issue or PR as rotten with /lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Oct 31, 2022
@JoelSpeed
Copy link
Contributor Author

/remove-lifecycle stale

This is waiting on review

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Nov 1, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/feature Categorizes issue or PR as related to a new feature.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants