Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proposal: Cluster Autoscaler Debugging Snapshot #4346

Closed
jayantjain93 opened this issue Sep 21, 2021 · 4 comments · Fixed by #4552
Closed

Proposal: Cluster Autoscaler Debugging Snapshot #4346

jayantjain93 opened this issue Sep 21, 2021 · 4 comments · Fixed by #4552
Labels
area/cluster-autoscaler kind/feature Categorizes issue or PR as related to a new feature.

Comments

@jayantjain93
Copy link
Contributor

jayantjain93 commented Sep 21, 2021

Cluster Autoscaler Debugging Snapshot [Toil Reduction]

Author:

Introduction

With the growing number of large, autoscaled clusters we are increasingly dealing with complex cases that are very hard to debug. One major difficulty is that we log information about what decision Cluster Autoscaler (CA) took as well as the results of various intermediate steps that lead to this decision, but we don't log the data this decision was based on. The reason for this is that CA is internally simulating the behavior of the entire cluster and to fully understand any given decision we need to know the exact state of relevant nodes and pods (possibly all nodes and pods in the cluster) as well as certain k8s objects. The volume of logging required to capture all that data would be prohibitive.

This document proposed introducing a new "snapshot" feature to Cluster Autoscaler. This feature would allow an engineer debugging an autoscaler issue to manually trigger CA to dump its internal state.

Proposal

The snapshot tool will use a manual trigger mechanism using HTTP endpoint which will collect the debugging information in the following run-cycle of CA and return the information as a JSON-parsable HTTP response.
Screenshot 2021-09-20 at 7 34 02 PM

Trigger

The debugging snapshot is captured only when it receives a trigger from the user. Manual trigger is used instead of automated data collection is because the process of capturing the snapshot can be long in a large cluster thereby affecting the performance of CA and increase in latency. Using the HTTP request as a trigger. This is done by creating a HTTP endpoint in CA which would receive the trigger as an API call. This will allow passing of parameters for better extension of the trigger. It allows for easy return of error code if the request fails.

Method of Formatting

The data in the snapshot will be very large and not all of the fields are always relevant. This poses the problem of choosing between what data we want to include and how readable the data be. We want to avoid a situation where we have a debugging snapshot of which fields are missing, but that may not always be possible.
The proposed way is to make the snapshot a parsable json.
All the relevant data collected to be encapsulated as a json field. The snapshot will contain all the elements that can be captured.
This file may not be easily human readable. But with some additional tooling (e.g. jq) to extract a readable format from this “full” snapshot. This should bridge the gap between how much data can be pushed in the snapshot and the readability of it. It also gives a quick turnaround time to create a “new readable” snapshot which won’t require any changes to the code or long waiting times for kubernetes releases. It also gives the ability to regenerate a new readable snapshot for older data.

Data Collection and Workflow

There are two data points proposed (so far) to be collected.

  1. List of all NodeInfo. This will be collected in static_autoscaler:RunOnce(), after we have added all the upcoming nodes. This will contain all the nodes, incl. properties not limited to resources, labels, taints. It will also have all pods scheduled on each of the nodes and all the properties related to each pod.

  2. List of all UnschedulablePodsCanBeScheduled. They are collected in filter_out_schedulable:Process(). This list of Pods are what CA considers will fit on an existing / upcoming node(s), and hence does not consider it as part of the scale up.

The final list of data points and the location where they are set may change based on implementation details.

There are design decisions also made on how the interface will be used. We need to have an adaptive interface to capture all data fields and be extensible for each cloud provider.

There will be one interface in-core which cloud-providers will have to extend to add extra values from in-cloud-provider. And for each data field added, a new function is created in-cloud-provider interface with the correct data type of the data field.

type Snapshot_interface interface {
       func SetData1(Data1)
}

type Cloud_provider_snapshot_interface interface {    
    Snapshot_interface
    func SetData2(Data2)
    func SetData3(Data3)
}

// implementation of the interface
func (someClass *b) SetData2(Data2 d) {
          // do something
}

This is actively trying to avoid using a generic function using interface{} as an argument as:
func AddExtraData(interface{} d) { }

This is done to increase readability of the code and keep type-augmented functions.

@jayantjain93 jayantjain93 added the kind/feature Categorizes issue or PR as related to a new feature. label Sep 21, 2021
@elmiko
Copy link
Contributor

elmiko commented Sep 21, 2021

i think this sounds like an interesting idea and i could certainly see using it occasionally, i do have some concerns about the size of the data and how it gets returned to the user.

couple questions,

  1. would this feature be gated by a command line flag?
  2. how does collecting on a large cluster, or cluster with many pending pods, affect the resource usage of the autoscaler? (eg will users need to add requests to their autoscaler deployments)
  3. what happens in the event of broken download connection, would the user restart the process?
  4. have you considered options where the data is compressed before being returned to the user?

@jayantjain93
Copy link
Contributor Author

  1. would this feature be gated by a command line flag?
    Yes. This would be gated by a command line flag, which by default will be unchecked.
  1. how does collecting on a large cluster, or cluster with many pending pods, affect the resource usage of the autoscaler? (eg will users need to add requests to their autoscaler deployments)
    I have some estimates but this may be bias towards GCE. I found pods use about 4.5Kb, were as nodes use about 100Kb to represented. I think a cloud provider may have to run their own benchmarks for their scalability limits.
  1. what happens in the event of broken download connection, would the user restart the process?
    Yes. That's the expectation. The user would get a status representing the state and would need to resend a request.
  1. have you considered options where the data is compressed before being returned to the user?
    I hadn't included many optimisations as part of v1. I would be happy incl. this as an add-on.

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle stale
  • Mark this issue or PR as rotten with /lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Dec 26, 2021
@jayantjain93
Copy link
Contributor Author

/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Dec 27, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/cluster-autoscaler kind/feature Categorizes issue or PR as related to a new feature.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants