Proposal: Cluster Autoscaler Debugging Snapshot #4346

jayantjain93 · 2021-09-21T13:46:35Z

Cluster Autoscaler Debugging Snapshot [Toil Reduction]

Author:

Jayant Jain (@jayantjain93)

Introduction

With the growing number of large, autoscaled clusters we are increasingly dealing with complex cases that are very hard to debug. One major difficulty is that we log information about what decision Cluster Autoscaler (CA) took as well as the results of various intermediate steps that lead to this decision, but we don't log the data this decision was based on. The reason for this is that CA is internally simulating the behavior of the entire cluster and to fully understand any given decision we need to know the exact state of relevant nodes and pods (possibly all nodes and pods in the cluster) as well as certain k8s objects. The volume of logging required to capture all that data would be prohibitive.

This document proposed introducing a new "snapshot" feature to Cluster Autoscaler. This feature would allow an engineer debugging an autoscaler issue to manually trigger CA to dump its internal state.

Proposal

The snapshot tool will use a manual trigger mechanism using HTTP endpoint which will collect the debugging information in the following run-cycle of CA and return the information as a JSON-parsable HTTP response.

Trigger

The debugging snapshot is captured only when it receives a trigger from the user. Manual trigger is used instead of automated data collection is because the process of capturing the snapshot can be long in a large cluster thereby affecting the performance of CA and increase in latency. Using the HTTP request as a trigger. This is done by creating a HTTP endpoint in CA which would receive the trigger as an API call. This will allow passing of parameters for better extension of the trigger. It allows for easy return of error code if the request fails.

Method of Formatting

The data in the snapshot will be very large and not all of the fields are always relevant. This poses the problem of choosing between what data we want to include and how readable the data be. We want to avoid a situation where we have a debugging snapshot of which fields are missing, but that may not always be possible.
The proposed way is to make the snapshot a parsable json.
All the relevant data collected to be encapsulated as a json field. The snapshot will contain all the elements that can be captured.
This file may not be easily human readable. But with some additional tooling (e.g. jq) to extract a readable format from this “full” snapshot. This should bridge the gap between how much data can be pushed in the snapshot and the readability of it. It also gives a quick turnaround time to create a “new readable” snapshot which won’t require any changes to the code or long waiting times for kubernetes releases. It also gives the ability to regenerate a new readable snapshot for older data.

Data Collection and Workflow

There are two data points proposed (so far) to be collected.

List of all NodeInfo. This will be collected in static_autoscaler:RunOnce(), after we have added all the upcoming nodes. This will contain all the nodes, incl. properties not limited to resources, labels, taints. It will also have all pods scheduled on each of the nodes and all the properties related to each pod.
List of all UnschedulablePodsCanBeScheduled. They are collected in filter_out_schedulable:Process(). This list of Pods are what CA considers will fit on an existing / upcoming node(s), and hence does not consider it as part of the scale up.

The final list of data points and the location where they are set may change based on implementation details.

There are design decisions also made on how the interface will be used. We need to have an adaptive interface to capture all data fields and be extensible for each cloud provider.

There will be one interface in-core which cloud-providers will have to extend to add extra values from in-cloud-provider. And for each data field added, a new function is created in-cloud-provider interface with the correct data type of the data field.

type Snapshot_interface interface {
       func SetData1(Data1)
}

type Cloud_provider_snapshot_interface interface {    
    Snapshot_interface
    func SetData2(Data2)
    func SetData3(Data3)
}

// implementation of the interface
func (someClass *b) SetData2(Data2 d) {
          // do something
}

This is actively trying to avoid using a generic function using interface{} as an argument as:
func AddExtraData(interface{} d) { }

This is done to increase readability of the code and keep type-augmented functions.

The text was updated successfully, but these errors were encountered:

elmiko · 2021-09-21T16:45:37Z

i think this sounds like an interesting idea and i could certainly see using it occasionally, i do have some concerns about the size of the data and how it gets returned to the user.

couple questions,

would this feature be gated by a command line flag?
how does collecting on a large cluster, or cluster with many pending pods, affect the resource usage of the autoscaler? (eg will users need to add requests to their autoscaler deployments)
what happens in the event of broken download connection, would the user restart the process?
have you considered options where the data is compressed before being returned to the user?

jayantjain93 · 2021-09-22T04:55:01Z

would this feature be gated by a command line flag?
Yes. This would be gated by a command line flag, which by default will be unchecked.

how does collecting on a large cluster, or cluster with many pending pods, affect the resource usage of the autoscaler? (eg will users need to add requests to their autoscaler deployments)
I have some estimates but this may be bias towards GCE. I found pods use about 4.5Kb, were as nodes use about 100Kb to represented. I think a cloud provider may have to run their own benchmarks for their scalability limits.

what happens in the event of broken download connection, would the user restart the process?
Yes. That's the expectation. The user would get a status representing the state and would need to resend a request.

have you considered options where the data is compressed before being returned to the user?
I hadn't included many optimisations as part of v1. I would be happy incl. this as an add-on.

k8s-triage-robot · 2021-12-26T14:17:15Z

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle stale
Mark this issue or PR as rotten with /lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

jayantjain93 · 2021-12-27T09:39:15Z

/remove-lifecycle stale

jayantjain93 added the kind/feature Categorizes issue or PR as related to a new feature. label Sep 21, 2021

jbartosik added the area/cluster-autoscaler label Sep 27, 2021

jayantjain93 mentioned this issue Dec 22, 2021

Adding support for Debugging Snapshot #4552

Merged

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Dec 26, 2021

k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Dec 27, 2021

k8s-ci-robot closed this as completed in #4552 Dec 30, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Proposal: Cluster Autoscaler Debugging Snapshot #4346

Proposal: Cluster Autoscaler Debugging Snapshot #4346

jayantjain93 commented Sep 21, 2021 •

edited

Loading

elmiko commented Sep 21, 2021

jayantjain93 commented Sep 22, 2021

k8s-triage-robot commented Dec 26, 2021

jayantjain93 commented Dec 27, 2021

Proposal: Cluster Autoscaler Debugging Snapshot #4346

Proposal: Cluster Autoscaler Debugging Snapshot #4346

Comments

jayantjain93 commented Sep 21, 2021 • edited Loading

Cluster Autoscaler Debugging Snapshot [Toil Reduction]

Introduction

Proposal

Trigger

Method of Formatting

Data Collection and Workflow

elmiko commented Sep 21, 2021

jayantjain93 commented Sep 22, 2021

k8s-triage-robot commented Dec 26, 2021

jayantjain93 commented Dec 27, 2021

jayantjain93 commented Sep 21, 2021 •

edited

Loading