From 16d11e80eb8c4352b63c2917a38db6b5dc0ab782 Mon Sep 17 00:00:00 2001
From: Peter Hunt <pehunt@redhat.com>
Date: Thu, 28 Jan 2021 20:08:04 -0500
Subject: [PATCH] sig-node: Add KEP for cAdvisor-less, CRI-full Container and
 Pod Stats

Signed-off-by: Peter Hunt <pehunt@redhat.com>
---
 .../2371-cri-pod-container-stats/README.md    | 397 ++++++++++++++++++
 .../2371-cri-pod-container-stats/kep.yaml     |  23 +
 2 files changed, 420 insertions(+)
 create mode 100644 keps/sig-node/2371-cri-pod-container-stats/README.md
 create mode 100644 keps/sig-node/2371-cri-pod-container-stats/kep.yaml

diff --git a/keps/sig-node/2371-cri-pod-container-stats/README.md b/keps/sig-node/2371-cri-pod-container-stats/README.md
new file mode 100644
index 000000000000..29870b5addc3
--- /dev/null
+++ b/keps/sig-node/2371-cri-pod-container-stats/README.md
@@ -0,0 +1,397 @@
+<!-- toc -->
+- [cAdvisor-less, CRI-full Container and Pod Stats](#cadvisor-less-cri-full-container-and-pod-stats)
+  - [Summary](#summary)
+  - [Current State](#current-state)
+    - [Current Fulfiller of Summary API Stats &amp; Future Proposal](#current-fulfiller-of-summary-api-stats--future-proposal)
+  - [Motivation](#motivation)
+    - [Goals](#goals)
+    - [Non-Goals](#non-goals)
+  - [Proposal](#proposal)
+      - [CRI Implementation](#cri-implementation)
+        - [ContainerStats additions](#containerstats-additions)
+        - [PodStats CRI additions](#podstats-cri-additions)
+      - [Kubelet](#kubelet)
+      - [cAdvisor](#cadvisor)
+    - [User Stories [optional]](#user-stories-optional)
+      - [Story 1](#story-1)
+      - [Story 2](#story-2)
+    - [Risks and Mitigations](#risks-and-mitigations)
+  - [Graduation Criteria](#graduation-criteria)
+  - [Implementation History](#implementation-history)
+  - [Drawbacks [optional]](#drawbacks-optional)
+  - [Alternatives [optional]](#alternatives-optional)
+    - [Notes [optional]](#notes-optional)
+<!-- /toc -->
+
+# cAdvisor-less, CRI-full Container and Pod Stats
+
+## Summary
+
+Kubelet relies on the [summary API](https://github.com/kubernetes/kubernetes/blob/release-1.20/staging/src/k8s.io/kubelet/pkg/apis/stats/v1alpha1/types.go) to provide node level stats, pod level stats and container level stats. 
+
+The [CRI API](https://github.com/kubernetes/cri-api) currently does not provide enough metrics to fully supply all the fields for the summary API. As such, today some summary API metrics are fetched from cAdvisor while others originate from calling into the CRI implementation. This results in an unclear origin of metrics, duplication of work done between cAdvisor and CRI, and performance implications. 
+
+This KEP aims to enhance the CRI API with enough metrics to be able to supplement the summary API directly from CRI alleviating the need for cAdvisor to collect pod and container level stats thereby ensuring only entity (CRI runtime) will collect container and pod level metrics.
+
+
+## Current State
+
+Summary API has two interfaces:
+* [cAdvisor stats provider](https://github.com/kubernetes/kubernetes/blob/release-1.20/pkg/kubelet/stats/cadvisor_stats_provider.go)
+    * Calls cAdvisor directly to obtain node, pod, and container stats
+* [CRI stats provider](https://github.com/kubernetes/kubernetes/blob/release-1.20/pkg/kubelet/stats/cri_stats_provider.go#L54)
+    * Calls CRI implementation to provide some minimum container level stats, but still relies on cAdvisor for the majority of container level stats + pod level stats + node level stats
+
+### Current Fulfiller of Summary API Stats & Future Proposal 
+
+Below is a table describing which stats come from what source now, as well a proposal of which should come from where in the future:
+
+| Top level object         | Field                | Currently provided by: | Proposed to be provided by: | 
+|--------------------------|----------------------|------------------------|---------------------|
+| InterfaceStats (Network) | RxBytes              | cAdvisor               | CRI                 |
+|                          | RxErrors             | cAdvisor               | CRI                 |
+|                          | TxBytes              | cAdvisor               | CRI                 |
+|                          | TxErrors             | cAdvisor               | CRI                 |
+| CPUStats                 | UsageNanoCores       | cAdvisor               | CRI                 |
+|                          | UsageCoreNanoSeconds | CRI                    | CRI                 |
+| MemoryStats              | AvailableBytes       | cAdvisor               | CRI                 |
+|                          | UsageBytes           | cAdvisor               | CRI                 |
+|                          | WorkingSetBytes      | CRI                    | CRI                 |
+|                          | RSSBytes             | cAdvisor               | CRI                 |
+|                          | PageFaults           | cAdvisor               | CRI                 |
+|                          | MajorPageFaults      | cAdvisor               | CRI                 |
+| AcceleratorStats         | Make                 | cAdvisor               | cAdvisor or N/A     |
+|                          | Model                | cAdvisor               | cAdvisor or N/A     |
+|                          | ID                   | cAdvisor               | cAdvisor or N/A     |
+|                          | MemoryTotal          | cAdvisor               | cAdvisor or N/A     |
+|                          | MemoryUsed           | cAdvisor               | cAdvisor or N/A     |
+|                          | DutyCycle            | cAdvisor               | cAdvisor or N/A     |
+| VolumeStats              | All Fields           | Kubelet                | Kubelet             |
+| Ephemeral Storage        | All Fields           | Kubelet                | Kubelet             |
+| Rootfs.FsStats           | AvailableBytes       | cAdvisor or N/A        | CRI or N/A          |
+|                          | CapacityBytes        | cAdvisor or N/A        | CRI or N/A          |
+|                          | UsedBytes            | CRI                    | CRI                 |
+|                          | InodesFree           | cAdvisor or N/A        | CRI or N/A          |
+|                          | Inodes               | cAdvisor or N/A        | CRI or N/A          |
+|                          | InodesUsed           | CRI                    | CRI                 |
+| UserDefinedMetrics       | All Fields           | cAdvisor               | cAdvisor or N/A     |
+
+
+## Motivation
+
+We want to avoid using cAdvisor for container & pod level stats and move metric collection to CRI runtime for the following reasons:
+
+* cAdvisor and metric dependency: CRI mission is not fully fulfilled - container runtime is not fully plugable 
+* It would be ideal to break the monolithic design of cAdvisor (i.e. cAdvisor currently needs to be aware of the underlying container runtimes) 
+* Duplicate stats are collected by both cAdvisor and the CRI runtime 
+    * Can lead to:
+        * Different information from different sources
+        * Confusion: Unclear origin of a given metric
+        * Performance degradations (increased CPU / Memory / etc) [xref][perf-issue]
+* Stats should be reported by the container runtime which knows behavior of the container/pod the best.
+* cAdvisor only supports runtimes that run processes on the host, not e.g. VM based runtime like Kata Containers
+* cAdvisor has big list of external dependencies since it needs to support multiple container runtimes itself - leads to large support surface
+
+[perf-issue]: https://github.com/kubernetes/kubernetes/issues/51798
+
+### Goals
+
+* Summary API CRI stats provider will use CRI runtime for all pod and container level stats; cAdvisor will only be used for node/machine stats.
+* Update the CRI to be able to report all of the pod-level stats that the Kubelet is not responsible for generating.
+* Add ability for cAdvisor to stop collecting pod and container level stats; any other dependencies on pod/container level stats in kubelet will use CRI stats.
+* Update the Kubelet to only gather the required metrics from each respective source of truth.
+
+### Non-Goals
+
+- Have CRI support Volume Plugin stats (responsibility of the Kubelet).
+- Have CRI support Ephemeral Storage stats (responsibility of the Kubelet).
+- Have CRI support metrics of the rootfs (responsibility of cAdvisor).
+- Have CRI support Accelerator metrics stats (deprecated already in summary API).
+- Have CRI support UserDefinedMetrics.
+
+## Proposal
+
+This KEP encompasses three different components, all related to the goal of preventing duplicated work in gathering stats, as well as clarifying which component is responsible for what. These three pieces will be described in terms of the three components in play: CRI runtime, Kubelet, cAdvisor:
+
+The high level steps to achieve desired goal are:
+
+1. The [`ContainerStats`](https://github.com/kubernetes/cri-api/blob/release-1.20/pkg/apis/runtime/v1/api.proto#L1303-L1312) CRI message needs to be extended to include all of the fields from Summary's API's [`ContainerStats`][#summary-container-stats-object].
+2. Add CRI Pod Level Stats message to CRI protobuf that includes all [Pod Level Stats][#summary-pod-stats-object] metrics from Summary API.
+3. Add support for the new CRI additions in supported container runtimes (CRI-O and containerd).
+4. Switch Kubelet's CRI stats provider from querying container and pod level stats from cAdvisor to newly added CRI pod and container level stats
+5. cAdvisor should stop collecting container and pod level stats. If any other components need container or pod level stats from cAdvisor, the CRI implementation should be queried instead.
+
+Below we describe each step in further detail.
+
+
+[summary-pod-stats-object]: https://github.com/kubernetes/kubernetes/blob/release-1.20/staging/src/k8s.io/kubelet/pkg/apis/stats/v1alpha1/types.go#L102-L132
+
+#### CRI Implementation
+The CRI implementation will need to be extended to support reporting the full set of container-level from the [Summary API][#summary-container-stats-object].
+
+
+
+##### ContainerStats additions
+Currently, the CRI endpoints `{,List}ContainerStats` report the following fields for each container:
+- CPU
+    - `usage_core_nano_seconds`
+- Memory
+    - `working_set_bytes`
+- Filesystem
+    - `inodes_used`
+    - `used_bytes`
+
+These correspond to some fields of the [ContainerStats][#summary-container-stats-object] object reported by cAdvisor. Each of the CRI Stats fields will have to be extended to support the remaining fields. Here are some concrete additions the CRI will need, as well as notes showing what fields correspond to what between CRI and cAdvisor (note: a table version of this can be seen above):
+```
+-// CpuUsage provides the CPU usage information.
+-message CpuUsage {
+-    // Timestamp in nanoseconds at which the information were collected. Must be > 0.
++    // Corresponds to Stats Summary API CPUStats Time field
+-    int64 timestamp = 1;
+-    // Cumulative CPU usage (sum across all cores) since object creation.
++    // Corresponds to Stats Summary API CPUStats UsageCoreNanoSeconds
+-    UInt64Value usage_core_nano_seconds = 2;
++    // Total CPU usage (sum of all cores) averaged over the sample window.
++    UInt64Value usage_nano_seconds = 3;
+-}
+
+-// MemoryUsage provides the memory usage information.
+-message MemoryUsage {
+-    // Timestamp in nanoseconds at which the information were collected. Must be > 0.
++    // Corresponds to Stats Summary API MemoryStats Time field
+-    int64 timestamp = 1;
+-    // The amount of working set memory in bytes.
++    // Corresponds to Stats Summary API MemoryStats WorkingSetBytes field
+-    UInt64Value working_set_bytes = 2;
++    // Available memory for use.  This is defined as the memory limit - workingSetBytes.
++    // If memory limit is undefined, the available bytes is omitted.
++    UInt64Value available_bytes = 3;
++    // Total memory in use. This includes all memory regardless of when it was accessed.
++    UInt64Value usage_bytes
++    // The amount of working set memory. This includes recently accessed memory,
++    // dirty memory, and kernel memory. WorkingSetBytes is <= UsageBytes
++    UInt64Value working_set_bytes = 4;
++    // The amount of anonymous and swap cache memory (includes transparent
++    // hugepages).
++    UInt64Value rss_bytes = 5;
++    // Cumulative number of minor page faults.
++    Uint64Value page_faults = 6;
++    // Cumulative number of major page faults.
++    Uint64Value major_page_faults = 6;
+-}
+
+-// FilesystemUsage provides the filesystem usage information.
+-message FilesystemUsage {
+-    // Timestamp in nanoseconds at which the information were collected. Must be > 0.
++    // Corresponds to Stats Summary API FsStats Time field
+-    int64 timestamp = 1;
+-    // The unique identifier of the filesystem.
++    // Does not correspond to any field in Stats Summary API FsStats
+-    FilesystemIdentifier fs_id = 2;
+-    // UsedBytes represents the bytes used for images on the filesystem.
+-    // This may differ from the total bytes used on the filesystem and may not
+-    // equal CapacityBytes - AvailableBytes.
+-    // Corresponds to Stats Summary API FsStats UsedBytes field
+-    UInt64Value used_bytes = 3;
+-    // InodesUsed represents the inodes used by the images.
+-    // This may not equal InodesCapacity - InodesAvailable because the underlying
+-    // filesystem may also be used for purposes other than storing images.
+-    // Corresponds to Stats Summary API FsStats InodesUsed field
+-    UInt64Value inodes_used = 4;
++    // TODO: Unclear how the remaining fields relate to container stats. Is it filled in cAdvisor?
++    // AvailableBytes represents the storage space available (bytes) for the filesystem.
++    UInt64Value available_bytes = 5;
++    // CapacityBytes represents the total capacity (bytes) of the filesystems underlying storage.
++    UInt64Value capacity_bytes = 6;
++    // InodesFree represents the free inodes in the filesystem.
++    UInt64Value inodes_free = 7;
++    // Inodes represents the total inodes in the filesystem.
++    UInt64Value inodes = 8;
+-}
+```
+
+**TODO**: `UserDefinedMetrics` aren't supported by CRI and are cAdvisor specific, does it make sense to even support them in new CRI interfaces?
+
+All together, below is the proposition for the new `ContainerStats` object:
+```
+-// ContainerStats provides the resource usage statistics for a container.
+-message ContainerStats {
+-    // Information of the container.
++    // Corresponds to the Stats Summary API ContainerStats Name field
+-    ContainerAttributes attributes = 1;
+-    // CPU usage gathered from the container.
++    // Corresponds to Stats Summary API ContainerStats CPUStats field
+-    CpuUsage cpu = 2;
+-    // Memory usage gathered from the container.
++    // Corresponds to Stats Summary API ContainerStats MemoryStats field
+-    MemoryUsage memory = 3;
+-    // Usage of the writable layer.
++    // Corresponds to Stats Summary API ContainerStats Rootfs field
+-    FilesystemUsage writable_layer = 4;
++    // Stats pertaining to container logs usage of filesystem resources
++    // Logs.UsedBytes is the number of bytes used for the container logs.
++    FilesystemUsage logs = 5;
+-}
+```
+**TODO**: in Stats Summary API ContainerStats object, there's a time field. Do we want to add this to `ContainerStats`?
+
+Note the omission of the Stats Summary API ContainerStats Accelerators field. With its [deprecation](https://github.com/kubernetes/enhancements/blob/master/keps/sig-node/1867-disable-accelerator-usage-metrics/README.md), it was deemed not worth implementing in the CRI.
+
+[summary-container-stats-object]: https://github.com/kubernetes/kubernetes/blob/release-1.20/staging/src/k8s.io/kubelet/pkg/apis/stats/v1alpha1/types.go#L135-L160
+
+##### PodStats CRI additions
+Previously, pod level stats were deemed the sole responsibility of the Kubelet. Currently, kubelet fetches pod level metrics by making use of cAdvisor to lookup the pod sandbox stats.
+
+
+Instead of doing so, this KEP proposes introducing a new set of CRI API endpoints, structured similarly to the `ContainerStats` endpoints, dedicated to obtaining `PodSandbox` stats.
+
+They will be defined as follows:
+
+```
+// Runtime service defines the public APIs for remote pod runtimes
+service RuntimeService {
+    ...
+    // PodSandboxStats returns stats of the pod. If the pod does not
+    // exist, the call returns an error.
+    rpc PodSandboxStats(PodSandboxStatsRequest) returns (PodSandboxStatsResponse) {}
+    // ListPodSandboxStats returns stats of all running pods.
+    rpc ListPodSandboxStats(ListPodSandboxStatsRequest) returns (ListPodSandboxStatsResponse) {}
+    ...
+}
+...
+
+message PodSandboxStatsRequest{
+    // ID of the pod for which to retrieve stats.
+    string pod_id = 1;
+}
+
+message PodSandboxStatsResponse {
+    // Stats of the pod.
+    PodSandboxStats stats = 1;
+}
+
+message ListPodSandboxStatsRequest{
+    // Filter for the list request.
+    PodSandboxStatsFilter filter = 1;
+}
+
+message ListPodSandboxStatsResponse {
+    // Stats of the pod.
+    repeated PodSandboxStats stats = 1;
+}
+
+// PodSandboxStats provides the resource usage statistics for a pod.
+message PodSandboxStats {
+    // Information of the pod.
+    // Corresponds to PodRef in SummaryAPI
+    PodSandboxAttributes attributes = 1;
+    // CPU usage gathered from the pod.
+    // Corresponds to Stats SummaryAPI CPUStats field
+    CpuUsage cpu = 2;
+    // Memory usage gathered from the pod.
+    // Corresponds to Stats SummaryAPI MemoryStats field
+    MemoryUsage memory = 3;
+    // TODO: do we want a start time field?
+    // The time at which data collection for the pod-scoped (e.g. network) stats was (re)started.
+    // int64 timestamp = 1;
+    // Stats of containers in the measured pod.
+    repeated ContainerStats containers
+    // Stats pertaining to CPU resources consumed by pod cgroup (which includes all containers' resource usage and pod overhead).
+    NetworkStats network = 4;
+    // Note the specific omission of VolumeStats and EphemeralStorage
+    // Each of these fields will be calculated Kubelet-level
+    // ProcessStats pertaining to processes.
+    ProcessStats process = 5;
+}
+
+// NetworkStats contains data about network resources.
+message NetworkStats {
+    // The time at which these stats were updated.
+    int64 timestamp = 1;
+
+    // Stats for the default interface, if found
+    InterfaceStats default_interface = 2;
+
+    repeated InterfaceStats interfaces = 3;
+}
+
+// InterfaceStats contains resource value data about interface.
+type InterfaceStats struct {
+    // The name of the interface
+    string name = 1;
+    // Cumulative count of bytes received.
+    Uint64Value rx_bytes = 2;
+    // Cumulative count of receive errors encountered.
+    Uint64Value rx_errors = 2;
+    // Cumulative count of bytes transmitted.
+    Uint64Value tx_bytes = 2;
+    // Cumulative count of transmit errors encountered.
+    Uint64Value tx_errors = 2;
+}
+
+// ProcessStats are stats pertaining to processes.
+message ProcessStats {
+    // Number of processes
+    Uint64Value process_count = 1;
+}
+```
+
+#### Kubelet
+
+Once all required CRI changes are completed, Kubelet can update it's CRI stats provider to stop fetching metrics from cAdvisor and instead obtain the metrics from the CRI for container and pods.
+
+* TODO: How to handle migration to newly created "full" CRI stats provider?
+    * Option #1: Modify existing CRI stats provider by removing all usage of cAdvisor. May lead to backwards compatibility issue due to possible differences between CRI and cAdvisor metric collection
+    * Option #2: Create a new stats provider backed by a feature gate that provides backward compatibility and allows transition to fully CRI backed stats provider 
+
+Additional work may be required to evaluate other kubelet components (e.g. eviction, preemption, etc) that may be relying on container or pod level metrics. Ideally all components will rely on summary API thereby alleviating need for cAdvisor for container and pod level stats. This is also a requirement to be able to disable cAdvisor container metrics collection.
+
+#### cAdvisor
+
+Once CRI and Kubelet stats provider level changes are in place, we can evaluate disabling cAdvisor from collecting container and pod level stats. We may need to introduce new ability in cAdvisor to disable on-demand collection for certain cgroup hierarchies to ensure we can continue using cAdvisor for only node/machine level stats. 
+
+### User Stories [optional]
+
+#### Story 1
+As a cluster admin, I would like my node to be as performant as possible, and not waste resources by duplicating stats collection by CRI and cAdvisor.
+
+#### Story 2
+As a node maintainer, I would like to clarify the sources of truth for each stat the Kubelet reports to the Summary API, so that I can better find which component is to investigate for any issues.
+
+
+### Risks and Mitigations
+
+- To properly move to CRI stats, it is likely there are some metrics that we'll want to not support (Accelerator/UserDefined). We should be careful to not break entities that rely on these metrics.
+- A large part of this work is changing the source of the Summary API metrics. In doing so, there is a risk that collecting from a new source will change how the stats look in aggregate (and risk bugs popping up in new areas).
+- cAdvisor has a long history of collecting these stats. There is a risk that changing the source of the stats to the CRI implementation can cause performance regressions
+as cAdvisor is fine tuned to perform in an adequate manner.
+    - CRI implementations should do performance regression analyses to ensure the change does not regress too much.
+
+## Graduation Criteria
+
+
+## Implementation History
+
+
+## Drawbacks [optional]
+
+CRI runtimes will each have to implement additional interface to support full stats, rather than all metric collection being unified by cAdvisor. 
+
+Note: This is by design as this will enable to decouple runtime implementation details further from kubelet.
+
+## Alternatives [optional]
+
+Instead of teaching CRI how to do *everything* cAdvisor does, we could instead have cAdvisor not do the work the CRI stats end up doing (specifically when reporting disk stats, which are the most expensive operation to report). However, this doesn't address the anti-pattern of having multiple parties confusingly responsible for a wide array of metrics and other issues described.
+
+### Notes [optional]
+
+**TODOs**:
+
+* The fields in PodSandboxStats will largely mirror ContainerStats, any room to simplify that?
+* Should we support [User Defined metrics](https://github.com/google/cadvisor/blob/master/docs/application_metrics.md)? This feature is alpha and does not appear to be widely used?
+* There are some fields that are defined in SummaryAPI (e.g time in a lot). Do we add those?
+* How do we tell cAdvisor to not collect container metrics? Below are some options:
+* - Give it a bogus container storage/cgroup root
+* - Flag to cAdvisor
+* - Configuration field in cAdvisor
+* Who collects container log metrics? Kubelet/cAdvisor/CRI?
diff --git a/keps/sig-node/2371-cri-pod-container-stats/kep.yaml b/keps/sig-node/2371-cri-pod-container-stats/kep.yaml
new file mode 100644
index 000000000000..062a134d4011
--- /dev/null
+++ b/keps/sig-node/2371-cri-pod-container-stats/kep.yaml
@@ -0,0 +1,23 @@
+kep-number: 2371
+title: cAdvisor-less, CRI-full Container and Pod Stats
+authors:
+  - "@haircommander"
+  - "@bobbypage"
+owning-sig: sig-node
+participating-sigs:
+reviewers:
+  - "@derekwaynecarr"
+  - "@mrunalp"
+  - "TODO containerd reviewer"
+approvers:
+  - TBD
+editor: TBD
+creation-date: 2021-01-27
+last-updated: 2021-01-27
+status: provisional
+see-also:
+  - TODO
+replaces:
+  - TODO/N/A?
+superseded-by:
+  - N/A