Calico node spikes with memory usage after upgrading to 3.20 #4810

invidian · 2021-08-05T11:28:07Z

Expected Behavior

Memory usage of DaemonSet remains stable.

Current Behavior

Memory usage is now spiky, occasionally triggering OOM on small nodes. This graph shows memory usage after updating from 3.19.1 to 3.20.0.

Steps to Reproduce (for bugs)

helm repo add flexkube https://flexkube.github.io/charts/
helm upgrade --install --wait -n kube-system calico flexkube/calico
Wait.

Context

Your Environment

Calico version

    Image:          docker.io/calico/kube-controllers:v3.20.0
    Image:         docker.io/calico/cni:v3.20.0
    Image:         docker.io/calico/cni:v3.20.0
    Image:          docker.io/calico/pod2daemon-flexvol:v3.20.0
    Image:          docker.io/calico/node:v3.20.0

Orchestrator version (e.g. kubernetes, mesos, rkt):

Client Version: version.Info{Major:"1", Minor:"21", GitVersion:"v1.21.0", GitCommit:"cb303e613a121a29364f75cc67d3d580833a7479", GitTreeState:"clean", BuildDate:"2021-04-08T16:31:21Z", GoVersion:"go1.16.1", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"21", GitVersion:"v1.21.3", GitCommit:"ca643a4d1f7bfe34773c74f79527be4afd95bf39", GitTreeState:"clean", BuildDate:"2021-07-15T20:59:07Z", GoVersion:"go1.16.6", Compiler:"gc", Platform:"linux/amd64"}

Operating System and version:

Flatcar Container Linux by Kinvolk 2905.2.1 (Oklo)   5.10.55-flatcar

My config:

data:
  calico_backend: bird
  cni_network_config: |-
    {
      "name": "k8s-pod-network",
      "cniVersion": "0.3.1",
      "plugins": [
        {
          "type": "calico",
          "log_level": "info",
          "log_file_path": "/var/log/calico/cni/cni.log",
          "datastore_type": "kubernetes",
          "nodename": "__KUBERNETES_NODE_NAME__",
          "mtu": __CNI_MTU__,
          "nodename_file_optional": true,
          "ipam": {
              "type": "calico-ipam"
          },
          "policy": {
              "type": "k8s"
          },
          "kubernetes": {
              "kubeconfig": "__KUBECONFIG_FILEPATH__"
          }
        },
        {
          "type": "portmap",
          "snat": true,
          "capabilities": {"portMappings": true}
        },
        {
          "type": "bandwidth",
          "capabilities": {"bandwidth": true}
        }
      ]
    }
  typha_service_name: none
  veth_mtu: "1430"

The text was updated successfully, but these errors were encountered:

caseydavenport · 2021-08-10T16:11:37Z

@invidian thanks for the report - could you share a little bit more information about your setup?

Where is this cluster running? Are you using iptables or BPF mode? VXLAN or BGP? What query are you using to measure calico/node memory usage?

I'd like to attempt to reproduce this as closely as possible.

caseydavenport · 2021-08-10T16:17:15Z

@invidian if possible, it would be really helpful if you could collect a felix memory profile as well.

You can do that by:

Setting the FelixConfiguration option DebugMemoryProfilePath=/path/to/profile
Sending a SIGUSR1 to felix to trigger generation of a pprof: e.g., sudo pkill -USR1 calico-felix within calico/node.

That should help me identify where the memory is being used.

invidian · 2021-08-11T23:03:47Z

@caseydavenport thanks for having a look. I'll try to collect some information next week. It seems GitHub didn't send any notifications about your comments to me, perhaps because of the outage yesterday.

george-angel · 2021-08-23T10:46:13Z

Hello, we are seeing the same behaviour:

Calico: v3.20.0 - using Wireguard, container_memory_working_set_bytes metric.

Will work on getting your pprof 👍

george-angel · 2021-08-23T14:33:11Z

Only has been running for 3h - but you can see the gradual increase

felix-mem.pb.gz

tyvm to Shaun for helping me get the pprof.

george-angel · 2021-08-24T07:51:42Z

felix-mem-2021-08-24.pb.gz
18 hours

fasaxc · 2021-11-09T14:17:42Z

Cracking the profiles yields the attached memory diff. Since most of the memory is leaked from startm(), which is the runtime function that starts system threads, I think this is likely to be a duplicate of #5018, which has already been fixed in v3.21.0 (and the upcoming v3.20.3 patch release).

Memory profile diff

fasaxc · 2021-11-09T14:19:10Z

@george-angel @invidian Please try v3.21.0 and let us know if that resolves the issue.

invidian · 2021-11-09T14:32:49Z

I'm testing v3.21.0 right now and things looks good so far. Downgrading to v3.20.2 and I definitely see a sudden increase in go_threads and go_goroutines metrics. See the graph below:

I think it can be closed in favor of #5018 then!

fasaxc · 2021-11-09T14:33:43Z

Great, thanks for letting us know.

CecileRobertMichon mentioned this issue Aug 5, 2021

Update Calico to v3.19.2 kubernetes-sigs/cluster-api-provider-azure#1583

Merged

3 tasks

caseydavenport added kind/bug impact/high likelihood/high labels Aug 10, 2021

fasaxc closed this as completed Nov 9, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Calico node spikes with memory usage after upgrading to 3.20 #4810

Calico node spikes with memory usage after upgrading to 3.20 #4810

invidian commented Aug 5, 2021 •

edited

Loading

caseydavenport commented Aug 10, 2021

caseydavenport commented Aug 10, 2021

invidian commented Aug 11, 2021

george-angel commented Aug 23, 2021

george-angel commented Aug 23, 2021

george-angel commented Aug 24, 2021

fasaxc commented Nov 9, 2021 •

edited

Loading

fasaxc commented Nov 9, 2021

invidian commented Nov 9, 2021

fasaxc commented Nov 9, 2021

Calico node spikes with memory usage after upgrading to 3.20 #4810

Calico node spikes with memory usage after upgrading to 3.20 #4810

Comments

invidian commented Aug 5, 2021 • edited Loading

Expected Behavior

Current Behavior

Steps to Reproduce (for bugs)

Context

Your Environment

caseydavenport commented Aug 10, 2021

caseydavenport commented Aug 10, 2021

invidian commented Aug 11, 2021

george-angel commented Aug 23, 2021

george-angel commented Aug 23, 2021

george-angel commented Aug 24, 2021

fasaxc commented Nov 9, 2021 • edited Loading

fasaxc commented Nov 9, 2021

invidian commented Nov 9, 2021

fasaxc commented Nov 9, 2021

invidian commented Aug 5, 2021 •

edited

Loading

fasaxc commented Nov 9, 2021 •

edited

Loading