Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Calico node spikes with memory usage after upgrading to 3.20 #4810

Closed
invidian opened this issue Aug 5, 2021 · 10 comments
Closed

Calico node spikes with memory usage after upgrading to 3.20 #4810

invidian opened this issue Aug 5, 2021 · 10 comments

Comments

@invidian
Copy link
Contributor

invidian commented Aug 5, 2021

Expected Behavior

Memory usage of DaemonSet remains stable.

Current Behavior

Memory usage is now spiky, occasionally triggering OOM on small nodes. This graph shows memory usage after updating from 3.19.1 to 3.20.0.
Selection_444

Steps to Reproduce (for bugs)

  1. helm repo add flexkube https://flexkube.github.io/charts/
  2. helm upgrade --install --wait -n kube-system calico flexkube/calico
  3. Wait.

Context

Your Environment

  • Calico version
    Image:          docker.io/calico/kube-controllers:v3.20.0
    Image:         docker.io/calico/cni:v3.20.0
    Image:         docker.io/calico/cni:v3.20.0
    Image:          docker.io/calico/pod2daemon-flexvol:v3.20.0
    Image:          docker.io/calico/node:v3.20.0
  • Orchestrator version (e.g. kubernetes, mesos, rkt):
Client Version: version.Info{Major:"1", Minor:"21", GitVersion:"v1.21.0", GitCommit:"cb303e613a121a29364f75cc67d3d580833a7479", GitTreeState:"clean", BuildDate:"2021-04-08T16:31:21Z", GoVersion:"go1.16.1", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"21", GitVersion:"v1.21.3", GitCommit:"ca643a4d1f7bfe34773c74f79527be4afd95bf39", GitTreeState:"clean", BuildDate:"2021-07-15T20:59:07Z", GoVersion:"go1.16.6", Compiler:"gc", Platform:"linux/amd64"}
  • Operating System and version:
Flatcar Container Linux by Kinvolk 2905.2.1 (Oklo)   5.10.55-flatcar

My config:

data:
  calico_backend: bird
  cni_network_config: |-
    {
      "name": "k8s-pod-network",
      "cniVersion": "0.3.1",
      "plugins": [
        {
          "type": "calico",
          "log_level": "info",
          "log_file_path": "/var/log/calico/cni/cni.log",
          "datastore_type": "kubernetes",
          "nodename": "__KUBERNETES_NODE_NAME__",
          "mtu": __CNI_MTU__,
          "nodename_file_optional": true,
          "ipam": {
              "type": "calico-ipam"
          },
          "policy": {
              "type": "k8s"
          },
          "kubernetes": {
              "kubeconfig": "__KUBECONFIG_FILEPATH__"
          }
        },
        {
          "type": "portmap",
          "snat": true,
          "capabilities": {"portMappings": true}
        },
        {
          "type": "bandwidth",
          "capabilities": {"bandwidth": true}
        }
      ]
    }
  typha_service_name: none
  veth_mtu: "1430"
@caseydavenport
Copy link
Member

@invidian thanks for the report - could you share a little bit more information about your setup?

Where is this cluster running? Are you using iptables or BPF mode? VXLAN or BGP? What query are you using to measure calico/node memory usage?

I'd like to attempt to reproduce this as closely as possible.

@caseydavenport
Copy link
Member

@invidian if possible, it would be really helpful if you could collect a felix memory profile as well.

You can do that by:

  • Setting the FelixConfiguration option DebugMemoryProfilePath=/path/to/profile
  • Sending a SIGUSR1 to felix to trigger generation of a pprof: e.g., sudo pkill -USR1 calico-felix within calico/node.

That should help me identify where the memory is being used.

@invidian
Copy link
Contributor Author

@caseydavenport thanks for having a look. I'll try to collect some information next week. It seems GitHub didn't send any notifications about your comments to me, perhaps because of the outage yesterday.

@george-angel
Copy link
Contributor

Hello, we are seeing the same behaviour:

2021-08-23-104809_2560x1368_scrot

Calico: v3.20.0 - using Wireguard, container_memory_working_set_bytes metric.

Will work on getting your pprof 👍

@george-angel
Copy link
Contributor

2021-08-23-153024_2528x632_scrot

Only has been running for 3h - but you can see the gradual increase

felix-mem.pb.gz

tyvm to Shaun for helping me get the pprof.

@george-angel
Copy link
Contributor

2021-08-24-083041_2528x631_scrot
felix-mem-2021-08-24.pb.gz
18 hours

@fasaxc
Copy link
Member

fasaxc commented Nov 9, 2021

Cracking the profiles yields the attached memory diff. Since most of the memory is leaked from startm(), which is the runtime function that starts system threads, I think this is likely to be a duplicate of #5018, which has already been fixed in v3.21.0 (and the upcoming v3.20.3 patch release).

Memory profile diff

@fasaxc
Copy link
Member

fasaxc commented Nov 9, 2021

@george-angel @invidian Please try v3.21.0 and let us know if that resolves the issue.

@invidian
Copy link
Contributor Author

invidian commented Nov 9, 2021

I'm testing v3.21.0 right now and things looks good so far. Downgrading to v3.20.2 and I definitely see a sudden increase in go_threads and go_goroutines metrics. See the graph below:
Selection_498

I think it can be closed in favor of #5018 then!

@fasaxc
Copy link
Member

fasaxc commented Nov 9, 2021

Great, thanks for letting us know.

@fasaxc fasaxc closed this as completed Nov 9, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants