Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Release-1.29] - K3s etcd snapshot reconcile consumes excessive memory when a large number of snapshots are present #10560

Closed
brandond opened this issue Jul 24, 2024 · 1 comment
Assignees
Milestone

Comments

@brandond
Copy link
Member

brandond commented Jul 24, 2024

Backport fix for

@brandond brandond changed the title [Release-1.29] - Multiple simultaneous snapshots result in silent failure and/or corruption of at least one snapshot [Release-1.29] - K3s etcd snapshot reconcile consumes excessive memory when a large number of snapshots are present Jul 24, 2024
@caroline-suse-rancher caroline-suse-rancher added this to the v1.29.8+k3s1 milestone Jul 29, 2024
@caroline-suse-rancher caroline-suse-rancher moved this from New to Next Up in K3s Development Jul 29, 2024
@aganesh-suse
Copy link

Validated on release-1.29 branch with commit d416975

Environment Details

Infrastructure

  • Cloud
  • Hosted

Node(s) CPU architecture, OS, and Version:

$ cat /etc/os-release
PRETTY_NAME="Ubuntu 22.04.2 LTS"

$ uname -m
x86_64

Setup Size: 4Gb Memory, 2vCPU, 30G disc size.

Cluster Configuration:

HA: 3 server/ 1 agent

Config.yaml:

token: xxxx
cluster-init: true
write-kubeconfig-mode: "0644"
node-external-ip: 1.1.1.1
node-label:
- k3s-upgrade=server

etcd-snapshot-retention: 255
etcd-snapshot-schedule-cron: "* * * * *"
etcd-s3: true
etcd-s3-access-key: <access_key>
etcd-s3-secret-key: <secret_key>
etcd-s3-bucket: <bucket>
etcd-s3-folder: <folder>
etcd-s3-region: <region>

Testing Steps

  1. Copy config.yaml
$ sudo mkdir -p /etc/rancher/k3s && sudo cp config.yaml /etc/rancher/k3s
  1. Install k3s
curl -sfL https://get.k3s.io | sudo INSTALL_K3S_COMMIT='d416975b02a38f6a8f691c40d7633c2941b7a4cd' sh -s - server
  1. Verify Cluster Status:
kubectl get nodes -o wide
kubectl get pods -A
  1. Apply the k3s extra metadata here:
kubectl apply -f https://gist.githubusercontent.com/aganesh-suse/52c3d6c3d7fe70141fa3a49431ac0032/raw/20039a159ab0f5fce1930f5ec12f6afc2b034784/k3s-etcd-snapshot-extra-metadata.yaml
  1. Monitor the memory usage of k3s.service, while taking snapshots every 1 minute for upto 255 snapshots.
for (( I=0; I < "${ON_DEMAND_SNAPSHOT_COUNT}"; I++ ))
do
    sudo k3s etcd-snapshot save
    sleep 5
done
write_mem_usage_k3s_to_file () {
    while true; do
        echo "$(top -b -n 1 | grep k3s)"  | tee -a top-output.log
        sleep 1
    done
}

ttyplot_k3s_memory () {
    K3S_PID=$(ps aux | grep 'k3s' | head -n 1 | awk '{print $2}')
    while :; do grep -oP '^VmRSS:\s+\K\d+' /proc/$K3S_PID/status \
    | numfmt --from-unit Ki --to-unit Mi; sleep 1; done | ttyplot -u Mi
}

P.S: We run out of disc space before running out of memory space by ~280 snapshots, so capping the snapshot count to 255 for testing purposes.

Replication Results:

  • k3s version used for replication:
$ k3s -v
k3s version v1.29.7+k3s1 (f246bbc3)
go version go1.22.5

Validation Results:

  • k3s version used for validation:
$ k3s -v
k3s version v1.29.7+k3s-d416975b (d416975b)
go version go1.22.5

Memory Usage Comparison Results

Plot to compare memory usage between released version and latest commit:
Compare % Memory, Max Memory used in Mb, Avg Memory in Mb for both released version and latest commit for various snapshot counts - 120, 150, 170, 200, 230, 250, 255.

          v1.29.7        |  release-1.29 commit: d416975b
Snaps   %M  Max     Avg     %M   Max    Avg
100     45  1879    1770    44  1739    1645
120     53  2079    1956    44  1876    1788
150     59  2529    2244    52  2073    1984
170     60  2612    2429    58  2263    2169
200     70  2890    2670    64  2480    2427
230     72  2894    2728    62  2489    2261
250     70  3233    2863    60  2662    2418
255     78  3233    2840    70  2786    2573

Observations

Memory wise, till about 120 snapshots, the new commit is 2% lesser than the older vesion memory usage.
On an average, the difference starts increasing to ~5% till 200 snapshots, then to ~10+% lesser memory usage for more than 200 snapshots.

@github-project-automation github-project-automation bot moved this from Next Up to Done Issue in K3s Development Aug 13, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Archived in project
Development

No branches or pull requests

3 participants