Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Running out of disk space during upgrade from v0.6 and v0.7 where the default disks are 32GB #2161

Closed
9 tasks
przemyslavic opened this issue Mar 26, 2021 · 1 comment
Assignees
Labels
Milestone

Comments

@przemyslavic
Copy link
Collaborator

przemyslavic commented Mar 26, 2021

Describe the bug
In versions 0.6 and 0.7 there was no repository machine and epirepo was installed on kubernetes master vm which was 32 GB by default. When upgrading to develop, you are running out of disk space, causing Docker to randomly delete some images to free up space, which in turn results in the upgrade process being aborted with an error when trying to tag images that no longer exist.
Looks like we need to clean old and unnecessary images and packages to free up disk space before downloading new requirements. Otherwise, we will have to extend the disk to perform upgrade, which will not be an easy solution.

How to reproduce
Steps to reproduce the behavior:

  1. Deploy a 0.6 (or 0.7) cluster with kubernetes master component enabled - execute epicli apply from v0.6/v0.7 branch
  2. Upgrade the cluster to the develop branch - execute epicli upgrade from develop branch

Expected behavior
The cluster has been successfully upgraded.

Environment

  • Cloud provider: [all]
  • OS: [all]

epicli version: [epicli --version]

Additional context

2021-03-25T17:28:16.6926428Z[38;21m17:28:16 INFO cli.engine.ansible.AnsibleCommand - TASK [image_registry : Tag k8s.gcr.io/kube-scheduler:v1.15.10 image with ec2-xx-xx-xx-xx.eu-west-3.compute.amazonaws.com:5000/k8s.gcr.io/kube-scheduler:v1.15.10] ***[0m
2021-03-25T17:28:18.2723196Z[31;21m17:28:18 ERROR cli.engine.ansible.AnsibleCommand - fatal: [ec2-xx-xx-xx-xx.eu-west-3.compute.amazonaws.com]: FAILED! => {"changed": true, "cmd": ["docker", "tag", "k8s.gcr.io/kube-scheduler:v1.15.10", "ec2-xx-xx-xx-xx.eu-west-3.compute.amazonaws.com:5000/k8s.gcr.io/kube-scheduler:v1.15.10"], "delta": "0:00:00.057406", "end": "2021-03-25 17:28:18.042186", "msg": "non-zero return code", "rc": 1, "start": "2021-03-25 17:28:17.984780", "stderr": "Error response from daemon: No such image: k8s.gcr.io/kube-scheduler:v1.15.10", "stderr_lines": ["Error response from daemon: No such image: k8s.gcr.io/kube-scheduler:v1.15.10"], "stdout": "", "stdout_lines": []}[0m
[ec2-user@ec2-xx-xx-xx-xx ~]$ df -h
Filesystem      Size  Used Avail Use% Mounted on
devtmpfs        1.8G     0  1.8G   0% /dev
tmpfs           1.9G     0  1.9G   0% /dev/shm
tmpfs           1.9G   18M  1.9G   1% /run
tmpfs           1.9G     0  1.9G   0% /sys/fs/cgroup
/dev/nvme0n1p2   30G   25G  5.5G  82% /
tmpfs           373M     0  373M   0% /run/user/1000

DoD checklist

  • Changelog updated (if affected version was released)
  • COMPONENTS.md updated / doesn't need to be updated
  • Automated tests passed (QA pipelines)
    • apply
    • upgrade
  • Case covered by automated test (if possible)
  • Idempotency tested
  • Documentation updated / doesn't need to be updated
  • All conversations in PR resolved
@przemyslavic
Copy link
Collaborator Author

The fix works in the sense that it removes unnecessary files, packages and images. For some older clusters where the repository disks are 32 GB it is insufficient (os disks have to be extended there) but it brings additional improvements anyway, so it's worth applying.

@mkyc mkyc closed this as completed Apr 1, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants