Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support combining SSDs into a raid #5085

Merged
merged 2 commits into from
Aug 6, 2021
Merged

Support combining SSDs into a raid #5085

merged 2 commits into from
Aug 6, 2021

Conversation

csweichel
Copy link
Contributor

Useful for GKE deployments

@roboquat roboquat requested a review from mrsimonemms August 6, 2021 07:24
@roboquat roboquat added the size/M label Aug 6, 2021
Copy link
Contributor

@mrsimonemms mrsimonemms left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@roboquat
Copy link
Contributor

roboquat commented Aug 6, 2021

LGTM label has been added.

Git tree hash: cdd2fb617237261659421cd5d2f86682773fd08f

@csweichel
Copy link
Contributor Author

/approve no-issue

@roboquat
Copy link
Contributor

roboquat commented Aug 6, 2021

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: csweichel, MrSimonEmms

Associated issue requirement bypassed by: csweichel

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:
  • OWNERS [MrSimonEmms,csweichel]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@roboquat roboquat merged commit 3b93768 into main Aug 6, 2021
@roboquat roboquat deleted the cw/ssd-raid branch August 6, 2021 07:33
@jankeromnes
Copy link
Contributor

@csweichel @aledbf @mrsimonemms Hi!

FYI, this change seems to be related to a cluster incident where a ws-daemon pod keeps crashlooping (happened several times since this PR was deployed).

The symptoms are:

  • A ws-daemon pod keeps crashlooping, triggering an alert

  • kubectl describe shows this event:

$ kubectl describe pod ws-daemon-zc64j
[...]
Events:
  Type     Reason   Age                   From     Message
  ----     ------   ----                  ----     -------
  Warning  BackOff  112s (x390 over 86m)  kubelet  Back-off restarting failed container
  • When getting the logs for all containers, it seems to be the first "init container", raid-local-disks, that fails with:
...
+ DISK_DEV_SUFFIX=(c d e f g h i j)
+ MAX_NUM_DISKS=8
+ NUM_DISKS=0
+ declare -a DISKS
++ seq 0 7
+ for i in `seq 0 $((MAX_NUM_DISKS-1))`
+ CURR_DISK=/dev/sdc
+ '[' '!' -b /dev/sdc ']'
+ break
+ '[' 0 -eq 0 ']'
+ echo 'no local disks detected!'
+ exit 1
no local disks detected!

@jankeromnes
Copy link
Contributor

Update: We now believe the code is behaving as expected (after also #5096), i.e. "fail to init ws-daemon if setupSSDRaid is enabled but there are no disks for it".

The incident looks rather like a configuration problem (i.e. there is a headless pool in EU, which is unexpected, and it seems to both enable setupSSDRaid and have no disks for it, leading its single ws-daemon to crashloop).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants