Skip to content

Commit

Permalink
osd: make osd pod to sleep when osds are flapping
Browse files Browse the repository at this point in the history
When OSDs flap, ceph stops the OSD daemon if its marked down greater than
5 times in 600 seconds. But OSD pod restarts and marks the OSD `up` again.
This causes the PGs mapped to these OSDs to peer. While the PGs are peering,
IO to these PGs are blocked.

So we need to ensure that if ceph is marking OSD `down` due to flapping, OSD pod
should not restart to mark the OSDs `up` again.

This PR adds a sleep to the OSD pod if the container returned with a 0 exit code
Default behavior is to sleep for 6 hrs. But user can configure it from the
ceph cluster spec.

Signed-off-by: sp98 <[email protected]>
(cherry picked from commit 4eb9f62)
Signed-off-by: sp98 <[email protected]>
  • Loading branch information
sp98 committed Sep 19, 2023
1 parent f8d5b3d commit 68578db
Show file tree
Hide file tree
Showing 12 changed files with 132 additions and 33 deletions.
1 change: 1 addition & 0 deletions Documentation/CRDs/Cluster/ceph-cluster-crd.md
Original file line number Diff line number Diff line change
Expand Up @@ -85,6 +85,7 @@ For more details on the mons and when to choose a number other than `3`, see the
* `onlyApplyOSDPlacement`: Whether the placement specific for OSDs is merged with the `all` placement. If `false`, the OSD placement will be merged with the `all` placement. If true, the `OSD placement will be applied` and the `all` placement will be ignored. The placement for OSDs is computed from several different places depending on the type of OSD:
* For non-PVCs: `placement.all` and `placement.osd`
* For PVCs: `placement.all` and inside the storageClassDeviceSets from the `placement` or `preparePlacement`
* `flappingRestartIntervalHours`: Defines the time for which an OSD pod will sleep before restarting, if it stopped due to flapping. Flapping occurs where OSDs are marked `down` by Ceph more than 5 times in 600 seconds. The OSDs will stay down when flapping since they likely have a bad disk or other issue that needs investigation. The default is 24 hours. If the issue with the OSD is fixed manually, the OSD pod can be manually restarted.
* `disruptionManagement`: The section for configuring management of daemon disruptions
* `managePodBudgets`: if `true`, the operator will create and manage PodDisruptionBudgets for OSD, Mon, RGW, and MDS daemons. OSD PDBs are managed dynamically via the strategy outlined in the [design](https://github.com/rook/rook/blob/master/design/ceph/ceph-managed-disruptionbudgets.md). The operator will block eviction of OSDs by default and unblock them safely when drains are detected.
* `osdMaintenanceTimeout`: is a duration in minutes that determines how long an entire failureDomain like `region/zone/host` will be held in `noout` (in addition to the default DOWN/OUT interval) when it is draining. This is only relevant when `managePodBudgets` is `true`. The default value is `30` minutes.
Expand Down
17 changes: 17 additions & 0 deletions Documentation/CRDs/specification.md
Original file line number Diff line number Diff line change
Expand Up @@ -11359,6 +11359,23 @@ OSDStore
<em>(Optional)</em>
</td>
</tr>
<tr>
<td>
<code>flappingRestartIntervalHours</code><br/>
<em>
int
</em>
</td>
<td>
<em>(Optional)</em>
<p>FlappingRestartIntervalHours defines the time for which the OSD pods, that failed with zero exit code, will sleep before restarting.
This is needed for OSD flapping where OSD daemons are marked down more than 5 times in 600 seconds by Ceph.
Preventing the OSD pods to restart immediately in such scenarios will prevent Rook from marking OSD as <code>up</code> and thus
peering of the PGs mapped to the OSD.
The interval defaults to 24 hours if no value is provided. User needs to manually restart the OSD pod if they manage to fix
the underlying OSD flapping issue before the restart interval.</p>
</td>
</tr>
</tbody>
</table>
<h3 id="ceph.rook.io/v1.StoreType">StoreType
Expand Down
3 changes: 3 additions & 0 deletions deploy/charts/rook-ceph/templates/resources.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -2663,6 +2663,9 @@ spec:
nullable: true
type: array
x-kubernetes-preserve-unknown-fields: true
flappingRestartIntervalHours:
description: FlappingRestartIntervalHours defines the time for which the OSD pods, that failed with zero exit code, will sleep before restarting. This is needed for OSD flapping where OSD daemons are marked down more than 5 times in 600 seconds by Ceph. Preventing the OSD pods to restart immediately in such scenarios will prevent Rook from marking OSD as `up` and thus peering of the PGs mapped to the OSD. The interval defaults to 24 hours if no value is provided. User needs to manually restart the OSD pod if they manage to fix the underlying OSD flapping issue before the restart interval.
type: integer
nodes:
items:
description: Node is a storage nodes
Expand Down
2 changes: 2 additions & 0 deletions deploy/examples/cluster.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -261,6 +261,8 @@ spec:
# deviceFilter: "^sd."
# when onlyApplyOSDPlacement is false, will merge both placement.All() and placement.osd
onlyApplyOSDPlacement: false
# Time for which an OSD pod will sleep before restarting, if it stopped due to flapping
# flappingRestartIntervalHours: 24
# The section for configuring management of daemon disruptions during upgrade or fencing.
disruptionManagement:
# If true, the operator will create and manage PodDisruptionBudgets for OSD, Mon, RGW, and MDS daemons. OSD PDBs are managed dynamically
Expand Down
3 changes: 3 additions & 0 deletions deploy/examples/crds.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -2661,6 +2661,9 @@ spec:
nullable: true
type: array
x-kubernetes-preserve-unknown-fields: true
flappingRestartIntervalHours:
description: FlappingRestartIntervalHours defines the time for which the OSD pods, that failed with zero exit code, will sleep before restarting. This is needed for OSD flapping where OSD daemons are marked down more than 5 times in 600 seconds by Ceph. Preventing the OSD pods to restart immediately in such scenarios will prevent Rook from marking OSD as `up` and thus peering of the PGs mapped to the OSD. The interval defaults to 24 hours if no value is provided. User needs to manually restart the OSD pod if they manage to fix the underlying OSD flapping issue before the restart interval.
type: integer
nodes:
items:
description: Node is a storage nodes
Expand Down
8 changes: 8 additions & 0 deletions pkg/apis/ceph.rook.io/v1/types.go
Original file line number Diff line number Diff line change
Expand Up @@ -2662,6 +2662,14 @@ type StorageScopeSpec struct {
StorageClassDeviceSets []StorageClassDeviceSet `json:"storageClassDeviceSets,omitempty"`
// +optional
Store OSDStore `json:"store,omitempty"`
// +optional
// FlappingRestartIntervalHours defines the time for which the OSD pods, that failed with zero exit code, will sleep before restarting.
// This is needed for OSD flapping where OSD daemons are marked down more than 5 times in 600 seconds by Ceph.
// Preventing the OSD pods to restart immediately in such scenarios will prevent Rook from marking OSD as `up` and thus
// peering of the PGs mapped to the OSD.
// The interval defaults to 24 hours if no value is provided. User needs to manually restart the OSD pod if they manage to fix
// the underlying OSD flapping issue before the restart interval.
FlappingRestartIntervalHours int `json:"flappingRestartIntervalHours"`
}

// OSDStore is the backend storage type used for creating the OSDs
Expand Down
18 changes: 13 additions & 5 deletions pkg/operator/ceph/cluster/osd/osd.go
Original file line number Diff line number Diff line change
Expand Up @@ -21,6 +21,7 @@ import (
"bufio"
"context"
"fmt"
"regexp"
"sort"
"strconv"
"strings"
Expand Down Expand Up @@ -586,13 +587,11 @@ func (c *Cluster) getOSDInfo(d *appsv1.Deployment) (OSDInfo, error) {
}

locationFound := false
for _, a := range container.Args {
for _, a := range container.Command {
locationPrefix := "--crush-location="
if strings.HasPrefix(a, locationPrefix) {
if strings.Contains(a, locationPrefix) {
locationFound = true
// Extract the same CRUSH location as originally determined by the OSD prepare pod
// by cutting off the prefix: --crush-location=
osd.Location = a[len(locationPrefix):]
osd.Location = getLocationWithRegex(a)
}
}

Expand Down Expand Up @@ -871,3 +870,12 @@ func (c *Cluster) waitForHealthyPGs() (bool, error) {

return true, nil
}

func getLocationWithRegex(input string) string {
rx := regexp.MustCompile(`--crush-location="(.+?)"`)
match := rx.FindStringSubmatch(input)
if len(match) == 2 {
return strings.TrimSpace(match[1])
}
return ""
}
11 changes: 11 additions & 0 deletions pkg/operator/ceph/cluster/osd/osd_test.go
Original file line number Diff line number Diff line change
Expand Up @@ -813,3 +813,14 @@ func TestReplaceOSDForNewStore(t *testing.T) {
assert.Nil(t, c.replaceOSD)
})
}

func TestGetLocationWithRegex(t *testing.T) {
location := getLocationWithRegex("")
assert.Equal(t, "", location)

location = getLocationWithRegex(`ceph-osd --crush-location="root=default host=node" --default-log-to-stderr=true`)
assert.Equal(t, "root=default host=node", location)

location = getLocationWithRegex(`ceph-osd --crush-location="" --default-log-to-stderr=true`)
assert.Equal(t, "", location)
}
54 changes: 44 additions & 10 deletions pkg/operator/ceph/cluster/osd/spec.go
Original file line number Diff line number Diff line change
Expand Up @@ -22,6 +22,7 @@ import (
"path"
"path/filepath"
"strconv"
"strings"

"github.com/pkg/errors"
cephv1 "github.com/rook/rook/pkg/apis/ceph.rook.io/v1"
Expand Down Expand Up @@ -61,16 +62,36 @@ const (
// DmcryptMetadataType is a portion of the device mapper name for the encrypted OSD on PVC block
DmcryptMetadataType = "db-dmcrypt"
// DmcryptWalType is a portion of the device mapper name for the encrypted OSD on PVC wal
DmcryptWalType = "wal-dmcrypt"
bluestoreBlockName = "block"
bluestoreMetadataName = "block.db"
bluestoreWalName = "block.wal"
tempEtcCephDir = "/etc/temp-ceph"
osdPortv1 = 6801
osdPortv2 = 6800
DmcryptWalType = "wal-dmcrypt"
bluestoreBlockName = "block"
bluestoreMetadataName = "block.db"
bluestoreWalName = "block.wal"
tempEtcCephDir = "/etc/temp-ceph"
osdPortv1 = 6801
osdPortv2 = 6800
defaultOSDRestartInterval = 24
)

const (
cephOSDStart = `
function sigterm() {
echo "SIGTERM received"
exit
}
trap sigterm SIGTERM
%s %s & wait
RESTART_INTERVAL=%d
rc=$?
if [ $rc -eq 0 ]; then
touch /tmp/osd-sleep
echo "OSD daemon exited with code 0, possibly due to OSD flapping. The OSD pod will sleep for $RESTART_INTERVAL hours. Restart the pod manually once the flapping issue is fixed"
sleep "$RESTART_INTERVAL"h & wait
exit $rc
fi`

activateOSDOnNodeCode = `
set -o errexit
set -o pipefail
Expand Down Expand Up @@ -400,7 +421,7 @@ func (c *Cluster) makeDeployment(osdProps osdProperties, osd OSDInfo, provisionC
"--fsid", c.clusterInfo.FSID,
"--setuser", "ceph",
"--setgroup", "ceph",
fmt.Sprintf("--crush-location=%s", osd.Location),
fmt.Sprintf("--crush-location=%q", osd.Location),
}...)

// Ceph expects initial weight as float value in tera-bytes units
Expand Down Expand Up @@ -598,8 +619,7 @@ func (c *Cluster) makeDeployment(osdProps osdProperties, osd OSDInfo, provisionC
InitContainers: initContainers,
Containers: []v1.Container{
{
Command: command,
Args: args,
Command: osdStartScript(command, args, c.spec.Storage.FlappingRestartIntervalHours),
Name: "osd",
Image: c.spec.CephVersion.Image,
ImagePullPolicy: controller.GetContainerImagePullPolicy(c.spec.CephVersion.ImagePullPolicy),
Expand Down Expand Up @@ -1396,3 +1416,17 @@ func (c *Cluster) getOSDServicePorts() []v1.ServicePort {

return ports
}

func osdStartScript(cmd, args []string, interval int) []string {
osdRestartInterval := defaultOSDRestartInterval
if interval != 0 {
osdRestartInterval = interval
}

return []string{
"/bin/bash",
"-c",
"-x",
fmt.Sprintf(cephOSDStart, strings.Join(cmd, " "), strings.Join(args, " "), osdRestartInterval),
}
}
5 changes: 2 additions & 3 deletions pkg/operator/ceph/cluster/osd/spec_test.go
Original file line number Diff line number Diff line change
Expand Up @@ -178,7 +178,6 @@ func testPodDevices(t *testing.T, dataDir, deviceName string, allDevices bool) {
cont := deployment.Spec.Template.Spec.Containers[0]
assert.Equal(t, spec.CephVersion.Image, cont.Image)
assert.Equal(t, 8, len(cont.VolumeMounts))
assert.Equal(t, "ceph-osd", cont.Command[0])
verifyEnvVar(t, cont.Env, "TCMALLOC_MAX_TOTAL_THREAD_CACHE_BYTES", "134217728", true)

// Test OSD on PVC with LVM
Expand Down Expand Up @@ -434,15 +433,15 @@ func testPodDevices(t *testing.T, dataDir, deviceName string, allDevices bool) {
deployment, err = c.makeDeployment(osdProp, osd, dataPathMap)
assert.NoError(t, err)
for _, flag := range defaultTuneFastSettings {
assert.Contains(t, deployment.Spec.Template.Spec.Containers[0].Args, flag)
assert.Contains(t, deployment.Spec.Template.Spec.Containers[0].Command[3], flag)
}

// Test tune Slow settings when OSD on PVC
osdProp.tuneSlowDeviceClass = true
deployment, err = c.makeDeployment(osdProp, osd, dataPathMap)
assert.NoError(t, err)
for _, flag := range defaultTuneSlowSettings {
assert.Contains(t, deployment.Spec.Template.Spec.Containers[0].Args, flag)
assert.Contains(t, deployment.Spec.Template.Spec.Containers[0].Command[3], flag)
}

// Test shareProcessNamespace presence
Expand Down
34 changes: 26 additions & 8 deletions pkg/operator/ceph/controller/spec.go
Original file line number Diff line number Diff line change
Expand Up @@ -31,6 +31,7 @@ import (
"github.com/rook/rook/pkg/clusterd"
"github.com/rook/rook/pkg/daemon/ceph/client"
"github.com/rook/rook/pkg/operator/ceph/config"
opconfig "github.com/rook/rook/pkg/operator/ceph/config"
"github.com/rook/rook/pkg/operator/ceph/config/keyring"
"github.com/rook/rook/pkg/operator/k8sutil"
"github.com/rook/rook/pkg/util/display"
Expand Down Expand Up @@ -75,6 +76,26 @@ type daemonConfig struct {
var logger = capnslog.NewPackageLogger("github.com/rook/rook", "ceph-spec")

var (
osdLivenessProbeScript = `
outp="$(ceph --admin-daemon %s %s 2>&1)"
rc=$?
if [ $rc -ne 0 ] && [ ! -f /tmp/osd-sleep ]; then
echo "ceph daemon health check failed with the following output:"
echo "$outp" | sed -e 's/^/> /g'
exit $rc
fi
`

livenessProbeScript = `
outp="$(ceph --admin-daemon %s %s 2>&1)"
rc=$?
if [ $rc -ne 0 ]; then
echo "ceph daemon health check failed with the following output:"
echo "$outp" | sed -e 's/^/> /g'
exit $rc
fi
`

cronLogRotate = `
CEPH_CLIENT_ID=%s
PERIODICITY=%s
Expand Down Expand Up @@ -619,6 +640,10 @@ func StoredLogAndCrashVolumeMount(varLogCephDir, varLibCephCrashDir string) []v1
// that it can be called, and that it returns 0
func GenerateLivenessProbeExecDaemon(daemonType, daemonID string) *v1.Probe {
confDaemon := getDaemonConfig(daemonType, daemonID)
probeScript := livenessProbeScript
if daemonType == opconfig.OsdType {
probeScript = osdLivenessProbeScript
}

return &v1.Probe{
ProbeHandler: v1.ProbeHandler{
Expand All @@ -637,14 +662,7 @@ func GenerateLivenessProbeExecDaemon(daemonType, daemonID string) *v1.Probe {
"-i",
"sh",
"-c",
fmt.Sprintf(`outp="$(ceph --admin-daemon %s %s 2>&1)"
rc=$?
if [ $rc -ne 0 ]; then
echo "ceph daemon health check failed with the following output:"
echo "$outp" | sed -e 's/^/> /g'
exit $rc
fi`,
confDaemon.buildSocketPath(), confDaemon.buildAdminSocketCommand()),
fmt.Sprintf(probeScript, confDaemon.buildSocketPath(), confDaemon.buildAdminSocketCommand()),
},
},
},
Expand Down
9 changes: 2 additions & 7 deletions pkg/operator/ceph/controller/spec_test.go
Original file line number Diff line number Diff line change
Expand Up @@ -159,13 +159,8 @@ func TestGenerateLivenessProbeExecDaemon(t *testing.T) {
"-i",
"sh",
"-c",
`outp="$(ceph --admin-daemon /run/ceph/ceph-osd.0.asok status 2>&1)"
rc=$?
if [ $rc -ne 0 ]; then
echo "ceph daemon health check failed with the following output:"
echo "$outp" | sed -e 's/^/> /g'
exit $rc
fi`}
fmt.Sprintf(osdLivenessProbeScript, "/run/ceph/ceph-osd.0.asok", "status"),
}

assert.Equal(t, expectedCommand, probe.ProbeHandler.Exec.Command)
assert.Equal(t, livenessProbeInitialDelaySeconds, probe.InitialDelaySeconds)
Expand Down

0 comments on commit 68578db

Please sign in to comment.