Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create health condition for degraded status / quorum issues #360

Merged
merged 5 commits into from
Jul 13, 2022
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,8 +10,8 @@ Changelog for Cass Operator, new PRs should update the `main / unreleased` secti
```

## unreleased
* [ENHANCEMENT] [#360](https://github.com/k8ssandra/cass-operator/pull/360) If Datacenter quorum reports unhealthy state, change Status Condition DatacenterHealthy to False (DBPE-2283)
* [BUGFIX] [#355](https://github.com/k8ssandra/cass-operator/issues/335) Cleanse label values derived from cluster name, which can contain illegal chars. Include app.kubernetes.io/created-by label.

* [BUGFIX] [#330](https://github.com/k8ssandra/cass-operator/issues/330) Apply correct updates to Service labels and annotations through additionalServiceConfig (they are now validated and don't allow reserved prefixes).

## v1.11.0
Expand Down
2 changes: 1 addition & 1 deletion Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -119,7 +119,7 @@ lint: ## Run golangci-lint against code.
golangci-lint run ./...

.PHONY: test
test: manifests generate fmt vet envtest ## Run tests.
test: manifests generate fmt vet lint envtest ## Run tests.
# Old unit tests first - these use mocked client / fakeclient
go test ./pkg/... -coverprofile cover-pkg.out
# Then the envtest ones
Expand Down
4 changes: 4 additions & 0 deletions apis/cassandra/v1beta1/cassandradatacenter_types.go
Original file line number Diff line number Diff line change
Expand Up @@ -328,6 +328,10 @@ const (
DatacenterRollingRestart DatacenterConditionType = "RollingRestart"
DatacenterValid DatacenterConditionType = "Valid"
DatacenterDecommission DatacenterConditionType = "Decommission"

// DatacenterHealthy indicates if QUORUM can be reached from all deployed nodes.
// If this check fails, certain operations such as scaling up will not proceed.
DatacenterHealthy DatacenterConditionType = "Healthy"
)

type DatacenterCondition struct {
Expand Down
36 changes: 33 additions & 3 deletions pkg/reconciliation/reconcile_racks.go
Original file line number Diff line number Diff line change
Expand Up @@ -623,12 +623,16 @@ func (rc *ReconciliationContext) CheckPodsReady(endpointData httphelper.CassMeta

// step 3 - get all nodes up
// if the cluster isn't healthy, that's ok, but go back to step 1
if !rc.isClusterHealthy() {
clusterHealthy := rc.isClusterHealthy()
if err := rc.updateHealth(clusterHealthy); err != nil {
return result.Error(err)
}

if !clusterHealthy {
rc.ReqLogger.Info(
"cluster isn't healthy",
)
// FIXME this is one spot I've seen get spammy, should we raise this number?
return result.RequeueSoon(2)
return result.RequeueSoon(5)
}

needsMoreNodes, err := rc.startAllNodes(endpointData)
Expand Down Expand Up @@ -1161,6 +1165,30 @@ func (rc *ReconciliationContext) UpdateStatus() result.ReconcileResult {
return result.Continue()
}

func (rc *ReconciliationContext) updateHealth(healthy bool) error {
updated := false
dcPatch := client.MergeFrom(rc.Datacenter.DeepCopy())

if !healthy {
updated = rc.setCondition(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the cluster is degraded or unhealthy, should we just requeue?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We already do that elsewhere, this is just to add the status (as I thought the DBPE-2283 wish was). If we requeue here, then we could never recover from the wrong status.

api.NewDatacenterCondition(
api.DatacenterHealthy, corev1.ConditionFalse))
} else {
updated = rc.setCondition(
api.NewDatacenterCondition(
api.DatacenterHealthy, corev1.ConditionTrue))
}

if updated {
err := rc.Client.Status().Patch(rc.Ctx, rc.Datacenter, dcPatch)
if err != nil {
return err
}
}

return nil
}

func hasBeenXMinutes(x int, sinceTime time.Time) bool {
xMinutesAgo := time.Now().Add(time.Minute * time.Duration(-x))
return sinceTime.Before(xMinutesAgo)
Expand Down Expand Up @@ -1279,6 +1307,8 @@ func (rc *ReconciliationContext) deleteStuckNodes() (bool, error) {
return false, nil
}

// isClusterHealthy does a LOCAL_QUORUM query to the Cassandra pods and returns true if all the pods were able to
// respond without error.
func (rc *ReconciliationContext) isClusterHealthy() bool {
pods := FilterPodListByCassNodeState(rc.clusterPods, stateStarted)

Expand Down