Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use healthcheck cron job to determine cluster readiness #743

Merged
merged 16 commits into from
Mar 30, 2021
Merged
Show file tree
Hide file tree
Changes from 6 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
53 changes: 29 additions & 24 deletions pkg/common/cluster/clusterutil.go
Original file line number Diff line number Diff line change
Expand Up @@ -207,41 +207,46 @@ func waitForClusterReadyWithOverrideAndExpectedNumberOfNodes(clusterID string, l
return nil
}

// PollClusterHealth looks at CVO data to determine if a cluster is alive/healthy or not
// param clusterID: If specified, Provider will be discovered through OCM. If the empty string,
// assume we are running in a cluster and use in-cluster REST config instead.
whereswaldon marked this conversation as resolved.
Show resolved Hide resolved
func PollClusterHealth(clusterID string, logger *log.Logger) (status bool, failures []string, err error) {
logger = logging.CreateNewStdLoggerOrUseExistingLogger(logger)

logger.Print("Polling Cluster Health...\n")

var restConfig *rest.Config
var providerType string

func ClusterConfig(clusterID string) (restConfig *rest.Config, providerType string, err error) {
if clusterID == "" {
if restConfig, err = rest.InClusterConfig(); err != nil {
logger.Printf("Error getting in-cluster REST config: %v\n", err)
return false, nil, nil
return nil, "", fmt.Errorf("error getting in-cluster rest config: %w", err)
}

// FIXME: Is there a way to discover this from within the cluster?
// For now, ocm and rosa behave the same, so hardcode either.
providerType = "ocm"
return

} else {
provider, err := providers.ClusterProvider()
}
provider, err := providers.ClusterProvider()

if err != nil {
return false, nil, fmt.Errorf("error getting cluster provisioning client: %v", err)
}
if err != nil {
return nil, "", fmt.Errorf("error getting cluster provisioning client: %w", err)
}
providerType = provider.Type()

restConfig, err = getRestConfig(provider, clusterID)
if err != nil {
logger.Printf("Error generating Rest Config: %v\n", err)
return false, nil, nil
}
restConfig, err = getRestConfig(provider, clusterID)
if err != nil {

return nil, "", fmt.Errorf("error generating rest config: %w", err)
}

return
}

// PollClusterHealth looks at CVO data to determine if a cluster is alive/healthy or not
// param clusterID: If specified, Provider will be discovered through OCM. If the empty string,
// assume we are running in a cluster and use in-cluster REST config instead.
func PollClusterHealth(clusterID string, logger *log.Logger) (status bool, failures []string, err error) {
logger = logging.CreateNewStdLoggerOrUseExistingLogger(logger)

providerType = provider.Type()
logger.Print("Polling Cluster Health...\n")

restConfig, providerType, err := ClusterConfig(clusterID)
if err != nil {
logger.Printf("Error getting cluster config: %v\n", err)
return false, nil, nil
}

kubeClient, err := kubernetes.NewForConfig(restConfig)
Expand Down
67 changes: 67 additions & 0 deletions pkg/common/cluster/healthchecks/healthcheckjob.go
Original file line number Diff line number Diff line change
@@ -0,0 +1,67 @@
package healthchecks

import (
"context"
"fmt"
"log"

"github.com/openshift/osde2e/pkg/common/logging"
batchv1 "k8s.io/api/batch/v1"
metav1 "k8s.io/apimachinery/pkg/apis/meta/v1"
"k8s.io/apimachinery/pkg/watch"
"k8s.io/client-go/kubernetes"
)

// CheckHealthcheckJob uses the `openshift-cluster-ready-*` healthcheck job to determine cluster readiness.
whereswaldon marked this conversation as resolved.
Show resolved Hide resolved
func CheckHealthcheckJob(k8sClient *kubernetes.Clientset, ctx context.Context, logger *log.Logger) (bool, error) {
logger = logging.CreateNewStdLoggerOrUseExistingLogger(logger)

logger.Print("Checking that all Nodes are running or completed...")
whereswaldon marked this conversation as resolved.
Show resolved Hide resolved

bv1C := k8sClient.BatchV1()
namespace := "openshift-monitoring"
name := "osd-cluster-ready"
jobs, err := bv1C.Jobs(namespace).List(ctx, metav1.ListOptions{})
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not super familiar with client-go, but why doesn't bv1C.Jobs(namespace).Get(ctx, name, metav1.GetOptions{}) work here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sigh... It does. I just did it a dumb way because I too am not very familiar with client-go.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wait! On second thought, it does need to be a list. I need the resourceVersion field of the list itself in order to initiate the watch later. This allows me to detect the creation and deletion of jobs matching my criteria since I performed the list.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I understand.

If you use a zero resourceVersion, do you even need the initial get/list?

if err != nil {
return false, fmt.Errorf("failed listing jobs: %w", err)
}
for _, job := range jobs.Items {
if job.Name != name {
continue
}
if job.Status.Succeeded > 0 {
log.Println("Healthcheck job has already succeeded")
return true, nil
}
log.Println("Healthcheck job has not yet succeeded, watching...")
}
watcher, err := bv1C.Jobs(namespace).Watch(ctx, metav1.ListOptions{
ResourceVersion: jobs.ResourceVersion,
FieldSelector: "metadata.name=osd-cluster-ready",
})
if err != nil {
return false, fmt.Errorf("failed watching job: %w", err)
}
for {
select {
case event := <-watcher.ResultChan():
switch event.Type {
case watch.Added:
fallthrough
case watch.Modified:
job := event.Object.(*batchv1.Job)
if job.Status.Succeeded > 0 {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need to handle "completed but failed" as well.

...But how we handle it is different before vs after openshift/configure-alertmanager-operator#143

  • Before: Failed Job means you fail here.
  • After: Failed Job will be deleted and reinstated (see below).

Need to think through this some more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll wait to hear your updated thoughts on how this should work before altering this logic.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Talked this through with @jharrington22 just now. We're going to rework the readiness side so the picture is clear. Specific to this case, I think for osde2e's purposes "completed but failed" will translate to "cluster not ready, ain't ever gonna be".

But I'm also pretty sure we're going to use prometheus to carry that state... which would mean substantial changes in this PR. Stay tuned -- I hope to have something written up by tomorrow.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Keep an eye on OSD-6646 starting here

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Completed but failed" should be treated as "cluster not ready, ain't ever gonna be". I'll post more info in a top-level comment.

return true, nil
}
case watch.Deleted:
return false, fmt.Errorf("cluster readiness job deleted before becoming ready")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should accommodate deletion of the Job and keep looping. As soon as openshift/configure-alertmanager-operator#143 lands, deletion will be a normal part of the flow (see here).

A bit more explanation: We'll delete the Job if it fails. This does not mean health checks failed -- it means something went horribly wrong under the covers, like we failed to talk to k8s or something.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If something went that wrong, I doubt that we'd want to proceed with testing, so I think that this behavior is probably okay.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I wasn't very clear: a failing Job is not unusual early in the cluster's life. A common thing we used to see was prometheus not being up yet. Evictions can also be a cause. The point of having the Job owned by some kind of controller was so we could mitigate this kind of thing.

That said, we could keep quite a bit of that level of retry logic in the Job itself via backoffLimit/activeDeadlineSeconds rather than having the controller react to failure. That would allow both sides to consider a failed Job "fatal" in whatever sense is appropriate.

Copy link
Member

@2uasimojo 2uasimojo Mar 19, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Update on this: For now, a deletion should never happen (except manually). So please consider it an immediate failure (as if the cluster will never be ready) -- in other words, the logic is correct as written.

Ideally, please ping me and let me have a poke at the cluster if you do see this case.

case watch.Error:
return false, fmt.Errorf("watch returned error event: %v", event)
default:
logger.Printf("Unrecognized event type while watching for healthcheck job updates: %v", event.Type)
}
case <-ctx.Done():
return false, fmt.Errorf("healtcheck watch context cancelled while still waiting for success")
}
}
}
24 changes: 23 additions & 1 deletion pkg/e2e/e2e.go
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,7 @@ package e2e
import (
"bytes"
"compress/gzip"
"context"
"encoding/json"
"encoding/xml"
"fmt"
Expand All @@ -19,6 +20,7 @@ import (
"github.com/hpcloud/tail"
junit "github.com/joshdk/go-junit"
vegeta "github.com/tsenart/vegeta/lib"
"k8s.io/client-go/kubernetes"

pd "github.com/PagerDuty/go-pagerduty"
"github.com/onsi/ginkgo"
Expand All @@ -32,6 +34,7 @@ import (
"github.com/openshift/osde2e/pkg/common/aws"
"github.com/openshift/osde2e/pkg/common/cluster"
clusterutil "github.com/openshift/osde2e/pkg/common/cluster"
"github.com/openshift/osde2e/pkg/common/cluster/healthchecks"
"github.com/openshift/osde2e/pkg/common/clusterproperties"
"github.com/openshift/osde2e/pkg/common/config"
"github.com/openshift/osde2e/pkg/common/events"
Expand Down Expand Up @@ -121,7 +124,26 @@ var _ = ginkgo.SynchronizedBeforeSuite(func() []byte {
log.Printf("Error while adding upgrade version property to cluster via OCM: %v", err)
}

err = clusterutil.WaitForClusterReady(cluster.ID(), nil)
clusterConfig, _, err := clusterutil.ClusterConfig(cluster.ID())
if err != nil {
log.Printf("Failed looking up cluster config for healthcheck: %v", err)
}
kubeClient, err := kubernetes.NewForConfig(clusterConfig)
if err != nil {
log.Printf("Error generating Kube Clientset: %v\n", err)
}
ctx, cancel := context.WithTimeout(context.Background(), time.Hour*2)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you want to make this timeout configurable? It is in c-am-o FWIW. (This could be a separate PR.)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought about it, but I don't think we have a real use-case for configuring it. YAGNI?

Copy link
Member

@2uasimojo 2uasimojo Mar 23, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIRC there's a zillion other config knobs around this, including the clean count, success sleep, and failure sleep. But I'm sure you're right -- are the other knobs ever used?

[Later] Actually, the use case is for testing, e.g. if you want to make sure the timeout code path works properly without having to wait 2h or edit code. I made extensive use of this in osd-cluster-ready itself when it had a similar tunable.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@2uasimojo I've added a config knob for this and tested it.

defer cancel()
if viper.GetString(config.Tests.SkipClusterHealthChecks) != "" {
log.Println("WARNING: Skipping cluster health checks is no longer supported, as they no longer introduce delay into the build. Ignoring your request to skip them.")
}
ready, err := healthchecks.CheckHealthcheckJob(kubeClient, ctx, nil)
if !ready && err == nil {
err = fmt.Errorf("Cluster not ready")
}
if ready {
log.Println("Cluster is healthy and ready for testing")
}
events.HandleErrorWithEvents(err, events.HealthCheckSuccessful, events.HealthCheckFailed).ShouldNot(HaveOccurred(), "cluster failed health check")
if err != nil {
getLogs()
Expand Down