Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(lifecycle-operator): adapt WorkloadVersionReconciler logic to use ObservabilityTimeout for workload deployment #3160

Merged
merged 6 commits into from
Mar 4, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -34,6 +34,7 @@ integration-test:
chainsaw test --test-dir ./test/chainsaw/testanalysis/
chainsaw test --test-dir ./test/chainsaw/testcertificate/
chainsaw test --test-dir ./test/chainsaw/non-blocking-deployment/
chainsaw test --test-dir ./test/chainsaw/timeout-failure-deployment/

.PHONY: integration-test-local #these tests should run on a real cluster!
integration-test-local:
Expand Down
11 changes: 11 additions & 0 deletions docs/docs/components/lifecycle-operator/deployment-flow.md
Original file line number Diff line number Diff line change
Expand Up @@ -122,6 +122,17 @@ If any of these activities fail,
the `KeptnApp` issues the `AppDeployErrored` event
and terminates the deployment.

> **Note**
By default Keptn observes the state of the Kubernetes workloads
for 5 minutes.
After this timeout is exceeded, the deployment phase (from Keptn
viewpoint) is considered as `Failed` and Keptn does not proceed
with post-deployment phases (tasks, evaluations or promotion phase).
This timeout can be modified for the cluster by changing the value
of the `observabilityTimeout` field in the
[KeptnConfig](../../reference/crd-reference/config.md)
resource.

```shell
AppDeploy
AppDeployStarted
Expand Down
11 changes: 11 additions & 0 deletions docs/docs/components/lifecycle-operator/keptn-apps.md
Original file line number Diff line number Diff line change
Expand Up @@ -28,6 +28,17 @@ The `KeptnWorkload` resources are created automatically
and without delay by the mutating webhook
as soon as the workload manifest is applied.

> **Note**
By default Keptn observes the state of the Kubernetes workloads
for 5 minutes.
After this timeout is exceeded, the deployment phase (from Keptn
viewpoint) is considered as `Failed` and Keptn does not proceed
with post-deployment phases (tasks, evaluations or promotion phase).
This timeout can be modified for the cluster by changing the value
of the `observabilityTimeout` field in the
[KeptnConfig](../../reference/crd-reference/config.md)
resource.

## Keptn Applications

A [KeptnApp](../../reference/crd-reference/app.md)
Expand Down
1 change: 1 addition & 0 deletions docs/docs/getting-started/observability.md
Original file line number Diff line number Diff line change
Expand Up @@ -74,6 +74,7 @@ metadata:
spec:
OTelCollectorUrl: 'jaeger-collector.keptn-system.svc.cluster.local:4317'
keptnAppCreationRequestTimeoutSeconds: 30
observabilityTimeout: 5m
```

Apply the file and wait for Keptn to pick up the new configuration:
Expand Down
14 changes: 14 additions & 0 deletions docs/docs/guides/otel.md
Original file line number Diff line number Diff line change
Expand Up @@ -161,6 +161,20 @@ kubectl port-forward deployment/metrics-operator 9999 -n keptn-system

You can access the metrics from your browser at: `http://localhost:9999`

## Define timeout for workload observability

There are situations when the deployment of the application fails due to
various reasons (e.g. container image not found).
By default Keptn observes the state of the Kubernetes workloads
for 5 minutes.
After this timeout is exceeded, the deployment phase (from Keptn
viewpoint) is considered as `Failed` and Keptn does not proceed
with post-deployment phases (tasks, evaluations or promotion phase).
This timeout can be modified for the cluster by changing the value
of the `observabilityTimeout` field in the
[KeptnConfig](../reference/crd-reference/config.md)
resource.

## Advanced tracing configurations in Keptn: Linking traces

In Keptn you can connect multiple traces, for instance to connect deployments
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -521,3 +521,11 @@ func (w KeptnWorkloadVersion) GetEventAnnotations() map[string]string {
"workloadVersionName": w.Name,
}
}

func (w *KeptnWorkloadVersion) SetDeploymentStartTime() {
w.Status.DeploymentStartTime = metav1.NewTime(time.Now().UTC())
}

func (w *KeptnWorkloadVersion) IsDeploymentStartTimeSet() bool {
return !w.Status.DeploymentStartTime.IsZero()
}
Original file line number Diff line number Diff line change
Expand Up @@ -113,12 +113,15 @@ func TestKeptnWorkloadVersion(t *testing.T) {

require.False(t, workload.IsEndTimeSet())
require.False(t, workload.IsStartTimeSet())
require.False(t, workload.IsDeploymentStartTimeSet())

workload.SetStartTime()
workload.SetEndTime()
workload.SetDeploymentStartTime()

require.True(t, workload.IsEndTimeSet())
require.True(t, workload.IsStartTimeSet())
require.True(t, workload.IsDeploymentStartTimeSet())

require.Equal(t, []attribute.KeyValue{
common.AppName.String("appname"),
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -52,6 +52,7 @@ func TestKeptnWorkloadVersionReconciler_reconcileDeployment_FailedReplicaSet(t *
keptnState, err := r.reconcileDeployment(context.TODO(), workloadVersion)
require.Nil(t, err)
require.Equal(t, apicommon.StateProgressing, keptnState)
require.False(t, workloadVersion.Status.DeploymentStartTime.IsZero())
}

func TestKeptnWorkloadVersionReconciler_reconcileDeployment_UnavailableReplicaSet(t *testing.T) {
Expand All @@ -70,6 +71,51 @@ func TestKeptnWorkloadVersionReconciler_reconcileDeployment_UnavailableReplicaSe
keptnState, err := r.reconcileDeployment(context.TODO(), workloadVersion)
require.NotNil(t, err)
require.Equal(t, apicommon.StateUnknown, keptnState)
require.True(t, workloadVersion.Status.DeploymentStartTime.IsZero())
}

func TestKeptnWorkloadVersionReconciler_reconcileDeployment_WorkloadDeploymentTimedOut(t *testing.T) {

rep := int32(1)
replicaset := makeReplicaSet("myrep", "default", &rep, 0)
workloadVersion := makeWorkloadVersionWithRef(replicaset.ObjectMeta, "ReplicaSet")

fakeClient := testcommon.NewTestClient(replicaset, workloadVersion)

fakeRecorder := record.NewFakeRecorder(100)

r := &KeptnWorkloadVersionReconciler{
Client: fakeClient,
Config: config.Instance(),
EventSender: eventsender.NewK8sSender(fakeRecorder),
}

r.Config.SetObservabilityTimeout(metav1.Duration{
Duration: 5 * time.Second,
})

keptnState, err := r.reconcileDeployment(context.TODO(), workloadVersion)
require.Nil(t, err)
require.Equal(t, apicommon.StateProgressing, keptnState)
require.False(t, workloadVersion.Status.DeploymentStartTime.IsZero())

//revert the start time parameter backwards to check the timer
workloadVersion.Status.DeploymentStartTime = metav1.Time{
Time: workloadVersion.Status.DeploymentStartTime.Add(-10 * time.Second),
}

err = r.Client.Status().Update(context.TODO(), workloadVersion)
require.Nil(t, err)

keptnState, err = r.reconcileDeployment(context.TODO(), workloadVersion)
require.Nil(t, err)
require.Equal(t, apicommon.StateFailed, keptnState)
require.False(t, workloadVersion.Status.DeploymentStartTime.IsZero())

event := <-fakeRecorder.Events
require.Equal(t, strings.Contains(event, workloadVersion.GetName()), true, "wrong workloadVersion")
require.Equal(t, strings.Contains(event, workloadVersion.GetNamespace()), true, "wrong namespace")
require.Equal(t, strings.Contains(event, "has reached timeout"), true, "wrong message")
}

func TestKeptnWorkloadVersionReconciler_reconcileDeployment_FailedStatefulSet(t *testing.T) {
Expand All @@ -86,6 +132,7 @@ func TestKeptnWorkloadVersionReconciler_reconcileDeployment_FailedStatefulSet(t
keptnState, err := r.reconcileDeployment(context.TODO(), workloadVersion)
require.Nil(t, err)
require.Equal(t, apicommon.StateProgressing, keptnState)
require.False(t, workloadVersion.Status.DeploymentStartTime.IsZero())
}

func TestKeptnWorkloadVersionReconciler_reconcileDeployment_UnavailableStatefulSet(t *testing.T) {
Expand All @@ -104,6 +151,7 @@ func TestKeptnWorkloadVersionReconciler_reconcileDeployment_UnavailableStatefulS
keptnState, err := r.reconcileDeployment(context.TODO(), workloadVersion)
require.NotNil(t, err)
require.Equal(t, apicommon.StateUnknown, keptnState)
require.True(t, workloadVersion.Status.DeploymentStartTime.IsZero())
}

func TestKeptnWorkloadVersionReconciler_reconcileDeployment_FailedDaemonSet(t *testing.T) {
Expand All @@ -120,6 +168,7 @@ func TestKeptnWorkloadVersionReconciler_reconcileDeployment_FailedDaemonSet(t *t
keptnState, err := r.reconcileDeployment(context.TODO(), workloadVersion)
require.Nil(t, err)
require.Equal(t, apicommon.StateProgressing, keptnState)
require.False(t, workloadVersion.Status.DeploymentStartTime.IsZero())
}

func TestKeptnWorkloadVersionReconciler_reconcileDeployment_UnavailableDaemonSet(t *testing.T) {
Expand All @@ -136,6 +185,7 @@ func TestKeptnWorkloadVersionReconciler_reconcileDeployment_UnavailableDaemonSet
keptnState, err := r.reconcileDeployment(context.TODO(), workloadVersion)
require.NotNil(t, err)
require.Equal(t, apicommon.StateUnknown, keptnState)
require.True(t, workloadVersion.Status.DeploymentStartTime.IsZero())
}

func TestKeptnWorkloadVersionReconciler_reconcileDeployment_ReadyReplicaSet(t *testing.T) {
Expand All @@ -153,6 +203,7 @@ func TestKeptnWorkloadVersionReconciler_reconcileDeployment_ReadyReplicaSet(t *t
keptnState, err := r.reconcileDeployment(context.TODO(), workloadVersion)
require.Nil(t, err)
require.Equal(t, apicommon.StateSucceeded, keptnState)
require.False(t, workloadVersion.Status.DeploymentStartTime.IsZero())
}

func TestKeptnWorkloadVersionReconciler_reconcileDeployment_ReadyStatefulSet(t *testing.T) {
Expand All @@ -170,6 +221,7 @@ func TestKeptnWorkloadVersionReconciler_reconcileDeployment_ReadyStatefulSet(t *
keptnState, err := r.reconcileDeployment(context.TODO(), workloadVersion)
require.Nil(t, err)
require.Equal(t, apicommon.StateSucceeded, keptnState)
require.False(t, workloadVersion.Status.DeploymentStartTime.IsZero())
}

func TestKeptnWorkloadVersionReconciler_reconcileDeployment_ReadyDaemonSet(t *testing.T) {
Expand All @@ -186,6 +238,7 @@ func TestKeptnWorkloadVersionReconciler_reconcileDeployment_ReadyDaemonSet(t *te
keptnState, err := r.reconcileDeployment(context.TODO(), workloadVersion)
require.Nil(t, err)
require.Equal(t, apicommon.StateSucceeded, keptnState)
require.False(t, workloadVersion.Status.DeploymentStartTime.IsZero())
}

func TestKeptnWorkloadVersionReconciler_reconcileDeployment_UnsupportedReferenceKind(t *testing.T) {
Expand All @@ -199,6 +252,7 @@ func TestKeptnWorkloadVersionReconciler_reconcileDeployment_UnsupportedReference
keptnState, err := r.reconcileDeployment(context.TODO(), workloadVersion)
require.ErrorIs(t, err, controllererrors.ErrUnsupportedWorkloadVersionResourceReference)
require.Equal(t, apicommon.StateUnknown, keptnState)
require.True(t, workloadVersion.Status.DeploymentStartTime.IsZero())
}

func makeReplicaSet(name string, namespace string, wanted *int32, available int32) *appsv1.ReplicaSet {
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,7 @@ package keptnworkloadversion

import (
"context"
"time"

argov1alpha1 "github.com/argoproj/argo-rollouts/pkg/apis/rollouts/v1alpha1"
klcv1beta1 "github.com/keptn/lifecycle-toolkit/lifecycle-operator/apis/lifecycle/v1beta1"
Expand All @@ -15,6 +16,16 @@ func (r *KeptnWorkloadVersionReconciler) reconcileDeployment(ctx context.Context
var isRunning bool
var err error

if r.isDeploymentTimedOut(workloadVersion) {
workloadVersion.Status.DeploymentStatus = apicommon.StateFailed
err = r.Client.Status().Update(ctx, workloadVersion)
if err != nil {
return apicommon.StateUnknown, err
}
r.EventSender.Emit(apicommon.PhaseWorkloadDeployment, "Warning", workloadVersion, apicommon.PhaseStateFinished, "has reached timeout", workloadVersion.GetVersion())
return workloadVersion.Status.DeploymentStatus, nil
}

switch workloadVersion.Spec.ResourceReference.Kind {
case "ReplicaSet":
isRunning, err = r.isReplicaSetRunning(ctx, workloadVersion.Spec.ResourceReference, workloadVersion.Namespace)
Expand All @@ -29,10 +40,14 @@ func (r *KeptnWorkloadVersionReconciler) reconcileDeployment(ctx context.Context
if err != nil {
return apicommon.StateUnknown, err
}

if !workloadVersion.IsDeploymentStartTimeSet() {
workloadVersion.SetDeploymentStartTime()
workloadVersion.Status.DeploymentStatus = apicommon.StateProgressing
}

if isRunning {
workloadVersion.Status.DeploymentStatus = apicommon.StateSucceeded
} else {
workloadVersion.Status.DeploymentStatus = apicommon.StateProgressing
}

err = r.Client.Status().Update(ctx, workloadVersion)
Expand All @@ -42,6 +57,16 @@ func (r *KeptnWorkloadVersionReconciler) reconcileDeployment(ctx context.Context
return workloadVersion.Status.DeploymentStatus, nil
}

func (r *KeptnWorkloadVersionReconciler) isDeploymentTimedOut(workloadVersion *klcv1beta1.KeptnWorkloadVersion) bool {
if !workloadVersion.IsDeploymentStartTimeSet() {
return false
}

deploymentDeadline := workloadVersion.Status.DeploymentStartTime.Add(r.Config.GetObservabilityTimeout().Duration)
currentTime := time.Now().UTC()
return currentTime.After(deploymentDeadline)
}

func (r *KeptnWorkloadVersionReconciler) isReplicaSetRunning(ctx context.Context, resource klcv1beta1.ResourceReference, namespace string) (bool, error) {
rep := appsv1.ReplicaSet{}
err := r.Client.Get(ctx, types.NamespacedName{Name: resource.Name, Namespace: namespace}, &rep)
Expand Down
4 changes: 2 additions & 2 deletions test/chainsaw/non-blocking-deployment/chainsaw-test.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -16,7 +16,7 @@ spec:
- name: step-01
try:
- script:
content: ./verify-keptnconfig.sh
content: ./../common/verify-keptnconfig.sh
- sleep:
duration: 30s
- name: step-02
Expand All @@ -32,7 +32,7 @@ spec:
- name: step-04
try:
- script:
content: ./verify-keptnconfig.sh
content: ./../common/verify-keptnconfig.sh
- sleep:
duration: 30s
- name: step-05
Expand Down
50 changes: 50 additions & 0 deletions test/chainsaw/timeout-failure-deployment/00-assert.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,50 @@
apiVersion: lifecycle.keptn.sh/v1beta1
kind: KeptnAppVersion
metadata:
name: podtato-head-0.1.0-6b86b273
spec:
appName: podtato-head
revision: 1
version: 0.1.0
workloads:
- name: podtato-head-entry
version: 0.1.0
status:
currentPhase: AppDeploy
postDeploymentEvaluationStatus: Deprecated
postDeploymentStatus: Deprecated
preDeploymentEvaluationStatus: Succeeded
preDeploymentStatus: Succeeded
promotionStatus: Deprecated
status: Failed
workloadOverallStatus: Failed
---
apiVersion: lifecycle.keptn.sh/v1beta1
kind: KeptnWorkloadVersion
metadata:
generation: 1
name: podtato-head-podtato-head-entry-0.1.0
spec:
app: podtato-head
version: 0.1.0
workloadName: podtato-head-podtato-head-entry
status:
currentPhase: WorkloadDeploy
deploymentStatus: Failed
postDeploymentEvaluationStatus: Deprecated
postDeploymentStatus: Deprecated
preDeploymentEvaluationStatus: Succeeded
preDeploymentStatus: Succeeded
status: Failed
---
apiVersion: v1
kind: Pod
metadata:
annotations:
keptn.sh/app: podtato-head
keptn.sh/version: 0.1.0
keptn.sh/workload: podtato-head-entry
labels:
component: podtato-head-entry
status:
phase: Pending
28 changes: 28 additions & 0 deletions test/chainsaw/timeout-failure-deployment/00-install.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,28 @@
apiVersion: apps/v1
kind: Deployment
metadata:
name: podtato-head-entry
labels:
app: podtato-head
spec:
selector:
matchLabels:
component: podtato-head-entry
template:
metadata:
labels:
component: podtato-head-entry
annotations:
keptn.sh/app: podtato-head
keptn.sh/workload: podtato-head-entry
keptn.sh/version: 0.1.0
spec:
containers:
- name: server
image: ghcr.io/podtato-head/entry:non-existing
imagePullPolicy: Always
ports:
- containerPort: 9000
env:
- name: PODTATO_PORT
value: "9000"
Loading
Loading