clusteroperator: Report when OLM reaches "level" and check syncs #748

smarterclayton · 2019-03-11T18:05:11Z

Cluster operators are expected to report the version of the payload
they are included in once they are "deployed", and also to keep the
cluster operator object created. Have the OLM operator keep CO up
to date, report the payload version once it hits available, and use
the count of successful syncs from the queueInformers as a probalistic
measurement of "available" (i.e. is the operator able to retire syncs).
A future change should add a "health over time" metric or a "has
successfully synced all InstallPlans at least once" metric to replace
the current estimation.

ecordell · 2019-03-11T18:17:10Z

/test unit

ecordell · 2019-03-11T18:57:56Z

/retest

ecordell

/lgtm

Thanks for getting us up to code, @smarterclayton

We should have something similar for catalog-operator; we'll need a new top-level clusterversion for reporting that.

I'll put something on the backlog to get better reporting of progressing. Sync success rate doesn't tell you much.

ecordell · 2019-03-12T12:27:25Z

/retest

smarterclayton · 2019-03-12T16:52:38Z

Looks like a legit failure

/lgtm cancel

smarterclayton · 2019-03-12T17:09:15Z

--- FAIL: TestCSVCopyWatchingAllNamespaces (220.56s)
	operator_groups_e2e_test.go:1129: Creating CRD
	operator_groups_e2e_test.go:1136: Getting default operator group 'global-operators' installed via operatorgroup-default.yaml openshift-operators
	operator_groups_e2e_test.go:1144: Waiting on operator group to have correct status
	operator_groups_e2e_test.go:1159: Creating CSV
	operator_groups_e2e_test.go:1229: wait for CSV to succeed
	csv_e2e_test.go:226: Pending (RequirementsUnknown): requirements not yet checked
	csv_e2e_test.go:226: Installing (InstallSucceeded): waiting for install components to report healthy
	csv_e2e_test.go:226: Installing (InstallWaiting): installing: Waiting: waiting for deployment operator-deploymentsjjqt to become ready: Waiting for rollout to finish: 0 of 2 updated replicas are available...
	csv_e2e_test.go:226: Installing (InstallWaiting): installing: Waiting: waiting for deployment operator-deploymentsjjqt to become ready: Waiting for rollout to finish: 0 of 2 updated replicas are available...
	csv_e2e_test.go:226: Installing (InstallWaiting): installing: Waiting: waiting for deployment operator-deploymentsjjqt to become ready: Waiting for rollout to finish: 0 of 2 updated replicas are available...
	csv_e2e_test.go:226: Installing (InstallWaiting): installing: Waiting: waiting for deployment operator-deploymentsjjqt to become ready: Waiting for rollout to finish: 0 of 2 updated replicas are available...
	csv_e2e_test.go:226: Installing (InstallWaiting): installing: Waiting: waiting for deployment operator-deploymentsjjqt to become ready: Waiting for rollout to finish: 0 of 2 updated replicas are available...
	csv_e2e_test.go:226: Installing (InstallWaiting): installing: Waiting: waiting for deployment operator-deploymentsjjqt to become ready: Waiting for rollout to finish: 0 of 2 updated replicas are available...
	csv_e2e_test.go:226: Installing (InstallWaiting): installing: Waiting: waiting for deployment operator-deploymentsjjqt to become ready: Waiting for rollout to finish: 0 of 2 updated replicas are available...
	csv_e2e_test.go:226: Installing (InstallWaiting): installing: Waiting: waiting for deployment operator-deploymentsjjqt to become ready: Waiting for rollout to finish: 0 of 2 updated replicas are available...
	csv_e2e_test.go:226: Installing (InstallWaiting): installing: Waiting: waiting for deployment operator-deploymentsjjqt to become ready: Waiting for rollout to finish: 0 of 2 updated replicas are available...
	csv_e2e_test.go:226: Installing (InstallWaiting): installing: Waiting: waiting for deployment operator-deploymentsjjqt to become ready: Waiting for rollout to finish: 0 of 2 updated replicas are available...
	csv_e2e_test.go:226: Installing (InstallWaiting): installing: Waiting: waiting for deployment operator-deploymentsjjqt to become ready: Waiting for rollout to finish: 0 of 2 updated replicas are available...
	csv_e2e_test.go:226: Installing (InstallWaiting): installing: Waiting: waiting for deployment operator-deploymentsjjqt to become ready: Waiting for rollout to finish: 0 of 2 updated replicas are available...
	csv_e2e_test.go:226: Installing (InstallWaiting): installing: Waiting: waiting for deployment operator-deploymentsjjqt to become ready: Waiting for rollout to finish: 0 of 2 updated replicas are available...
	csv_e2e_test.go:226: Installing (InstallWaiting): installing: Waiting: waiting for deployment operator-deploymentsjjqt to become ready: Waiting for rollout to finish: 0 of 2 updated replicas are available...
	csv_e2e_test.go:226: Installing (InstallWaiting): installing: Waiting: waiting for deployment operator-deploymentsjjqt to become ready: Waiting for rollout to finish: 0 of 2 updated replicas are available...
	csv_e2e_test.go:226: Installing (InstallWaiting): installing: Waiting: waiting for deployment operator-deploymentsjjqt to become ready: Waiting for rollout to finish: 0 of 2 updated replicas are available...
	csv_e2e_test.go:226: Installing (InstallWaiting): installing: Waiting: waiting for deployment operator-deploymentsjjqt to become ready: Waiting for rollout to finish: 0 of 2 updated replicas are available...
	csv_e2e_test.go:226: Installing (InstallWaiting): installing: Waiting: waiting for deployment operator-deploymentsjjqt to become ready: Waiting for rollout to finish: 0 of 2 updated replicas are available...
	csv_e2e_test.go:226: Installing (InstallWaiting): installing: Waiting: waiting for deployment operator-deploymentsjjqt to become ready: Waiting for rollout to finish: 0 of 2 updated replicas are available...
	csv_e2e_test.go:226: Installing (InstallWaiting): installing: Waiting: waiting for deployment operator-deploymentsjjqt to become ready: Waiting for rollout to finish: 0 of 2 updated replicas are available...
	csv_e2e_test.go:226: Installing (InstallWaiting): installing: Waiting: waiting for deployment operator-deploymentsjjqt to become ready: Waiting for rollout to finish: 0 of 2 updated replicas are available...
	csv_e2e_test.go:226: Succeeded (InstallSucceeded): install strategy completed with no errors
	operator_groups_e2e_test.go:1233: Waiting for operator namespace csv to have annotations
	operator_groups_e2e_test.go:1252: Found CSV count of 32
	operator_groups_e2e_test.go:1254: Create other namespace
	operator_groups_e2e_test.go:1268: Waiting to ensure copied CSV shows up in other namespace
	operator_groups_e2e_test.go:1286: Modifying operator group to no longer watch all namespaces
	require.go:794: 
			Error Trace:	operator_groups_e2e_test.go:1312
			Error:      	Received unexpected error:
			            	timed out waiting for the condition
			Test:       	TestCSVCopyWatchingAllNamespaces
	operator_groups_e2e_test.go:1293: Re-modifying operator group to be watching all namespaces

I think these are the errors within the window of the test failing, but I can't see anything obvious

E0312 13:39:46.351352       1 event.go:203] Server rejected event '&v1.Event{TypeMeta:v1.TypeMeta{Kind:"", APIVersion:""}, ObjectMeta:v1.ObjectMeta{Name:"b-hlwmw-stable.158b39be5e6d3bfd", GenerateName:"", Namespace:"b-4wxcz", SelfLink:"", UID:"", ResourceVersion:"", Generation:0, CreationTimestamp:v1.Time{Time:time.Time{wall:0x0, ext:0, loc:(*time.Location)(nil)}}, DeletionTimestamp:(*v1.Time)(nil), DeletionGracePeriodSeconds:(*int64)(nil), Labels:map[string]string(nil), Annotations:map[string]string(nil), OwnerReferences:[]v1.OwnerReference(nil), Initializers:(*v1.Initializers)(nil), Finalizers:[]string(nil), ClusterName:""}, InvolvedObject:v1.ObjectReference{Kind:"ClusterServiceVersion", Namespace:"b-4wxcz", Name:"b-hlwmw-stable", UID:"3889b57c-44cc-11e9-a299-0e1cd9533b18", APIVersion:"operators.coreos.com/v1alpha1", ResourceVersion:"42171", FieldPath:""}, Reason:"NoOperatorGroup", Message:"csv in namespace with no operatorgroups", Source:v1.EventSource{Component:"operator-lifecycle-manager", Host:""}, FirstTimestamp:v1.Time{Time:time.Time{wall:0xbf1a0b8894dd67fd, ext:1185390129874, loc:(*time.Location)(0x1f4f7a0)}}, LastTimestamp:v1.Time{Time:time.Time{wall:0xbf1a0b8894dd67fd, ext:1185390129874, loc:(*time.Location)(0x1f4f7a0)}}, Count:1, Type:"Warning", EventTime:v1.MicroTime{Time:time.Time{wall:0x0, ext:0, loc:(*time.Location)(nil)}}, Series:(*v1.EventSeries)(nil), Action:"", Related:(*v1.ObjectReference)(nil), ReportingController:"", ReportingInstance:""}': 'events "b-hlwmw-stable.158b39be5e6d3bfd" is forbidden: unable to create new content in namespace b-4wxcz because it is being terminated' (will not retry!)
E0312 13:39:46.625748       1 queueinformer_operator.go:175] Sync "b-4wxcz/b-hlwmw-stable" failed: error transitioning ClusterServiceVersion: csv in namespace with no operatorgroups and error updating CSV status: error updating ClusterServiceVersion status: Operation cannot be fulfilled on clusterserviceversions.operators.coreos.com "b-hlwmw-stable": StorageError: invalid object, Code: 4, Key: /openshift.io/operators.coreos.com/clusterserviceversions/b-4wxcz/b-hlwmw-stable, ResourceVersion: 0, AdditionalErrorMsg: Precondition failed: UID in precondition: 3889b57c-44cc-11e9-a299-0e1cd9533b18, UID in object meta:
time="2019-03-12T13:39:48Z" level=error msg="Create for new CSV failed: clusterserviceversions.operators.coreos.com \"another-csv-t8j44\" is forbidden: unable to create new content in namespace c-qd258 because it is being terminated"
time="2019-03-12T13:39:59Z" level=error msg="Create for new CSV failed: namespaces \"d-99ndr\" not found"
time="2019-03-12T13:40:01Z" level=warning msg="error adding operatorgroup annotations" csv=b-hlwmw-stable error="Operation cannot be fulfilled on clusterserviceversions.operators.coreos.com \"b-hlwmw-stable\": StorageError: invalid object, Code: 4, Key: /openshift.io/operators.coreos.com/clusterserviceversions/openshift-monitoring/b-hlwmw-stable, ResourceVersion: 0, AdditionalErrorMsg: Precondition failed: UID in precondition: 38a4023a-44cc-11e9-8851-12831077e07e, UID in object meta: " namespace=openshift-monitoring operatorGroup=openshift-cluster-monitoring
time="2019-03-12T13:40:02Z" level=warning msg="failed to annotate CSVs in operatorgroup after group change" error="Operation cannot be fulfilled on clusterserviceversions.operators.coreos.com \"b-hlwmw-stable\": StorageError: invalid object, Code: 4, Key: /openshift.io/operators.coreos.com/clusterserviceversions/openshift-monitoring/b-hlwmw-stable, ResourceVersion: 0, AdditionalErrorMsg: Precondition failed: UID in precondition: 38a4023a-44cc-11e9-8851-12831077e07e, UID in object meta: " namespace=openshift-monitoring operatorGroup=openshift-cluster-monitoring
E0312 13:40:02.225778       1 queueinformer_operator.go:175] Sync "openshift-monitoring/openshift-cluster-monitoring" failed: Operation cannot be fulfilled on clusterserviceversions.operators.coreos.com "b-hlwmw-stable": StorageError: invalid object, Code: 4, Key: /openshift.io/operators.coreos.com/clusterserviceversions/openshift-monitoring/b-hlwmw-stable, ResourceVersion: 0, AdditionalErrorMsg: Precondition failed: UID in precondition: 38a4023a-44cc-11e9-8851-12831077e07e, UID in object meta:
time="2019-03-12T13:40:12Z" level=error msg="Create for new CSV failed: namespaces \"b-4wxcz\" not found"
time="2019-03-12T13:40:20Z" level=warning msg="unhealthy component: Waiting: waiting for deployment operator-deploymentsjjqt to become ready: Waiting for deployment spec update to be observed...\n" csv=another-csv-t8j44 id=0MJqe namespace=openshift-operators phase=Succeeded strategy=deployment
time="2019-03-12T13:40:21Z" level=warning msg="needs reinstall: Waiting: waiting for deployment operator-deploymentsjjqt to become ready: Waiting for rollout to finish: 1 out of 2 new replicas have been updated...\n" csv=another-csv-t8j44 id=aB9VM namespace=openshift-operators phase=Failed strategy=deployment
time="2019-03-12T13:43:20Z" level=warning msg="unhealthy component: Waiting: waiting for deployment operator-deploymentsjjqt to become ready: Waiting for deployment spec update to be observed...\n" csv=another-csv-t8j44 id=xwt+9 namespace=openshift-operators phase=Succeeded strategy=deployment
time="2019-03-12T13:43:20Z" level=warning msg="needs reinstall: Waiting: waiting for deployment operator-deploymentsjjqt to become ready: Waiting for deployment spec update to be observed...\n" csv=another-csv-t8j44 id=jiRdD namespace=openshift-operators phase=Failed strategy=deployment
time="2019-03-12T13:43:26Z" level=error msg="Create for new CSV failed: namespaces \"openshift-operators-pxxqq\" not found"

cmd/olm/main.go

smarterclayton · 2019-03-12T17:26:54Z

This is really wierd, the error is consistent in this PR. Trying to spot what I changed - I might have slowed the sync loop down such that we're now flaking more (would be my first guess). Have we seen this test flake before?

smarterclayton · 2019-03-12T22:01:15Z

Hrm, error also happened in https://openshift-gce-devel.appspot.com/build/origin-ci-test/pr-logs/pull/operator-framework_operator-lifecycle-manager/750/pull-ci-operator-framework-operator-lifecycle-manager-master-e2e-aws-olm/1225/

So I think that flake is not this PR, but this PR might cause it to happen more frequently.

smarterclayton · 2019-03-12T23:55:41Z

/retest

smarterclayton · 2019-03-13T00:52:51Z

/retest

njhale · 2019-03-13T03:58:58Z

I have a pretty good idea of what's causing the flakes with TestUpdateCSVWithOwnedAPIService at least. I should have a PR with a fix by sometime tomorrow morning.

njhale · 2019-03-14T11:56:17Z

#758 opened - should help out some.

ecordell · 2019-03-15T01:48:09Z

/retest

Cluster operators are expected to report the version of the payload they are included in once they are "deployed", and also to keep the cluster operator object created. Have the OLM operator keep CO up to date, report the payload version once it hits available, and use the count of successful syncs from the queueInformers as a probalistic measurement of "available" (i.e. is the operator able to retire syncs). A future change should add a "health over time" metric or a "has successfully synced all InstallPlans at least once" metric to replace the current estimation.

ecordell · 2019-03-15T12:20:28Z

/retest

ecordell · 2019-03-15T13:19:37Z

/lgtm

openshift-ci-robot · 2019-03-15T13:19:39Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: ecordell

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [ecordell]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

openshift-ci-robot requested review from ecordell and jpeeler March 11, 2019 18:05

openshift-ci-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Mar 11, 2019

openshift-ci-robot assigned ecordell Mar 11, 2019

ecordell approved these changes Mar 11, 2019

View reviewed changes

openshift-ci-robot added lgtm Indicates that a PR is ready to be merged. approved Indicates a PR has been approved by an approver from all required OWNERS files. labels Mar 11, 2019

openshift-ci-robot removed the lgtm Indicates that a PR is ready to be merged. label Mar 12, 2019

alecmerdler reviewed Mar 12, 2019

View reviewed changes

cmd/olm/main.go Show resolved Hide resolved

smarterclayton force-pushed the report_level_status branch from 0cf9ff6 to cc76831 Compare March 12, 2019 17:25

smarterclayton force-pushed the report_level_status branch from cc76831 to 012ff64 Compare March 12, 2019 20:41

ecordell force-pushed the report_level_status branch from 012ff64 to 1c10730 Compare March 15, 2019 11:50

openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Mar 15, 2019

openshift-merge-robot merged commit da201ca into operator-framework:master Mar 15, 2019

ecordell mentioned this pull request Mar 18, 2019

Add upstream catalog #762

Merged

ecordell added the kind/feature Categorizes issue or PR as related to a new feature. label Mar 19, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

clusteroperator: Report when OLM reaches "level" and check syncs #748

clusteroperator: Report when OLM reaches "level" and check syncs #748

smarterclayton commented Mar 11, 2019

ecordell commented Mar 11, 2019

ecordell commented Mar 11, 2019

ecordell left a comment

ecordell commented Mar 12, 2019

smarterclayton commented Mar 12, 2019

smarterclayton commented Mar 12, 2019 •

edited

Loading

smarterclayton commented Mar 12, 2019

smarterclayton commented Mar 12, 2019

smarterclayton commented Mar 12, 2019

smarterclayton commented Mar 13, 2019

njhale commented Mar 13, 2019

njhale commented Mar 14, 2019

ecordell commented Mar 15, 2019

ecordell commented Mar 15, 2019

ecordell commented Mar 15, 2019

openshift-ci-robot commented Mar 15, 2019

clusteroperator: Report when OLM reaches "level" and check syncs #748

clusteroperator: Report when OLM reaches "level" and check syncs #748

Conversation

smarterclayton commented Mar 11, 2019

ecordell commented Mar 11, 2019

ecordell commented Mar 11, 2019

ecordell left a comment

Choose a reason for hiding this comment

ecordell commented Mar 12, 2019

smarterclayton commented Mar 12, 2019

smarterclayton commented Mar 12, 2019 • edited Loading

smarterclayton commented Mar 12, 2019

smarterclayton commented Mar 12, 2019

smarterclayton commented Mar 12, 2019

smarterclayton commented Mar 13, 2019

njhale commented Mar 13, 2019

njhale commented Mar 14, 2019

ecordell commented Mar 15, 2019

ecordell commented Mar 15, 2019

ecordell commented Mar 15, 2019

openshift-ci-robot commented Mar 15, 2019

smarterclayton commented Mar 12, 2019 •

edited

Loading