Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

clusteroperator: Report when OLM reaches "level" and check syncs #748

Conversation

smarterclayton
Copy link
Collaborator

Cluster operators are expected to report the version of the payload
they are included in once they are "deployed", and also to keep the
cluster operator object created. Have the OLM operator keep CO up
to date, report the payload version once it hits available, and use
the count of successful syncs from the queueInformers as a probalistic
measurement of "available" (i.e. is the operator able to retire syncs).
A future change should add a "health over time" metric or a "has
successfully synced all InstallPlans at least once" metric to replace
the current estimation.

@openshift-ci-robot openshift-ci-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Mar 11, 2019
@ecordell
Copy link
Member

/test unit

@ecordell
Copy link
Member

/retest

Copy link
Member

@ecordell ecordell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

Thanks for getting us up to code, @smarterclayton

We should have something similar for catalog-operator; we'll need a new top-level clusterversion for reporting that.

I'll put something on the backlog to get better reporting of progressing. Sync success rate doesn't tell you much.

@openshift-ci-robot openshift-ci-robot added lgtm Indicates that a PR is ready to be merged. approved Indicates a PR has been approved by an approver from all required OWNERS files. labels Mar 11, 2019
@ecordell
Copy link
Member

/retest

@smarterclayton
Copy link
Collaborator Author

Looks like a legit failure

/lgtm cancel

@openshift-ci-robot openshift-ci-robot removed the lgtm Indicates that a PR is ready to be merged. label Mar 12, 2019
@smarterclayton
Copy link
Collaborator Author

smarterclayton commented Mar 12, 2019

--- FAIL: TestCSVCopyWatchingAllNamespaces (220.56s)
	operator_groups_e2e_test.go:1129: Creating CRD
	operator_groups_e2e_test.go:1136: Getting default operator group 'global-operators' installed via operatorgroup-default.yaml openshift-operators
	operator_groups_e2e_test.go:1144: Waiting on operator group to have correct status
	operator_groups_e2e_test.go:1159: Creating CSV
	operator_groups_e2e_test.go:1229: wait for CSV to succeed
	csv_e2e_test.go:226: Pending (RequirementsUnknown): requirements not yet checked
	csv_e2e_test.go:226: Installing (InstallSucceeded): waiting for install components to report healthy
	csv_e2e_test.go:226: Installing (InstallWaiting): installing: Waiting: waiting for deployment operator-deploymentsjjqt to become ready: Waiting for rollout to finish: 0 of 2 updated replicas are available...
	csv_e2e_test.go:226: Installing (InstallWaiting): installing: Waiting: waiting for deployment operator-deploymentsjjqt to become ready: Waiting for rollout to finish: 0 of 2 updated replicas are available...
	csv_e2e_test.go:226: Installing (InstallWaiting): installing: Waiting: waiting for deployment operator-deploymentsjjqt to become ready: Waiting for rollout to finish: 0 of 2 updated replicas are available...
	csv_e2e_test.go:226: Installing (InstallWaiting): installing: Waiting: waiting for deployment operator-deploymentsjjqt to become ready: Waiting for rollout to finish: 0 of 2 updated replicas are available...
	csv_e2e_test.go:226: Installing (InstallWaiting): installing: Waiting: waiting for deployment operator-deploymentsjjqt to become ready: Waiting for rollout to finish: 0 of 2 updated replicas are available...
	csv_e2e_test.go:226: Installing (InstallWaiting): installing: Waiting: waiting for deployment operator-deploymentsjjqt to become ready: Waiting for rollout to finish: 0 of 2 updated replicas are available...
	csv_e2e_test.go:226: Installing (InstallWaiting): installing: Waiting: waiting for deployment operator-deploymentsjjqt to become ready: Waiting for rollout to finish: 0 of 2 updated replicas are available...
	csv_e2e_test.go:226: Installing (InstallWaiting): installing: Waiting: waiting for deployment operator-deploymentsjjqt to become ready: Waiting for rollout to finish: 0 of 2 updated replicas are available...
	csv_e2e_test.go:226: Installing (InstallWaiting): installing: Waiting: waiting for deployment operator-deploymentsjjqt to become ready: Waiting for rollout to finish: 0 of 2 updated replicas are available...
	csv_e2e_test.go:226: Installing (InstallWaiting): installing: Waiting: waiting for deployment operator-deploymentsjjqt to become ready: Waiting for rollout to finish: 0 of 2 updated replicas are available...
	csv_e2e_test.go:226: Installing (InstallWaiting): installing: Waiting: waiting for deployment operator-deploymentsjjqt to become ready: Waiting for rollout to finish: 0 of 2 updated replicas are available...
	csv_e2e_test.go:226: Installing (InstallWaiting): installing: Waiting: waiting for deployment operator-deploymentsjjqt to become ready: Waiting for rollout to finish: 0 of 2 updated replicas are available...
	csv_e2e_test.go:226: Installing (InstallWaiting): installing: Waiting: waiting for deployment operator-deploymentsjjqt to become ready: Waiting for rollout to finish: 0 of 2 updated replicas are available...
	csv_e2e_test.go:226: Installing (InstallWaiting): installing: Waiting: waiting for deployment operator-deploymentsjjqt to become ready: Waiting for rollout to finish: 0 of 2 updated replicas are available...
	csv_e2e_test.go:226: Installing (InstallWaiting): installing: Waiting: waiting for deployment operator-deploymentsjjqt to become ready: Waiting for rollout to finish: 0 of 2 updated replicas are available...
	csv_e2e_test.go:226: Installing (InstallWaiting): installing: Waiting: waiting for deployment operator-deploymentsjjqt to become ready: Waiting for rollout to finish: 0 of 2 updated replicas are available...
	csv_e2e_test.go:226: Installing (InstallWaiting): installing: Waiting: waiting for deployment operator-deploymentsjjqt to become ready: Waiting for rollout to finish: 0 of 2 updated replicas are available...
	csv_e2e_test.go:226: Installing (InstallWaiting): installing: Waiting: waiting for deployment operator-deploymentsjjqt to become ready: Waiting for rollout to finish: 0 of 2 updated replicas are available...
	csv_e2e_test.go:226: Installing (InstallWaiting): installing: Waiting: waiting for deployment operator-deploymentsjjqt to become ready: Waiting for rollout to finish: 0 of 2 updated replicas are available...
	csv_e2e_test.go:226: Installing (InstallWaiting): installing: Waiting: waiting for deployment operator-deploymentsjjqt to become ready: Waiting for rollout to finish: 0 of 2 updated replicas are available...
	csv_e2e_test.go:226: Installing (InstallWaiting): installing: Waiting: waiting for deployment operator-deploymentsjjqt to become ready: Waiting for rollout to finish: 0 of 2 updated replicas are available...
	csv_e2e_test.go:226: Succeeded (InstallSucceeded): install strategy completed with no errors
	operator_groups_e2e_test.go:1233: Waiting for operator namespace csv to have annotations
	operator_groups_e2e_test.go:1252: Found CSV count of 32
	operator_groups_e2e_test.go:1254: Create other namespace
	operator_groups_e2e_test.go:1268: Waiting to ensure copied CSV shows up in other namespace
	operator_groups_e2e_test.go:1286: Modifying operator group to no longer watch all namespaces
	require.go:794: 
			Error Trace:	operator_groups_e2e_test.go:1312
			Error:      	Received unexpected error:
			            	timed out waiting for the condition
			Test:       	TestCSVCopyWatchingAllNamespaces
	operator_groups_e2e_test.go:1293: Re-modifying operator group to be watching all namespaces

I think these are the errors within the window of the test failing, but I can't see anything obvious

E0312 13:39:46.351352       1 event.go:203] Server rejected event '&v1.Event{TypeMeta:v1.TypeMeta{Kind:"", APIVersion:""}, ObjectMeta:v1.ObjectMeta{Name:"b-hlwmw-stable.158b39be5e6d3bfd", GenerateName:"", Namespace:"b-4wxcz", SelfLink:"", UID:"", ResourceVersion:"", Generation:0, CreationTimestamp:v1.Time{Time:time.Time{wall:0x0, ext:0, loc:(*time.Location)(nil)}}, DeletionTimestamp:(*v1.Time)(nil), DeletionGracePeriodSeconds:(*int64)(nil), Labels:map[string]string(nil), Annotations:map[string]string(nil), OwnerReferences:[]v1.OwnerReference(nil), Initializers:(*v1.Initializers)(nil), Finalizers:[]string(nil), ClusterName:""}, InvolvedObject:v1.ObjectReference{Kind:"ClusterServiceVersion", Namespace:"b-4wxcz", Name:"b-hlwmw-stable", UID:"3889b57c-44cc-11e9-a299-0e1cd9533b18", APIVersion:"operators.coreos.com/v1alpha1", ResourceVersion:"42171", FieldPath:""}, Reason:"NoOperatorGroup", Message:"csv in namespace with no operatorgroups", Source:v1.EventSource{Component:"operator-lifecycle-manager", Host:""}, FirstTimestamp:v1.Time{Time:time.Time{wall:0xbf1a0b8894dd67fd, ext:1185390129874, loc:(*time.Location)(0x1f4f7a0)}}, LastTimestamp:v1.Time{Time:time.Time{wall:0xbf1a0b8894dd67fd, ext:1185390129874, loc:(*time.Location)(0x1f4f7a0)}}, Count:1, Type:"Warning", EventTime:v1.MicroTime{Time:time.Time{wall:0x0, ext:0, loc:(*time.Location)(nil)}}, Series:(*v1.EventSeries)(nil), Action:"", Related:(*v1.ObjectReference)(nil), ReportingController:"", ReportingInstance:""}': 'events "b-hlwmw-stable.158b39be5e6d3bfd" is forbidden: unable to create new content in namespace b-4wxcz because it is being terminated' (will not retry!)
E0312 13:39:46.625748       1 queueinformer_operator.go:175] Sync "b-4wxcz/b-hlwmw-stable" failed: error transitioning ClusterServiceVersion: csv in namespace with no operatorgroups and error updating CSV status: error updating ClusterServiceVersion status: Operation cannot be fulfilled on clusterserviceversions.operators.coreos.com "b-hlwmw-stable": StorageError: invalid object, Code: 4, Key: /openshift.io/operators.coreos.com/clusterserviceversions/b-4wxcz/b-hlwmw-stable, ResourceVersion: 0, AdditionalErrorMsg: Precondition failed: UID in precondition: 3889b57c-44cc-11e9-a299-0e1cd9533b18, UID in object meta:
time="2019-03-12T13:39:48Z" level=error msg="Create for new CSV failed: clusterserviceversions.operators.coreos.com \"another-csv-t8j44\" is forbidden: unable to create new content in namespace c-qd258 because it is being terminated"
time="2019-03-12T13:39:59Z" level=error msg="Create for new CSV failed: namespaces \"d-99ndr\" not found"
time="2019-03-12T13:40:01Z" level=warning msg="error adding operatorgroup annotations" csv=b-hlwmw-stable error="Operation cannot be fulfilled on clusterserviceversions.operators.coreos.com \"b-hlwmw-stable\": StorageError: invalid object, Code: 4, Key: /openshift.io/operators.coreos.com/clusterserviceversions/openshift-monitoring/b-hlwmw-stable, ResourceVersion: 0, AdditionalErrorMsg: Precondition failed: UID in precondition: 38a4023a-44cc-11e9-8851-12831077e07e, UID in object meta: " namespace=openshift-monitoring operatorGroup=openshift-cluster-monitoring
time="2019-03-12T13:40:02Z" level=warning msg="failed to annotate CSVs in operatorgroup after group change" error="Operation cannot be fulfilled on clusterserviceversions.operators.coreos.com \"b-hlwmw-stable\": StorageError: invalid object, Code: 4, Key: /openshift.io/operators.coreos.com/clusterserviceversions/openshift-monitoring/b-hlwmw-stable, ResourceVersion: 0, AdditionalErrorMsg: Precondition failed: UID in precondition: 38a4023a-44cc-11e9-8851-12831077e07e, UID in object meta: " namespace=openshift-monitoring operatorGroup=openshift-cluster-monitoring
E0312 13:40:02.225778       1 queueinformer_operator.go:175] Sync "openshift-monitoring/openshift-cluster-monitoring" failed: Operation cannot be fulfilled on clusterserviceversions.operators.coreos.com "b-hlwmw-stable": StorageError: invalid object, Code: 4, Key: /openshift.io/operators.coreos.com/clusterserviceversions/openshift-monitoring/b-hlwmw-stable, ResourceVersion: 0, AdditionalErrorMsg: Precondition failed: UID in precondition: 38a4023a-44cc-11e9-8851-12831077e07e, UID in object meta:
time="2019-03-12T13:40:12Z" level=error msg="Create for new CSV failed: namespaces \"b-4wxcz\" not found"
time="2019-03-12T13:40:20Z" level=warning msg="unhealthy component: Waiting: waiting for deployment operator-deploymentsjjqt to become ready: Waiting for deployment spec update to be observed...\n" csv=another-csv-t8j44 id=0MJqe namespace=openshift-operators phase=Succeeded strategy=deployment
time="2019-03-12T13:40:21Z" level=warning msg="needs reinstall: Waiting: waiting for deployment operator-deploymentsjjqt to become ready: Waiting for rollout to finish: 1 out of 2 new replicas have been updated...\n" csv=another-csv-t8j44 id=aB9VM namespace=openshift-operators phase=Failed strategy=deployment
time="2019-03-12T13:43:20Z" level=warning msg="unhealthy component: Waiting: waiting for deployment operator-deploymentsjjqt to become ready: Waiting for deployment spec update to be observed...\n" csv=another-csv-t8j44 id=xwt+9 namespace=openshift-operators phase=Succeeded strategy=deployment
time="2019-03-12T13:43:20Z" level=warning msg="needs reinstall: Waiting: waiting for deployment operator-deploymentsjjqt to become ready: Waiting for deployment spec update to be observed...\n" csv=another-csv-t8j44 id=jiRdD namespace=openshift-operators phase=Failed strategy=deployment
time="2019-03-12T13:43:26Z" level=error msg="Create for new CSV failed: namespaces \"openshift-operators-pxxqq\" not found"

@smarterclayton
Copy link
Collaborator Author

This is really wierd, the error is consistent in this PR. Trying to spot what I changed - I might have slowed the sync loop down such that we're now flaking more (would be my first guess). Have we seen this test flake before?

@smarterclayton
Copy link
Collaborator Author

Hrm, error also happened in https://openshift-gce-devel.appspot.com/build/origin-ci-test/pr-logs/pull/operator-framework_operator-lifecycle-manager/750/pull-ci-operator-framework-operator-lifecycle-manager-master-e2e-aws-olm/1225/

So I think that flake is not this PR, but this PR might cause it to happen more frequently.

@smarterclayton
Copy link
Collaborator Author

/retest

1 similar comment
@smarterclayton
Copy link
Collaborator Author

/retest

@njhale
Copy link
Member

njhale commented Mar 13, 2019

I have a pretty good idea of what's causing the flakes with TestUpdateCSVWithOwnedAPIService at least. I should have a PR with a fix by sometime tomorrow morning.

@njhale
Copy link
Member

njhale commented Mar 14, 2019

#758 opened - should help out some.

@ecordell
Copy link
Member

/retest

Cluster operators are expected to report the version of the payload
they are included in once they are "deployed", and also to keep the
cluster operator object created. Have the OLM operator keep CO up
to date, report the payload version once it hits available, and use
the count of successful syncs from the queueInformers as a probalistic
measurement of "available" (i.e. is the operator able to retire syncs).
A future change should add a "health over time" metric or a "has
successfully synced all InstallPlans at least once" metric to replace
the current estimation.
@ecordell ecordell force-pushed the report_level_status branch from 012ff64 to 1c10730 Compare March 15, 2019 11:50
@ecordell
Copy link
Member

/retest

@ecordell
Copy link
Member

/lgtm

@openshift-ci-robot openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Mar 15, 2019
@openshift-ci-robot
Copy link
Collaborator

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: ecordell

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-merge-robot openshift-merge-robot merged commit da201ca into operator-framework:master Mar 15, 2019
@ecordell ecordell mentioned this pull request Mar 18, 2019
@ecordell ecordell added the kind/feature Categorizes issue or PR as related to a new feature. label Mar 19, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. kind/feature Categorizes issue or PR as related to a new feature. lgtm Indicates that a PR is ready to be merged. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants