-
Notifications
You must be signed in to change notification settings - Fork 545
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
clusteroperator: Report when OLM reaches "level" and check syncs #748
clusteroperator: Report when OLM reaches "level" and check syncs #748
Conversation
/test unit |
/retest |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
Thanks for getting us up to code, @smarterclayton
We should have something similar for catalog-operator; we'll need a new top-level clusterversion for reporting that.
I'll put something on the backlog to get better reporting of progressing
. Sync success rate doesn't tell you much.
/retest |
Looks like a legit failure /lgtm cancel |
I think these are the errors within the window of the test failing, but I can't see anything obvious
|
0cf9ff6
to
cc76831
Compare
This is really wierd, the error is consistent in this PR. Trying to spot what I changed - I might have slowed the sync loop down such that we're now flaking more (would be my first guess). Have we seen this test flake before? |
cc76831
to
012ff64
Compare
Hrm, error also happened in https://openshift-gce-devel.appspot.com/build/origin-ci-test/pr-logs/pull/operator-framework_operator-lifecycle-manager/750/pull-ci-operator-framework-operator-lifecycle-manager-master-e2e-aws-olm/1225/ So I think that flake is not this PR, but this PR might cause it to happen more frequently. |
/retest |
1 similar comment
/retest |
I have a pretty good idea of what's causing the flakes with |
#758 opened - should help out some. |
/retest |
Cluster operators are expected to report the version of the payload they are included in once they are "deployed", and also to keep the cluster operator object created. Have the OLM operator keep CO up to date, report the payload version once it hits available, and use the count of successful syncs from the queueInformers as a probalistic measurement of "available" (i.e. is the operator able to retire syncs). A future change should add a "health over time" metric or a "has successfully synced all InstallPlans at least once" metric to replace the current estimation.
012ff64
to
1c10730
Compare
/retest |
/lgtm |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: ecordell The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Cluster operators are expected to report the version of the payload
they are included in once they are "deployed", and also to keep the
cluster operator object created. Have the OLM operator keep CO up
to date, report the payload version once it hits available, and use
the count of successful syncs from the queueInformers as a probalistic
measurement of "available" (i.e. is the operator able to retire syncs).
A future change should add a "health over time" metric or a "has
successfully synced all InstallPlans at least once" metric to replace
the current estimation.