Consistent and Correct Conditions #5076

jonjohnsonjr · 2019-08-06T20:29:27Z

Not all of our conditions follow our condition conventions.

We also don't propagate conditions in a consistent (or even correct) way: #4937

I'm opening this to track some work around wrangling our conditions.

Resources that reconcile child resources should have a condition for each child that reflects the child's top-level "happy state" (TLHS) condition.
1. This makes it easy to propagate conditions from our children.
A resource's own TLHS condition should depend on those child conditions via ConditionSets.
1. This makes it easy to propagate conditions to our parent.
We should always update a resource's observed generation to indicate that we've attempted to reconcile its spec.
1. This allows our parent to know if our status is up to date.
If the observed generation of a child resource does not match its current generation, we should mark it as NotReconciled since its state does not yet reflect its spec yet.
1. This ensures we don’t reflect stale child statuses.
2. This might not make sense in some cases (e.g. a Revision reconciling a Deployment would flip back and forth between Ready and Unknown, given that the KPA will be constantly bumping the Deployment's Generation).
If there is an non-terminal error during reconciliation, Ready should become unknown.
1. This ensures we aren’t prematurely Ready and that we aren’t swallowing errors.
If the observed generation does match, and we did not encounter an error, we should propagate the child’s TLHS condition into our own.
1. This is the happy path.

In general, everything should look something like this:

Here's (a bit simplified) Service as an example. Service is pretty close to correct, but it's not unconditionally bumping ObservedGeneration or updating its status with reconciliation errors (bolded things aren't correct (yet)):

We'll also need to rework the Revision and KPA and SKS conditions to fit better into this world, since they're all over the place.

TODO

Consistently update ObservedGeneration
Surface reconciliation errors in conditions.
Modify conditions to be in terms of child resources.
Consistently propagate child conditions.
Attempt to refactor/code-gen this stuff so we don't regress.
Update sample-controller with parent->child best practices.

/area API

The text was updated successfully, but these errors were encountered:

markusthoemmes · 2019-08-07T06:27:55Z

Thanks for the writeup @jonjohnsonjr. That makes a lot of sense to me!

jonjohnsonjr · 2019-08-07T22:46:33Z

@JRBANCEL pointed out some more inconsistencies in our lifecycle Mark* methods:

$ rg -e 'func .* Mark[^\(]+' -o --no-filename | sort | uniq
func (cs *CertificateStatus) MarkNotReady
func (cs *CertificateStatus) MarkReady
func (cs *CertificateStatus) MarkResourceNotOwned
func (cs *CertificateStatus) MarkUnknown
func (cs *ConfigurationStatus) MarkLatestCreatedFailed
func (cs *ConfigurationStatus) MarkLatestReadyDeleted
func (cs *ConfigurationStatus) MarkResourceNotConvertible
func (cs *ConfigurationStatus) MarkRevisionCreationFailed
func (is *IngressStatus) MarkLoadBalancerPending
func (is *IngressStatus) MarkLoadBalancerReady
func (is *IngressStatus) MarkNetworkConfigured
func (is *IngressStatus) MarkResourceNotOwned
func (pas *PodAutoscalerStatus) MarkActivating
func (pas *PodAutoscalerStatus) MarkActive
func (pas *PodAutoscalerStatus) MarkInactive
func (pas *PodAutoscalerStatus) MarkResourceFailedCreation
func (pas *PodAutoscalerStatus) MarkResourceNotOwned
func (rs *RevisionStatus) MarkActivating
func (rs *RevisionStatus) MarkActive
func (rs *RevisionStatus) MarkContainerExiting
func (rs *RevisionStatus) MarkContainerHealthy
func (rs *RevisionStatus) MarkContainerMissing
func (rs *RevisionStatus) MarkDeploying
func (rs *RevisionStatus) MarkInactive
func (rs *RevisionStatus) MarkProgressDeadlineExceeded
func (rs *RevisionStatus) MarkResourceNotConvertible
func (rs *RevisionStatus) MarkResourceNotOwned
func (rs *RevisionStatus) MarkResourcesAvailable
func (rs *RevisionStatus) MarkResourcesUnavailable
func (rs *RouteStatus) MarkCertificateNotOwned
func (rs *RouteStatus) MarkCertificateNotReady
func (rs *RouteStatus) MarkCertificateProvisionFailed
func (rs *RouteStatus) MarkCertificateReady
func (rs *RouteStatus) MarkConfigurationFailed
func (rs *RouteStatus) MarkConfigurationNotReady
func (rs *RouteStatus) MarkIngressNotConfigured
func (rs *RouteStatus) MarkMissingTrafficTarget
func (rs *RouteStatus) MarkResourceNotConvertible
func (rs *RouteStatus) MarkRevisionFailed
func (rs *RouteStatus) MarkRevisionNotReady
func (rs *RouteStatus) MarkServiceNotOwned
func (rs *RouteStatus) MarkTrafficAssigned
func (rs *RouteStatus) MarkUnknownTrafficError
func (ss *ServiceStatus) MarkConfigurationNotOwned
func (ss *ServiceStatus) MarkConfigurationNotReconciled
func (ss *ServiceStatus) MarkResourceNotConvertible
func (ss *ServiceStatus) MarkRevisionNameTaken
func (ss *ServiceStatus) MarkRouteNotOwned
func (ss *ServiceStatus) MarkRouteNotReconciled
func (ss *ServiceStatus) MarkRouteNotYetReady
func (sss *ServerlessServiceStatus) MarkActivatorEndpointsPopulated
func (sss *ServerlessServiceStatus) MarkActivatorEndpointsRemoved
func (sss *ServerlessServiceStatus) MarkEndpointsNotOwned
func (sss *ServerlessServiceStatus) MarkEndpointsNotReady
func (sss *ServerlessServiceStatus) MarkEndpointsReady

The naming is inconsistent: NotReady means Unknown, other times NotReady means failed. It's not clear from the method name what state you're actually setting.

For simple methods that just set the state I'd propose following this pattern, since it's most prevalent:

MarkFooReady -> Ready=True
MarkFooNotReady -> Ready=Unknown
MarkFooFailed -> Ready=False

There are a lot of convenience methods to make setting certain statuses more legible, we should probably leave those alone for now.

ZhiminXiang · 2019-09-09T18:15:42Z

MarkFooReady -> Ready=True
MarkFooNotReady -> Ready=Unknown
MarkFooFailed -> Ready=False

We also need MarkFooNotOwned for the case where Foo resource is not owned by the parent resource, and MarkFooNotReconciled for the case where generation != observedGeneration. We probably want to add these two functions into our standard.

nimakaviani · 2019-10-16T17:17:02Z

@dgerd opened an issue and a proposal (#5780) on how to deal with condition types and marking resource status. PR #5804 already implements that proposal for revision lifecycle and I plan to continue the refactoring for other resource status markers as well.

nimakaviani · 2019-10-16T17:26:03Z

also this #5804 (comment) for cross referencing.

dgerd · 2019-12-02T22:08:24Z

As part of #4937 I took a look at our current reconciler loops and put together a few state machines to help visualize the potential condition sets you can end up with. A few thoughts about them:

We should not have Terminal states that exit without updating status. These are all opportunities to provide additional information/context back to the client and I can't think of a reason why we wouldn't want to. The reconciler knows something went wrong, but we are not sharing that information back.
- As an aside, it is very easy to return an error in the reconcilers and end up with more of these. This may be an opportunity to extend our controller framework to ensure that an error/success always gets propagated back.
As mentioned in the issue referenced above, ObservedGeneration is inconsistent across resources. The only consistent guidance I have found from reading K8s docs and issues is that controllers should report the most recent generation seen in status when they post status. From our API specification we state: The latest metadata.generation that the reconciler has observed. If observedGeneration is updated, conditions MUST be updated with current status in the same transaction. If we make all reconciler exits post status this would allow us to just bump observedGeneration from the controller framework at the beginning of the reconcile loop. The current state of the world is:
- KService - Incremented at the end
- Configuration - Incremented after v1alpha1 conversion, but before everything else
- Route - Incremented on traffic errors, or at the end
We use "Unknown" for (at least) 4 things. We should stop using it for iv.
1. Initialization
2. The observedGeneration of a child resource does not match (i.e. child not yet reconciled)
3. We did not get around to looking at this resource (i.e. exited early due to previous dependency failing or not yet ready)
4. Unknown error returned about object and we are not sure how to proceed
We are mostly consistent about this, but we should stop posting Reason and Message for "True" conditions and we should always post a Reason and Message for False/Unknown conditions. We have wording in the API specification for always including a message for failures, but we have no wording about reasons.
We should be more consistent about wording in "Reason". For example we do both "FooNotOwned" and "NotOwned" depending on the Resources and condition. I would be in favor of removing the stutter as the condition name already reflects the resource type.

I haven put together a Revision reconciler state machine yet, but here are the Service, Route, and Config ones.

KService Reconciler

Route Reconciler

Configuration Reconciler

knative-housekeeping-robot · 2020-03-02T00:01:23Z

Issues go stale after 90 days of inactivity.
Mark the issue as fresh by adding the comment /remove-lifecycle stale.
Stale issues rot after an additional 30 days of inactivity and eventually close.
If this issue is safe to close now please do so by adding the comment /close.

Send feedback to Knative Productivity Slack channel or file an issue in knative/test-infra.

/lifecycle stale

knative-housekeeping-robot · 2020-04-01T12:01:24Z

Stale issues rot after 30 days of inactivity.
Mark the issue as fresh by adding the comment /remove-lifecycle rotten.
Rotten issues close after an additional 30 days of inactivity.
If this issue is safe to close now please do so by adding the comment /close.

Send feedback to Knative Productivity Slack channel or file an issue in knative/test-infra.

/lifecycle rotten

knative-housekeeping-robot · 2020-05-01T12:01:50Z

Rotten issues close after 30 days of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh by adding the comment /remove-lifecycle rotten.

Send feedback to Knative Productivity Slack channel or file an issue in knative/test-infra.

/close

knative-prow-robot · 2020-05-01T12:01:56Z

@knative-housekeeping-robot: Closing this issue.

In response to this:

Rotten issues close after 30 days of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh by adding the comment /remove-lifecycle rotten.

Send feedback to Knative Productivity Slack channel or file an issue in knative/test-infra.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

vagababov · 2020-05-19T16:43:02Z

/remove-lifecycle rotten
/reopen
/lifecycle-frozen

github-actions · 2021-06-24T01:21:12Z

This issue is stale because it has been open for 90 days with no
activity. It will automatically close after 30 more days of
inactivity. Reopen the issue with /reopen. Mark the issue as
fresh by adding the comment /remove-lifecycle stale.

markusthoemmes · 2021-06-24T07:01:07Z

/remove-lifecycle stale

dprotaso · 2021-06-24T14:19:53Z

/lifecycle frozen

jonjohnsonjr added the kind/question Further information is requested label Aug 6, 2019

knative-prow-robot added the area/API API objects and controllers label Aug 6, 2019

jonjohnsonjr mentioned this issue Aug 6, 2019

Refactor deployment status propagation #5077

Merged

taragu mentioned this issue Aug 12, 2019

More consistent lifecycle conditions for certificate #5138

Merged

This was referenced Aug 14, 2019

Surface Revision quota problems #496

Closed

Move autoscaler status propagation #5182

Merged

This was referenced Aug 19, 2019

Update certificate observed generation always #5207

Merged

Consistent update for ObservedGeneration for ingress: update observed generation -> reconcile -> update status #5250

Merged

jonjohnsonjr mentioned this issue Aug 27, 2019

RFC: Add status field to metrics object #5304

Closed

taragu mentioned this issue Oct 16, 2019

refactor revision lifecycle status markers #5804

Merged

eallred-google added this to the Needs Triage milestone Oct 23, 2019

nak3 mentioned this issue Nov 16, 2019

Mark LoadBalancerReady False when VirtualService failed to be reconciled #6048

Merged

This was referenced Dec 2, 2019

Standardize True/False/Unknown status throughout eventing knative/eventing#2187

Closed

Bump observedGeneration first #6163

Closed

dgerd mentioned this issue Dec 20, 2019

Stuck in RevisionMissing and "Unknown" state #6265

Closed

knative-prow-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Mar 2, 2020

knative-prow-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Apr 1, 2020

knative-prow-robot closed this as completed May 1, 2020

knative-prow-robot removed the kind/question Further information is requested label Mar 22, 2021

github-actions bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jun 24, 2021

knative-prow-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jun 24, 2021

dprotaso modified the milestones: Needs Triage, 0.24.x, 0.25.x Jun 24, 2021

knative-prow-robot added the lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. label Jun 24, 2021

dprotaso modified the milestones: 0.25.x, 0.26.x Aug 26, 2021

dprotaso modified the milestones: 0.26.x, 0.27.x Oct 6, 2021

dprotaso modified the milestones: v1.0.0, v1.1.0 Nov 3, 2021

dprotaso added this to Serving API Roadmap Nov 4, 2021

dprotaso moved this to Now in Serving API Roadmap Nov 4, 2021

dprotaso moved this from Now to Next in Serving API Roadmap Nov 4, 2021

dprotaso modified the milestones: v1.1.0, v1.2.0 Jan 4, 2022

dprotaso modified the milestones: v1.2.0, v1.3.0 Feb 2, 2022

dprotaso modified the milestones: v1.3.0, v1.4.0 Mar 10, 2022

dprotaso modified the milestones: v1.4.0, v1.5.0 May 10, 2022

dprotaso modified the milestones: v1.5.0, Icebox Jul 28, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Consistent and Correct Conditions #5076

Consistent and Correct Conditions #5076

jonjohnsonjr commented Aug 6, 2019 •

edited

Loading

markusthoemmes commented Aug 7, 2019

jonjohnsonjr commented Aug 7, 2019

ZhiminXiang commented Sep 9, 2019

nimakaviani commented Oct 16, 2019

nimakaviani commented Oct 16, 2019

dgerd commented Dec 2, 2019 •

edited

Loading

knative-housekeeping-robot commented Mar 2, 2020

knative-housekeeping-robot commented Apr 1, 2020

knative-housekeeping-robot commented May 1, 2020

knative-prow-robot commented May 1, 2020

vagababov commented May 19, 2020

github-actions bot commented Jun 24, 2021

markusthoemmes commented Jun 24, 2021

dprotaso commented Jun 24, 2021

Consistent and Correct Conditions #5076

Consistent and Correct Conditions #5076

Comments

jonjohnsonjr commented Aug 6, 2019 • edited Loading

TODO

markusthoemmes commented Aug 7, 2019

jonjohnsonjr commented Aug 7, 2019

ZhiminXiang commented Sep 9, 2019

nimakaviani commented Oct 16, 2019

nimakaviani commented Oct 16, 2019

dgerd commented Dec 2, 2019 • edited Loading

KService Reconciler

Route Reconciler

Configuration Reconciler

knative-housekeeping-robot commented Mar 2, 2020

knative-housekeeping-robot commented Apr 1, 2020

knative-housekeeping-robot commented May 1, 2020

knative-prow-robot commented May 1, 2020

vagababov commented May 19, 2020

github-actions bot commented Jun 24, 2021

markusthoemmes commented Jun 24, 2021

dprotaso commented Jun 24, 2021

jonjohnsonjr commented Aug 6, 2019 •

edited

Loading

dgerd commented Dec 2, 2019 •

edited

Loading