Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consistent and Correct Conditions #5076

Open
6 tasks
jonjohnsonjr opened this issue Aug 6, 2019 · 22 comments
Open
6 tasks

Consistent and Correct Conditions #5076

jonjohnsonjr opened this issue Aug 6, 2019 · 22 comments
Assignees
Labels
area/API API objects and controllers kind/cleanup Categorizes issue or PR as related to cleaning up code, process, or technical debt. lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. triage/accepted Issues which should be fixed (post-triage)
Milestone

Comments

@jonjohnsonjr
Copy link
Contributor

jonjohnsonjr commented Aug 6, 2019

Not all of our conditions follow our condition conventions.

We also don't propagate conditions in a consistent (or even correct) way: #4937

I'm opening this to track some work around wrangling our conditions.

  1. Resources that reconcile child resources should have a condition for each child that reflects the child's top-level "happy state" (TLHS) condition.
    1. This makes it easy to propagate conditions from our children.
  2. A resource's own TLHS condition should depend on those child conditions via ConditionSets.
    1. This makes it easy to propagate conditions to our parent.
  3. We should always update a resource's observed generation to indicate that we've attempted to reconcile its spec.
    1. This allows our parent to know if our status is up to date.
  4. If the observed generation of a child resource does not match its current generation, we should mark it as NotReconciled since its state does not yet reflect its spec yet.
    1. This ensures we don’t reflect stale child statuses.
    2. This might not make sense in some cases (e.g. a Revision reconciling a Deployment would flip back and forth between Ready and Unknown, given that the KPA will be constantly bumping the Deployment's Generation).
  5. If there is an non-terminal error during reconciliation, Ready should become unknown.
    1. This ensures we aren’t prematurely Ready and that we aren’t swallowing errors.
  6. If the observed generation does match, and we did not encounter an error, we should propagate the child’s TLHS condition into our own.
    1. This is the happy path.

In general, everything should look something like this:
image

Here's (a bit simplified) Service as an example. Service is pretty close to correct, but it's not unconditionally bumping ObservedGeneration or updating its status with reconciliation errors (bolded things aren't correct (yet)):
image

We'll also need to rework the Revision and KPA and SKS conditions to fit better into this world, since they're all over the place.

TODO

  • Consistently update ObservedGeneration
  • Surface reconciliation errors in conditions.
  • Modify conditions to be in terms of child resources.
  • Consistently propagate child conditions.
  • Attempt to refactor/code-gen this stuff so we don't regress.
  • Update sample-controller with parent->child best practices.

/area API

@jonjohnsonjr jonjohnsonjr added the kind/question Further information is requested label Aug 6, 2019
@knative-prow-robot knative-prow-robot added the area/API API objects and controllers label Aug 6, 2019
@markusthoemmes
Copy link
Contributor

Thanks for the writeup @jonjohnsonjr. That makes a lot of sense to me!

@jonjohnsonjr
Copy link
Contributor Author

@JRBANCEL pointed out some more inconsistencies in our lifecycle Mark* methods:

$ rg -e 'func .* Mark[^\(]+' -o --no-filename | sort | uniq
func (cs *CertificateStatus) MarkNotReady
func (cs *CertificateStatus) MarkReady
func (cs *CertificateStatus) MarkResourceNotOwned
func (cs *CertificateStatus) MarkUnknown
func (cs *ConfigurationStatus) MarkLatestCreatedFailed
func (cs *ConfigurationStatus) MarkLatestReadyDeleted
func (cs *ConfigurationStatus) MarkResourceNotConvertible
func (cs *ConfigurationStatus) MarkRevisionCreationFailed
func (is *IngressStatus) MarkLoadBalancerPending
func (is *IngressStatus) MarkLoadBalancerReady
func (is *IngressStatus) MarkNetworkConfigured
func (is *IngressStatus) MarkResourceNotOwned
func (pas *PodAutoscalerStatus) MarkActivating
func (pas *PodAutoscalerStatus) MarkActive
func (pas *PodAutoscalerStatus) MarkInactive
func (pas *PodAutoscalerStatus) MarkResourceFailedCreation
func (pas *PodAutoscalerStatus) MarkResourceNotOwned
func (rs *RevisionStatus) MarkActivating
func (rs *RevisionStatus) MarkActive
func (rs *RevisionStatus) MarkContainerExiting
func (rs *RevisionStatus) MarkContainerHealthy
func (rs *RevisionStatus) MarkContainerMissing
func (rs *RevisionStatus) MarkDeploying
func (rs *RevisionStatus) MarkInactive
func (rs *RevisionStatus) MarkProgressDeadlineExceeded
func (rs *RevisionStatus) MarkResourceNotConvertible
func (rs *RevisionStatus) MarkResourceNotOwned
func (rs *RevisionStatus) MarkResourcesAvailable
func (rs *RevisionStatus) MarkResourcesUnavailable
func (rs *RouteStatus) MarkCertificateNotOwned
func (rs *RouteStatus) MarkCertificateNotReady
func (rs *RouteStatus) MarkCertificateProvisionFailed
func (rs *RouteStatus) MarkCertificateReady
func (rs *RouteStatus) MarkConfigurationFailed
func (rs *RouteStatus) MarkConfigurationNotReady
func (rs *RouteStatus) MarkIngressNotConfigured
func (rs *RouteStatus) MarkMissingTrafficTarget
func (rs *RouteStatus) MarkResourceNotConvertible
func (rs *RouteStatus) MarkRevisionFailed
func (rs *RouteStatus) MarkRevisionNotReady
func (rs *RouteStatus) MarkServiceNotOwned
func (rs *RouteStatus) MarkTrafficAssigned
func (rs *RouteStatus) MarkUnknownTrafficError
func (ss *ServiceStatus) MarkConfigurationNotOwned
func (ss *ServiceStatus) MarkConfigurationNotReconciled
func (ss *ServiceStatus) MarkResourceNotConvertible
func (ss *ServiceStatus) MarkRevisionNameTaken
func (ss *ServiceStatus) MarkRouteNotOwned
func (ss *ServiceStatus) MarkRouteNotReconciled
func (ss *ServiceStatus) MarkRouteNotYetReady
func (sss *ServerlessServiceStatus) MarkActivatorEndpointsPopulated
func (sss *ServerlessServiceStatus) MarkActivatorEndpointsRemoved
func (sss *ServerlessServiceStatus) MarkEndpointsNotOwned
func (sss *ServerlessServiceStatus) MarkEndpointsNotReady
func (sss *ServerlessServiceStatus) MarkEndpointsReady

The naming is inconsistent: NotReady means Unknown, other times NotReady means failed. It's not clear from the method name what state you're actually setting.

For simple methods that just set the state I'd propose following this pattern, since it's most prevalent:

MarkFooReady -> Ready=True
MarkFooNotReady -> Ready=Unknown
MarkFooFailed -> Ready=False

There are a lot of convenience methods to make setting certain statuses more legible, we should probably leave those alone for now.

@ZhiminXiang
Copy link

MarkFooReady -> Ready=True
MarkFooNotReady -> Ready=Unknown
MarkFooFailed -> Ready=False

We also need MarkFooNotOwned for the case where Foo resource is not owned by the parent resource, and MarkFooNotReconciled for the case where generation != observedGeneration. We probably want to add these two functions into our standard.

@nimakaviani
Copy link
Contributor

@dgerd opened an issue and a proposal (#5780) on how to deal with condition types and marking resource status. PR #5804 already implements that proposal for revision lifecycle and I plan to continue the refactoring for other resource status markers as well.

@nimakaviani
Copy link
Contributor

also this #5804 (comment) for cross referencing.

@dgerd
Copy link

dgerd commented Dec 2, 2019

As part of #4937 I took a look at our current reconciler loops and put together a few state machines to help visualize the potential condition sets you can end up with. A few thoughts about them:

  1. We should not have Terminal states that exit without updating status. These are all opportunities to provide additional information/context back to the client and I can't think of a reason why we wouldn't want to. The reconciler knows something went wrong, but we are not sharing that information back.
    • As an aside, it is very easy to return an error in the reconcilers and end up with more of these. This may be an opportunity to extend our controller framework to ensure that an error/success always gets propagated back.
  2. As mentioned in the issue referenced above, ObservedGeneration is inconsistent across resources. The only consistent guidance I have found from reading K8s docs and issues is that controllers should report the most recent generation seen in status when they post status. From our API specification we state: The latest metadata.generation that the reconciler has observed. If observedGeneration is updated, conditions MUST be updated with current status in the same transaction. If we make all reconciler exits post status this would allow us to just bump observedGeneration from the controller framework at the beginning of the reconcile loop. The current state of the world is:
    • KService - Incremented at the end
    • Configuration - Incremented after v1alpha1 conversion, but before everything else
    • Route - Incremented on traffic errors, or at the end
  3. We use "Unknown" for (at least) 4 things. We should stop using it for iv.
    1. Initialization
    2. The observedGeneration of a child resource does not match (i.e. child not yet reconciled)
    3. We did not get around to looking at this resource (i.e. exited early due to previous dependency failing or not yet ready)
    4. Unknown error returned about object and we are not sure how to proceed
  4. We are mostly consistent about this, but we should stop posting Reason and Message for "True" conditions and we should always post a Reason and Message for False/Unknown conditions. We have wording in the API specification for always including a message for failures, but we have no wording about reasons.
  5. We should be more consistent about wording in "Reason". For example we do both "FooNotOwned" and "NotOwned" depending on the Resources and condition. I would be in favor of removing the stutter as the condition name already reflects the resource type.

I haven put together a Revision reconciler state machine yet, but here are the Service, Route, and Config ones.

KService Reconciler

KService Reconciler State Machine

Route Reconciler

Route Reconciler State Machine

Configuration Reconciler

Configuration Reconciler State Machine

@knative-housekeeping-robot

Issues go stale after 90 days of inactivity.
Mark the issue as fresh by adding the comment /remove-lifecycle stale.
Stale issues rot after an additional 30 days of inactivity and eventually close.
If this issue is safe to close now please do so by adding the comment /close.

Send feedback to Knative Productivity Slack channel or file an issue in knative/test-infra.

/lifecycle stale

@knative-prow-robot knative-prow-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Mar 2, 2020
@knative-housekeeping-robot

Stale issues rot after 30 days of inactivity.
Mark the issue as fresh by adding the comment /remove-lifecycle rotten.
Rotten issues close after an additional 30 days of inactivity.
If this issue is safe to close now please do so by adding the comment /close.

Send feedback to Knative Productivity Slack channel or file an issue in knative/test-infra.

/lifecycle rotten

@knative-prow-robot knative-prow-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Apr 1, 2020
@knative-housekeeping-robot

Rotten issues close after 30 days of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh by adding the comment /remove-lifecycle rotten.

Send feedback to Knative Productivity Slack channel or file an issue in knative/test-infra.

/close

@knative-prow-robot
Copy link
Contributor

@knative-housekeeping-robot: Closing this issue.

In response to this:

Rotten issues close after 30 days of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh by adding the comment /remove-lifecycle rotten.

Send feedback to Knative Productivity Slack channel or file an issue in knative/test-infra.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@vagababov
Copy link
Contributor

/remove-lifecycle rotten
/reopen
/lifecycle-frozen

@knative-prow-robot knative-prow-robot removed the kind/question Further information is requested label Mar 22, 2021
@github-actions
Copy link

This issue is stale because it has been open for 90 days with no
activity. It will automatically close after 30 more days of
inactivity. Reopen the issue with /reopen. Mark the issue as
fresh by adding the comment /remove-lifecycle stale.

@github-actions github-actions bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jun 24, 2021
@markusthoemmes
Copy link
Contributor

/remove-lifecycle stale

@knative-prow-robot knative-prow-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jun 24, 2021
@dprotaso dprotaso modified the milestones: Needs Triage, 0.24.x, 0.25.x Jun 24, 2021
@dprotaso
Copy link
Member

/lifecycle frozen

@knative-prow-robot knative-prow-robot added the lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. label Jun 24, 2021
@dprotaso dprotaso modified the milestones: 0.25.x, 0.26.x Aug 26, 2021
@dprotaso dprotaso modified the milestones: 0.26.x, 0.27.x Oct 6, 2021
@dprotaso dprotaso modified the milestones: v1.0.0, v1.1.0 Nov 3, 2021
@dprotaso dprotaso moved this from Now to Next in Serving API Roadmap Nov 4, 2021
@dprotaso dprotaso modified the milestones: v1.1.0, v1.2.0 Jan 4, 2022
@dprotaso dprotaso modified the milestones: v1.2.0, v1.3.0 Feb 2, 2022
@dprotaso dprotaso modified the milestones: v1.3.0, v1.4.0 Mar 10, 2022
@dprotaso dprotaso modified the milestones: v1.4.0, v1.5.0 May 10, 2022
@dprotaso dprotaso modified the milestones: v1.5.0, Icebox Jul 28, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/API API objects and controllers kind/cleanup Categorizes issue or PR as related to cleaning up code, process, or technical debt. lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. triage/accepted Issues which should be fixed (post-triage)
Projects
Status: Ready To Work
Development

No branches or pull requests