Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[feat] Enable regular reconciliation of promise workflows #219

Closed
3 tasks done
abangser opened this issue Aug 9, 2024 · 3 comments
Closed
3 tasks done

[feat] Enable regular reconciliation of promise workflows #219

abangser opened this issue Aug 9, 2024 · 3 comments

Comments

@abangser
Copy link
Member

abangser commented Aug 9, 2024

With Kratix being a Kubernetes controller, it is expected that it could reconcile at any point. The most common triggers are:

  1. Changes
  2. Restarts of the controller

It is also common to have a regular interval for that reconciliation to enable drift detection and re-alignment in between the active triggers mentioned above.

Kratix should have a sensible default for regular reconciliations and also allow user override of this number within a reasonable range.

Assumptions

  • The forced-reconciliation should not happen very often by default
    • The re-execution of the pipeline is an expensive operation
    • In the future, we will have healthchecks that would tell Kratix when something has drifted
  • User will want to configure the interval
  • Re-reconciling on a Promise should not re-reconcile all of its resources
    • we may want to add support for this in the future

Scenarios

Feature: Default reconciliation time
  Scenario: Promise workflow is run at least every 10 hours
    Given the Kratix manager has not restarted in the last 10 hours
    When 10 hours have passed
    Then the Promise workflows should be triggered 

    Given the Promise workflow has run recently
    When the planned 10 hour reconciliation time occurs
    Then the Promise workflow does not run unnecessarily

Feature: documentation
  Given a user wanting to know more about Kratix reconciliation
  When they go to the docs
  Then they can find information on the default values

Note: While valuable there are a few outstanding aspects that will make this more complete. Referencing these in the docs will be helpful. These will all be covered in additional pieces of work:

  1. This will not affect the statestores unless the workflow output documents have changed. I.e. if someone has deleted a file from git, but the expected files have not changed, this will not rewrite the file to git.
  2. This will not rerun resource pipelines as those will be done in a different story

Engineering notes

  • We should not try to "detect" drift, but rather just re-execute the pipelines every X hours.

TODO List

  • Validate the writer will rewrite if we update a file in the state store
    • We tested this and found that files were not rewritten so this needs to be actively ensured as part of a different story
  • Validate behaviour when pipeline is in failure state (we are checking on the PipelineCompleted condition)
    • The pipeline will not get rerun. This was deemed the right call since if the pipeline is in crashloopbackoff or another failure mode it most likely should not be rerun without a human intervention.
  • Validate if our requeue logic will be noisy on manager restarts
    • no, in anything it is not effective enough! If the kratix controller restarts, there may be less reconciliation loops scheduled as the 10 hours is recalculated on controller boot. Therefore, in a worst case scenario, if the controller is restarted every 9h and 59m we will not have this job ever automatically run.
@aclevername aclevername self-assigned this Oct 2, 2024
@abangser abangser self-assigned this Oct 2, 2024
@richcooper95 richcooper95 self-assigned this Oct 2, 2024
@ChunyiLyu ChunyiLyu self-assigned this Oct 2, 2024
@abangser
Copy link
Member Author

abangser commented Oct 4, 2024

This should be configurable by the operator. 👈 this will be covered in #228 which is where configurability is introduced.

@ChunyiLyu
Copy link
Member

This can be achieved by customizing manager cache options. See: https://github.com/kubernetes-sigs/controller-runtime/blob/main/pkg/cache/cache.go#L170

@kirederik kirederik assigned SaphMB and unassigned richcooper95 Oct 7, 2024
@aclevername aclevername assigned richcooper95 and unassigned kirederik Oct 9, 2024
@abangser abangser assigned abangser and unassigned richcooper95 Oct 10, 2024
@abangser abangser changed the title [feat] Enable regular reconciliation loops to identify/fix drift [feat] Enable regular reconciliation of promise workflows Oct 14, 2024
@SaphMB
Copy link
Member

SaphMB commented Oct 15, 2024

Documentation

abangser added a commit that referenced this issue Oct 15, 2024
This is the first of a handful of features which will combine to
provide regular reconciliation across the full Kratix platform.

Right now the reconcilation time is set to 10 hours but will be
configurable in the future.

There are two notes here to be aware of:
1. If the kratix controller restarts, there may be *more*
   reconciliation loops scheduled. That is why this is saying
   "at least" every 10 hours rather than "at most" or "exactly".
2. The workflows will run, but if the declarative outputs are not
   changed, the write to the statestore will not be triggered.
   Making sure to reconcile the actual contents of the statestore
   with the declarede contents is an upcoming piece of work to
   in this stream.

closes #219

Co-authored-by: Sapphire Mason-Brown <[email protected]>
Co-authored-by: Derik Evangelista <[email protected]>
Co-authored-by: Rich Barton-Cooper <[email protected]>
abangser added a commit that referenced this issue Oct 15, 2024
This is the first of a handful of features which will combine to
provide regular reconciliation across the full Kratix platform.

Right now the reconcilation time is set to 10 hours but will be
configurable in the future.

There are two notes here to be aware of:
1. If the kratix controller restarts, there may be *more*
   reconciliation loops scheduled. That is why this is saying
   "at least" every 10 hours rather than "at most" or "exactly".
2. The workflows will run, but if the declarative outputs are not
   changed, the write to the statestore will not be triggered.
   Making sure to reconcile the actual contents of the statestore
   with the declarede contents is an upcoming piece of work to
   in this stream.

closes #219

Co-authored-by: Sapphire Mason-Brown <[email protected]>
Co-authored-by: Derik Evangelista <[email protected]>
Co-authored-by: Rich Barton-Cooper <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants