Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor reconcilers and introduce v1beta2 API #311

Merged
merged 10 commits into from
Feb 16, 2023
Merged

Conversation

darkowlzz
Copy link
Contributor

@darkowlzz darkowlzz commented Sep 13, 2022

This change refactors image-reflector-controller to implement the new unified standards of flux controllers.
It uses the result finalizer from fluxcd/pkg#329 for computing the status conditions of
ImageRepository and ImagePolicy objects such that they are kstatus compliant.

Following are some highlights of the changes:

  • Introduce v1beta2 API which removes the old status condition readiness API and replaces them with the new conditions.Setter API from https://pkg.go.dev/github.com/fluxcd/pkg/[email protected]/conditions#Setter, along with other minor validation checks in the API fields.
  • Cleanup the controllers/ directory by moving the code that are not directly/strictly related to reconciliation operation into the internal/ directory, along with their tests.
  • Refactor the ImageRepository and ImagePolicy to use the helpers from https://pkg.go.dev/github.com/fluxcd/pkg/runtime.
  • The logs, events and notifications have been standardized for useful information to be helpful in debugging.
  • Introduce latest tags in ImageRepository status to list the 10 latest scanned tags.
  • Introduce observed previous image in ImagePolicy status to keep track of the previous image so that it can provide update path information.
  • The reconciliation business logic of both the reconcilers have been divided into smaller functions that are tested separately, out of the reconciliation loop.
  • The existing controller e2e tests weren't changed much in order to use them to ensure the behavior remains the same as much as possible. They are slightly modified to check the kstatus conditions after every run.
  • Cloud provider contextual login is configured per object with .spec.provider in ImageRepository.

The new status conditions under various scenarios can be seen in:

Instructions for testing

To test this, checkout this branch and build the container image with make docker-build IMG=<image-name>. Before deploying the container image, update/install the CRDs to v1beta2 API with make install or apply the CRD definition manifests in config/crd/bases. Create image objects and test.

Fix: #259
Fix: #166

@stefanprodan
Copy link
Member

@darkowlzz please delete the v1alpha CRDs, we haven't kept them around for the others controllers, so it's time to remove them here too.

@stefanprodan
Copy link
Member

For ImageRepository In propose that we add the latest ten tags fetched from the registry. The tag list would help users debug polices by looking up .status.lastScanResult.latestTags e.g.:

status:
  canonicalImageName: ghcr.io/stefanprodan/podinfo
  conditions:
  - lastTransitionTime: "2022-09-12T11:59:49Z"
    message: successful scan, found 34 tags
    observedGeneration: 1
    reason: Succeeded
    status: "True"
    type: Ready
  lastScanResult:
    scanTime: "2022-09-12T11:59:49Z"
    tagCount: 34
    latestTags: '6.2.0, 6.1.8, 6.1.7, 6.1.6, 6.1.5, 6.1.4, 6.1.3, 6.1.2, 6.1.1, 6.1.0, ...'
  observedExclusionList:
  - ^.*\.sig$
  observedGeneration: 1

@stefanprodan stefanprodan added the enhancement New feature or request label Sep 13, 2022
@stefanprodan stefanprodan added this to the GA milestone Sep 13, 2022
@stefanprodan
Copy link
Member

@darkowlzz does this PR solves #166?

@darkowlzz
Copy link
Contributor Author

does this PR solves #166?

@stefanprodan yes. In

// Was ready before and is ready now, but the scan results have changed.
if conditions.IsReady(oldObj) && conditions.IsReady(obj) &&
(conditions.GetMessage(oldObj, meta.ReadyCondition) != conditions.GetMessage(obj, meta.ReadyCondition)) {
r.eventLogf(ctx, obj, corev1.EventTypeNormal, ready.Reason, ready.Message)
it tries to only send notification when the result changes. I'll do some more testing to make sure it's handled well.

@darkowlzz darkowlzz force-pushed the v1beta2-refactor branch 2 times, most recently from 03166ea to 2b8520b Compare September 13, 2022 23:49
@darkowlzz
Copy link
Contributor Author

darkowlzz commented Sep 13, 2022

Removed alpha APIs and fixed the fuzz test failure along with some other minor changes to reduce the verbosity of the test logs from badger and test registry server.

Added a rough implementation of latest tags in ImageRepository status:

status:
  canonicalImageName: ghcr.io/stefanprodan/podinfo
  conditions:
  - lastTransitionTime: "2022-09-13T23:23:46Z"
    message: successful scan, found 34 tags
    observedGeneration: 1
    reason: Succeeded
    status: "True"
    type: Ready
  lastScanResult:
    latestTags: latest, 6.2.0, 6.1.8, 6.1.7, 6.1.6, 6.1.5, 6.1.4, 6.1.3, 6.1.2, 6.1.1
      ...
    scanTime: "2022-09-13T23:48:52Z"
    tagCount: 34
  observedExclusionList:
  - ^.*\.sig$
  observedGeneration: 1

Looks like we may have to remove latest from it.

Also added previous tag in the new ImagePolicy version events to address fluxcd/image-automation-controller#437 by including previous tag in the update event.
When it's first policy apply, it results in event:

Latest image tag for 'ghcr.io/stefanprodan/podinfo' resolved to: 5.2.0

When there's an update, it results in event:

Latest image tag for 'ghcr.io/stefanprodan/podinfo' updated from 5.2.0 to 5.2.1

This requires introducing a new field in ImagePolicy, observedPreviousImage to keep track of the previous image in order to prevent duplicate events on subsequent reconciliation without any actual update.

status:
  conditions:
  - lastTransitionTime: "2022-09-13T23:03:29Z"
    message: Latest image tag for 'ghcr.io/stefanprodan/podinfo' updated from 5.2.0
      to 5.2.1
    observedGeneration: 1
    reason: Succeeded
    status: "True"
    type: Ready
  latestImage: ghcr.io/stefanprodan/podinfo:5.2.1
  observedGeneration: 1
  observedPreviousImage: ghcr.io/stefanprodan/podinfo:5.2.0

These are implemented in the last commit. Need some feedback and help in improving the messaging.
I still need to test them more and write some more unit tests.

@stefanprodan
Copy link
Member

Looks like we may have to remove latest from it.

No we don't, we just trim the list to max 10 items without altering them.

Need some feedback and help in improving the messaging.

I suggest we do:

Latest image tag for 'ghcr.io/stefanprodan/podinfo' resolved to 5.2.0

Latest image tag for 'ghcr.io/stefanprodan/podinfo' updated from 5.2.0 to 5.2.1

@tun0
Copy link

tun0 commented Sep 14, 2022

Fix: fluxcd/image-automation-controller#437

I don't see how this would enhance the data available through https://pkg.go.dev/github.com/fluxcd/image-automation-controller/pkg/update? But, I could very well be overlooking something given my very limited Go knowledge 😉

@stefanprodan
Copy link
Member

@tun0 you would setup alerting for ImagePolicy objects, and those will tell you "Latest image tag for 'ghcr.io/stefanprodan/podinfo' updated from 5.2.0 to 5.2.1", then the ImageAutomation alert will tell you "Image 'ghcr.io/stefanprodan/podinfo' updated from 5.2.1".

@tun0
Copy link

tun0 commented Sep 14, 2022

The ImagePolicy stuff doesn't have any workload specific information though. The policy could've gone from 5.2.0 to 5.2.1, which won't say anything about a workload that was still on 5.1.0 for some reason. Perhaps it's a new resource that was just added to git with an outdated tag. Ideally I'd be able to see detailed version transitions per workload, not per policy. In the end, the people looking these notifications, in our case at least, care about workloads, not image policies.

@stefanprodan
Copy link
Member

OK so that issue is wrongly placed here, @darkowlzz please move it or create a new one in IAC.

@darkowlzz
Copy link
Contributor Author

Okay, my bad. I moved it to image-reflector yesterday because of the misunderstanding. I'll move it to IAC again and add some more context around it. Thanks @tun0 for the clarification.

@darkowlzz
Copy link
Contributor Author

darkowlzz commented Sep 14, 2022

Did some polishing of the latest tags and previous - new tag implementation with tests. Everything looked fine on some testing.
I'll do some more testing while writing spec docs next.

I need some more help with the emitted events for ImageRepository. A new ImageRepository which is reconciled for the first time emits two events:

29s         Normal   Succeeded   imagerepository/podinfo   successful scan, next scan in 5m0s
29s         Normal   Succeeded   imagerepository/podinfo   successful scan, found 33 tags

The first event is only a K8s event and it's also logged to provide user a signal that reconciliation took place and when the next scan would take place. Since the scanning is done only at specific periods, I think it's useful to log when it'll actually scan whenever a reconciliation runs.
The second event is a K8s event and a notification controller alert. This is used to notify when there's an update in the scan result.

The first part of both the events look redundant if one checks the k8s events. Maybe we can change one of them to be different?

Also, when ImageRepository is triggered to reconcile before the scan time, maybe due to an update in the interval value or some other configuration, it does a no-op reconciliation and emits K8s event and logs:

31m         Normal   Succeeded   imagerepository/podinfo   no change in repository configuration since last scan, next scan in 2m41.993960864s

Just highlighting this behavior to get any feedback on this.

@stefanprodan
Copy link
Member

no change in repository configuration since last scan, next scan in 2m

This is confusing to me, what configuration are we referring to? If this is a no-op because the tags in the cache match the ones from upstream, then let’s log only:

no new tags found since last scan, next scan in 2m

@darkowlzz
Copy link
Contributor Author

This is confusing to me, what configuration are we referring to?

It's referring to the image URL and the exclusion list at the moment. When they are changed, it performs a scan even if it's not time for a scan based on previous results and resets the scan timer.

Regarding the no-op reconciliation, saying no new tags found since last scan doesn't seem right because the no-op is because it's not scan time, not because of no new tags.

@stefanprodan
Copy link
Member

successful scan, found 33 tags

What happens at the next scan if the same tags are fetched, do we still issue an event or do we log only?

@darkowlzz
Copy link
Contributor Author

What happens at the next scan if the same tags are fetched, do we still issue an event or do we log only?

Then it logs and emits k8s only event: successful scan, next scan in 5m0s .

@stefanprodan
Copy link
Member

Then it logs and emits k8s only event: successful scan, next scan in 5m0s .

Let's make that "no new tags found, next scan in 5m"

@darkowlzz darkowlzz force-pushed the v1beta2-refactor branch 3 times, most recently from 8d98d7f to 77c39d1 Compare September 15, 2022 23:00
@darkowlzz
Copy link
Contributor Author

The code should be ready for review now while I'm still working on the spec docs.
fluxcd/pkg#329 should be ready in its current state. I'll address the remaining comments before marking the PR ready.

Regarding the ImageRepository events discussed above, now it only emits single event per reconciliation with different messages.
When a new repository is created and it scans successfully:

successful scan, found 211 tags

This is also sent to the notification controller.

On subsequent reconciliation at the scan interval with no new tags:

no new tags found, next scan in 1h0m0s

This is not sent to the notification controller. It's only logged and recorded as k8s event.

On reconciliation triggered before the scan interval without any configuration change that affect the scan result:

no change in repository configuration since last scan, next scan in 58m39.516114621s

This is also not sent to the notification controller. It's only logged and recorded as k8s event.

@darkowlzz darkowlzz marked this pull request as ready for review September 15, 2022 23:17
docs/spec/v1beta2/imagerepositories.md Outdated Show resolved Hide resolved
docs/spec/v1beta2/imagerepositories.md Outdated Show resolved Hide resolved
Copy link
Member

@stefanprodan stefanprodan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add validation to all v1.Duration fields, example here: #314

@darkowlzz
Copy link
Contributor Author

darkowlzz commented Feb 7, 2023

I added a getter for the Exclusion List that's in the ImageRepository API so that we avoid the issues we had with notification-controller new API fields with defaults and did some upgrade testing to see if anything breaks. Sharing the details below so that we can observe and discuss about it.

Manual upgrade test

With the current latest version of the controller and API, I created an ImageRepository and an ImagePolicy as below:

---
apiVersion: image.toolkit.fluxcd.io/v1beta1
kind: ImageRepository
metadata:
  name: podinfo
  namespace: default
spec:
  image: ghcr.io/stefanprodan/podinfo
  interval: 10m
---
apiVersion: image.toolkit.fluxcd.io/v1beta1
kind: ImagePolicy
metadata:
  name: podinfo
  namespace: default
spec:
  imageRepositoryRef:
    name: podinfo
  policy:
    semver:
      range: 5.0.x

This resulted in

apiVersion: image.toolkit.fluxcd.io/v1beta1 
kind: ImageRepository                                                                 
...
spec:                                                                                                                                                                        
  image: ghcr.io/stefanprodan/podinfo
  interval: 10m                                                                                                                                                              
status:                                                                               
  canonicalImageName: ghcr.io/stefanprodan/podinfo 
  conditions:
  - lastTransitionTime: "2023-02-07T18:15:54Z"                                                                                                                               
    message: successful scan, found 41 tags                 
    reason: ReconciliationSucceeded
    status: "True"                                                                    
    type: Ready
  lastScanResult:
    scanTime: "2023-02-07T18:15:54Z"
    tagCount: 41
  observedGeneration: 1
---
apiVersion: image.toolkit.fluxcd.io/v1beta1
kind: ImagePolicy
...
spec:
  imageRepositoryRef:
    name: podinfo
  policy:
    semver:
      range: 5.0.x
status:
  conditions:
  - lastTransitionTime: "2023-02-07T18:16:38Z"
    message: 'Latest image tag for ''ghcr.io/stefanprodan/podinfo'' resolved to: 5.0.3'
    reason: ReconciliationSucceeded
    status: "True"
    type: Ready
  latestImage: ghcr.io/stefanprodan/podinfo:5.0.3
  observedGeneration: 1

Updated the API and the controller, not the object manifests themselves, the objects became:

apiVersion: image.toolkit.fluxcd.io/v1beta2                                                    
kind: ImageRepository                                                                 
...
spec:                
  exclusionList:  
  - ^.*\.sig$    
  image: ghcr.io/stefanprodan/podinfo
  interval: 10m
  provider: generic
status:    
  canonicalImageName: ghcr.io/stefanprodan/podinfo                                                                     
  conditions:
  - lastTransitionTime: "2023-02-07T19:18:01Z"                                                                         
    message: 'successful scan: found 41 tags'                                                                          
    observedGeneration: 1
    reason: Succeeded
    status: "True"
    type: Ready                                
  lastScanResult:                              
    latestTags:                                
    - latest                                   
    - 6.3.3                                    
    - 6.3.2                                                
    - 6.3.1                                                                                                                                                                                   
    - 6.3.0                                    
    - 6.2.3                                                
    - 6.2.2                                                
    - 6.2.1                                                
    - 6.2.0                                                
    - 6.1.8                                                
    scanTime: "2023-02-07T19:18:01Z"                       
    tagCount: 41                                           
  observedExclusionList:                                   
  - ^.*\.sig$                                              
  observedGeneration: 1
---
apiVersion: image.toolkit.fluxcd.io/v1beta2
kind: ImagePolicy
...
spec:
  imageRepositoryRef:
    name: podinfo
  policy:
    semver:
      range: 5.0.x
status:
  conditions:
  - lastTransitionTime: "2023-02-07T19:18:01Z"
    message: Latest image tag for 'ghcr.io/stefanprodan/podinfo' resolved to 5.0.3
    observedGeneration: 1
    reason: Succeeded
    status: "True"
    type: Ready
  latestImage: ghcr.io/stefanprodan/podinfo:5.0.3
  observedGeneration: 1

NOTE: Both the objects got converted to v1beta2 and some new default fields were added. In case of autologin, the user will have to make sure that the automatically populated .spec.provider in ImageRepository has the appropriate value for their provider.

Updated the ImagePolicy semver range to see some new status fields:

apiVersion: image.toolkit.fluxcd.io/v1beta2
kind: ImagePolicy
...
spec:
  imageRepositoryRef:
    name: podinfo
  policy:
    semver:
      range: 6.x
status:
  conditions:
  - lastTransitionTime: "2023-02-07T19:22:15Z"
    message: Latest image tag for 'ghcr.io/stefanprodan/podinfo' updated from 5.0.3
      to 6.3.3
    observedGeneration: 2
    reason: Succeeded
    status: "True"
    type: Ready
  latestImage: ghcr.io/stefanprodan/podinfo:6.3.3
  observedGeneration: 2
  observedPreviousImage: ghcr.io/stefanprodan/podinfo:5.0.3

Previous image is being tracked in the status now which is used to construct update event message that shows the progression of the versions.

- Removes SetImageRepositoryReadiness() and SetImagePolicyReadiness().
- Adds GetConditions() and SetConditions() to make the image API objects
  conditions.Setter.
- Adds a default value in the ImageRepository.Spec.ExclusionList to
  exclude .sig artifacts and limits the number of max items in the list
  to 25 with API validation.
- Adds GetExclusionList() for ExclusionList to get the exclusion list
  with default when empty.
- Adds ImageRepository.Status.ObservedExclusionList to record the
  observed ExclusionList.

Signed-off-by: Sunny <[email protected]>
The controllers package is filled with a lot of unrelated code.
Organizing them by moving them out of controllers would help manage them
and the controllers better.

- Move the secrets related code to internal/secret package along with
  the secret related tests. These tests have testdata which are also
  moved to the new package.
- Test registry server tests are moved to internal/test package.
- tlsserver test server and tests are moved to internal/test package.

Signed-off-by: Sunny <[email protected]>
Refactor the two reconcilers to add support for kstatus based status
conditions and break down the reconciler into smaller easily testable
units.
It also fixes a lot of hidden bugs in the behavior of the reconcilers.

A few of the fixed issues and change in behavior are:
- Avoid writing to the ImageRepository spec for adding default exclusion
  list values. The default is now set via the CRD API.
- Bug in the exclusion list filter. Multiple patterns in the list didn't
  filter the tags for all the patterns.
- Updating image or the exclusion list of ImageRepository didn't force
  scan. It waited for the scan period before scanning. Now, if they are
  updated, scan is performed immediately.
- For an existing ImageRepository, if the image is changed, the old
  scan result is now deleted as that no longer belongs to the new
  image.

Also, silence badger and test registry server logs.

Signed-off-by: Sunny <[email protected]>
Introduce new field in ImagePolicy status, observedPreviousImage,
to store the previous image. This is needed to show image update
version change in alerts.

Introduce new field in ImageRepository status, latestTags, to store the
10 recent tags. This would help users debug image policy better.

Signed-off-by: Sunny <[email protected]>
In alignment with OCIRepository API in source-controller, introduce per
object spec.provider field for setting the contextual login  provider
option. Deprecate the autologin flags and error log when the flags are
used.

WARNING: This is a breaking change for the existing users with the flag
set.

Signed-off-by: Sunny <[email protected]>
- Update the image API go import.
- Update the ImageRepository in tests to use spec.provider for setting
  cloud provider option.
- Update test set up to use the CRDs from the current repo, patching the
  image CRDs downloaded along with the flux install manifest.

Signed-off-by: Sunny <[email protected]>
@tun0
Copy link

tun0 commented Feb 9, 2023

  • Introduce observed previous image in ImagePolicy status to keep track of the previous image so that it can provide update path information.

The way I read this, is that this is about latest from the point-of-view of the policy. Our existing Flux v1 notifications show the "actual" previous image, as in: what was actually configured in git for the specific workload.

The difference is subtle, but especially for new workloads it could be confusing:

  • previousTag: foo
  • newTag: bar
  • create new workload X with "ancient" tag: qux
    This would trigger a notification for X mentioning updating from foo to bar, whereas it should actually say from qux to bar.

I understand this is a bit of a corner-case, but on the other hand I also kinda treat this as feature parity with Flux v1.

@darkowlzz
Copy link
Contributor Author

  • create new workload X with "ancient" tag: qux
    This would trigger a notification for X mentioning updating from foo to bar, whereas it should actually say from qux to bar.

@tun0 Sounds like a confusion we had before, earlier in the thread. This notification is only for the image policy, independent of the workload. We'll add similar notification in image-automation-controller that'll be specific to the workload and would show the previously used tag that was in git and the new tag.

@tun0
Copy link

tun0 commented Feb 9, 2023

  • create new workload X with "ancient" tag: qux
    This would trigger a notification for X mentioning updating from foo to bar, whereas it should actually say from qux to bar.

@tun0 Sounds like a confusion we had before, earlier in the thread. This notification is only for the image policy, independent of the workload. We'll add similar notification in image-automation-controller that'll be specific to the workload and would show the previously used tag that was in git and the new tag.

Ah, right. Makes sense!

- ImageRepository spec doc
- ImagePolicy spec doc

Signed-off-by: Sunny <[email protected]>
Copy link
Member

@stefanprodan stefanprodan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Thanks @darkowlzz 🥇

Copy link
Member

@somtochiama somtochiama left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

I tested this on GKE with GCR and Artifact Registry and it worked fine.
Thank you @darkowlzz 🌹

@tun0
Copy link

tun0 commented Sep 1, 2023

We'll add similar notification in image-automation-controller that'll be specific to the workload and would show the previously used tag that was in git and the new tag.

What's the status of this?

@darkowlzz
Copy link
Contributor Author

@tun0 nothing as of now. We got busy with Flux GitOps GA which included GitRepository, Kustomization and Receiver API only. image-automation-controller was not a part of it. Since we are over that now, the other controllers will get more attention gradually. So, it's one of the next things, but there's no set plans for it yet. Since you're interested, we can request you to try it as we develop that in the coming months.

@tun0
Copy link

tun0 commented Sep 1, 2023

@darkowlzz: Thanks for the update. Looking forward to the request 😉 I'll gladly assist where possible.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

Roll out new unified standards to Image Reflector Controller Limit scanning success events to be less spammy
6 participants