Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Image Update Automation stops working when git clone takes a long time #296

Open
1 task done
Tracked by #2593
martinzellner opened this issue Jan 19, 2022 · 9 comments
Open
1 task done
Tracked by #2593
Milestone

Comments

@martinzellner
Copy link

Describe the bug

For our git repo with a lot of commits, the image update automation stopped working.
As increasing the timeouts for the git clone [1] did not help we decided to squash a large part of the git history which dramatically reduced the time to clone the repo and also fixed the image update automation. Of course, this comes with the cost of losing the commit history.

Therefore we kindly ask if it would be possible to enable the use of git's shallow clone functionality [2], which would enable faster cloning without having to squash the git history.

[1] https://fluxcd.io/docs/components/source/api/#source.toolkit.fluxcd.io/v1beta1.GitRepository
[2] https://git-scm.com/docs/git-clone#Documentation/git-clone.txt---depthltdepthgt

Steps to reproduce

  1. Install Flux with image update automation
  2. Increase the git history to a significant size such that a full git clone takes > 8 seconds
  3. Observe image automation failing
  4. Squash the commits in the git repository to archive shorter cloning times
  5. Observe image automation working again

Expected behavior

We would like flux to use GIT's shallow clone functionality [1].

[1] https://git-scm.com/docs/git-clone#Documentation/git-clone.txt---depthltdepthgt

Screenshots and recordings

No response

OS / Distro

N/A

Flux version

N/A

Flux check

N/A

Git provider

Bitbucket

Container Registry provider

No response

Additional context

No response

Code of Conduct

  • I agree to follow this project's Code of Conduct
@stefanprodan stefanprodan transferred this issue from fluxcd/flux2 Jan 19, 2022
@squaremo
Copy link
Member

Using shallow clones makes it difficult or impossible to switch branches when someone specifies a "push branch" -- see
#177 and its motivation, #164 for background on that. It's still unclear to me whether this is a limitation or a bug in the go-git library, and whether it can be worked around. Back when I wrote that PR I concluded it was expedient to not bother with the complication of shallow clones :-/

I'm surprised that it only needs latency of >8s to fail. That suggests there's something else going on here.

3. Observe image automation failing

Failing how? Does it fail to clone the repo, but continue running? Or crash, or stall? Or something else.

@uqix
Copy link

uqix commented Jan 20, 2022

Similar situation here:
after this log line,

{
  "level": "error",
  "ts": "2022-01-19T13:01:34.563Z",
  "logger": "controller.imageupdateautomation",
  "msg": "Reconciler error",
  "reconciler group": "image.toolkit.fluxcd.io",
  "reconciler kind": "ImageUpdateAutomation",
  "name": "image-update-automation",
  "namespace": "demo",
  "error": "unable to clone: failed to connect to some.gitlab.com: Connection timed out"
}

the image-automation-controller pod stopped working silently (it continues running without more logs and reconciliations) while the corresponding ImagePolicy got the right LatestImage.

FYI, we use default timeout settings everywhere in flux resources, our gitlab instance does have some performance problems now and then which could cause clone timed out.

@hiddeco
Copy link
Member

hiddeco commented Jan 20, 2022

Would one of you be able to try out the image from #297?

This brings the timeout logic back to the shape it was in around the time of the 0.15.x release (which pretty much equals to none), while still including some of the other improvements in the area of Git.

If this resolves the issue, we need to have another look at how libgit2 reacts to the (cancelling) callbacks.

@hiddeco
Copy link
Member

hiddeco commented Feb 7, 2022

With the release of Flux v0.26.2, we would like to kindly ask folks with issues to update to the latest image releases. Since we changed our build process around libgit2 for the source-controller and image-automation-controller, we have observed some of the issues as described to have vanished (and confirmed by others as per fluxcd/source-controller#439 (comment)).

@pjbgf
Copy link
Member

pjbgf commented Mar 22, 2022

@martinzellner thank you very much for reporting this. Today we are releasing version v0.21.0 which introduces an experimental transport that should fix the issue in which the controller stops working in some scenarios.

The experimental transport needs to be opted-in by setting the environment variable EXPERIMENTAL_GIT_TRANSPORT to true in the controller's Deployment.

Can you test it again using the version v0.21.0, with the experimental transport enabled and let us know how you get on please?

@pjbgf pjbgf added this to the GA milestone May 27, 2022
@pjbgf
Copy link
Member

pjbgf commented May 27, 2022

This should be fixed as part of Managed Transport being made default. Latest release candidates with this changes:

ghcr.io/fluxcd/source-controller:rc-4b3e0f9a
ghcr.io/fluxcd/image-automation-controller:rc-48bcca59

Closing for lack of activity - happy to reopen in case others report recurrence.

@pjbgf pjbgf closed this as completed May 27, 2022
@steviestainz
Copy link

I believe that this issue is still present in v0.32.0 as part of the 2.0.0 release candidate. We have a large kustomize tree in a ~72MB git repository source (when zipped) containing a total of 24 image policies referred to by policy markers in a single kustomization.yaml.

ImagePolicies are matching the correct tag to be applied. Whenever the ImageUpdateAutomation runs the controller downloads the source branch, unpacks these files into /tmp/somefolder/ (82MB on disk) and quickly deletes the files when the total usage within this folder is approaching ~100MB.

Because my target branch is different to the source I think it then downloads the latest commit for comparison but then also deletes the files immediately.

Finally the logs indicate that no changes were applied.

2023-04-26T11:46:24.139Z debug ImageUpdateAutomation/bitbucket-hotfix-dev.flux-system - fetching git repository 2023-04-26T11:46:24.139Z debug ImageUpdateAutomation/bitbucket-hotfix-dev.flux-system - attempting to clone git repository 2023-04-26T11:46:37.270Z debug ImageUpdateAutomation/bitbucket-hotfix-dev.flux-system - updating with setters according to image policies 2023-04-26T11:46:37.274Z debug ImageUpdateAutomation/bitbucket-hotfix-dev.flux-system - ran updates to working dir 2023-04-26T11:46:37.274Z info ImageUpdateAutomation/bitbucket-hotfix-dev.flux-system - no changes made in working directory; no commit

Typically the process outlined above takes between 13-15s for our repository but no further log output is generated (note DEBUG is enabled above).

If I define a different Kustomization tree from a smaller git repo for a subset of the ImagePolicy markers (28MB zip archive) then commits are made successfully to the target branch. Would you consider reopening this please @pjbgf?

@pjbgf pjbgf reopened this Apr 26, 2023
@steviestainz
Copy link

Thank you for reopening the issue. In order to validate if it is the same problem I have tested this against a shallow clone of the original branch, removing some ~50MB of unnecessary files which reduced the processing time to only 6 seconds. I still see the same behaviour, namely no changes made in working directory; no commit. Could I just rule out whether using multiple ImagePolicy markers in the same manifest(s) is supported? e.g.

images:
  - name: myregistry/myproject/myapplication-module1 # {"$imagepolicy": "flux-system:myapplication-module1:name"}
    newTag: "1.5.0"                                  # {"$imagepolicy": "flux-system:myapplication-module1:tag"}
  - name: myregistry/myproject/myapplication-module2 # {"$imagepolicy": "flux-system:myapplication-module2:name"}
    newTag: "5.2.0_07"                               # {"$imagepolicy": "flux-system:myapplication-module2:tag"}
  - name: myregistry/myproject/myapplication-module3 # {"$imagepolicy": "flux-system:myapplication-module3:name"}
    newTag: "3.46.2"                                 # {"$imagepolicy": "flux-system:myapplication-module3:tag"}
  - name: myregistry/myproject/myapplication-module4 # {"$imagepolicy": "flux-system:myapplication-module4:name"}
    newTag: "2.2.5"                                  # {"$imagepolicy": "flux-system:myapplication-module4:tag"}

@kingdonb
Copy link
Member

kingdonb commented Jul 7, 2023

Just want to check in with users who reported here. Are there any users actively tracking this issue who can say if it persists in Flux 2.0.0, which has been released this week?

@steviestainz The multiple imagepolicy markers in the same manifest are definitely supported. But one of your tags looks like a possibly invalid semver. 5.2.0_07 does not parse as semver if I remember correctly... is this build metadata? I guess I just haven't seen this before, I thought that would probably not match with the semver rule and any pattern you might have passed in, but you said all of your policies were matching correctly so that probably isn't it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants