-
Notifications
You must be signed in to change notification settings - Fork 1.1k
Improve observability and error reporting in Flux #2812
Comments
Hi @2opremio, I am looking forward to contribute to Flux and willing to participate in GSOC 2020. How can I get things started for this issue? |
Hi @supra08! Thanks a lot for the interest. I would start by reading the issues and the text above (the issue list is not exhaustive so you may want to dive through the other flux issues), using Flux, reproducing the problems mentioned and thinking about a strategy to improve things. I don't know the logistic details of GSoC. I would think that the project needs to be accepted first? @dholbach will probably know better. |
|
I am using fluxcd+helm-operator with fluxcloud, I deploy a container and the helmrelease goes in failed with reason: HelmUpgradeFailed, even if I have in the helmrelease yaml file the attribute wait: true. Fluxcd always reports to fluxcloud "result":{"default:helmrelease/myapp":{"Status":"success" It's almost impossible to work with notifications failures in this way. fluxcd execute kubectl -f release/myapp.yaml without check any result of the deployed helmrelease |
Per events, Kube events also has a pretty decent ecosystem now of tools that can monitor events like Kube watch and Argo Events. Another option could be a webhook setting that shoots off a CloudEvent to a specified endpoint. |
I was really looking forward to implementing flux in our clusters, but the inability to know what's going on or what went wrong is a real deal breaker.
:( |
Hello everyone. I'm trying to write a proposal for new metrics, following what was informed on that issue, which you can see here. Any feedback, from users or maintainers, on the metrics proposed would be awesome. If anyone would like to add anything(specially use-cases), that would be great too! |
Observability has been a primary goal of Flux v2's redesign efforts. All of the controllers now implement Prometheus metrics around their various reconciliation topics, and expose reconciliation status details as events on their respective APIs' CRDs. Flux v1 is in maintenance mode now, and is not adding any new features unless they are critical. As Flux contrib efforts have been focused on Flux v2, the Flux project has moved to a new repo, fluxcd/flux2 In the interest of reducing the number of open issues not directly related to supporting Flux v1 in maintenance mode, and respecting you may have moved on already, I will go ahead and close out this issue for now. |
Users have been requesting better observability and error reporting in Flux for a while.
We should improve that situation, letting users monitor the state of Flux and diagnose any problems easily.
Right now we provide a few alternatives which aren't 100% satisfactory and should be improved:
An event API. Which is not documented and which was originally aimed at getting notifications on Weave Cloud's Deploy UI. This API isn't properly documented and is no official integrations exist. There is FluxCloud which is great, but isn't maintained by us and only really covers the Slack notifications use case. We should consider revamping the API to make integrations easier (e.g. using WebSub) or, at the very least, document it. Related issues: Report errors at kubernetes level #2695
Logs. Very often users end up needing to grep logs in order to know what's happening. This is hairy and often confusing due to the quality of errors. Users ideally shouldn't need to grep logs. And, if they end up needing, the error messages should be clear (right now they aren't in many situations. Related issues #Internal error: git repo is not configured #874
Metrics. Flux does provide some prometheus metrics as documented at https://docs.fluxcd.io/en/1.17.1/references/monitoring.html . However, those metrics are not sufficient to diagnose problems and create alarms and we don't provide a dashboard for them. Related issues Provide Grafana Dashboard with Flux Prometheus metrics #2792 Export image data as Prometheus metrics #2793 Create metric for flux manifest errors #2199
On top of that:
The errors reported by
fluxctl
are sometimes not very intuitive (this happens transitively, since in many cases it gets the same errorsfluxd
prints in the logs). Related issues: Improve error message offluxctl release
: 'Error: no changes made in repo' #2839We should give better errors and warnings when users write inconsistent update annotations. This is particularly confusing for
HelmRelease
workloads. Related: Give feedback on incorrect annotations #2354 (comment)The text was updated successfully, but these errors were encountered: