Improve observability and error reporting in Flux #2812

2opremio · 2020-02-03T13:02:23Z

Users have been requesting better observability and error reporting in Flux for a while.

We should improve that situation, letting users monitor the state of Flux and diagnose any problems easily.

Right now we provide a few alternatives which aren't 100% satisfactory and should be improved:

An event API. Which is not documented and which was originally aimed at getting notifications on Weave Cloud's Deploy UI. This API isn't properly documented and is no official integrations exist. There is FluxCloud which is great, but isn't maintained by us and only really covers the Slack notifications use case. We should consider revamping the API to make integrations easier (e.g. using WebSub) or, at the very least, document it. Related issues: Report errors at kubernetes level #2695
Logs. Very often users end up needing to grep logs in order to know what's happening. This is hairy and often confusing due to the quality of errors. Users ideally shouldn't need to grep logs. And, if they end up needing, the error messages should be clear (right now they aren't in many situations. Related issues #Internal error: git repo is not configured #874
Metrics. Flux does provide some prometheus metrics as documented at https://docs.fluxcd.io/en/1.17.1/references/monitoring.html . However, those metrics are not sufficient to diagnose problems and create alarms and we don't provide a dashboard for them. Related issues Provide Grafana Dashboard with Flux Prometheus metrics #2792 Export image data as Prometheus metrics #2793 Create metric for flux manifest errors #2199
On top of that:
The errors reported by fluxctl are sometimes not very intuitive (this happens transitively, since in many cases it gets the same errors fluxd prints in the logs). Related issues: Improve error message of fluxctl release: 'Error: no changes made in repo' #2839
We should give better errors and warnings when users write inconsistent update annotations. This is particularly confusing for HelmRelease workloads. Related: Give feedback on incorrect annotations #2354 (comment)

The text was updated successfully, but these errors were encountered:

supra08 · 2020-02-12T01:12:37Z

Hi @2opremio, I am looking forward to contribute to Flux and willing to participate in GSOC 2020. How can I get things started for this issue?

2opremio · 2020-02-12T16:26:11Z

Hi @supra08! Thanks a lot for the interest. I would start by reading the issues and the text above (the issue list is not exhaustive so you may want to dive through the other flux issues), using Flux, reproducing the problems mentioned and thinking about a strategy to improve things.

I don't know the logistic details of GSoC. I would think that the project needs to be accepted first? @dholbach will probably know better.

dholbach · 2020-02-12T16:41:10Z

Yes: https://summerofcode.withgoogle.com/how-it-works/#timeline

omkarprabhu-98 · 2020-03-12T09:55:06Z

@hiddeco @2opremio

For logging, would creating a central log like using fluentd to collect logs, store them in elastic and providing an interface through kibana be a good option?
Grafana Dashboard with Prometheus would allow displaying metrics, which other metrics would require to be captured for an indication of a problem?
To look into improving error reporting, is there a list of possible errors which defined already?
For the event API, is it about broadcasting events like when a commit is pushed and sync starts etc.?

c4m4 · 2020-03-28T08:00:38Z

I am using fluxcd+helm-operator with fluxcloud, I deploy a container and the helmrelease goes in failed with reason: HelmUpgradeFailed, even if I have in the helmrelease yaml file the attribute wait: true.

Fluxcd always reports to fluxcloud "result":{"default:helmrelease/myapp":{"Status":"success"

It's almost impossible to work with notifications failures in this way.

fluxcd execute kubectl -f release/myapp.yaml without check any result of the deployed helmrelease

RichiCoder1 · 2020-03-29T19:53:07Z

Per events, Kube events also has a pretty decent ecosystem now of tools that can monitor events like Kube watch and Argo Events.

Another option could be a webhook setting that shoots off a CloudEvent to a specified endpoint.

azelezni · 2020-04-06T14:02:35Z

I was really looking forward to implementing flux in our clusters, but the inability to know what's going on or what went wrong is a real deal breaker.

Log structure is terrible, every log message seems to have a different structure, some have info field, some have error field, some have both err and output, even with logstash+elasticsearch parsing the logs is a nightmare.
The errors in the logs aren't very informative, even if there is an error message, it doesn't always specify what file caused the error.
Available metrics don't help a lot, I still need to search the logs to find what's wrong.
No support for notification of any kind (even simple webhook).

:(

ArthurSens · 2020-07-08T00:12:42Z

Hello everyone.
As shown at the link above, I've asked for some guidance and best practices when implementing new metrics with the CNCF-SIG Observability team.

I'm trying to write a proposal for new metrics, following what was informed on that issue, which you can see here.

Any feedback, from users or maintainers, on the metrics proposed would be awesome. If anyone would like to add anything(specially use-cases), that would be great too!

kingdonb · 2021-02-26T20:06:38Z

Observability has been a primary goal of Flux v2's redesign efforts. All of the controllers now implement Prometheus metrics around their various reconciliation topics, and expose reconciliation status details as events on their respective APIs' CRDs.

Flux v1 is in maintenance mode now, and is not adding any new features unless they are critical.

As Flux contrib efforts have been focused on Flux v2, the Flux project has moved to a new repo, fluxcd/flux2

In the interest of reducing the number of open issues not directly related to supporting Flux v1 in maintenance mode, and respecting you may have moved on already, I will go ahead and close out this issue for now.

2opremio added ☂️ umbrella issue help wanted enhancement labels Feb 3, 2020

sa-spag mentioned this issue Feb 11, 2020

Improve and/or extend available Prometheus metrics fluxcd/helm-operator#281

Closed

cmanzi mentioned this issue Feb 13, 2020

Improve logging #2851

Closed

2opremio mentioned this issue Feb 20, 2020

Flux aborts synchronization on manifest syntax errors #2861

Closed

prometherion mentioned this issue Mar 2, 2020

Field "ts" in logs is a string not a timestamp #2820

Closed

ArthurSens mentioned this issue Jul 6, 2020

Guidelines for developers on how to implement new metrics cncf/tag-observability#18

Open

kingdonb closed this as completed Feb 26, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve observability and error reporting in Flux #2812

Improve observability and error reporting in Flux #2812

2opremio commented Feb 3, 2020 •

edited

Loading

supra08 commented Feb 12, 2020

2opremio commented Feb 12, 2020

dholbach commented Feb 12, 2020

omkarprabhu-98 commented Mar 12, 2020 •

edited

Loading

c4m4 commented Mar 28, 2020 •

edited

Loading

RichiCoder1 commented Mar 29, 2020

azelezni commented Apr 6, 2020

ArthurSens commented Jul 8, 2020

kingdonb commented Feb 26, 2021

Improve observability and error reporting in Flux #2812

Improve observability and error reporting in Flux #2812

Comments

2opremio commented Feb 3, 2020 • edited Loading

supra08 commented Feb 12, 2020

2opremio commented Feb 12, 2020

dholbach commented Feb 12, 2020

omkarprabhu-98 commented Mar 12, 2020 • edited Loading

c4m4 commented Mar 28, 2020 • edited Loading

RichiCoder1 commented Mar 29, 2020

azelezni commented Apr 6, 2020

ArthurSens commented Jul 8, 2020

kingdonb commented Feb 26, 2021

2opremio commented Feb 3, 2020 •

edited

Loading

omkarprabhu-98 commented Mar 12, 2020 •

edited

Loading

c4m4 commented Mar 28, 2020 •

edited

Loading