Skip to content
This repository has been archived by the owner on Nov 1, 2022. It is now read-only.

Improve observability and error reporting in Flux #2812

Closed
2opremio opened this issue Feb 3, 2020 · 9 comments
Closed

Improve observability and error reporting in Flux #2812

2opremio opened this issue Feb 3, 2020 · 9 comments

Comments

@2opremio
Copy link
Contributor

2opremio commented Feb 3, 2020

Users have been requesting better observability and error reporting in Flux for a while.

We should improve that situation, letting users monitor the state of Flux and diagnose any problems easily.

Right now we provide a few alternatives which aren't 100% satisfactory and should be improved:

@supra08
Copy link

supra08 commented Feb 12, 2020

Hi @2opremio, I am looking forward to contribute to Flux and willing to participate in GSOC 2020. How can I get things started for this issue?

@2opremio
Copy link
Contributor Author

Hi @supra08! Thanks a lot for the interest. I would start by reading the issues and the text above (the issue list is not exhaustive so you may want to dive through the other flux issues), using Flux, reproducing the problems mentioned and thinking about a strategy to improve things.

I don't know the logistic details of GSoC. I would think that the project needs to be accepted first? @dholbach will probably know better.

@dholbach
Copy link
Member

@omkarprabhu-98
Copy link

omkarprabhu-98 commented Mar 12, 2020

@hiddeco @2opremio

  1. For logging, would creating a central log like using fluentd to collect logs, store them in elastic and providing an interface through kibana be a good option?
  2. Grafana Dashboard with Prometheus would allow displaying metrics, which other metrics would require to be captured for an indication of a problem?
  3. To look into improving error reporting, is there a list of possible errors which defined already?
  4. For the event API, is it about broadcasting events like when a commit is pushed and sync starts etc.?

@c4m4
Copy link

c4m4 commented Mar 28, 2020

I am using fluxcd+helm-operator with fluxcloud, I deploy a container and the helmrelease goes in failed with reason: HelmUpgradeFailed, even if I have in the helmrelease yaml file the attribute wait: true.

Fluxcd always reports to fluxcloud "result":{"default:helmrelease/myapp":{"Status":"success"

It's almost impossible to work with notifications failures in this way.

fluxcd execute kubectl -f release/myapp.yaml without check any result of the deployed helmrelease

@RichiCoder1
Copy link

Per events, Kube events also has a pretty decent ecosystem now of tools that can monitor events like Kube watch and Argo Events.

Another option could be a webhook setting that shoots off a CloudEvent to a specified endpoint.

@azelezni
Copy link

azelezni commented Apr 6, 2020

I was really looking forward to implementing flux in our clusters, but the inability to know what's going on or what went wrong is a real deal breaker.

  1. Log structure is terrible, every log message seems to have a different structure, some have info field, some have error field, some have both err and output, even with logstash+elasticsearch parsing the logs is a nightmare.
  2. The errors in the logs aren't very informative, even if there is an error message, it doesn't always specify what file caused the error.
  3. Available metrics don't help a lot, I still need to search the logs to find what's wrong.
  4. No support for notification of any kind (even simple webhook).

:(

@ArthurSens
Copy link

Hello everyone.
As shown at the link above, I've asked for some guidance and best practices when implementing new metrics with the CNCF-SIG Observability team.

I'm trying to write a proposal for new metrics, following what was informed on that issue, which you can see here.

Any feedback, from users or maintainers, on the metrics proposed would be awesome. If anyone would like to add anything(specially use-cases), that would be great too!

@kingdonb
Copy link
Member

Observability has been a primary goal of Flux v2's redesign efforts. All of the controllers now implement Prometheus metrics around their various reconciliation topics, and expose reconciliation status details as events on their respective APIs' CRDs.

Flux v1 is in maintenance mode now, and is not adding any new features unless they are critical.

As Flux contrib efforts have been focused on Flux v2, the Flux project has moved to a new repo, fluxcd/flux2

In the interest of reducing the number of open issues not directly related to supporting Flux v1 in maintenance mode, and respecting you may have moved on already, I will go ahead and close out this issue for now.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

9 participants