Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Expose metrics #258

Closed
lkysow opened this issue Sep 7, 2018 · 26 comments
Closed

Expose metrics #258

lkysow opened this issue Sep 7, 2018 · 26 comments
Labels
feature New functionality/enhancement

Comments

@lkysow
Copy link
Member

lkysow commented Sep 7, 2018

Via @mechastorm, they would like Atlantis to expose metrics around:

  • number of plan/applys
  • number of errors encountered
  • time when plan ran successful after error was detected (that would our MTTR - mean time to recover)
@psalaberria002
Copy link
Contributor

Prometheus please

@majormoses
Copy link
Contributor

Prometheus please

I'd like to focus on providing and endpoint /metrics that provides a JSON response. This allows it to be scraped and mutated by monitoring solutions rather than saying you need to run prometheus to get metrics out. In the long term building in specific support for prometheus, graphite, statsd, etc more natively might be nice but I think in order to get the most bang for your buck the initial implementation should be inclusive rather than rely on a single common piece of tech. Just my $0.02.

@psalaberria002
Copy link
Contributor

Any progress on this one?

@lkysow
Copy link
Member Author

lkysow commented Nov 1, 2018

Nope!

@lkysow lkysow added the feature New functionality/enhancement label Apr 4, 2019
@kent-b
Copy link
Contributor

kent-b commented Apr 9, 2019

Here's a basic RFC for this.
https://docs.google.com/document/d/1GwCvqEzQx0B-tEtq4T4H_LJ_7IddIP_ItmlM1zUTG2I/edit

@gwkunze
Copy link

gwkunze commented May 8, 2019

https://openmetrics.io/ could be an option, although it's still in its infancy

@psalaberria002
Copy link
Contributor

@lkysow How do you think metrics should be collected and exposed? Any preference?

I think we should use an existing library for collecting metrics (Prometheus, Openmetrics in the future?,...), and not reinvent the wheel.. There are hundreds of Prometheus exporters, so you just need a sidecar to expose them in your preferred format or to send them your metrics store.

@xbglowx
Copy link

xbglowx commented Sep 25, 2019

Prometheus please

I'd like to focus on providing and endpoint /metrics that provides a JSON response. This allows it to be scraped and mutated by monitoring solutions rather than saying you need to run prometheus to get metrics out. In the long term building in specific support for prometheus, graphite, statsd, etc more natively might be nice but I think in order to get the most bang for your buck the initial implementation should be inclusive rather than rely on a single common piece of tech. Just my $0.02.

You could default to exposing as JSON and give the option (URL parameter) to change the format to something else, i.e. Prometheus. Consul and Nomad allow for this.

@psalaberria002
Copy link
Contributor

psalaberria002 commented Sep 25, 2019

@xbglowx How do they do metric collection internally? Have they reimplemented Counters, Gauges, Histograms, etc?

Edit: Ok, they are using https://github.com/armon/go-metrics which could be an option. I am gonna give that a try.

@psalaberria002
Copy link
Contributor

Thay library only supports Gauges and Counters. And personally I don't like that it tries to deal with all kind of sinks. I don't think that logic should be built within Atlantis. Sidecar extractors solve the issue in a much cleaner manner.

@lkysow
Copy link
Member Author

lkysow commented Nov 1, 2019

@caryyu please use the reactions on the post rather than adding comments.

@waltervargas
Copy link

Datadog integration?

@cep21
Copy link
Contributor

cep21 commented Aug 30, 2020

In lieu of metrics support, how are people currently monitoring their atlantis deploy to make sure it's healthy?

@mwarkentin
Copy link
Contributor

Another option could be to log metrics in some structured format like EMF: https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/CloudWatch_Embedded_Metric_Format_Specification.html

At least in AWS this would be easy to parse out into Cloudwatch metrics. Not sure if any other tools have added support for the spec.

@tewing
Copy link

tewing commented Oct 13, 2020

It's going to be tough to get approval to use Atlantis without a prometheus metrics endpoint. I'm wondering how others are monitoring Atlantis uptime?

@majormoses
Copy link
Contributor

majormoses commented Oct 14, 2020

Absolutely agree that we should add this, it's just not something that IMO ranks as the most important problem for atlantis to solve at this moment. That is not to say my opinion is important 😉 . If you or your org feel it is, this is OSS and someone (props to @psalaberria002 for taking a swing at it) can invest developer time or hire a contractor to build the feature. As Luke said always vote with 👍/👎 on the main comment to show your support/opposition to an issue as that is what github lets you sort on.

I know everyone (self included) loves data but in absence of data I can offer anecdotal advice on real usage. I have run atlantis at multiple orgs for years and have had 0 problems from an uptime perspective. We ran it on fairly typical instances (something like a t or m medium/large instance class) ec2 instance. If we have a lot going on then we see some elevated cpu (terraform) but every time terraform was the cause and resources are released after terraform finishes executing. I have not observed any memory, file descriptor, or other resource leaks in a number of years. I can't say that about many projects that do offer such metrics 😆. Standard resource monitoring and an http health check have so far worked out pretty well for me. I mostly used cloudwatch on the (E|A)LB (which also offloaded TLS) and sensu for your standard resource (disk, memory, cpu, network, etc) but those could be just about anything. I found it to be much more CPU bound so you really wanted to tune it I would stick with a c class instance instead. I think if I had to pick one metric I would wish for it would be the longest running plan to catch times where we have been rate limited (I am looking at you Github). We do have plans to move our atlantis instance into k8s next quarter I will let you know what we end up changing out if anything.

Personal Plea/Rant to the Industry: there is no one single monitoring system in my experience that covers everything and does it the best. They all have their strengths and weaknesses. Saying someone can't use a solution because it is not supported by a specific monitoring product is ludicrous at an engineering organization. There are always options, while it might not be sexy running a sidecar for something like atlantis can work just fine for many use cases. I had to build monitoring for production docker setups before there were projects like prometheus, docker monitoring apis, docker exec, etc. We always found a clever ways to meet the needs of our customers regardless where there apps are at. Eventually the solutions mature over time and we replace the clever as it is no longer needed.

@nishkrishnan
Copy link
Contributor

We have metrics support in our fork, however it uses statsd in the form of github.com/lyft/gostats.

Here is the commit:
lyft/atlantis@37c200f

If theres enough likes on this, seeing as it's already implemented, I can just upstream it for others to build upon/use. If it helps I can also have a tutorial on how to setup statsd with atlantis. I know people were expressing their desire for prometheus but this is already done and used in production so could be a starting point at least.

@cep21
Copy link
Contributor

cep21 commented Apr 2, 2021

Awesome work. Thanks @nishkrishnan

@haarchri
Copy link

@nishkrishnan would be great to see your work here as PR ;)

@smitthakkar96
Copy link

@nishkrishnan any plans to open a PR?

@nishkrishnan
Copy link
Contributor

yeah will do, sorry about that i must have missed all this stuff.

@nishkrishnan
Copy link
Contributor

#2147

@yoonsio
Copy link
Contributor

yoonsio commented Apr 17, 2022

I have PR open to support Prometheus metrics: #2204

@nuno-silva
Copy link

#2204 is now merged, so I believe this can be closed :) (thanks @yoonsio )

@ekhaydarov
Copy link

Thanks for the work @yoonsio. however in 0.19.8 trying to implement

  metrics:
    prometheus:
      endpoint: /metrics

we get an error of

Error: initializing server: parsing /etc/atlantis/repos.yaml file: yaml: unmarshal errors:
  line 87: field prometheus not found in type raw.Metrics

which is strange because in #2204 we can clearly see prometheus being added to metrics here

@nitrocode
Copy link
Member

@ekhaydarov can you try 0.19.9 ?

Also, the whitespace in your yaml sample seems off. Just like policies and repos, metrics is also a root-level key. It should probably be documented as such.

type GlobalCfg struct {
Repos []Repo `yaml:"repos" json:"repos"`
Workflows map[string]Workflow `yaml:"workflows" json:"workflows"`
PolicySets PolicySets `yaml:"policies" json:"policies"`
Metrics Metrics `yaml:"metrics" json:"metrics"`
}

metrics:
  prometheus:
    endpoint: /metrics

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature New functionality/enhancement
Projects
None yet
Development

No branches or pull requests