Expose metrics #258

lkysow · 2018-09-07T18:22:11Z

Via @mechastorm, they would like Atlantis to expose metrics around:

number of plan/applys
number of errors encountered
time when plan ran successful after error was detected (that would our MTTR - mean time to recover)

psalaberria002 · 2018-09-17T20:25:59Z

Prometheus please

majormoses · 2018-09-17T22:23:42Z

Prometheus please

I'd like to focus on providing and endpoint /metrics that provides a JSON response. This allows it to be scraped and mutated by monitoring solutions rather than saying you need to run prometheus to get metrics out. In the long term building in specific support for prometheus, graphite, statsd, etc more natively might be nice but I think in order to get the most bang for your buck the initial implementation should be inclusive rather than rely on a single common piece of tech. Just my $0.02.

psalaberria002 · 2018-11-01T08:12:15Z

Any progress on this one?

lkysow · 2018-11-01T16:41:49Z

Nope!

kent-b · 2019-04-09T12:36:55Z

Here's a basic RFC for this.
https://docs.google.com/document/d/1GwCvqEzQx0B-tEtq4T4H_LJ_7IddIP_ItmlM1zUTG2I/edit

gwkunze · 2019-05-08T11:33:06Z

https://openmetrics.io/ could be an option, although it's still in its infancy

psalaberria002 · 2019-09-25T10:10:51Z

@lkysow How do you think metrics should be collected and exposed? Any preference?

I think we should use an existing library for collecting metrics (Prometheus, Openmetrics in the future?,...), and not reinvent the wheel.. There are hundreds of Prometheus exporters, so you just need a sidecar to expose them in your preferred format or to send them your metrics store.

xbglowx · 2019-09-25T13:58:35Z

Prometheus please

I'd like to focus on providing and endpoint /metrics that provides a JSON response. This allows it to be scraped and mutated by monitoring solutions rather than saying you need to run prometheus to get metrics out. In the long term building in specific support for prometheus, graphite, statsd, etc more natively might be nice but I think in order to get the most bang for your buck the initial implementation should be inclusive rather than rely on a single common piece of tech. Just my $0.02.

You could default to exposing as JSON and give the option (URL parameter) to change the format to something else, i.e. Prometheus. Consul and Nomad allow for this.

psalaberria002 · 2019-09-25T14:08:30Z

@xbglowx How do they do metric collection internally? Have they reimplemented Counters, Gauges, Histograms, etc?

Edit: Ok, they are using https://github.com/armon/go-metrics which could be an option. I am gonna give that a try.

psalaberria002 · 2019-09-25T15:25:54Z

Thay library only supports Gauges and Counters. And personally I don't like that it tries to deal with all kind of sinks. I don't think that logic should be built within Atlantis. Sidecar extractors solve the issue in a much cleaner manner.

lkysow · 2019-11-01T16:55:25Z

@caryyu please use the reactions on the post rather than adding comments.

waltervargas · 2020-04-09T09:35:39Z

Datadog integration?

cep21 · 2020-08-30T08:14:28Z

In lieu of metrics support, how are people currently monitoring their atlantis deploy to make sure it's healthy?

mwarkentin · 2020-08-30T17:54:06Z

Another option could be to log metrics in some structured format like EMF: https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/CloudWatch_Embedded_Metric_Format_Specification.html

At least in AWS this would be easy to parse out into Cloudwatch metrics. Not sure if any other tools have added support for the spec.

tewing · 2020-10-13T19:42:12Z

It's going to be tough to get approval to use Atlantis without a prometheus metrics endpoint. I'm wondering how others are monitoring Atlantis uptime?

majormoses · 2020-10-14T02:50:10Z

Absolutely agree that we should add this, it's just not something that IMO ranks as the most important problem for atlantis to solve at this moment. That is not to say my opinion is important 😉 . If you or your org feel it is, this is OSS and someone (props to @psalaberria002 for taking a swing at it) can invest developer time or hire a contractor to build the feature. As Luke said always vote with 👍/👎 on the main comment to show your support/opposition to an issue as that is what github lets you sort on.

I know everyone (self included) loves data but in absence of data I can offer anecdotal advice on real usage. I have run atlantis at multiple orgs for years and have had 0 problems from an uptime perspective. We ran it on fairly typical instances (something like a t or m medium/large instance class) ec2 instance. If we have a lot going on then we see some elevated cpu (terraform) but every time terraform was the cause and resources are released after terraform finishes executing. I have not observed any memory, file descriptor, or other resource leaks in a number of years. I can't say that about many projects that do offer such metrics 😆. Standard resource monitoring and an http health check have so far worked out pretty well for me. I mostly used cloudwatch on the (E|A)LB (which also offloaded TLS) and sensu for your standard resource (disk, memory, cpu, network, etc) but those could be just about anything. I found it to be much more CPU bound so you really wanted to tune it I would stick with a c class instance instead. I think if I had to pick one metric I would wish for it would be the longest running plan to catch times where we have been rate limited (I am looking at you Github). We do have plans to move our atlantis instance into k8s next quarter I will let you know what we end up changing out if anything.

Personal Plea/Rant to the Industry: there is no one single monitoring system in my experience that covers everything and does it the best. They all have their strengths and weaknesses. Saying someone can't use a solution because it is not supported by a specific monitoring product is ludicrous at an engineering organization. There are always options, while it might not be sexy running a sidecar for something like atlantis can work just fine for many use cases. I had to build monitoring for production docker setups before there were projects like prometheus, docker monitoring apis, docker exec, etc. We always found a clever ways to meet the needs of our customers regardless where there apps are at. Eventually the solutions mature over time and we replace the clever as it is no longer needed.

nishkrishnan · 2021-03-31T21:57:14Z

We have metrics support in our fork, however it uses statsd in the form of github.com/lyft/gostats.

Here is the commit:
lyft/atlantis@37c200f

If theres enough likes on this, seeing as it's already implemented, I can just upstream it for others to build upon/use. If it helps I can also have a tutorial on how to setup statsd with atlantis. I know people were expressing their desire for prometheus but this is already done and used in production so could be a starting point at least.

cep21 · 2021-04-02T20:34:44Z

Awesome work. Thanks @nishkrishnan

haarchri · 2021-06-22T18:02:01Z

@nishkrishnan would be great to see your work here as PR ;)

smitthakkar96 · 2022-03-14T17:21:06Z

@nishkrishnan any plans to open a PR?

nishkrishnan · 2022-03-14T18:39:21Z

yeah will do, sorry about that i must have missed all this stuff.

nishkrishnan · 2022-03-16T23:45:11Z

#2147

yoonsio · 2022-04-17T16:27:01Z

I have PR open to support Prometheus metrics: #2204

nuno-silva · 2022-07-18T10:38:03Z

#2204 is now merged, so I believe this can be closed :) (thanks @yoonsio )

ekhaydarov · 2022-09-13T08:29:35Z

Thanks for the work @yoonsio. however in 0.19.8 trying to implement

  metrics:
    prometheus:
      endpoint: /metrics

we get an error of

Error: initializing server: parsing /etc/atlantis/repos.yaml file: yaml: unmarshal errors:
  line 87: field prometheus not found in type raw.Metrics

which is strange because in #2204 we can clearly see prometheus being added to metrics here

nitrocode · 2022-10-06T19:59:29Z

@ekhaydarov can you try 0.19.9 ?

Also, the whitespace in your yaml sample seems off. Just like policies and repos, metrics is also a root-level key. It should probably be documented as such.

atlantis/server/core/config/raw/global_cfg.go

Lines 14 to 19 in d1d1539

    
           type GlobalCfg struct { 
        
           	Repos      []Repo              `yaml:"repos" json:"repos"` 
        
           	Workflows  map[string]Workflow `yaml:"workflows" json:"workflows"` 
        
           	PolicySets PolicySets          `yaml:"policies" json:"policies"` 
        
           	Metrics    Metrics             `yaml:"metrics" json:"metrics"` 
        
           }

metrics:
  prometheus:
    endpoint: /metrics

lkysow added the feature New functionality/enhancement label Apr 4, 2019

psalaberria002 mentioned this issue Sep 26, 2019

Instrument plan/apply commands with Prometheus metrics #790

Closed

yoonsio mentioned this issue Apr 17, 2022

feat: Prometheus metrics support #2204

Merged

nitrocode closed this as completed Oct 20, 2022

jamengual pushed a commit that referenced this issue Nov 24, 2022

fix ff allocator for checks client (#258)

96f4564

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Expose metrics #258

Expose metrics #258

lkysow commented Sep 7, 2018

psalaberria002 commented Sep 17, 2018

majormoses commented Sep 17, 2018

psalaberria002 commented Nov 1, 2018

lkysow commented Nov 1, 2018

kent-b commented Apr 9, 2019

gwkunze commented May 8, 2019

psalaberria002 commented Sep 25, 2019

xbglowx commented Sep 25, 2019

psalaberria002 commented Sep 25, 2019 •

edited

Loading

psalaberria002 commented Sep 25, 2019

lkysow commented Nov 1, 2019 •

edited

Loading

waltervargas commented Apr 9, 2020

cep21 commented Aug 30, 2020

mwarkentin commented Aug 30, 2020

tewing commented Oct 13, 2020 •

edited

Loading

majormoses commented Oct 14, 2020 •

edited

Loading

nishkrishnan commented Mar 31, 2021

cep21 commented Apr 2, 2021

haarchri commented Jun 22, 2021

smitthakkar96 commented Mar 14, 2022

nishkrishnan commented Mar 14, 2022

nishkrishnan commented Mar 16, 2022

yoonsio commented Apr 17, 2022

nuno-silva commented Jul 18, 2022

ekhaydarov commented Sep 13, 2022

nitrocode commented Oct 6, 2022

Expose metrics #258

Expose metrics #258

Comments

lkysow commented Sep 7, 2018

psalaberria002 commented Sep 17, 2018

majormoses commented Sep 17, 2018

psalaberria002 commented Nov 1, 2018

lkysow commented Nov 1, 2018

kent-b commented Apr 9, 2019

gwkunze commented May 8, 2019

psalaberria002 commented Sep 25, 2019

xbglowx commented Sep 25, 2019

psalaberria002 commented Sep 25, 2019 • edited Loading

psalaberria002 commented Sep 25, 2019

lkysow commented Nov 1, 2019 • edited Loading

waltervargas commented Apr 9, 2020

cep21 commented Aug 30, 2020

mwarkentin commented Aug 30, 2020

tewing commented Oct 13, 2020 • edited Loading

majormoses commented Oct 14, 2020 • edited Loading

nishkrishnan commented Mar 31, 2021

cep21 commented Apr 2, 2021

haarchri commented Jun 22, 2021

smitthakkar96 commented Mar 14, 2022

nishkrishnan commented Mar 14, 2022

nishkrishnan commented Mar 16, 2022

yoonsio commented Apr 17, 2022

nuno-silva commented Jul 18, 2022

ekhaydarov commented Sep 13, 2022

nitrocode commented Oct 6, 2022

psalaberria002 commented Sep 25, 2019 •

edited

Loading

lkysow commented Nov 1, 2019 •

edited

Loading

tewing commented Oct 13, 2020 •

edited

Loading

majormoses commented Oct 14, 2020 •

edited

Loading