The "collector_duration_seconds_total" metric as histogram to keep duration of full collection. #317

vlamug · 2019-02-08T10:04:31Z

The "collector_duration_seconds_total" metric was added to keep duration of full collection. We added it because we noticed some strange dips in graphs(please, see attached screenshot) and the reason is the big time of collecting metrics so that Prometheus cannot collect metrics during scrape timeout. The scrape timeout is 20s in our Prometheus configuration. As a result the existing metric "collector_duration_seconds" does not show us the big time of scraping, because it is gauge.

So, the new metric as histogram will measure duration always. Now the metric has the following buckets in seconds:
0.5, 1, 2.5, 5, 10, 15, 20, 25, 30, 45, 60

If you think that it is needed to change this set, please, let me know. Thanks.

…ull duration of collecting.

carlpett · 2019-02-16T08:32:58Z

Hi @vlamug,
Thanks for helping to improve wmi_exporter! Am I understanding you right that you mainly want to see in later scrapes how long time previously timed-out scrapes took?
If you want to detect timeouts, you can already detect them since Prometheus' automatic up metric will go to 0 if the scrape times out.

vlamug · 2019-02-18T08:46:34Z

Hello @carlpett. Thanks for answer.

Am I understanding you right that you mainly want to see in later scrapes how long time previously timed-out scrapes took?

Yes. Additionally, it can show us the reason of failed scraping. Let consider the image attached in the first message. If I see the dips in the graph I can check the collector_duration_seconds_total metric and then if the scape time took more then scrape timeout in Prometheus configuration(the scrape_timout parameter), I will sure that the reason of failed scraping is the long scraping.
The metric up will not give us such information.

vlamug · 2019-02-18T08:48:26Z

So, I want to see not only current scraping is failing now, but the find out the reason in the past.

vlamug · 2019-03-06T06:37:04Z

@carlpett could you check my last message? Thanks.

carlpett · 2019-08-03T13:56:08Z

Sorry, seems I accidentally silenced this thead. As part of #335, we're introducing timeouts for collectors, so you should be seeing much fewer total scrape failures from the Prometheus side. Additionally it adds a collector_timeout metric so you can see which ones took too long.
Would this cover your use case?

vlamug · 2019-09-27T07:08:33Z

@carlpett , ok, it seems that added functionality is helpful. Thank you. I close this issue.

vlamug added 2 commits February 8, 2019 12:41

The metric "collector_duration_seconds_total" was added to observer f…

897a723

…ull duration of collecting.

The fix of calculating total duration of collection metrics.

f33b946

vlamug closed this Sep 27, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

The "collector_duration_seconds_total" metric as histogram to keep duration of full collection. #317

The "collector_duration_seconds_total" metric as histogram to keep duration of full collection. #317

vlamug commented Feb 8, 2019 •

edited

Loading

carlpett commented Feb 16, 2019

vlamug commented Feb 18, 2019

vlamug commented Feb 18, 2019

vlamug commented Mar 6, 2019

carlpett commented Aug 3, 2019

vlamug commented Sep 27, 2019

The "collector_duration_seconds_total" metric as histogram to keep duration of full collection. #317

The "collector_duration_seconds_total" metric as histogram to keep duration of full collection. #317

Conversation

vlamug commented Feb 8, 2019 • edited Loading

carlpett commented Feb 16, 2019

vlamug commented Feb 18, 2019

vlamug commented Feb 18, 2019

vlamug commented Mar 6, 2019

carlpett commented Aug 3, 2019

vlamug commented Sep 27, 2019

vlamug commented Feb 8, 2019 •

edited

Loading