Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The "collector_duration_seconds_total" metric as histogram to keep duration of full collection. #317

Conversation

vlamug
Copy link

@vlamug vlamug commented Feb 8, 2019

The "collector_duration_seconds_total" metric was added to keep duration of full collection. We added it because we noticed some strange dips in graphs(please, see attached screenshot) and the reason is the big time of collecting metrics so that Prometheus cannot collect metrics during scrape timeout. The scrape timeout is 20s in our Prometheus configuration. As a result the existing metric "collector_duration_seconds" does not show us the big time of scraping, because it is gauge.

image

So, the new metric as histogram will measure duration always. Now the metric has the following buckets in seconds:
0.5, 1, 2.5, 5, 10, 15, 20, 25, 30, 45, 60

If you think that it is needed to change this set, please, let me know. Thanks.

@carlpett
Copy link
Collaborator

Hi @vlamug,
Thanks for helping to improve wmi_exporter! Am I understanding you right that you mainly want to see in later scrapes how long time previously timed-out scrapes took?
If you want to detect timeouts, you can already detect them since Prometheus' automatic up metric will go to 0 if the scrape times out.

@vlamug
Copy link
Author

vlamug commented Feb 18, 2019

Hello @carlpett. Thanks for answer.

Am I understanding you right that you mainly want to see in later scrapes how long time previously timed-out scrapes took?

Yes. Additionally, it can show us the reason of failed scraping. Let consider the image attached in the first message. If I see the dips in the graph I can check the collector_duration_seconds_total metric and then if the scape time took more then scrape timeout in Prometheus configuration(the scrape_timout parameter), I will sure that the reason of failed scraping is the long scraping.
The metric up will not give us such information.

@vlamug
Copy link
Author

vlamug commented Feb 18, 2019

So, I want to see not only current scraping is failing now, but the find out the reason in the past.

@vlamug
Copy link
Author

vlamug commented Mar 6, 2019

@carlpett could you check my last message? Thanks.

@carlpett
Copy link
Collaborator

carlpett commented Aug 3, 2019

Sorry, seems I accidentally silenced this thead. As part of #335, we're introducing timeouts for collectors, so you should be seeing much fewer total scrape failures from the Prometheus side. Additionally it adds a collector_timeout metric so you can see which ones took too long.
Would this cover your use case?

@vlamug
Copy link
Author

vlamug commented Sep 27, 2019

@carlpett , ok, it seems that added functionality is helpful. Thank you. I close this issue.

@vlamug vlamug closed this Sep 27, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants