Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Stats: Report config reload data #5680

Closed
4 tasks done
jordansissel opened this issue Jul 22, 2016 · 14 comments
Closed
4 tasks done

Stats: Report config reload data #5680

jordansissel opened this issue Jul 22, 2016 · 14 comments

Comments

@jordansissel
Copy link
Contributor

jordansissel commented Jul 22, 2016

  • Report config reload success count
  • Config reload failure count
  • Time of most recent config reload success
  • Time of most recent config reload failure

Suggested by @widhalmt in NETWAYS/check_logstash#6

@jordansissel
Copy link
Contributor Author

For tracking reloads, we'll have to change the way the entire metrics tree is reset when a config is reloaded.

@acchen97
Copy link
Contributor

acchen97 commented Aug 4, 2016

Thanks for tracking this. I can see this as a new resource type under /_node/stats/{x}.

We'll need to document this new behavior of metrics living across config reloads.

@suyograo
Copy link
Contributor

@jordansissel assigning this to @jsvd

@acchen97
Copy link
Contributor

@jsvd FYI added this as the last item under "Node Stats" here #5732

@jsvd
Copy link
Member

jsvd commented Aug 29, 2016

what about:

/_node/stats/pipeline/reloads/successful - number of successful pipeline reload /_node/stats/pipeline/reloads/failed - number of failed pipeline reloads (either from failing to fetch the configuration, to failing to create the new pipeline, or failure to start it)
/_node/stats/pipeline/reloads/last_success_time - date of the last successful reload /_node/stats/pipeline/reloads/last_failure_time - date of the last failed reload

[edit]

/_node/stats/pipeline/reloads/last_failure_message - exception message + backtrace(?) of the last failure to reload

@ph
Copy link
Contributor

ph commented Aug 29, 2016

@jsvd Seems good to me, I see that you record the last_failure_time, I wonder if we should keep track of what was the last error?

/_node/stats/pipeline/reloads/failed

This node could be detailed, I see values when config management get in.

@jsvd
Copy link
Member

jsvd commented Aug 29, 2016

yes I thought about that as well,I updated the issue description to add this metric, can we have a kind of metric gauge which is a string?

@jsvd
Copy link
Member

jsvd commented Aug 30, 2016

@ph the resource /_node/stats/pipeline/ is expected next to list the pipeline ids correct? e.g. /_node/stats/pipeline/main/?

If so we can't use /_node/stats/pipeline/reloads/ as that would seems like the name of a pipeline

Also, pipeline reload metrics could be done per pipeline id: a reload on the main pipeline would only increase the failed/succesful counts for that pipeline id. Once we have 2 or more pipelines in the same instance, having a global failed count is somewhat useless.

Implementing this means that the metric reset on reload must be even finer grained by only clearing the event counts/plugin stats for that pipeline id and not the whole pipeline resource.

Thoughts?

@ph
Copy link
Contributor

ph commented Aug 30, 2016

yes I thought about that as well,I updated the issue description to add this metric, can we have a kind of metric gauge which is a string?

Perfectly possible, gauge can hold any types.

If so we can't use /_node/stats/pipeline/reloads/ as that would seems like the name of a pipeline

Also, pipeline reload metrics could be done per pipeline id: a reload on the main pipeline would only increase the failed/succesful counts for that pipeline id. Once we have 2 or more pipelines in the same instance, having a global failed count is somewhat useless.
Implementing this means that the metric reset on reload must be even finer grained by only clearing the event counts/plugin stats for that pipeline id and not the whole pipeline resource.

This is good point, I am not sure of the best approch here for two reason.

  • In theory, we can have multiples pipeline coexists, this never have been the case other than the internal metric one, which doesn't gather metric..
  • When multiples pipelines land some of the API will also have to change possibly.

I wonder if we should have an /agent namespace that hold theses kind of values independently of any numbers of pipelines.

@jsvd
Copy link
Member

jsvd commented Aug 30, 2016

well in this case we don't need the /agent thing, we just to do a deeper reset and put the reload count in the correct pipeline_id

@ph
Copy link
Contributor

ph commented Aug 30, 2016

agree

@acchen97
Copy link
Contributor

As we only expose the main pipeline today, I think we can proceed with these stats under /_node/stats/pipeline/reloads, which is similar to the other events/plugin stats stuff. We can always correlate to more specific pipeline IDs later (beyond main).

@jsvd FYI in case you haven't seen this, but @jordansissel defined a metrics style guide that we should strive to adhere to. Also, for success/failure, I know for instance with grok/date stats, the terms "matches" and "failures" are used. Maybe "successes" and "failures" makes sense here?

@jsvd
Copy link
Member

jsvd commented Aug 31, 2016

@acchen97 I wasn't aware of that, great feedback thanks!

@jsvd
Copy link
Member

jsvd commented Sep 6, 2016

done in #5848

@jsvd jsvd closed this as completed Sep 6, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants