Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prometheus metrics endpoint #259

Closed
jpds opened this issue Mar 8, 2019 · 7 comments
Closed

Prometheus metrics endpoint #259

jpds opened this issue Mar 8, 2019 · 7 comments

Comments

@jpds
Copy link

jpds commented Mar 8, 2019

Could a /metrics endpoint be added to n-p-d so that tools like Prometheus can gather these and then create alerts based on those?

@xueweiz
Copy link
Contributor

xueweiz commented May 17, 2019

@jpds
Hi Jone, I wonder what kind of Prometheus metrics do you have in mind?

I just shared a proposal about exporting node problems as logs and metrics. I think we could translate today's conditions & events into metrics.

For example, for a permanent problem (node condition), it looks like this today:
{ "type":"KernelDeadlock", "status":"True", "transition":"2012-10-31 15:50:13.793654 +0000 UTC", "reason":"AUFSUmountHung", "message":"[...]task umount.aufs:xxxx blocked for more than 180 seconds.[...]" }

And for a temporary problem (event), it looks like this:
{ "severity":"warn", "timestamp":"2012-10-31 15:50:13.793654 +0000 UTC", "reason":"OOMKilling", "message":"[...]Kill process 677 dockerd…[...]" }

I plan to translate them into counter metrics, basically counting how many times have they occurred on this node, like this:
problem_counter {"name": "KernelDeadlock", "reason": "AUFSUmountHung"} 1
problem_counter {"name": "OOMKilling"} 2

Would that be the kind of Prometheus metrics you had in mind? And if you could take a look of the doc and share your opinion, that'd also be very helpful. Thanks!

@frittentheke
Copy link

@jpds ... your wish just came true -> 23dc265

@xueweiz is there going to be a release anytime soon?

@xueweiz
Copy link
Contributor

xueweiz commented Jul 24, 2019

Hi @frittentheke @jpds, the implementation of NPD metrics mode is mainly tracked at #284.

It is almost complete. The only lacking thing is #315. I expect to get it finished within a few days.
I discussed with @Random-Liu yesterday, and the current plan is to cut v0.7.0 release after #284 gets closed.

It'd be good if you can help verify if these changes suits your use cases. If not, we can make improvements further along the way (0.7.x).

@xueweiz
Copy link
Contributor

xueweiz commented Aug 26, 2019

The changes mentioned above has been released in v0.7.0. I think we can close this issue now, right?
@wangzhen127
@andyxning

@amagura
Copy link

amagura commented Dec 2, 2019

I'm using v0.7.0, however, neither NPD or its metrics are showing up in Prometheus despite NPD successfully starting its prometheus exporter:

I1202 20:45:18.656590 1 node_problem_detector.go:60] K8s exporter started.
I1202 20:45:18.656762 1 node_problem_detector.go:64] Prometheus exporter started.
I1202 20:45:18.656777 1 log_monitor.go:107] Start log monitor
I1202 20:45:18.656849 1 log_monitor.go:107] Start log monitor
I1202 20:45:18.658475 1 log_watcher.go:80] Start watching journald
I1202 20:45:18.658497 1 problem_detector.go:67] Problem detector started
I1202 20:45:18.658868 1 log_monitor.go:224] Initialize condition generated: [{Type:KernelDeadlock Status:False Transition:2019-12-02 20:45:18.658765117 +0000 UTC m=+0.108182803 Reason:KernelHasNoDeadlock Message:kernel has no deadlock} {Type:ReadonlyFilesystem Status:False Transition:2019-12-02 20:45:18.658765217 +0000 UTC m=+0.108182903 Reason:FilesystemIsNotReadOnly Message:Filesystem is not read-only}]
I1202 20:45:18.660071 1 log_monitor.go:224] Initialize condition generated: []

@xueweiz
Copy link
Contributor

xueweiz commented Dec 2, 2019

Hi @amagura ,

From your logs Prometheus exporter started, I think the Prometheus exporter is exporting metrics. Can you verify that via running curl localhost:20257/metrics? (If you are running NPD in a pod/container, please run the command in that container as well.)

And if you can see something output, then NPD is working as intended. You might need to configure your Prometheus to scrap metrics from NPD's metrics endpoint.

@amagura
Copy link

amagura commented Dec 2, 2019

And if you can see something output, then NPD is working as intended. You might need to configure your Prometheus to scrap metrics from NPD's metrics endpoint.

That must be what the issue is: running curl printed out a some metrics so NPD must be is working.

Thanks for getting back to me so quickly.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants