Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to hook up third-party daemons? #35

Closed
jfilak opened this issue Oct 24, 2016 · 10 comments
Closed

How to hook up third-party daemons? #35

jfilak opened this issue Oct 24, 2016 · 10 comments
Labels
lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed.

Comments

@jfilak
Copy link

jfilak commented Oct 24, 2016

I'm an ABRT developer and I would love to create a problem daemon reporting problems detected by ABRT to node-problem-detector.

ABRT's architecture is similar to node-problem-detector's - there are agents reporting detected problems to abrtd. An ABRT agent is either a tiny daemon watching logs (or systemd-journal) or a language error handler (Python sys.excepthook, Ruby at_exit callback, /proc/sys/kernel/core_pattern, Node.js uncaughtException event handler, Java JNI agent).

I've created a docker image that is capable to detect Kernel oopses, vmcores and core files on a host:
https://github.com/jfilak/docker-abrt/tree/atomic_minimal

(It should be possible to detect uncaught [Python, Ruby, Java] exceptions in the future)

ABRT provides several ways of reporting the detected problems to users - e-mail, FTP|SCP upload, D-Bus signal, Bugzilla bug, micro-Report, systemd-journal catalog message - and it is trivial to add another report destination.

The Design Doc defines "Problem Report Interface" but I've failed to find out how to register a new problem deamon to node-problem-detector or how to use the "Problem Report Interface" from a third party daemon.

@derekwaynecarr
Copy link
Member

/subscribed

@Random-Liu
Copy link
Member

Random-Liu commented Oct 24, 2016

@jfilak Cool, we'd really like to integrate with third party daemons!

NPD was introduced in K8s 1.3 for several reasons:

  1. We have urgent requirement for kernel problem detection. At that time we were suffering from some kernel deadlocks, such as the unregister_netdevice kernel race Mitigate impact of unregister_netdevice kernel race kubernetes#20096..
  2. Node problem detection is a necessity, but it is really environment-dependent. We don't have enough knowledge and bandwidth to work on concrete solution for all different environments. So we make it composable and want to open a door to get community help and integrate with third party solution.

The Design Doc defines "Problem Report Interface" but I've failed to find out how to register a new problem deamon to node-problem-detector or how to use the "Problem Report Interface" from a third party daemon.

In the first version, we architecturally separated out the "problem daemon" and defined the "problem report interface", but kernel monitor (the first "problem daemon") is still in-process integrated because at that time it's the only daemon.

We have plan to support inter-process integration and now it seems to be the time. :)

@dchen1107
/cc @kubernetes/sig-node

@dchen1107
Copy link
Member

dchen1107 commented Oct 24, 2016

@jfilak Thanks for interest integrating new problem detector with k8s's generic NPD!

By design, NPD should be easy to plug-in / swap with a different problem detector containers, and aggregate / report all problems to the upstream layers / the users. Do you want to give a demo on one of our sig-node meeting?

@jfilak
Copy link
Author

jfilak commented Oct 25, 2016

Do you want to give a demon on one of our sig-node meeting?

Yes, I do. Thank you for the offer. However, I need some time to get familiar with kubernetes and to polish the image. I've been testing the image only on a bare metal with Docker so far.

@Random-Liu
Copy link
Member

@jfilak Thanks!
If you are interested, you can join our slack channel, http://slack.kubernetes.io/ and #sig-node. :)

@adohe-zz
Copy link
Contributor

adohe-zz commented Nov 7, 2016

/subscribed

@andyxning
Copy link
Member

@jfilak I am working on porting Nagios Plugin Interface to NPD which uses stdout and return status code to identify the status of a service or node status. IIUC, ABRT can not use stdout and return status code to communicate with NPD, right?

If then, how about we use a localhost http api to listen for third party daemon checker.

@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

Prevent issues from auto-closing with an /lifecycle frozen comment.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or @fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 4, 2018
@fejta-bot
Copy link

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten
/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Feb 8, 2018
@fejta-bot
Copy link

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed.
Projects
None yet
Development

No branches or pull requests

8 participants