-
Notifications
You must be signed in to change notification settings - Fork 716
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Prometheus Monitoring for TF Operator #1018
Prometheus Monitoring for TF Operator #1018
Conversation
Instructutions for deploying prometheus-operator with Helm Sample queries for getting monitoring metrics for: CPU Memory Network I/O Issue kubeflow#988
Hi @krishnadurai. Thanks for your PR. I'm waiting for a kubeflow member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/hold wip |
- Is Leader for tf-operator pod - Documents keep-alive check Issue kubeflow#988
@@ -119,6 +128,7 @@ func Run(opt *options.ServerOption) error { | |||
|
|||
// Set leader election start function. | |||
run := func(<-chan struct{}) { | |||
isLeader.Set(1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we just set this metric once from the pod or keep sending this repeatedly at an interval?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since a pod lose the leader status, shouldn't it be periodic?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The catch here is that even if it is periodic, the value for the gauge is still set as 1. The periodicity might only signify freshness of the metric which the metric up
is already providing us with (the success of scraping the /metric endpoint).
Should this metric only signify a pod's leadership?
How certain are we that isLeader.Set(0)
is always hit on crash? If its not certain, how can we reset this metric to 0 when it is not the leader?
cmd/tf-operator.v1/main.go
Outdated
@@ -31,6 +35,15 @@ func init() { | |||
log.AddHook(filenameHook) | |||
} | |||
|
|||
func startMonitoring() { | |||
go func() { | |||
monitoringPort := os.Getenv("MONITORING_CLIENT_PORT") //TODO (krishnadurai): remove with static port |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To be removed with a hardcoded port number or an optional server flag.
Please suggest.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For now, you can keep a server flag with defaults. https://github.com/kubeflow/tf-operator/blob/master/cmd/tf-operator.v1/app/options/options.go#L24
docs/monitoring/README.md
Outdated
@@ -0,0 +1,75 @@ | |||
# Prometheus Monitoring for TF operator | |||
|
|||
## Install Prometheus in your Kubernetes Cluster |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To be removed. Will add annotation configuration for Prometheus instead, as mentioned here:
https://cloud.google.com/solutions/white-box-app-monitoring-for-gke-with-prometheus
Travis tests have failedHey @krishnadurai, 1st Buildgo build -o tf-operator.v1 github.com/kubeflow/tf-operator/cmd/tf-operator.v1
gometalinter --config=linter_config.json --vendor ./...
goveralls -service=travis-ci -v -package ./pkg/... -ignore "pkg/client/*/*.go,pkg/client/*/*/*.go,pkg/client/*/*/*/*.go,pkg/client/*/*/*/*/*.go,pkg/client/*/*/*/*/*/*.go,pkg/client/*/*/*/*/*/*/*.go,pkg/util/testutil/*.go,pkg/apis/tensorflow/*/zz_generated.*.go,pkg/apis/tensorflow/*/*_generated.go,pkg/apis/common/*/zz_generated.*.go,pkg/apis/common/*/*_generated.go"
2nd Buildgo build -o tf-operator.v1 github.com/kubeflow/tf-operator/cmd/tf-operator.v1
gometalinter --config=linter_config.json --vendor ./...
goveralls -service=travis-ci -v -package ./pkg/... -ignore "pkg/client/*/*.go,pkg/client/*/*/*.go,pkg/client/*/*/*/*.go,pkg/client/*/*/*/*/*.go,pkg/client/*/*/*/*/*/*.go,pkg/client/*/*/*/*/*/*/*.go,pkg/util/testutil/*.go,pkg/apis/tensorflow/*/zz_generated.*.go,pkg/apis/tensorflow/*/*_generated.go,pkg/apis/common/*/zz_generated.*.go,pkg/apis/common/*/*_generated.go"
TravisBuddy Request Identifier: 4caaa8c0-816a-11e9-b9f1-63f75d62ea26 |
cmd/tf-operator.v1/main.go
Outdated
@@ -42,6 +55,8 @@ func main() { | |||
log.SetFormatter(&log.JSONFormatter{}) | |||
} | |||
|
|||
startMonitoring() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
go style not followed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll correct this.
/ok-to-test |
/hold |
- Jobs created - Jobs deleted - Jobs successful - Reference to container based GPU metrics through cAdvisor Updates packages for promauto and removes unneccessary ones
TODO:
|
Travis tests have failedHey @krishnadurai, 1st Buildgometalinter --config=linter_config.json --vendor ./...
goveralls -service=travis-ci -v -package ./pkg/... -ignore "pkg/client/*/*.go,pkg/client/*/*/*.go,pkg/client/*/*/*/*.go,pkg/client/*/*/*/*/*.go,pkg/client/*/*/*/*/*/*.go,pkg/client/*/*/*/*/*/*/*.go,pkg/util/testutil/*.go,pkg/apis/tensorflow/*/zz_generated.*.go,pkg/apis/tensorflow/*/*_generated.go,pkg/apis/common/*/zz_generated.*.go,pkg/apis/common/*/*_generated.go"
2nd Buildgometalinter --config=linter_config.json --vendor ./...
goveralls -service=travis-ci -v -package ./pkg/... -ignore "pkg/client/*/*.go,pkg/client/*/*/*.go,pkg/client/*/*/*/*.go,pkg/client/*/*/*/*/*.go,pkg/client/*/*/*/*/*/*.go,pkg/client/*/*/*/*/*/*/*.go,pkg/util/testutil/*.go,pkg/apis/tensorflow/*/zz_generated.*.go,pkg/apis/tensorflow/*/*_generated.go,pkg/apis/common/*/zz_generated.*.go,pkg/apis/common/*/*_generated.go"
TravisBuddy Request Identifier: 7d3f6250-8219-11e9-ace3-d7213f1d5b5a |
Code is now go formatted README modified to remove installation instructions Review comments addressed Issue kubeflow#988
/retest |
/hold cancel Is this PR ready to review? |
@gaocegege it is ready for review |
Travis tests have failedHey @krishnadurai, 1st Buildgometalinter --config=linter_config.json --vendor ./...
goveralls -service=travis-ci -v -package ./pkg/... -ignore "pkg/client/*/*.go,pkg/client/*/*/*.go,pkg/client/*/*/*/*.go,pkg/client/*/*/*/*/*.go,pkg/client/*/*/*/*/*/*.go,pkg/client/*/*/*/*/*/*/*.go,pkg/util/testutil/*.go,pkg/apis/tensorflow/*/zz_generated.*.go,pkg/apis/tensorflow/*/*_generated.go,pkg/apis/common/*/zz_generated.*.go,pkg/apis/common/*/*_generated.go"
2nd Buildgometalinter --config=linter_config.json --vendor ./...
goveralls -service=travis-ci -v -package ./pkg/... -ignore "pkg/client/*/*.go,pkg/client/*/*/*.go,pkg/client/*/*/*/*.go,pkg/client/*/*/*/*/*.go,pkg/client/*/*/*/*/*/*.go,pkg/client/*/*/*/*/*/*/*.go,pkg/util/testutil/*.go,pkg/apis/tensorflow/*/zz_generated.*.go,pkg/apis/tensorflow/*/*_generated.go,pkg/apis/common/*/zz_generated.*.go,pkg/apis/common/*/*_generated.go"
TravisBuddy Request Identifier: 61f10c20-838f-11e9-85b7-8d16a22b8596 |
/assign @richardsliu @gaocegege |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, this looks great.
docs/monitoring/README.md
Outdated
|
||
Component Containers: | ||
* tf-operator | ||
* tf-master |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be tf-chief, to match with the most recent TensorFlow semantics.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Making the change
tf-master changed to tf-chief Issue kubeflow#988
Welcome, I've addressed your comments. |
/retest |
/lgtm |
/approve |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: ScorpioCPH The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Metrics being targeted to track using Prometheus:
Not ready for review or merge
This change is
Approach:
The metrics -
CPU usage
GPU usage*
Memory usage
Network usage
I/O usage
Can be obtained by cAdvisor stats being captured in Prometheus.
For the metrics -
Is Leader check
Job creation
Job deletion
Jobs created per hour
Successful job completions
We can use go Prometheus instrumentation for reporting these metrics from the operator CRD component.
Implements issue #988