-
Notifications
You must be signed in to change notification settings - Fork 716
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Prometheus Monitoring for TF Operator #1018
Changes from 2 commits
4e2e16b
9defad7
ffbfae2
98e4161
6e09910
f140eb6
93efa92
60d9b73
79f7b05
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -15,13 +15,17 @@ | |
package main | ||
|
||
import ( | ||
"os" | ||
"flag" | ||
"fmt" | ||
"net/http" | ||
|
||
"github.com/onrik/logrus/filename" | ||
log "github.com/sirupsen/logrus" | ||
|
||
"github.com/kubeflow/tf-operator/cmd/tf-operator.v1/app" | ||
"github.com/kubeflow/tf-operator/cmd/tf-operator.v1/app/options" | ||
"github.com/prometheus/client_golang/prometheus/promhttp" | ||
) | ||
|
||
func init() { | ||
|
@@ -31,6 +35,15 @@ func init() { | |
log.AddHook(filenameHook) | ||
} | ||
|
||
func startMonitoring() { | ||
go func() { | ||
monitoringPort := os.Getenv("MONITORING_CLIENT_PORT") //TODO (krishnadurai): remove with static port | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. To be removed with a hardcoded port number or an optional server flag. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. For now, you can keep a server flag with defaults. https://github.com/kubeflow/tf-operator/blob/master/cmd/tf-operator.v1/app/options/options.go#L24 |
||
log.Infof("Setting up client for monitoring on port: %s", monitoringPort) | ||
http.Handle("/metrics", promhttp.Handler()) | ||
http.ListenAndServe(fmt.Sprintf(":%s", monitoringPort), nil) | ||
}() | ||
} | ||
|
||
func main() { | ||
s := options.NewServerOption() | ||
s.AddFlags(flag.CommandLine) | ||
|
@@ -42,6 +55,8 @@ func main() { | |
log.SetFormatter(&log.JSONFormatter{}) | ||
} | ||
|
||
startMonitoring() | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. go style not followed There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'll correct this. |
||
|
||
if err := app.Run(s); err != nil { | ||
log.Fatalf("%v\n", err) | ||
} | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,75 @@ | ||
# Prometheus Monitoring for TF operator | ||
|
||
## Install Prometheus in your Kubernetes Cluster | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. To be removed. Will add annotation configuration for Prometheus instead, as mentioned here: |
||
To install the chart with the release name `my-release`: | ||
|
||
```console | ||
$ helm install --name my-release stable/prometheus-operator | ||
``` | ||
|
||
Follow instructions in this [link](https://github.com/helm/charts/blob/master/stable/prometheus-operator/README.md#installing-the-chart) for elaborate instructions. | ||
|
||
*Note*: This [link](https://github.com/coreos/prometheus-operator/blob/master/Documentation/troubleshooting.md) helps in troubleshooting your setup. | ||
|
||
## Available Metrics | ||
|
||
Currently available metrics to monitor are listed below. | ||
|
||
### Metrics for Each Component Container for TF operator | ||
|
||
Component Containers: | ||
* tf-operator | ||
* tf-master | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This should be tf-chief, to match with the most recent TensorFlow semantics. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Making the change |
||
* tf-ps | ||
* tf-worker | ||
|
||
#### Each Container Reports on its: | ||
|
||
Use prometheus graph to run the following example commands to visualize metrics. | ||
|
||
*Note*: These metrics are derived from [cAdvisor](https://github.com/google/cadvisor) kubelet integration which reports to Prometheus through our prometheus-operator installation. You may see a complete list of metrics available in `\metrics` page of your Prometheus web UI which you can further use to compose your own queries. | ||
|
||
**CPU usage** | ||
``` | ||
sum (rate (container_cpu_usage_seconds_total{pod_name=~"tfjob-name-.*"}[1m])) by (pod_name) | ||
``` | ||
|
||
**GPU Usage** | ||
|
||
**Memory Usage** | ||
``` | ||
sum (rate (container_memory_usage_bytes{pod_name=~"tfjob-name-.*"}[1m])) by (pod_name) | ||
``` | ||
|
||
**Network Usage** | ||
``` | ||
sum (rate (container_network_transmit_bytes_total{pod_name=~"tfjob-name-.*"}[1m])) by (pod_name) | ||
``` | ||
|
||
**I/O Usage** | ||
``` | ||
sum (rate (container_fs_write_seconds_total{pod_name=~"tfjob-name-.*"}[1m])) by (pod_name) | ||
``` | ||
|
||
**Keep-Alive check** | ||
``` | ||
up | ||
``` | ||
This is maintained by Prometheus on its own with its `up` metric detailed in the documentation [here](https://prometheus.io/docs/concepts/jobs_instances/#automatically-generated-labels-and-time-series). | ||
|
||
**Is Leader check** | ||
``` | ||
tf_operator_is_leader | ||
``` | ||
|
||
*Note*: Replace `tfjob-name` with your own TF Job name you want to monitor for the example queries above. | ||
|
||
### Report TFJob metrics: | ||
|
||
**Job Creation** | ||
|
||
**Job Deletion** | ||
|
||
**Jobs Created per Hour** | ||
|
||
**Successful Job Completions** |
Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we just set this metric once from the pod or keep sending this repeatedly at an interval?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since a pod lose the leader status, shouldn't it be periodic?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The catch here is that even if it is periodic, the value for the gauge is still set as 1. The periodicity might only signify freshness of the metric which the metric
up
is already providing us with (the success of scraping the /metric endpoint).Should this metric only signify a pod's leadership?
How certain are we that
isLeader.Set(0)
is always hit on crash? If its not certain, how can we reset this metric to 0 when it is not the leader?