Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prometheus Monitoring for TF Operator #1018

Merged
Merged
Show file tree
Hide file tree
Changes from 2 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
61 changes: 61 additions & 0 deletions Gopkg.lock

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

11 changes: 11 additions & 0 deletions cmd/tf-operator.v1/app/server.go
Original file line number Diff line number Diff line change
Expand Up @@ -39,6 +39,8 @@ import (
election "k8s.io/client-go/tools/leaderelection"
"k8s.io/client-go/tools/leaderelection/resourcelock"
"k8s.io/client-go/tools/record"
"github.com/prometheus/client_golang/prometheus"
"github.com/prometheus/client_golang/prometheus/promauto"
)

const (
Expand All @@ -55,6 +57,13 @@ var (

const RecommendedKubeConfigPathEnv = "KUBECONFIG"

var (
isLeader = promauto.NewGauge(prometheus.GaugeOpts{
Name: "tf_operator_is_leader",
Help: "Is this client the leader of this tf-operator client set?",
})
)

func Run(opt *options.ServerOption) error {
// Check if the -version flag was passed and, if so, print the version and exit.
if opt.PrintVersion {
Expand Down Expand Up @@ -119,6 +128,7 @@ func Run(opt *options.ServerOption) error {

// Set leader election start function.
run := func(<-chan struct{}) {
isLeader.Set(1)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we just set this metric once from the pod or keep sending this repeatedly at an interval?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since a pod lose the leader status, shouldn't it be periodic?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The catch here is that even if it is periodic, the value for the gauge is still set as 1. The periodicity might only signify freshness of the metric which the metric up is already providing us with (the success of scraping the /metric endpoint).
Should this metric only signify a pod's leadership?
How certain are we that isLeader.Set(0) is always hit on crash? If its not certain, how can we reset this metric to 0 when it is not the leader?

if err := tc.Run(opt.Threadiness, stopCh); err != nil {
log.Errorf("Failed to run the controller: %v", err)
}
Expand Down Expand Up @@ -157,6 +167,7 @@ func Run(opt *options.ServerOption) error {
Callbacks: election.LeaderCallbacks{
OnStartedLeading: run,
OnStoppedLeading: func() {
isLeader.Set(0)
log.Fatalf("leader election lost")
},
},
Expand Down
15 changes: 15 additions & 0 deletions cmd/tf-operator.v1/main.go
Original file line number Diff line number Diff line change
Expand Up @@ -15,13 +15,17 @@
package main

import (
"os"
"flag"
"fmt"
"net/http"

"github.com/onrik/logrus/filename"
log "github.com/sirupsen/logrus"

"github.com/kubeflow/tf-operator/cmd/tf-operator.v1/app"
"github.com/kubeflow/tf-operator/cmd/tf-operator.v1/app/options"
"github.com/prometheus/client_golang/prometheus/promhttp"
)

func init() {
Expand All @@ -31,6 +35,15 @@ func init() {
log.AddHook(filenameHook)
}

func startMonitoring() {
go func() {
monitoringPort := os.Getenv("MONITORING_CLIENT_PORT") //TODO (krishnadurai): remove with static port
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be removed with a hardcoded port number or an optional server flag.
Please suggest.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

log.Infof("Setting up client for monitoring on port: %s", monitoringPort)
http.Handle("/metrics", promhttp.Handler())
http.ListenAndServe(fmt.Sprintf(":%s", monitoringPort), nil)
}()
}

func main() {
s := options.NewServerOption()
s.AddFlags(flag.CommandLine)
Expand All @@ -42,6 +55,8 @@ func main() {
log.SetFormatter(&log.JSONFormatter{})
}

startMonitoring()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

go style not followed

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll correct this.


if err := app.Run(s); err != nil {
log.Fatalf("%v\n", err)
}
Expand Down
75 changes: 75 additions & 0 deletions docs/monitoring/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,75 @@
# Prometheus Monitoring for TF operator

## Install Prometheus in your Kubernetes Cluster
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be removed. Will add annotation configuration for Prometheus instead, as mentioned here:
https://cloud.google.com/solutions/white-box-app-monitoring-for-gke-with-prometheus

To install the chart with the release name `my-release`:

```console
$ helm install --name my-release stable/prometheus-operator
```

Follow instructions in this [link](https://github.com/helm/charts/blob/master/stable/prometheus-operator/README.md#installing-the-chart) for elaborate instructions.

*Note*: This [link](https://github.com/coreos/prometheus-operator/blob/master/Documentation/troubleshooting.md) helps in troubleshooting your setup.

## Available Metrics

Currently available metrics to monitor are listed below.

### Metrics for Each Component Container for TF operator

Component Containers:
* tf-operator
* tf-master
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be tf-chief, to match with the most recent TensorFlow semantics.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Making the change

* tf-ps
* tf-worker

#### Each Container Reports on its:

Use prometheus graph to run the following example commands to visualize metrics.

*Note*: These metrics are derived from [cAdvisor](https://github.com/google/cadvisor) kubelet integration which reports to Prometheus through our prometheus-operator installation. You may see a complete list of metrics available in `\metrics` page of your Prometheus web UI which you can further use to compose your own queries.

**CPU usage**
```
sum (rate (container_cpu_usage_seconds_total{pod_name=~"tfjob-name-.*"}[1m])) by (pod_name)
```

**GPU Usage**

**Memory Usage**
```
sum (rate (container_memory_usage_bytes{pod_name=~"tfjob-name-.*"}[1m])) by (pod_name)
```

**Network Usage**
```
sum (rate (container_network_transmit_bytes_total{pod_name=~"tfjob-name-.*"}[1m])) by (pod_name)
```

**I/O Usage**
```
sum (rate (container_fs_write_seconds_total{pod_name=~"tfjob-name-.*"}[1m])) by (pod_name)
```

**Keep-Alive check**
```
up
```
This is maintained by Prometheus on its own with its `up` metric detailed in the documentation [here](https://prometheus.io/docs/concepts/jobs_instances/#automatically-generated-labels-and-time-series).

**Is Leader check**
```
tf_operator_is_leader
```

*Note*: Replace `tfjob-name` with your own TF Job name you want to monitor for the example queries above.

### Report TFJob metrics:

**Job Creation**

**Job Deletion**

**Jobs Created per Hour**

**Successful Job Completions**
20 changes: 20 additions & 0 deletions vendor/github.com/beorn7/perks/LICENSE

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

Loading