Fix 1340 prometheus counters #1375

deepak-muley · 2021-08-17T21:43:27Z

Due to error in merging, PR #1365 received few changes from master in the history. hence to keep the branch clean, have created this PR. Sorry for the inconvenience.
Ref #1340
@Jeffwan @andreyvelich @johnugeorge

Testing done:

Created 2 jobs of tensorflow and deleted one job quickly i saw only following even after long time

HELP training_operator_jobs_created_total Counts number of jobs created
TYPE training_operator_jobs_created_total counter
training_operator_jobs_created_total{framework="tensorflow",job_namespace="test-tf-operator"} 2

HELP training_operator_jobs_successful_total Counts number of jobs successful
TYPE training_operator_jobs_successful_total counter
training_operator_jobs_successful_total{framework="tensorflow",job_namespace="test-tf-operator"} 1

Note: somehow was expecting deletedjob count to go up but it did not. Need more debugging @Jeffwan @andreyvelich

pulling latest changes from kubeflow/tf-operator to deepak-muley/tf-operator

TODO: 1. Decide if we should be renaming the tf_operator specific counters (backward compability needed?) 2. Update counters at all other places

TODO: Need to find out if other files like job.go is needed

deepak-muley · 2021-08-17T21:43:33Z

Also to open up the discussion of how should the counters be named (metrics name vs label) here are current thoughts

Currently we have following:
training_operator_mxjobs_created_total {job_namespace: "ns"}
training_operator_pytorchjobs_created_total {job_namespace: "ns"}
training_operator_tfjobs_created_total {job_namespace: "ns"}
training_operator_xgboostjobs_created_total {job_namespace: "ns"}

training_operator_mxjobs_deleted_total {job_namespace: "ns"}
training_operator_pytorchjobs_deleted_total {job_namespace: "ns"}
training_operator_tfjobs_deleted_total {job_namespace: "ns"}
training_operator_xgboostjobs_deleted_total {job_namespace: "ns"}

training_operator_mxjobs_successful_total {job_namespace: "ns"}
training_operator_pytorchjobs_successful_total {job_namespace: "ns"}
training_operator_tfjobs_successful_total {job_namespace: "ns"}
training_operator_xgboostjobs_successful_total {job_namespace: "ns"}

training_operator_mxjobs_failed_total {job_namespace: "ns"}
training_operator_pytorchjobs_failed_total {job_namespace: "ns"}
training_operator_tfjobs_failed_total {job_namespace: "ns"}
training_operator_xgboostjobs_failed_total {job_namespace: "ns"}

training_operator_mxjobs_restarted_total {job_namespace: "ns"}
training_operator_pytorchjobs_restarted_total {job_namespace: "ns"}
training_operator_tfjobs_restarted_total {job_namespace: "ns"}
training_operator_xgboostjobs_restarted_total {job_namespace: "ns"}
—

Current suggestion is as follows:
training_operator_jobs_created_total {job_namespace: "ns", framework: "mxnet”}
training_operator_jobs_created_total {job_namespace: "ns", framework: “pytorch”}
training_operator_jobs_created_total {job_namespace: "ns", framework: “tensor flow”}
training_operator_jobs_created_total {job_namespace: "ns", framework: “xgboost”}

training_operator_jobs_deleted_total {job_namespace: "ns", framework: "mxnet”}
training_operator_jobs_deleted_total {job_namespace: "ns", framework: “pytorch”}
training_operator_jobs_deleted_total {job_namespace: "ns", framework: “tensor flow”}
training_operator_jobs_deleted_total {job_namespace: "ns", framework: “xgboost”}

training_operator_jobs_successful_total {job_namespace: "ns", framework: "mxnet”}
training_operator_jobs_successful_total {job_namespace: "ns", framework: “pytorch”}
training_operator_jobs_successful_total {job_namespace: "ns", framework: “tensor flow”}
training_operator_jobs_successful_total {job_namespace: "ns", framework: “xgboost”}

training_operator_jobs_failed_total {job_namespace: "ns", framework: "mxnet”}
training_operator_jobs_failed_total {job_namespace: "ns", framework: “pytorch”}
training_operator_jobs_failed_total {job_namespace: "ns", framework: “tensor flow”}
training_operator_jobs_failed_total {job_namespace: "ns", framework: “xgboost”}

training_operator_jobs_restarted_total {job_namespace: "ns", framework: "mxnet”}
training_operator_jobs_restarted_total {job_namespace: "ns", framework: “pytorch”}
training_operator_jobs_restarted_total {job_namespace: "ns", framework: “tensor flow”}
training_operator_jobs_restarted_total {job_namespace: "ns", framework: “xgboost”}

————
Why not to go then following route where we add type as well in label.
training_operator_jobs_total {job_namespace: "ns", framework: "mxnet”, type: “created”} training_operator_jobs_total {job_namespace: "ns", framework: "mxnet”, type: “deleted”}
training_operator_jobs_total {job_namespace: "ns", framework: "mxnet”, type: “successful”}
training_operator_jobs_total {job_namespace: "ns", framework: "mxnet”, type: “failed”}
training_operator_jobs_total {job_namespace: "ns", framework: "mxnet”, type: “restarted”}

training_operator_jobs_total {job_namespace: "ns", framework: “tensorflow”, type: “created”} training_operator_jobs_total {job_namespace: "ns", framework: "tensorflow”, type: “deleted”}
training_operator_jobs_total {job_namespace: "ns", framework: "tensorflow”, type: “successful”}
training_operator_jobs_total {job_namespace: "ns", framework: "tensorflow”, type: “failed”}
training_operator_jobs_total {job_namespace: "ns", framework: "tensorflow”, type: “restarted”}

training_operator_jobs_total {job_namespace: "ns", framework: “pytorch”, type: “created”} training_operator_jobs_total {job_namespace: "ns", framework: "pytorch”, type: “deleted”}
training_operator_jobs_total {job_namespace: "ns", framework: "pytorch”, type: “successful”}
training_operator_jobs_total {job_namespace: "ns", framework: "pytorch”, type: “failed”}
training_operator_jobs_total {job_namespace: "ns", framework: "pytorch”, type: “restarted”}

training_operator_jobs_total {job_namespace: "ns", framework: “xgboost”, type: “created”} training_operator_jobs_total {job_namespace: "ns", framework: "xgboost”, type: “deleted”}
training_operator_jobs_total {job_namespace: "ns", framework: "xgboost”, type: “successful”}
training_operator_jobs_total {job_namespace: "ns", framework: "xgboost”, type: “failed”}
training_operator_jobs_total {job_namespace: "ns", framework: "xgboost”, type: “restarted”}

any concerns on 3rd naming convention?

aws-kf-ci-bot · 2021-08-17T21:43:40Z

Hi @deepak-muley. Thanks for your PR.

I'm waiting for a kubeflow member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Jeffwan · 2021-08-17T22:05:09Z

/ok-to-test

Jeffwan · 2021-08-17T22:34:00Z

To me, I think option 2 is better?

Option 1. "training_operator_mxjobs_created_total " It might be hard to create dashboard.
Option 3. I think this is not recommended?

When you have a successful request count and a failed request count, the best way to expose this is as one metric for total requests and another metric for failed requests. This makes it easy to calculate the failure ratio. Do not use one metric with a failed or success label. Similarly, with hit or miss for caches, it’s better to have one metric for total and another for hits.

ref: https://prometheus.io/docs/instrumenting/writing_exporters/

Following are the different counters added training_operator_jobs_created_total {job_namespace: "ns", framework: "mxnet”} training_operator_jobs_created_total {job_namespace: "ns", framework: “pytorch”} training_operator_jobs_created_total {job_namespace: "ns", framework: “tensor flow”} training_operator_jobs_created_total {job_namespace: "ns", framework: “xgboost”} training_operator_jobs_deleted_total {job_namespace: "ns", framework: "mxnet”} training_operator_jobs_deleted_total {job_namespace: "ns", framework: “pytorch”} training_operator_jobs_deleted_total {job_namespace: "ns", framework: “tensor flow”} training_operator_jobs_deleted_total {job_namespace: "ns", framework: “xgboost”} training_operator_jobs_successful_total {job_namespace: "ns", framework: "mxnet”} training_operator_jobs_successful_total {job_namespace: "ns", framework: “pytorch”} training_operator_jobs_successful_total {job_namespace: "ns", framework: “tensor flow”} training_operator_jobs_successful_total {job_namespace: "ns", framework: “xgboost”} training_operator_jobs_failed_total {job_namespace: "ns", framework: "mxnet”} training_operator_jobs_failed_total {job_namespace: "ns", framework: “pytorch”} training_operator_jobs_failed_total {job_namespace: "ns", framework: “tensor flow”} training_operator_jobs_failed_total {job_namespace: "ns", framework: “xgboost”} training_operator_jobs_restarted_total {job_namespace: "ns", framework: "mxnet”} training_operator_jobs_restarted_total {job_namespace: "ns", framework: “pytorch”} training_operator_jobs_restarted_total {job_namespace: "ns", framework: “tensor flow”} training_operator_jobs_restarted_total {job_namespace: "ns", framework: “xgboost”}

deepak-muley · 2021-08-18T05:39:22Z

/test kubeflow-tf-operator-presubmit

deepak-muley · 2021-08-18T05:41:22Z

To me, I think option 2 is better?

Option 1. "training_operator_mxjobs_created_total " It might be hard to create dashboard.
Option 3. I think this is not recommended?
When you have a successful request count and a failed request count, the best way to expose this is as one metric for total requests and another metric for failed requests. This makes it easy to calculate the failure ratio. Do not use one metric with a failed or success label. Similarly, with hit or miss for caches, it’s better to have one metric for total and another for hits.
ref: https://prometheus.io/docs/instrumenting/writing_exporters/

Per https://prometheus.io/docs/prometheus/latest/querying/basics/ does not really highlight in a clean way of any issue with more labels

johnugeorge · 2021-08-18T05:50:06Z

Option 2 looks more intuitive to me. How about starting with Option 2 for now?

deepak-muley · 2021-08-18T05:51:38Z

Yes currently I have pushed the changes with option2

pkg/controller.v1/mxnet/mxjob_controller.go

deepak-muley · 2021-08-19T00:27:24Z

/retest

google-oss-robot · 2021-08-19T03:41:31Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: Jeffwan

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [Jeffwan]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Jeffwan · 2021-08-19T03:41:46Z

LGTM. leave it to @johnugeorge to double check and approve

johnugeorge · 2021-08-19T04:25:47Z

Can you delete previous tfjob prometheus counters? Any reason for keeping it?

tfJobsSuccessCount, tfJobsFailureCount, tfJobsCreatedCount, tfJobsRestartCount

deepak-muley · 2021-08-19T05:59:56Z

They will go away when we remove those files. They are not registered

johnugeorge · 2021-08-19T06:05:18Z

Related : #1367

johnugeorge · 2021-08-19T06:05:37Z

Thanks @deepak-muley
/lgtm

deepak-muley added 5 commits August 13, 2021 15:45

Merge pull request #1 from kubeflow/master

a9ad979

pulling latest changes from kubeflow/tf-operator to deepak-muley/tf-operator

Merge branch 'kubeflow:master' into master

ae0540f

1340: WIP: Added prometheus counters for all the jobs

4d6c5ff

TODO: 1. Decide if we should be renaming the tf_operator specific counters (backward compability needed?) 2. Update counters at all other places

1340: now registering the counters in the prometheus registry

0e85115

1340: now only using tfjobs_controller.go to set counters for TFJob

087a322

TODO: Need to find out if other files like job.go is needed

aws-kf-ci-bot added the needs-ok-to-test label Aug 17, 2021

deepak-muley marked this pull request as draft August 17, 2021 21:43

google-oss-robot added size/L do-not-merge/work-in-progress labels Aug 17, 2021

google-oss-robot requested review from jinchihe and terrytangyuan August 17, 2021 21:43

aws-kf-ci-bot added ok-to-test and removed needs-ok-to-test labels Aug 17, 2021

deepak-muley marked this pull request as ready for review August 18, 2021 05:39

google-oss-robot removed the do-not-merge/work-in-progress label Aug 18, 2021

Jeffwan reviewed Aug 18, 2021

View reviewed changes

pkg/controller.v1/mxnet/mxjob_controller.go Outdated Show resolved Hide resolved

Jeffwan mentioned this pull request Aug 18, 2021

Provides a default Grafana dashboard #1376

Closed

1340: moved the frameworkName constants to api/v1/constants.go

74b15d4

Jeffwan approved these changes Aug 19, 2021

View reviewed changes

google-oss-robot added the approved label Aug 19, 2021

google-oss-robot assigned johnugeorge Aug 19, 2021

google-oss-robot added the lgtm label Aug 19, 2021

google-oss-robot merged commit 7e69531 into kubeflow:master Aug 19, 2021

deepak-muley deleted the fix-1340-prometheus-counters branch August 19, 2021 15:14

johnugeorge mentioned this pull request Aug 23, 2021

Adding Deepak to member list kubeflow/internal-acls#486

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix 1340 prometheus counters #1375

Fix 1340 prometheus counters #1375

deepak-muley commented Aug 17, 2021 •

edited

Loading

deepak-muley commented Aug 17, 2021

aws-kf-ci-bot commented Aug 17, 2021

Jeffwan commented Aug 17, 2021

Jeffwan commented Aug 17, 2021 •

edited

Loading

deepak-muley commented Aug 18, 2021

deepak-muley commented Aug 18, 2021

johnugeorge commented Aug 18, 2021

deepak-muley commented Aug 18, 2021

deepak-muley commented Aug 19, 2021

google-oss-robot commented Aug 19, 2021

Jeffwan commented Aug 19, 2021

johnugeorge commented Aug 19, 2021

deepak-muley commented Aug 19, 2021

johnugeorge commented Aug 19, 2021

johnugeorge commented Aug 19, 2021

Fix 1340 prometheus counters #1375

Fix 1340 prometheus counters #1375

Conversation

deepak-muley commented Aug 17, 2021 • edited Loading

deepak-muley commented Aug 17, 2021

aws-kf-ci-bot commented Aug 17, 2021

Jeffwan commented Aug 17, 2021

Jeffwan commented Aug 17, 2021 • edited Loading

deepak-muley commented Aug 18, 2021

deepak-muley commented Aug 18, 2021

johnugeorge commented Aug 18, 2021

deepak-muley commented Aug 18, 2021

deepak-muley commented Aug 19, 2021

google-oss-robot commented Aug 19, 2021

Jeffwan commented Aug 19, 2021

johnugeorge commented Aug 19, 2021

deepak-muley commented Aug 19, 2021

johnugeorge commented Aug 19, 2021

johnugeorge commented Aug 19, 2021

deepak-muley commented Aug 17, 2021 •

edited

Loading

Jeffwan commented Aug 17, 2021 •

edited

Loading