-
Notifications
You must be signed in to change notification settings - Fork 706
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix 1340 prometheus counters #1375
Fix 1340 prometheus counters #1375
Conversation
pulling latest changes from kubeflow/tf-operator to deepak-muley/tf-operator
TODO: 1. Decide if we should be renaming the tf_operator specific counters (backward compability needed?) 2. Update counters at all other places
TODO: Need to find out if other files like job.go is needed
Also to open up the discussion of how should the counters be named (metrics name vs label) here are current thoughts Currently we have following: training_operator_mxjobs_deleted_total {job_namespace: "ns"} training_operator_mxjobs_successful_total {job_namespace: "ns"} training_operator_mxjobs_failed_total {job_namespace: "ns"} training_operator_mxjobs_restarted_total {job_namespace: "ns"} Current suggestion is as follows: training_operator_jobs_deleted_total {job_namespace: "ns", framework: "mxnet”} training_operator_jobs_successful_total {job_namespace: "ns", framework: "mxnet”} training_operator_jobs_failed_total {job_namespace: "ns", framework: "mxnet”} training_operator_jobs_restarted_total {job_namespace: "ns", framework: "mxnet”} ———— training_operator_jobs_total {job_namespace: "ns", framework: “tensorflow”, type: “created”}
training_operator_jobs_total {job_namespace: "ns", framework: "tensorflow”, type: “deleted”} training_operator_jobs_total {job_namespace: "ns", framework: “pytorch”, type: “created”}
training_operator_jobs_total {job_namespace: "ns", framework: "pytorch”, type: “deleted”} training_operator_jobs_total {job_namespace: "ns", framework: “xgboost”, type: “created”}
training_operator_jobs_total {job_namespace: "ns", framework: "xgboost”, type: “deleted”} any concerns on 3rd naming convention? |
Hi @deepak-muley. Thanks for your PR. I'm waiting for a kubeflow member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/ok-to-test |
To me, I think option 2 is better? Option 1. "training_operator_mxjobs_created_total " It might be hard to create dashboard.
ref: https://prometheus.io/docs/instrumenting/writing_exporters/ |
Following are the different counters added training_operator_jobs_created_total {job_namespace: "ns", framework: "mxnet”} training_operator_jobs_created_total {job_namespace: "ns", framework: “pytorch”} training_operator_jobs_created_total {job_namespace: "ns", framework: “tensor flow”} training_operator_jobs_created_total {job_namespace: "ns", framework: “xgboost”} training_operator_jobs_deleted_total {job_namespace: "ns", framework: "mxnet”} training_operator_jobs_deleted_total {job_namespace: "ns", framework: “pytorch”} training_operator_jobs_deleted_total {job_namespace: "ns", framework: “tensor flow”} training_operator_jobs_deleted_total {job_namespace: "ns", framework: “xgboost”} training_operator_jobs_successful_total {job_namespace: "ns", framework: "mxnet”} training_operator_jobs_successful_total {job_namespace: "ns", framework: “pytorch”} training_operator_jobs_successful_total {job_namespace: "ns", framework: “tensor flow”} training_operator_jobs_successful_total {job_namespace: "ns", framework: “xgboost”} training_operator_jobs_failed_total {job_namespace: "ns", framework: "mxnet”} training_operator_jobs_failed_total {job_namespace: "ns", framework: “pytorch”} training_operator_jobs_failed_total {job_namespace: "ns", framework: “tensor flow”} training_operator_jobs_failed_total {job_namespace: "ns", framework: “xgboost”} training_operator_jobs_restarted_total {job_namespace: "ns", framework: "mxnet”} training_operator_jobs_restarted_total {job_namespace: "ns", framework: “pytorch”} training_operator_jobs_restarted_total {job_namespace: "ns", framework: “tensor flow”} training_operator_jobs_restarted_total {job_namespace: "ns", framework: “xgboost”}
/test kubeflow-tf-operator-presubmit |
Per https://prometheus.io/docs/prometheus/latest/querying/basics/ does not really highlight in a clean way of any issue with more labels |
Option 2 looks more intuitive to me. How about starting with Option 2 for now? |
Yes currently I have pushed the changes with option2 |
/retest |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: Jeffwan The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
LGTM. leave it to @johnugeorge to double check and approve |
Can you delete previous tfjob prometheus counters? Any reason for keeping it? tfJobsSuccessCount, tfJobsFailureCount, tfJobsCreatedCount, tfJobsRestartCount |
They will go away when we remove those files. They are not registered |
Related : #1367 |
Thanks @deepak-muley |
Due to error in merging, PR #1365 received few changes from master in the history. hence to keep the branch clean, have created this PR. Sorry for the inconvenience.
Ref #1340
@Jeffwan @andreyvelich @johnugeorge
Testing done:
HELP training_operator_jobs_created_total Counts number of jobs created
TYPE training_operator_jobs_created_total counter
training_operator_jobs_created_total{framework="tensorflow",job_namespace="test-tf-operator"} 2
HELP training_operator_jobs_successful_total Counts number of jobs successful
TYPE training_operator_jobs_successful_total counter
training_operator_jobs_successful_total{framework="tensorflow",job_namespace="test-tf-operator"} 1
Note: somehow was expecting deletedjob count to go up but it did not. Need more debugging @Jeffwan @andreyvelich