-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
jobs: add job metrics per-type to track success, failure, and cancel #61482
Conversation
b3e3997
to
a5df1c8
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd prefer that this instead parallel the metrics as the life-cycle actually moves as opposed to the methods that get called. This will deal with retries and I also think that it's a good idea for us to couple these counter with the other extremely similar counters we already have. See
cockroach/pkg/jobs/registry.go
Line 1160 in c0d8340
jm := r.metrics.JobMetrics[jobType] |
Reviewed 1 of 6 files at r1.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @dt)
a5df1c8
to
c6a3537
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this is what you want either. This is now adding metrics which are not telemetry counters (at least, I'm pretty sure they aren't).
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @dt)
8f57090
to
ff75594
Compare
ff75594
to
ca37efc
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @dt and @fqazi)
pkg/sql/sqltelemetry/schema.go, line 155 at r2 (raw file):
Quoted 4 lines of code…
func SchemaJobSuccessCounter(jobName string) telemetry.Counter { jobName = strings.ToLower(strings.Replace(jobName, " ", "_", -1)) return telemetry.GetCounter(fmt.Sprintf("sql.schema.job.%s.successful", jobName)) }
It's a slightly pre-mature optimization but I'd prefer if you just statically allocate all of these counters rather than generating these strings and doing it lazily.
I'm also not sure that they need to be in this package. I think you can happily import telemetry
in the jobs
package and just stick the vars there.
Good idea, I'll move them into the job package and allocate them once. I originally tried the telemetry package but ran into a circular dependency issue with the protobuf. |
ca37efc
to
e02c24a
Compare
f6f4fbc
to
a81afb0
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 1 of 8 files at r2, 1 of 3 files at r3.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner, @dt, and @fqazi)
pkg/jobs/metrics.go, line 168 at r3 (raw file):
Successful: telemetry.GetCounterOnce(fmt.Sprintf("sql.schema.job.%s.successful", jobName)), Failed: telemetry.GetCounterOnce(fmt.Sprintf("sql.schema.job.%s.failed", jobName)), Canceled: telemetry.GetCounterOnce(fmt.Sprintf("sql.schema.job.%s.canceled", jobName)),
I don't think that the schema
string makes a lot of sense here. Jobs are used in a lot of contexts outside of schema changes. I don't even know if sql
makes sense. What if the prefix was just job.%s.<event>
I'll simplify it down to that format |
a81afb0
to
e882c74
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
bors r+ |
Build failed (retrying...): |
Fixes: cockroachdb#59711 Previously, there were only over all counters tracking how many jobs were completed, cancelled, or failed. This was inadequate because it didn't make it easy to tell in aggregate what job types they were. To address this, this patch will add counters for different job types for tracking success, failure, and cancellation. Release justification: Low risk change only adding a metric inside the crdb_internal.feature_usage table Release note: None
e882c74
to
1b39a7f
Compare
Canceled. |
bors r+ |
Build succeeded: |
@fqazi Can you confirm these are the names of the new telemetry counters?
|
They are the following @vy-ton
|
cc @mwang1026, you might be interested in these success, failure, and cancel telemetry counters for all job types |
Fixes: #59711
Previously, there were only over all counters tracking how many
jobs were completed, cancelled, or failed. This was inadequate
because it didn't make it easy to tell in aggregate what job
types they were. To address this, this patch will add counters
for different job types for tracking success, failure, and
cancellation.
Release justification: Low risk change only adding a metric inside
the crdb_internal.feature_usage table
Release note: None