Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

jobs: add job metrics per-type to track success, failure, and cancel #61482

Merged
merged 1 commit into from
Mar 5, 2021

Conversation

fqazi
Copy link
Collaborator

@fqazi fqazi commented Mar 4, 2021

Fixes: #59711

Previously, there were only over all counters tracking how many
jobs were completed, cancelled, or failed. This was inadequate
because it didn't make it easy to tell in aggregate what job
types they were. To address this, this patch will add counters
for different job types for tracking success, failure, and
cancellation.

Release justification: Low risk change only adding a metric inside
the crdb_internal.feature_usage table
Release note: None

@fqazi fqazi requested review from a team and dt and removed request for a team March 4, 2021 16:56
@cockroach-teamcity
Copy link
Member

This change is Reviewable

@fqazi fqazi force-pushed the addTelemetryForJobs branch 2 times, most recently from b3e3997 to a5df1c8 Compare March 4, 2021 17:17
Copy link
Contributor

@ajwerner ajwerner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd prefer that this instead parallel the metrics as the life-cycle actually moves as opposed to the methods that get called. This will deal with retries and I also think that it's a good idea for us to couple these counter with the other extremely similar counters we already have. See

jm := r.metrics.JobMetrics[jobType]

Reviewed 1 of 6 files at r1.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @dt)

@fqazi fqazi force-pushed the addTelemetryForJobs branch from a5df1c8 to c6a3537 Compare March 4, 2021 21:57
@fqazi fqazi requested a review from ajwerner March 4, 2021 21:57
Copy link
Contributor

@ajwerner ajwerner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this is what you want either. This is now adding metrics which are not telemetry counters (at least, I'm pretty sure they aren't).

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @dt)

@fqazi fqazi force-pushed the addTelemetryForJobs branch 2 times, most recently from 8f57090 to ff75594 Compare March 5, 2021 06:13
@fqazi fqazi requested a review from ajwerner March 5, 2021 06:13
@fqazi fqazi force-pushed the addTelemetryForJobs branch from ff75594 to ca37efc Compare March 5, 2021 14:14
Copy link
Contributor

@ajwerner ajwerner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @dt and @fqazi)


pkg/sql/sqltelemetry/schema.go, line 155 at r2 (raw file):

Quoted 4 lines of code…
func SchemaJobSuccessCounter(jobName string) telemetry.Counter {
	jobName = strings.ToLower(strings.Replace(jobName, " ", "_", -1))
	return telemetry.GetCounter(fmt.Sprintf("sql.schema.job.%s.successful", jobName))
}

It's a slightly pre-mature optimization but I'd prefer if you just statically allocate all of these counters rather than generating these strings and doing it lazily.

I'm also not sure that they need to be in this package. I think you can happily import telemetry in the jobs package and just stick the vars there.

@fqazi
Copy link
Collaborator Author

fqazi commented Mar 5, 2021

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @dt and @fqazi)

pkg/sql/sqltelemetry/schema.go, line 155 at r2 (raw file):

Quoted 4 lines of code…
It's a slightly pre-mature optimization but I'd prefer if you just statically allocate all of these counters rather than generating these strings and doing it lazily.

I'm also not sure that they need to be in this package. I think you can happily import telemetry in the jobs package and just stick the vars there.

Good idea, I'll move them into the job package and allocate them once. I originally tried the telemetry package but ran into a circular dependency issue with the protobuf.

@fqazi fqazi force-pushed the addTelemetryForJobs branch from ca37efc to e02c24a Compare March 5, 2021 15:14
@fqazi fqazi requested a review from ajwerner March 5, 2021 15:14
@fqazi fqazi force-pushed the addTelemetryForJobs branch 3 times, most recently from f6f4fbc to a81afb0 Compare March 5, 2021 15:16
Copy link
Contributor

@ajwerner ajwerner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 1 of 8 files at r2, 1 of 3 files at r3.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner, @dt, and @fqazi)


pkg/jobs/metrics.go, line 168 at r3 (raw file):

		Successful: telemetry.GetCounterOnce(fmt.Sprintf("sql.schema.job.%s.successful", jobName)),
		Failed:     telemetry.GetCounterOnce(fmt.Sprintf("sql.schema.job.%s.failed", jobName)),
		Canceled:   telemetry.GetCounterOnce(fmt.Sprintf("sql.schema.job.%s.canceled", jobName)),

I don't think that the schema string makes a lot of sense here. Jobs are used in a lot of contexts outside of schema changes. I don't even know if sql makes sense. What if the prefix was just job.%s.<event>

@fqazi
Copy link
Collaborator Author

fqazi commented Mar 5, 2021

Reviewed 1 of 8 files at r2, 1 of 3 files at r3.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner, @dt, and @fqazi)

pkg/jobs/metrics.go, line 168 at r3 (raw file):

		Successful: telemetry.GetCounterOnce(fmt.Sprintf("sql.schema.job.%s.successful", jobName)),
		Failed:     telemetry.GetCounterOnce(fmt.Sprintf("sql.schema.job.%s.failed", jobName)),
		Canceled:   telemetry.GetCounterOnce(fmt.Sprintf("sql.schema.job.%s.canceled", jobName)),

I don't think that the schema string makes a lot of sense here. Jobs are used in a lot of contexts outside of schema changes. I don't even know if sql makes sense. What if the prefix was just job.%s.<event>

I'll simplify it down to that format

@fqazi fqazi force-pushed the addTelemetryForJobs branch from a81afb0 to e882c74 Compare March 5, 2021 15:54
@fqazi fqazi requested a review from ajwerner March 5, 2021 15:55
Copy link
Contributor

@ajwerner ajwerner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@fqazi fqazi removed the request for review from dt March 5, 2021 16:11
@fqazi
Copy link
Collaborator Author

fqazi commented Mar 5, 2021

bors r+

@craig
Copy link
Contributor

craig bot commented Mar 5, 2021

Build failed (retrying...):

Fixes: cockroachdb#59711

Previously, there were only over all counters tracking how many
jobs were completed, cancelled, or failed. This was inadequate
because it didn't make it easy to tell in aggregate what job
types they were. To address this, this patch will add counters
for different job types for tracking success, failure, and
cancellation.

Release justification: Low risk change only adding a metric inside
the crdb_internal.feature_usage table
Release note: None
@fqazi fqazi force-pushed the addTelemetryForJobs branch from e882c74 to 1b39a7f Compare March 5, 2021 17:21
@craig
Copy link
Contributor

craig bot commented Mar 5, 2021

Canceled.

@fqazi
Copy link
Collaborator Author

fqazi commented Mar 5, 2021

bors r+

@craig craig bot merged commit 2264789 into cockroachdb:master Mar 5, 2021
@craig
Copy link
Contributor

craig bot commented Mar 5, 2021

Build succeeded:

@vy-ton
Copy link
Contributor

vy-ton commented Mar 5, 2021

@fqazi Can you confirm these are the names of the new telemetry counters?

job.<job type>.[successful | failed | canceled]
<job type> can be: 
backup
restore
schema_change
import
changefeed
create_stats
schema_change_gc
type_schema_change
stream_ingestion
migration

@fqazi
Copy link
Collaborator Author

fqazi commented Mar 5, 2021

They are the following @vy-ton

<job type > is the internal mapping:
backup               
restore              
schema_change        
import               
changefeed           
create_stats         
auto_create_stats    
schema_change_gc     
typedesc_schema_change
stream_ingestion     
new_schema_change    
migration            

@vy-ton
Copy link
Contributor

vy-ton commented Mar 5, 2021

cc @mwang1026, you might be interested in these success, failure, and cancel telemetry counters for all job types

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

sql: Add telemetry counters for succeeded, failed, canceled jobs
4 participants