-
Notifications
You must be signed in to change notification settings - Fork 8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: introduce cron job metrics #2256
Conversation
@@ -260,20 +270,24 @@ func (s *Service) killOldJobs(ctx context.Context) (time.Duration, error) { | |||
pattern, err := cron.Parse(stale.Schedule) | |||
if err != nil { | |||
logger.Errorf(err, "Could not kill stale cron job %q because schedule could not be parsed: %q", stale.Key, stale.Schedule) | |||
observability.Cron.JobFailed(ctx, stale.Key, stale.DeploymentKey) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of bundling all the failures to kill a job with failures of the actual job itself, could we refactor this to the following?
- JobRun with attribute
.succeeded
- JobKilled with attribute
.succeeded
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like the idea and would like to take it a little further. Removing the jobFailures
counter does result in a loss of visibility that can be avoided if the ftl.status.succeeded
attribute can take on a more generalized semantic. Changing from success/fail to "outcome" would help. Outcome is a string representing the canonical set of outcomes for a given domain. So in the metrics where the ftl.status.succeed
is used today; outcome would have the values of succeeded
or failed
.
For cron jobs I would like outcomes covering the following: success, failed to start, execution failure, and killed. This would allow me to remove the jobsKilled
metric and segment latency by outcome
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated
func (m *CronMetrics) JobExecutionStarted(ctx context.Context, job model.CronJobKey, deployment model.DeploymentKey) { | ||
m.jobsActive.Add(ctx, 1, metric.WithAttributes( | ||
attribute.String(observability.ModuleNameAttribute, job.Payload.Module), | ||
attribute.String(cronJobRefAttribute, job.String()), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just confirming - is job.String()
the ref to the cron job declaration?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's the module qualified verb name
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ahhh in that case should we rename the attribute key to match?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated
attribute.String(observability.ModuleNameAttribute, job.Payload.Module), | ||
attribute.String(cronJobRefAttribute, job.String()), | ||
attribute.String(observability.RunnerDeploymentKeyAttribute, deployment.String()), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since these attributes are identical to the ones above for jobsActive
, would you mind deduplicating them into an attrs
array? In fact, since they're duplicated across all these functions, could we make a helper function to construct attrs
given job, deployment
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated
* moving dup cron metric logic into helper functions * broadening shared succeeded attribute to an outcome status attribute and applying the change
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is looking really nice :D
) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you think about reprising the pattern you came up with ages ago and adding a helper function here that just returns attribute.String(observability.OutcomeStatusNameAttribute, observability.SuccessOrFailureStatus(succeeded))
? Since so much of the code that uses this now is basically just duplicating that line
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sgtm, applied
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
micro-nit: could we rename SuccessOrFailureStatus
to SuccessOrFailureStatusAttr
since that's what it returns?
|
||
meter := otel.Meter(deploymentMeterName) | ||
|
||
counter := fmt.Sprintf("%s.job.failures", cronMeterName) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/failures/completed/
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
😅 thanks for spotting that, updated
func (m *CronMetrics) JobFailedStart(ctx context.Context, job model.CronJob) { | ||
completionAttributes := cronAttributes(job, optional.Some(cronJobFailedStartStatus)) | ||
|
||
elapsed := time.Since(job.NextExecution) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you can use timeSinceMS
here and then you don't need the additional .Milliseconds()
call below. Same below for every time.Since
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks, applied
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Niiiice can't wait to see this in DD :D
) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
micro-nit: could we rename SuccessOrFailureStatus
to SuccessOrFailureStatusAttr
since that's what it returns?
No description provided.