Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: introduce cron job metrics #2256

Merged
merged 12 commits into from
Aug 7, 2024
Merged

feat: introduce cron job metrics #2256

merged 12 commits into from
Aug 7, 2024

Conversation

jonathanj-square
Copy link
Contributor

No description provided.

@jonathanj-square jonathanj-square requested review from a team and wesbillman and removed request for a team August 5, 2024 18:47
@ftl-robot ftl-robot mentioned this pull request Aug 5, 2024
@@ -260,20 +270,24 @@ func (s *Service) killOldJobs(ctx context.Context) (time.Duration, error) {
pattern, err := cron.Parse(stale.Schedule)
if err != nil {
logger.Errorf(err, "Could not kill stale cron job %q because schedule could not be parsed: %q", stale.Key, stale.Schedule)
observability.Cron.JobFailed(ctx, stale.Key, stale.DeploymentKey)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of bundling all the failures to kill a job with failures of the actual job itself, could we refactor this to the following?

  • JobRun with attribute .succeeded
  • JobKilled with attribute .succeeded

Copy link
Contributor Author

@jonathanj-square jonathanj-square Aug 5, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the idea and would like to take it a little further. Removing the jobFailures counter does result in a loss of visibility that can be avoided if the ftl.status.succeeded attribute can take on a more generalized semantic. Changing from success/fail to "outcome" would help. Outcome is a string representing the canonical set of outcomes for a given domain. So in the metrics where the ftl.status.succeed is used today; outcome would have the values of succeeded or failed.

For cron jobs I would like outcomes covering the following: success, failed to start, execution failure, and killed. This would allow me to remove the jobsKilled metric and segment latency by outcome

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated

func (m *CronMetrics) JobExecutionStarted(ctx context.Context, job model.CronJobKey, deployment model.DeploymentKey) {
m.jobsActive.Add(ctx, 1, metric.WithAttributes(
attribute.String(observability.ModuleNameAttribute, job.Payload.Module),
attribute.String(cronJobRefAttribute, job.String()),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just confirming - is job.String() the ref to the cron job declaration?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's the module qualified verb name

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ahhh in that case should we rename the attribute key to match?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated

Comment on lines 86 to 88
attribute.String(observability.ModuleNameAttribute, job.Payload.Module),
attribute.String(cronJobRefAttribute, job.String()),
attribute.String(observability.RunnerDeploymentKeyAttribute, deployment.String()),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since these attributes are identical to the ones above for jobsActive, would you mind deduplicating them into an attrs array? In fact, since they're duplicated across all these functions, could we make a helper function to construct attrs given job, deployment?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated

Copy link
Contributor

@deniseli deniseli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is looking really nice :D

)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you think about reprising the pattern you came up with ages ago and adding a helper function here that just returns attribute.String(observability.OutcomeStatusNameAttribute, observability.SuccessOrFailureStatus(succeeded))? Since so much of the code that uses this now is basically just duplicating that line

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sgtm, applied

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

micro-nit: could we rename SuccessOrFailureStatus to SuccessOrFailureStatusAttr since that's what it returns?


meter := otel.Meter(deploymentMeterName)

counter := fmt.Sprintf("%s.job.failures", cronMeterName)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/failures/completed/

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

😅 thanks for spotting that, updated

func (m *CronMetrics) JobFailedStart(ctx context.Context, job model.CronJob) {
completionAttributes := cronAttributes(job, optional.Some(cronJobFailedStartStatus))

elapsed := time.Since(job.NextExecution)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can use timeSinceMS here and then you don't need the additional .Milliseconds() call below. Same below for every time.Since

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks, applied

Copy link
Contributor

@deniseli deniseli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Niiiice can't wait to see this in DD :D

)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

micro-nit: could we rename SuccessOrFailureStatus to SuccessOrFailureStatusAttr since that's what it returns?

@jonathanj-square jonathanj-square added this pull request to the merge queue Aug 7, 2024
Merged via the queue into main with commit 005f605 Aug 7, 2024
17 checks passed
@jonathanj-square jonathanj-square deleted the jonathanj/otel/cron branch August 7, 2024 19:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants