jobs: make the registry logging less chatty #89064

knz · 2022-09-30T09:25:36Z

The "toggling idleness" log message was the 4th most voluminous log event source in CC, logged 4x more frequently than the first next event source in volume.

This commit makes it less verbose.

Release note: None

cockroach-teamcity · 2022-09-30T09:25:44Z

This change is

miretskiy · 2022-09-30T11:07:18Z

pkg/jobs/registry.go

@@ -1202,7 +1203,7 @@ func (r *Registry) stepThroughStateMachine(
 ) error {
 	payload := job.Payload()
 	jobType := payload.Type()
-	log.Infof(ctx, "%s job %d: stepping through state %s with error: %+v", jobType, job.ID(), status, jobErr)
+	log.VEventf(ctx, 2, "%s job %d: stepping through state %s with error: %+v", jobType, job.ID(), status, jobErr)


Should we keep info level if err non nil?

I don't know. Can you tell me?

dt · 2022-09-30T10:16:28Z

pkg/jobs/registry.go

@@ -1202,7 +1203,7 @@ func (r *Registry) stepThroughStateMachine(
 ) error {
 	payload := job.Payload()
 	jobType := payload.Type()
-	log.Infof(ctx, "%s job %d: stepping through state %s with error: %+v", jobType, job.ID(), status, jobErr)


Can we keep this one? it is among the highest utility log lines related to finding out if a cluster ran a big, destabilizing job and when it did, when it finished, failed, etc. this one feels worth keeping.

We could make it slightly quieter by conditionally skipping it if the job is of type auto create stats, but i hesitate to even do that since we have seen those stats jobs implicated in overload investigations or blocking server drain/shutdown which was debugged via this line iirc.

is there mileage to be gained with a different behavior when jobErr == nil?

As dt mentions, in terms of debugging this is probably the most useful log message in this package. Aside from just the error, it is kinda the only thing that lets you track where a job was running over time when looking at a long-running job that may have been adopted multiple times.

I'd prefer we keep this one at Info if we can as there is no way to get this information after the fact.

We could make it slightly quieter by conditionally skipping it if the job is of type auto create stats

If we went this route, perhaps rather than switching on type, we have a resumer method that returns a bool that controls whether that job type gets verbose logging by default. Kinda like we have for ForceRealSpans.

While failing with err is the most useful, knowing when it started and resumed are often very useful too, as is knowing when non-err jobs succeeded. This comes up if we're being asked to investigate "why did this backup take 6h instead of 3min" or "why did my all decommissioning block for 3 hours until 0437?"

The other log lines here happen at a constant rate (every time we look for expired jobs) or an unbounded number of times per job (which too and from idle), but this one here is mostly a fixed number of lines per job (eg create, run, failing, failed) and those are all pretty vital when doing forensics after the fact to figure out what was going on when

I 100% agree that this is an incredibly useful log line.

stevendanna

All but the stepping through state log message changes seem unobjectionable to me. I've left a few comments on the others but none of those are blocking.

stevendanna · 2022-09-30T11:33:52Z

pkg/jobs/registry.go

 		const stmt = `DELETE FROM system.jobs WHERE id = ANY($1)`
 		var nDeleted int
 		if nDeleted, err = r.ex.Exec(
 			ctx, "gc-jobs", nil /* txn */, stmt, toDelete,
 		); err != nil {
+			log.Infof(ctx, "error cleaning up %d jobs: %v", len(toDelete.Array), err)


Should this be Warningf or Errorf?

stevendanna · 2022-09-30T11:39:51Z

pkg/jobs/registry.go

@@ -1202,7 +1203,7 @@ func (r *Registry) stepThroughStateMachine(
 ) error {
 	payload := job.Payload()
 	jobType := payload.Type()
-	log.Infof(ctx, "%s job %d: stepping through state %s with error: %+v", jobType, job.ID(), status, jobErr)


As dt mentions, in terms of debugging this is probably the most useful log message in this package. Aside from just the error, it is kinda the only thing that lets you track where a job was running over time when looking at a long-running job that may have been adopted multiple times.

I'd prefer we keep this one at Info if we can as there is no way to get this information after the fact.

We could make it slightly quieter by conditionally skipping it if the job is of type auto create stats

If we went this route, perhaps rather than switching on type, we have a resumer method that returns a bool that controls whether that job type gets verbose logging by default. Kinda like we have for ForceRealSpans.

stevendanna · 2022-09-30T11:42:04Z

pkg/jobs/registry.go

 			return false, 0, errors.Wrap(err, "deleting old jobs")
 		}
-		log.Infof(ctx, "cleaned up %d expired job records", nDeleted)
+		log.VEventf(ctx, 2, "cleaned up %d expired job records", nDeleted)


I suppose we could log this at Info if nDeleted > 0. But I'll leave that up to you, I don't think we use these for post-mortem analysis very often.

knz · 2022-09-30T12:27:53Z

I've reverted the "stepping through state" change. RFAL.

The "toggling idleness" log message was the 4th most voluminous log event source in CC, logged 4x more frequently than the first next event source in volume. This commit makes it less verbose. Release note: None

knz · 2022-09-30T15:29:48Z

Thanks folk.

bors r=dt,stevendanna

craig · 2022-09-30T17:32:55Z

Build succeeded:

Bazel Essential CI (Cockroach)

blathers-crl · 2022-09-30T17:33:25Z

Encountered an error creating backports. Some common things that can go wrong:

The backport branch might have already existed.
There was a merge conflict.
The backport branch contained merge commits.

You might need to create your backport manually using the backport tool.

error creating merge commit from 0b8409a to blathers/backport-release-21.2-89064: POST https://api.github.com/repos/cockroachdb/cockroach/merges: 409 Merge conflict []

you may need to manually resolve merge conflicts with the backport tool.

Backport to branch 21.2.x failed. See errors above.

_{🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is otan.}

knz added backport-21.2.x labels Sep 30, 2022

knz requested a review from a team as a code owner September 30, 2022 09:25

miretskiy reviewed Sep 30, 2022

View reviewed changes

miretskiy approved these changes Sep 30, 2022

View reviewed changes

dt reviewed Sep 30, 2022

View reviewed changes

knz requested review from stevendanna and ajwerner September 30, 2022 11:16

stevendanna reviewed Sep 30, 2022

View reviewed changes

knz force-pushed the 20220930-jobs branch from e5c73b8 to 1ba9922 Compare September 30, 2022 12:27

jobs: make the registry logging less chatty

0b8409a

The "toggling idleness" log message was the 4th most voluminous log event source in CC, logged 4x more frequently than the first next event source in volume. This commit makes it less verbose. Release note: None

knz force-pushed the 20220930-jobs branch from 1ba9922 to 0b8409a Compare September 30, 2022 12:28

stevendanna approved these changes Sep 30, 2022

View reviewed changes

dt approved these changes Sep 30, 2022

View reviewed changes

craig bot merged commit aaca5ce into cockroachdb:master Sep 30, 2022

This was referenced Sep 30, 2022

release-22.1: jobs: make the registry logging less chatty #89104

Closed

release-22.2: jobs: make the registry logging less chatty #89105

Closed

knz deleted the 20220930-jobs branch September 30, 2022 20:33

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

jobs: make the registry logging less chatty #89064

jobs: make the registry logging less chatty #89064

knz commented Sep 30, 2022

cockroach-teamcity commented Sep 30, 2022

miretskiy Sep 30, 2022

knz Sep 30, 2022

dt Sep 30, 2022

dt Sep 30, 2022

knz Sep 30, 2022

stevendanna Sep 30, 2022

dt Sep 30, 2022

ajwerner Sep 30, 2022

stevendanna left a comment

stevendanna Sep 30, 2022

stevendanna Sep 30, 2022

stevendanna Sep 30, 2022

knz commented Sep 30, 2022

knz commented Sep 30, 2022

craig bot commented Sep 30, 2022

blathers-crl bot commented Sep 30, 2022

jobs: make the registry logging less chatty #89064

jobs: make the registry logging less chatty #89064

Conversation

knz commented Sep 30, 2022

cockroach-teamcity commented Sep 30, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

stevendanna left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

knz commented Sep 30, 2022

knz commented Sep 30, 2022

craig bot commented Sep 30, 2022

blathers-crl bot commented Sep 30, 2022