-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
jobs: make the registry logging less chatty #89064
Conversation
pkg/jobs/registry.go
Outdated
@@ -1202,7 +1203,7 @@ func (r *Registry) stepThroughStateMachine( | |||
) error { | |||
payload := job.Payload() | |||
jobType := payload.Type() | |||
log.Infof(ctx, "%s job %d: stepping through state %s with error: %+v", jobType, job.ID(), status, jobErr) | |||
log.VEventf(ctx, 2, "%s job %d: stepping through state %s with error: %+v", jobType, job.ID(), status, jobErr) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we keep info level if err non nil?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know. Can you tell me?
pkg/jobs/registry.go
Outdated
@@ -1202,7 +1203,7 @@ func (r *Registry) stepThroughStateMachine( | |||
) error { | |||
payload := job.Payload() | |||
jobType := payload.Type() | |||
log.Infof(ctx, "%s job %d: stepping through state %s with error: %+v", jobType, job.ID(), status, jobErr) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we keep this one? it is among the highest utility log lines related to finding out if a cluster ran a big, destabilizing job and when it did, when it finished, failed, etc. this one feels worth keeping.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could make it slightly quieter by conditionally skipping it if the job is of type auto create stats, but i hesitate to even do that since we have seen those stats jobs implicated in overload investigations or blocking server drain/shutdown which was debugged via this line iirc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is there mileage to be gained with a different behavior when jobErr == nil?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As dt mentions, in terms of debugging this is probably the most useful log message in this package. Aside from just the error, it is kinda the only thing that lets you track where a job was running over time when looking at a long-running job that may have been adopted multiple times.
I'd prefer we keep this one at Info if we can as there is no way to get this information after the fact.
We could make it slightly quieter by conditionally skipping it if the job is of type auto create stats
If we went this route, perhaps rather than switching on type, we have a resumer method that returns a bool that controls whether that job type gets verbose logging by default. Kinda like we have for ForceRealSpans.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While failing with err is the most useful, knowing when it started and resumed are often very useful too, as is knowing when non-err jobs succeeded. This comes up if we're being asked to investigate "why did this backup take 6h instead of 3min" or "why did my all decommissioning block for 3 hours until 0437?"
The other log lines here happen at a constant rate (every time we look for expired jobs) or an unbounded number of times per job (which too and from idle), but this one here is mostly a fixed number of lines per job (eg create, run, failing, failed) and those are all pretty vital when doing forensics after the fact to figure out what was going on when
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I 100% agree that this is an incredibly useful log line.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All but the stepping through state
log message changes seem unobjectionable to me. I've left a few comments on the others but none of those are blocking.
pkg/jobs/registry.go
Outdated
const stmt = `DELETE FROM system.jobs WHERE id = ANY($1)` | ||
var nDeleted int | ||
if nDeleted, err = r.ex.Exec( | ||
ctx, "gc-jobs", nil /* txn */, stmt, toDelete, | ||
); err != nil { | ||
log.Infof(ctx, "error cleaning up %d jobs: %v", len(toDelete.Array), err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be Warningf or Errorf?
pkg/jobs/registry.go
Outdated
@@ -1202,7 +1203,7 @@ func (r *Registry) stepThroughStateMachine( | |||
) error { | |||
payload := job.Payload() | |||
jobType := payload.Type() | |||
log.Infof(ctx, "%s job %d: stepping through state %s with error: %+v", jobType, job.ID(), status, jobErr) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As dt mentions, in terms of debugging this is probably the most useful log message in this package. Aside from just the error, it is kinda the only thing that lets you track where a job was running over time when looking at a long-running job that may have been adopted multiple times.
I'd prefer we keep this one at Info if we can as there is no way to get this information after the fact.
We could make it slightly quieter by conditionally skipping it if the job is of type auto create stats
If we went this route, perhaps rather than switching on type, we have a resumer method that returns a bool that controls whether that job type gets verbose logging by default. Kinda like we have for ForceRealSpans.
pkg/jobs/registry.go
Outdated
return false, 0, errors.Wrap(err, "deleting old jobs") | ||
} | ||
log.Infof(ctx, "cleaned up %d expired job records", nDeleted) | ||
log.VEventf(ctx, 2, "cleaned up %d expired job records", nDeleted) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suppose we could log this at Info if nDeleted > 0. But I'll leave that up to you, I don't think we use these for post-mortem analysis very often.
I've reverted the "stepping through state" change. RFAL. |
The "toggling idleness" log message was the 4th most voluminous log event source in CC, logged 4x more frequently than the first next event source in volume. This commit makes it less verbose. Release note: None
Thanks folk. bors r=dt,stevendanna |
Build succeeded: |
Encountered an error creating backports. Some common things that can go wrong:
You might need to create your backport manually using the backport tool. error creating merge commit from 0b8409a to blathers/backport-release-21.2-89064: POST https://api.github.com/repos/cockroachdb/cockroach/merges: 409 Merge conflict [] you may need to manually resolve merge conflicts with the backport tool. Backport to branch 21.2.x failed. See errors above. 🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is otan. |
The "toggling idleness" log message was the 4th most voluminous log event source in CC, logged 4x more frequently than the first next event source in volume.
This commit makes it less verbose.
Release note: None