Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

jobs: make the registry logging less chatty #89064

Merged
merged 1 commit into from
Sep 30, 2022

Conversation

knz
Copy link
Contributor

@knz knz commented Sep 30, 2022

The "toggling idleness" log message was the 4th most voluminous log event source in CC, logged 4x more frequently than the first next event source in volume.

This commit makes it less verbose.

Release note: None

@knz knz requested a review from a team as a code owner September 30, 2022 09:25
@cockroach-teamcity
Copy link
Member

This change is Reviewable

@@ -1202,7 +1203,7 @@ func (r *Registry) stepThroughStateMachine(
) error {
payload := job.Payload()
jobType := payload.Type()
log.Infof(ctx, "%s job %d: stepping through state %s with error: %+v", jobType, job.ID(), status, jobErr)
log.VEventf(ctx, 2, "%s job %d: stepping through state %s with error: %+v", jobType, job.ID(), status, jobErr)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we keep info level if err non nil?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know. Can you tell me?

@@ -1202,7 +1203,7 @@ func (r *Registry) stepThroughStateMachine(
) error {
payload := job.Payload()
jobType := payload.Type()
log.Infof(ctx, "%s job %d: stepping through state %s with error: %+v", jobType, job.ID(), status, jobErr)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we keep this one? it is among the highest utility log lines related to finding out if a cluster ran a big, destabilizing job and when it did, when it finished, failed, etc. this one feels worth keeping.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could make it slightly quieter by conditionally skipping it if the job is of type auto create stats, but i hesitate to even do that since we have seen those stats jobs implicated in overload investigations or blocking server drain/shutdown which was debugged via this line iirc.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there mileage to be gained with a different behavior when jobErr == nil?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As dt mentions, in terms of debugging this is probably the most useful log message in this package. Aside from just the error, it is kinda the only thing that lets you track where a job was running over time when looking at a long-running job that may have been adopted multiple times.

I'd prefer we keep this one at Info if we can as there is no way to get this information after the fact.

We could make it slightly quieter by conditionally skipping it if the job is of type auto create stats

If we went this route, perhaps rather than switching on type, we have a resumer method that returns a bool that controls whether that job type gets verbose logging by default. Kinda like we have for ForceRealSpans.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While failing with err is the most useful, knowing when it started and resumed are often very useful too, as is knowing when non-err jobs succeeded. This comes up if we're being asked to investigate "why did this backup take 6h instead of 3min" or "why did my all decommissioning block for 3 hours until 0437?"

The other log lines here happen at a constant rate (every time we look for expired jobs) or an unbounded number of times per job (which too and from idle), but this one here is mostly a fixed number of lines per job (eg create, run, failing, failed) and those are all pretty vital when doing forensics after the fact to figure out what was going on when

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I 100% agree that this is an incredibly useful log line.

@knz knz requested review from stevendanna and ajwerner September 30, 2022 11:16
Copy link
Collaborator

@stevendanna stevendanna left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All but the stepping through state log message changes seem unobjectionable to me. I've left a few comments on the others but none of those are blocking.

const stmt = `DELETE FROM system.jobs WHERE id = ANY($1)`
var nDeleted int
if nDeleted, err = r.ex.Exec(
ctx, "gc-jobs", nil /* txn */, stmt, toDelete,
); err != nil {
log.Infof(ctx, "error cleaning up %d jobs: %v", len(toDelete.Array), err)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be Warningf or Errorf?

@@ -1202,7 +1203,7 @@ func (r *Registry) stepThroughStateMachine(
) error {
payload := job.Payload()
jobType := payload.Type()
log.Infof(ctx, "%s job %d: stepping through state %s with error: %+v", jobType, job.ID(), status, jobErr)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As dt mentions, in terms of debugging this is probably the most useful log message in this package. Aside from just the error, it is kinda the only thing that lets you track where a job was running over time when looking at a long-running job that may have been adopted multiple times.

I'd prefer we keep this one at Info if we can as there is no way to get this information after the fact.

We could make it slightly quieter by conditionally skipping it if the job is of type auto create stats

If we went this route, perhaps rather than switching on type, we have a resumer method that returns a bool that controls whether that job type gets verbose logging by default. Kinda like we have for ForceRealSpans.

return false, 0, errors.Wrap(err, "deleting old jobs")
}
log.Infof(ctx, "cleaned up %d expired job records", nDeleted)
log.VEventf(ctx, 2, "cleaned up %d expired job records", nDeleted)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suppose we could log this at Info if nDeleted > 0. But I'll leave that up to you, I don't think we use these for post-mortem analysis very often.

@knz
Copy link
Contributor Author

knz commented Sep 30, 2022

I've reverted the "stepping through state" change. RFAL.

The "toggling idleness" log message was the 4th most voluminous log
event source in CC, logged 4x more frequently than the first next
event source in volume.

This commit makes it less verbose.

Release note: None
@knz
Copy link
Contributor Author

knz commented Sep 30, 2022

Thanks folk.

bors r=dt,stevendanna

@craig
Copy link
Contributor

craig bot commented Sep 30, 2022

Build succeeded:

@blathers-crl
Copy link

blathers-crl bot commented Sep 30, 2022

Encountered an error creating backports. Some common things that can go wrong:

  1. The backport branch might have already existed.
  2. There was a merge conflict.
  3. The backport branch contained merge commits.

You might need to create your backport manually using the backport tool.


error creating merge commit from 0b8409a to blathers/backport-release-21.2-89064: POST https://api.github.com/repos/cockroachdb/cockroach/merges: 409 Merge conflict []

you may need to manually resolve merge conflicts with the backport tool.

Backport to branch 21.2.x failed. See errors above.


🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is otan.

@knz knz deleted the 20220930-jobs branch September 30, 2022 20:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants