Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

distsql,jobs: propagate 'job' log tag and pprof label to remote nodes #91630

Merged
merged 2 commits into from
Nov 22, 2022

Conversation

lidorcarmel
Copy link
Contributor

@lidorcarmel lidorcarmel commented Nov 9, 2022

Previously, the job log tags and the recently added job pprof labels were only used locally and not sent to remote nodes. Which means that, for example, pprof did not show the job ids of jobs that consume cpu but were started on another node.

This PR propagates the job tag to the remote nodes, and there it adds that tag to be used when logging. It also adds the same tag as a pprof label on the remote node. With this change we no longer need to set the tags that were added in #77397, therefore those are reverted here.

Note that this PR changes the job log tag from job=<job id> to job=<job type> id=<job id>, to be the same as the pprof label.

See this gist for a before/after example https://gist.github.com/lidorcarmel/2c147b76d814e29682cccbdcadb5d7bb.

And here is pprof graph example with those tags:

pprof_with_job_tags

There is another commit in this PR which is fixing a small bug in the recently added pprofutil package.

Epic: None

Release note: None

@cockroach-teamcity
Copy link
Member

This change is Reviewable

@lidorcarmel lidorcarmel force-pushed the lidor_distributed_pprof_labels branch 6 times, most recently from 3cd7006 to a83a4fd Compare November 12, 2022 04:09
@lidorcarmel lidorcarmel marked this pull request as ready for review November 13, 2022 05:22
@lidorcarmel lidorcarmel requested review from a team as code owners November 13, 2022 05:22
@lidorcarmel lidorcarmel requested review from HonoreDB and removed request for a team November 13, 2022 05:22
@lidorcarmel lidorcarmel force-pushed the lidor_distributed_pprof_labels branch from a83a4fd to 92bf199 Compare November 13, 2022 05:23
@lidorcarmel
Copy link
Contributor Author

cc @tbg FYI.

Copy link
Member

@yuzefovich yuzefovich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we split up the first commit into a separate PR to be backported to 22.2? Or do we plan to backport the whole change there?

Reviewed 4 of 4 files at r1, 22 of 23 files at r2, all commit messages.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @adityamaru, @dt, @HonoreDB, and @lidorcarmel)


-- commits line 18 at r2:
nit: s/tagand/tag and/.


pkg/sql/distsql_running.go line 410 at r2 (raw file):

		return ""
	}
	setupReq := execinfrapb.SetupFlowRequest{

nit: could you please add the TODO for me here:

	// TODO(yuzefovich): avoid populating some fields of the SetupFlowRequest
	// for local plans.

pkg/sql/execinfra/server_config.go line 304 at r2 (raw file):

	// VerifyContext, when non-nil, is called by the execinfrapb.DistSQLServer
	// when responding to SetupFlow RPCs.
	VerifyContext func(ctx context.Context)

Is it possible to expand SetupFlowCb to do what VerifyContext does (or in the opposite direction)? I think the current single usage of SetupFlowCb would be ok if it was called right before RunFlow.

@lidorcarmel lidorcarmel force-pushed the lidor_distributed_pprof_labels branch from 92bf199 to 389ca86 Compare November 17, 2022 04:47
Copy link
Contributor Author

@lidorcarmel lidorcarmel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for asking! we just discussed it this morning, we probably want to backport everything to 22.2.x. And the pprofutil code touched by the first commit is all new code, so no need to backport to earlier releases.

even if we don't backport, I guess I'm not too worried about backporting the small pprofutil commit because the other 2 usages of SetProfilerLabelsFromCtxTags always add a tag first and should succeed.

but with the current pr, in mixed version the new<->old servers will not propagate log tags, so I'll change this PR to still send the job ID but add it as a tag only if it's not there already. cc @dt

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @adityamaru, @dt, @HonoreDB, and @yuzefovich)


-- commits line 18 at r2:

Previously, yuzefovich (Yahor Yuzefovich) wrote…

nit: s/tagand/tag and/.

done


pkg/sql/distsql_running.go line 410 at r2 (raw file):

// TODO(yuzefovich): avoid populating some fields of the SetupFlowRequest
// for local plans.

of course..


pkg/sql/execinfra/server_config.go line 304 at r2 (raw file):

Previously, yuzefovich (Yahor Yuzefovich) wrote…

Is it possible to expand SetupFlowCb to do what VerifyContext does (or in the opposite direction)? I think the current single usage of SetupFlowCb would be ok if it was called right before RunFlow.

unfortunately if we call that callback after setting up the flow the test will fail, because we see an inflight flow here:

if n := remoteNode.NumRemoteRunningFlows(); n != 0 {

if there's an easy way around it I can try.
otherwise I can s/VerifyContext/PostSetupFlowCb/

Copy link
Member

@yuzefovich yuzefovich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 1 of 1 files at r3.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @adityamaru, @dt, @HonoreDB, and @lidorcarmel)


pkg/sql/execinfra/server_config.go line 304 at r2 (raw file):

Previously, lidorcarmel (Lidor Carmel) wrote…

unfortunately if we call that callback after setting up the flow the test will fail, because we see an inflight flow here:

if n := remoteNode.NumRemoteRunningFlows(); n != 0 {

if there's an easy way around it I can try.
otherwise I can s/VerifyContext/PostSetupFlowCb/

Hm, I'd like to not introduce another testing knob if we can. I just pushed another commit to your branch that seems to work (I stressed it for a couple of minutes locally). Let me know what you think of that change.

Copy link
Contributor Author

@lidorcarmel lidorcarmel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @adityamaru, @dt, @HonoreDB, and @yuzefovich)


pkg/sql/execinfra/server_config.go line 304 at r2 (raw file):

Previously, yuzefovich (Yahor Yuzefovich) wrote…

Hm, I'd like to not introduce another testing knob if we can. I just pushed another commit to your branch that seems to work (I stressed it for a couple of minutes locally). Let me know what you think of that change.

ah, I see why what I did didn't work, I didn't do the cleanup. your commit lgtm, thanks!

@lidorcarmel lidorcarmel force-pushed the lidor_distributed_pprof_labels branch 2 times, most recently from b62a0e3 to 41c872a Compare November 18, 2022 23:04
Copy link
Member

@yuzefovich yuzefovich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm: it might good to get a sign off from someone else too.

Reviewed 4 of 5 files at r4, 16 of 16 files at r5, all commit messages.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @adityamaru, @dt, and @HonoreDB)

@@ -37,8 +37,9 @@ import "cloud/cloudpb/external_storage.proto";
// descriptor in the database, and doesn't emit any rows nor support
// any post-processing.
message BackfillerSpec {
// TODO(lidor): this is not needed in 23.2, the new way to send the job tag is
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor nit: rather than "in 23.2", I'd say "when interoperability with 22.2 is dropped" or something, which may or may not be 23.2 but we'd be trying to predict the future if we specify which version will drop backwards compat with 22.2.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, thanks!

@lidorcarmel lidorcarmel force-pushed the lidor_distributed_pprof_labels branch from 41c872a to 5fe124c Compare November 21, 2022 18:11
If we don't have log tags in the context we return invalid context and undo
func and then the caller might crash.

Return an empty func and the same context.

Add a couple of basic tests.

Epic: None

Release note: None
@lidorcarmel lidorcarmel force-pushed the lidor_distributed_pprof_labels branch from 5fe124c to 676afd0 Compare November 21, 2022 18:20
Previously, the recently added job log tag and pprof label
were only used locally and not sent to remote nodes. Which means that,
for example, pprof did not show the job ids of jobs that consume cpu
but were started on another node.

This PR propagates the job tag to the remote nodes, and there it
adds that tag to be used when logging. It also adds the same tag as
a pprof label on the remote node. With this change we no longer need
to set the tags that were added in cockroachdb#77397, but we cannot remove those
yet because we want the old job tags to still be sent and applied
during upgrades. Once 23.1 is out, we can remove the old implementation
from 23.2.

Note that this PR changes the job log tag from `job=<job id>` to
`job=<job type> id=<job id>`, to be the same as the pprof label.

Epic: None

Release note: None
@lidorcarmel lidorcarmel force-pushed the lidor_distributed_pprof_labels branch from 676afd0 to 1a7ec75 Compare November 21, 2022 22:51
@lidorcarmel
Copy link
Contributor Author

@yuzefovich PTAL.
TestSetupFlowAfterDrain failed because we double Cleanup(), so I reverted the small change in RunFlow() so that it will do the cleanup as it did before (only if the error is stop.ErrUnavailable). And I added a Cleanup when the SetupFlowCb is set.

Copy link
Member

@yuzefovich yuzefovich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, makes sense, LGTM

Reviewed 16 of 16 files at r6, 11 of 11 files at r7, all commit messages.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @adityamaru and @HonoreDB)

@lidorcarmel
Copy link
Contributor Author

Thanks!
bors r+

@craig
Copy link
Contributor

craig bot commented Nov 22, 2022

Build failed (retrying...):

@craig
Copy link
Contributor

craig bot commented Nov 22, 2022

Build succeeded:

@craig craig bot merged commit 59de8e8 into cockroachdb:master Nov 22, 2022
lidorcarmel added a commit to lidorcarmel/cockroach that referenced this pull request Nov 29, 2022
A recently added test (PR cockroachdb#91630) fails in 22.2 in stress or race.

The test is running a backup on a 3 node cluster, and then verifies a remote
flow was running with the right pprof labels and log tags. The problem is
that if all flows are local we fail, which can happen with a 3 node cluster.

Instead, this fix is changing the test cluster to have 4 nodes which will
force at lease one flow to run remotely.

Release note: None

Epic: None
lidorcarmel added a commit to lidorcarmel/cockroach that referenced this pull request Dec 1, 2022
A recently added test (PR cockroachdb#91630) fails in 22.2 in stress or race.

The test is running a backup on a 3 node cluster, and then verifies a remote
flow was running with the right pprof labels and log tags. The problem is
that if all flows are local we fail, which can happen with a 3 node cluster.

Instead, this fix is changing the test cluster to have 4 nodes which will
force at lease one flow to run remotely.

Release note: None

Epic: None
craig bot pushed a commit that referenced this pull request Dec 2, 2022
92696: backupccl: fix a recently added flaky test r=lidorcarmel a=lidorcarmel

A recently added test (PR #91630) fails in 22.2 in stress or race.

The test is running a backup on a 3 node cluster, and then verifies a remote flow was running with the right pprof labels and log tags. The problem is that if all flows are local we fail, which can happen with a 3 node cluster.

Instead, this fix is changing the test cluster to have 4 nodes which will force at lease one flow to run remotely.

Release note: None

Epic: None

Co-authored-by: Lidor Carmel <[email protected]>
lidorcarmel added a commit to lidorcarmel/cockroach that referenced this pull request Dec 2, 2022
A recently added test (PR cockroachdb#91630) fails in 22.2 in stress or race.

The test is running a backup on a 3 node cluster, and then verifies a remote
flow was running with the right pprof labels and log tags. The problem is
that if all flows are local we fail, which can happen with a 3 node cluster.

Instead, this fix is changing the test cluster to have 4 nodes which will
force at lease one flow to run remotely.

Release note: None

Epic: None
@lidorcarmel lidorcarmel deleted the lidor_distributed_pprof_labels branch January 24, 2023 20:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants