Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

backup: use locality filter to restrict processes to locality #99089

Merged
merged 2 commits into from
Mar 23, 2023

Conversation

dt
Copy link
Member

@dt dt commented Mar 21, 2023

This extends the previous locality filter option that pinned the backup job to a specific
locality filter to also pin the processors which backup the actual row data as well, allowing
assignment of a backup job to only the instances running in a particular locality.

Those instances will attempt to read from the closest replica they can, however a potential
optimization would be to pick an instace closest to a replica that also matches the locality
filter if there is one -- currently it will likely just pick the instance closest to the
leaseholder which matches the filter, however there may be an instance closer to a usable
replica that is not the leaseholder. To make use of these replicas, a replica oracle that
ranks replicas matching the filter would need to be passed when locality filtering placement
is in-use; this optimization is left to a follow-up.

Release note (sql change): the nodes involved in the execution of backups, including processing of row data and the job coordination, can now be controlled using an execution-locality filter.
Epic: CRDB-9547.

@dt dt requested review from benbardin and msbutler March 21, 2023 04:25
@dt dt requested review from a team as code owners March 21, 2023 04:25
@dt dt requested a review from mgartner March 21, 2023 04:25
@cockroach-teamcity
Copy link
Member

This change is Reviewable

pkg/sql/distsql_physical_planner.go Outdated Show resolved Hide resolved
pkg/sql/distsql_physical_planner.go Show resolved Hide resolved
pkg/sql/distsql_physical_planner.go Show resolved Hide resolved
pkg/sql/distsql_plan_bulk.go Show resolved Hide resolved
pkg/ccl/backupccl/backup_job.go Outdated Show resolved Hide resolved
@benbardin
Copy link
Collaborator

Approved for backupccl

Copy link
Member

@yuzefovich yuzefovich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there some tests we could add to ensure the planning is as expected (I guess we should be able to use EXPLAIN (DISTSQL))? It might be annoying to set up the cluster with all the localities though, and I can't easily recall an example of where we do that.

Reviewed 1 of 2 files at r1, 2 of 2 files at r2, 1 of 1 files at r3, all commit messages.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @benbardin, @dt, @mgartner, and @msbutler)

@dt dt force-pushed the backup-in-loc branch 4 times, most recently from e6c9387 to d714663 Compare March 22, 2023 12:47
@dt
Copy link
Member Author

dt commented Mar 22, 2023

Is there some tests we could add

Yep, as mentioned in the PR text, I was still working on tests. They're in now, along with the rebase, so this is RFAL.

In the (now) first commit, the existing tests of PartitionSpans are extended, including a mocked sql instances store, to verify filtered planning. This was helpful since it also caught the fact maybeReassignToGatewaySQLInstance could violate a filter (now fixed).

The second second commit integrating filtering into BACKUP now includes an end-to-end test using a BACKUP to nodelocal://0 (where 0 there means "my own node id") where the test can inspect whether or not a node ends up with files on it to determine where the processes ran and verify they ran where expected.

@dt
Copy link
Member Author

dt commented Mar 23, 2023

@yuzefovich I don't think we can EXPLAIN DISTSQL a job, so I think the test in the second commit is as close as we can get to e2e test?

Copy link
Member

@yuzefovich yuzefovich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yuzefovich I don't think we can EXPLAIN DISTSQL a job, so I think the test in the second commit is as close as we can get to e2e test?

Yeah, I think you're right.

I only have nits, so :lgtm: but it'd be good to get an approval from someone from DR too.

Reviewed 1 of 5 files at r5, 6 of 6 files at r6, 3 of 3 files at r7, all commit messages.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @benbardin, @dt, @mgartner, and @msbutler)


pkg/sql/distsql_physical_planner.go line 1426 at r6 (raw file):

	// If we were able to determine the locality information for at least some
	// instances, use the region-aware resolver.

nit: s/region/locality/g.


pkg/sql/distsql_physical_planner.go line 1455 at r6 (raw file):

			// No instances had any locality tiers in common with the node locality so
			// just return the a gateway if it is eligible. If it isn't, just pick a

nit: "the a gateway".


pkg/sql/distsql_plan_bulk.go line 51 at r6 (raw file):

}

// setupAllNodesPlanningSystem creates a planCtx and returns all nodes available

nit: mention the locality filter.


pkg/sql/distsql_plan_bulk.go line 88 at r6 (raw file):

}

// setupAllNodesPlanningTenant creates a planCtx and returns all nodes available

nit: ditto.


pkg/ccl/backupccl/backup_test.go line 372 at r7 (raw file):

	defer cleanupFn()

	// locationToDir converts backup URIs based on localFoo to the temporary

nit: s/locationToDir/dirOf/g.


pkg/ccl/backupccl/backup_test.go line 530 at r7 (raw file):

	// Job exec relocation will return an error while it waits for resumption on
	// a matching node, but ths makes for a slow test so just send the job stmt to

nit: s/ths/this/g.


pkg/sql/distsql_physical_planner_test.go line 1455 at r6 (raw file):

}

type mockAddressResolver map[base.SQLInstanceID]roachpb.Locality

nit: add "interface assertion" (i.e. something like var _ AddressResolver = mockAddressResolver{}).

This adds a locality filter to PlanningCtx, set via an argument to the
SetupAllNodesPlanning variant that allows controlling the replica oracle,
that will restrict the instances on which planning can occur to only those
which match the locality filter.

This option automatically triggers the use of instance-based planning.

Currently this option is only available via SetupAllNodesPlanning but
future work could expose it to all physical planning via a session var.

Release note: none.
Epic: none.
Copy link
Member Author

@dt dt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TFTRs!

good to get an approval from someone from DR too.

Indeed, Ben signed off earlier so all set there.

Thanks again for all the help/guidance in figuring out this out.

bors r=yuzefovich,benbardin

Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @benbardin, @mgartner, and @msbutler)


pkg/sql/distsql_physical_planner.go line 1565 at r2 (raw file):

Previously, benbardin (Ben Bardin) wrote…

I love the new name!

Done.

I just passed it anyway so there is one fewer cite to update when we add generalized query loc filters.


pkg/sql/distsql_physical_planner.go line 1426 at r6 (raw file):

Previously, yuzefovich (Yahor Yuzefovich) wrote…

nit: s/region/locality/g.

Done.


pkg/sql/distsql_physical_planner.go line 1455 at r6 (raw file):

Previously, yuzefovich (Yahor Yuzefovich) wrote…

nit: "the a gateway".

Done.


pkg/sql/distsql_plan_bulk.go line 46 at r2 (raw file):

Previously, benbardin (Ben Bardin) wrote…

👍

Done.


pkg/sql/distsql_plan_bulk.go line 51 at r6 (raw file):

Previously, yuzefovich (Yahor Yuzefovich) wrote…

nit: mention the locality filter.

Done.


pkg/sql/distsql_plan_bulk.go line 88 at r6 (raw file):

Previously, yuzefovich (Yahor Yuzefovich) wrote…

nit: ditto.

Done.


pkg/ccl/backupccl/backup_test.go line 372 at r7 (raw file):

Previously, yuzefovich (Yahor Yuzefovich) wrote…

nit: s/locationToDir/dirOf/g.

Done.


pkg/ccl/backupccl/backup_test.go line 530 at r7 (raw file):

Previously, yuzefovich (Yahor Yuzefovich) wrote…

nit: s/ths/this/g.

Done.


pkg/sql/distsql_physical_planner_test.go line 1455 at r6 (raw file):

Previously, yuzefovich (Yahor Yuzefovich) wrote…

nit: add "interface assertion" (i.e. something like var _ AddressResolver = mockAddressResolver{}).

Done.

This extends the previous locality filter option that pinned the backup job to a specific
locality filter to also pin the processors which backup the actual row data as well, allowing
assignment of a backup job to only the instances running in a particular locality.

Those instances will attempt to read from the closest replica they can, however a potential
optimization would be to pick an instace closest to a replica that also matches the locality
filter if there is one -- currently it will likely just pick the instance closest to the
leaseholder which matches the filter, however there may be an instance closer to a usable
replica that is not the leaseholder. To make use of these replicas, a replica oracle that
ranks replicas matching the filter would need to be passed when locality filtering placement
is in-use; this optimization is left to a follow-up.

Release note (sql change): the nodes involved in the execution of backups, including processing of row data and the job coordination, can now be controlled using an execution-locality filter.
Epic: CRDB-9547.
Copy link
Member Author

@dt dt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

bors r-

Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @benbardin, @mgartner, and @msbutler)

@craig
Copy link
Contributor

craig bot commented Mar 23, 2023

Canceled.

@dt dt force-pushed the backup-in-loc branch from 0ca768b to 2eeae74 Compare March 23, 2023 03:04
Copy link
Member Author

@dt dt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Whoops, overlooked that git push had failed. Let's try that again now

bors r+

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @benbardin, @mgartner, @msbutler, and @yuzefovich)

@craig
Copy link
Contributor

craig bot commented Mar 23, 2023

Build succeeded:

@craig craig bot merged commit 5493fdf into cockroachdb:master Mar 23, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants