Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sql: grant/revoke on a large number of objects can create a lot of jobs #123414

Merged
merged 1 commit into from
May 13, 2024

Conversation

Dedej-Bergin
Copy link
Contributor

Previously we would create multiple jobs for granting on multiple tables and types within one transaction. This caused a performance slowdown, these code changes skip the making of multiple jobs and just execute the grants in a batch.

Fixes:#117643
Release note(performance improvement): multiple grants on tables and types within one transaction now run faster.

@cockroach-teamcity
Copy link
Member

This change is Reviewable

@Dedej-Bergin Dedej-Bergin force-pushed the combine-jobs branch 4 times, most recently from 1cf0d02 to 0d3571c Compare May 2, 2024 19:15
@Dedej-Bergin Dedej-Bergin marked this pull request as ready for review May 2, 2024 19:56
@Dedej-Bergin Dedej-Bergin requested a review from a team as a code owner May 2, 2024 19:56
Copy link
Collaborator

@fqazi fqazi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good, just a few suggestions

Reviewed 3 of 4 files at r1, all commit messages.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @Dedej-Bergin)


-- commits line 10 at r1:
*large number of (scale makes a difference)


pkg/sql/schema_changer_test.go line 7850 at r1 (raw file):

		// wait for channel before starting the sleep
		<-messages
		time.Sleep(2 * time.Second)

This will work okayish. We could also confirm this by having the lease manager block this committing as an extra source of truth.

i.e. We would set the cluster setting: sql.catalog.descriptor_lease_duration to a low value like 15-30 seconds. Then have this transaction run queries like "SELECT 1" until it hits an error. What will happen is that once the lease duration has passed any other transactions is not allowed to use the old versions of the PROMO_CODES descriptor.


pkg/sql/schema_changer_test.go line 7862 at r1 (raw file):

	// post to channel to start sleep timer in go routine
	messages <- struct{}{}

Another option is to just do

close(messages)

Copy link
Collaborator

@rafiss rafiss left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @Dedej-Bergin and @fqazi)


-- commits line 10 at r1:

Previously, fqazi (Faizan Qazi) wrote…

*large number of (scale makes a difference)

since this change is a performance improvement, we should include a microbenchmark that shows that improvement. (or if this improves an existing one, then show the results.)

to see an example of a microbenchmark so you can write a similar one, take a look at:

func BenchmarkEnsureAllTables(b *testing.B) {

then you can compare the old and new performance by using:

N=10 BENCHTIMEOUT=24h PKG=./pkg/sql/ BENCHES=BenchmarkYourNameHere ./scripts/bench 'old-commit' 'new-commit'

(if you're adding a new benchmark, it's helpful to put it in its own commit before your other code changes, so you can use the above script.)


pkg/sql/grant_revoke.go line 429 at r1 (raw file):

			}
		case *schemadesc.Mutable:
			if err := p.writeSchemaDescChange(

nit: should this also be updated to use the batch?


pkg/sql/grant_revoke.go line 448 at r1 (raw file):

			}
		case *funcdesc.Mutable:
			if err := p.writeFuncSchemaChange(ctx, d); err != nil {

nit: should this also be updated to use the batch?


pkg/sql/schema_changer_test.go line 7825 at r1 (raw file):

}

func TestGrantAndSelectsWithinTransaction(t *testing.T) {

nit: could you add a comment to the test that explains what it is testing?


pkg/sql/schema_changer_test.go line 7850 at r1 (raw file):

Previously, fqazi (Faizan Qazi) wrote…

This will work okayish. We could also confirm this by having the lease manager block this committing as an extra source of truth.

i.e. We would set the cluster setting: sql.catalog.descriptor_lease_duration to a low value like 15-30 seconds. Then have this transaction run queries like "SELECT 1" until it hits an error. What will happen is that once the lease duration has passed any other transactions is not allowed to use the old versions of the PROMO_CODES descriptor.

why is a 2 second sleep necessary?


pkg/sql/schema_changer_test.go line 7869 at r1 (raw file):

	err = group.Wait()
	require.True(t, duration > 1500 * time.Millisecond)

why do we want this 1500ms assertion? (this could use a comment)

Copy link
Contributor Author

@Dedej-Bergin Dedej-Bergin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @fqazi and @rafiss)


-- commits line 10 at r1:

Previously, rafiss (Rafi Shamim) wrote…

since this change is a performance improvement, we should include a microbenchmark that shows that improvement. (or if this improves an existing one, then show the results.)

to see an example of a microbenchmark so you can write a similar one, take a look at:

func BenchmarkEnsureAllTables(b *testing.B) {

then you can compare the old and new performance by using:

N=10 BENCHTIMEOUT=24h PKG=./pkg/sql/ BENCHES=BenchmarkYourNameHere ./scripts/bench 'old-commit' 'new-commit'

(if you're adding a new benchmark, it's helpful to put it in its own commit before your other code changes, so you can use the above script.)

Thanks, I changed multiple grants to a Multiple/large amounts of grants.


pkg/sql/grant_revoke.go line 429 at r1 (raw file):

Previously, rafiss (Rafi Shamim) wrote…

nit: should this also be updated to use the batch?

I think the problem was mainly happening on tables and types, so it was decided not to update schema since the issue is not common there, but if you think that it would be good to use batch for schema grants than I could add that change in.

Let me know if you still think we should use batch for schema grants?


pkg/sql/grant_revoke.go line 448 at r1 (raw file):

Previously, rafiss (Rafi Shamim) wrote…

nit: should this also be updated to use the batch?

I think the problem was mainly happening on tables and types, so it was decided not to update functions since the issue is not common there, but if you think that it would be good to use batch for function grants than I could add that change in.

Let me know if you still think we should use batch for function grants?


pkg/sql/schema_changer_test.go line 7825 at r1 (raw file):

Previously, rafiss (Rafi Shamim) wrote…

nit: could you add a comment to the test that explains what it is testing?

Definitely, just added it thanks Rafi!


pkg/sql/schema_changer_test.go line 7850 at r1 (raw file):

Previously, rafiss (Rafi Shamim) wrote…

why is a 2 second sleep necessary?

The sleep is in there because with a sleep on the transaction1 commit we end up holding transaction2, which is a functionality that we want to test/protect from someone breaking this commit/wait process in the future. For the duration of the sleep being 2 seconds, I could lower it if that would be better. I went ahead and added a comment for this sleep.

Also at the moment I'm looking into modifying the test to use sql.catalog.descriptor_lease_duration


pkg/sql/schema_changer_test.go line 7862 at r1 (raw file):

Previously, fqazi (Faizan Qazi) wrote…

Another option is to just do

close(messages)

Sounds good I went ahead and made this change


pkg/sql/schema_changer_test.go line 7869 at r1 (raw file):

Previously, rafiss (Rafi Shamim) wrote…

why do we want this 1500ms assertion? (this could use a comment)

Thanks Rafi, I went ahead and added a comment for this. Here we are verifying that transaction2 did hold, it should hold for up to 2 seconds, so we check that the duration is at least 1.5 seconds (1500ms). If you want I could reduce the wait time here.

@Dedej-Bergin Dedej-Bergin force-pushed the combine-jobs branch 2 times, most recently from 93476e0 to f227b7f Compare May 3, 2024 20:52
@Dedej-Bergin
Copy link
Contributor Author

-- commits line 10 at r1:

Previously, Dedej-Bergin (Bergin Dedej) wrote…

Thanks, I changed multiple grants to a Multiple/large amounts of grants.

Thanks Rafi! I just now modified the benchmark_expectations file with the new improved benchmarks, I will look into BenchmarkEnsureAllTables as well

Copy link
Collaborator

@rafiss rafiss left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @Dedej-Bergin and @fqazi)


pkg/sql/schema_changer_test.go line 7850 at r1 (raw file):

Previously, Dedej-Bergin (Bergin Dedej) wrote…

The sleep is in there because with a sleep on the transaction1 commit we end up holding transaction2, which is a functionality that we want to test/protect from someone breaking this commit/wait process in the future. For the duration of the sleep being 2 seconds, I could lower it if that would be better. I went ahead and added a comment for this sleep.

Also at the moment I'm looking into modifying the test to use sql.catalog.descriptor_lease_duration

replied to this in the comment below


pkg/sql/schema_changer_test.go line 7869 at r1 (raw file):

Previously, Dedej-Bergin (Bergin Dedej) wrote…

Thanks Rafi, I went ahead and added a comment for this. Here we are verifying that transaction2 did hold, it should hold for up to 2 seconds, so we check that the duration is at least 1.5 seconds (1500ms). If you want I could reduce the wait time here.

ah ok that's what i was wondering! if the sleep was just there to check that one txn finished before the other, it would actually be better to just create a boolean variable, like txn1Committed or use a channel. we'd have to think a bit about how to make sure it's set atomically in order to make the test not be flaky. we can talk through ideas next week.

the main reason is that it's always best to avoid adding sleeps in tests. relying on sleeps makes tests flaky usually.

@Dedej-Bergin Dedej-Bergin force-pushed the combine-jobs branch 2 times, most recently from aaf09f5 to 5cd49d7 Compare May 7, 2024 13:59
@Dedej-Bergin Dedej-Bergin requested a review from a team as a code owner May 7, 2024 13:59
@Dedej-Bergin Dedej-Bergin requested review from yuzefovich and removed request for a team May 7, 2024 13:59
@Dedej-Bergin
Copy link
Contributor Author

-- commits line 10 at r1:

Previously, Dedej-Bergin (Bergin Dedej) wrote…

Thanks Rafi! I just now modified the benchmark_expectations file with the new improved benchmarks, I will look into BenchmarkEnsureAllTables as well

I just added to microbenchmark tests for grants on types and grants on tables, thanks Rafi!

@Dedej-Bergin
Copy link
Contributor Author

pkg/sql/schema_changer_test.go line 7850 at r1 (raw file):

Previously, rafiss (Rafi Shamim) wrote…

replied to this in the comment below

We got rid of the sleeps and channel, and are now using 2 go routines and a concurrent safe boolean to get this test to provide the same coverage with a runtime in milliseconds. Thanks for your help!

@Dedej-Bergin
Copy link
Contributor Author

pkg/sql/schema_changer_test.go line 7869 at r1 (raw file):

Previously, rafiss (Rafi Shamim) wrote…

ah ok that's what i was wondering! if the sleep was just there to check that one txn finished before the other, it would actually be better to just create a boolean variable, like txn1Committed or use a channel. we'd have to think a bit about how to make sure it's set atomically in order to make the test not be flaky. we can talk through ideas next week.

the main reason is that it's always best to avoid adding sleeps in tests. relying on sleeps makes tests flaky usually.

We got rid of the sleeps and channel, and are now using 2 go routines and a concurrent safe boolean to get this test to provide the same coverage with a runtime in milliseconds. Thanks for your help!

Copy link
Contributor Author

@Dedej-Bergin Dedej-Bergin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @fqazi, @rafiss, and @yuzefovich)


-- commits line 10 at r1:

Previously, Dedej-Bergin (Bergin Dedej) wrote…

I just added to microbenchmark tests for grants on types and grants on tables, thanks Rafi!

The benchmarks show that it's running a lot faster now! Grants on 1000 tables is 2,500% faster!

**

Benchmark Results with new Changes:

BenchmarkGrantTables/numTables=10

BenchmarkGrantTables/numTables=10-10               48   25188102 ns/op

BenchmarkGrantTables/numTables=100

BenchmarkGrantTables/numTables=100-10              6 195067576 ns/op

BenchmarkGrantTables/numTables=1000

BenchmarkGrantTables/numTables=1000-10             1 1828938834 ns/op

BenchmarkGrantTypes/numTypes=10

BenchmarkGrantTypes/numTypes=10-10               32   36117533 ns/op

BenchmarkGrantTypes/numTypes=100

BenchmarkGrantTypes/numTypes=100-10              4 328903364 ns/op

BenchmarkGrantTypes/numTypes=1000

BenchmarkGrantTypes/numTypes=1000-10             1 3449303583 ns/op

Benchmark Results without new Changes:

BenchmarkGrantTables/numTables=10

BenchmarkGrantTables/numTables=10-10               9 111688116 ns/op

BenchmarkGrantTables/numTables=100

BenchmarkGrantTables/numTables=100-10              2 720130208 ns/op

BenchmarkGrantTables/numTables=1000

BenchmarkGrantTables/numTables=1000-10             1 51262939792 ns/op

BenchmarkGrantTypes/numTypes=10

BenchmarkGrantTypes/numTypes=10-10               8 129941745 ns/op

BenchmarkGrantTypes/numTypes=100

BenchmarkGrantTypes/numTypes=100-10              1 1030206208 ns/op

BenchmarkGrantTypes/numTypes=1000

BenchmarkGrantTypes/numTypes=1000-10             1 10553151833 ns/op

**

@yuzefovich yuzefovich requested a review from rafiss May 7, 2024 16:08
Copy link
Member

@yuzefovich yuzefovich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @Dedej-Bergin, @fqazi, and @rafiss)


pkg/sql/stats/automatic_stats_test.go line 312 at r3 (raw file):

}

func BenchmarkGrantTables(b *testing.B) {

nit: is there a reason these new benchmarks are added to stats package? They seem orthogonal to the table stats feature.

Copy link
Collaborator

@rafiss rafiss left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @fqazi and @yuzefovich)


pkg/sql/stats/automatic_stats_test.go line 312 at r3 (raw file):

Previously, yuzefovich (Yahor Yuzefovich) wrote…

nit: is there a reason these new benchmarks are added to stats package? They seem orthogonal to the table stats feature.

a better place for the test would be pkg/sql/grant_revoke_test.go

Copy link
Collaborator

@rafiss rafiss left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @Dedej-Bergin, @fqazi, and @yuzefovich)


pkg/sql/grant_revoke.go line 429 at r1 (raw file):

Previously, Dedej-Bergin (Bergin Dedej) wrote…

I think the problem was mainly happening on tables and types, so it was decided not to update schema since the issue is not common there, but if you think that it would be good to use batch for schema grants than I could add that change in.

Let me know if you still think we should use batch for schema grants?

i do think it would be nice to keep all the different code paths consistent. unless there's a reason we can't use the batch here (like it would break an existing test), i think we should use it in all the branches.


pkg/sql/grant_revoke.go line 448 at r1 (raw file):

Previously, Dedej-Bergin (Bergin Dedej) wrote…

I think the problem was mainly happening on tables and types, so it was decided not to update functions since the issue is not common there, but if you think that it would be good to use batch for function grants than I could add that change in.

Let me know if you still think we should use batch for function grants?

see above comment

Copy link
Collaborator

@fqazi fqazi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @Dedej-Bergin, @rafiss, and @yuzefovich)


pkg/sql/grant_revoke.go line 395 at r7 (raw file):

Previously, rafiss (Rafi Shamim) wrote…

this is unrelated to your PR, but as a separate followup, i wonder if we should have this !d.Dropped() check for the other cases also. seems like it could prevent some unneeded work. wdyt @fqazi ?

Yeah, probably better to have the dropped check in other cases too.

Copy link
Contributor Author

@Dedej-Bergin Dedej-Bergin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @fqazi, @rafiss, and @yuzefovich)


pkg/sql/schema_changer_test.go line 7832 at r7 (raw file):

Previously, rafiss (Rafi Shamim) wrote…

since it looks like we do need this test to be timing related, i think it would be better to skip this under expensive configurations. there's a helper function called skip.UnderDuress(t, "reason"), which will skip the test if the race or deadlock detector is on, or if it's being run under stress. those are all situations in which we want to avoid slow tests. can you use that helper function, note the reason for the skip when it's called?

Just added it, thanks Rafi!


pkg/sql/schema_changer_test.go line 7871 at r7 (raw file):

Previously, rafiss (Rafi Shamim) wrote…

we could add something here to make sure it's the correct error, like this:

	require.ErrorContains(t, err, "RETRY_COMMIT_DEADLINE_EXCEEDED")

I see, this is very nice, just added it, thanks Rafi!

Copy link
Collaborator

@fqazi fqazi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @rafiss and @yuzefovich)


pkg/sql/schema_changer_test.go line 7834 at r7 (raw file):

Previously, rafiss (Rafi Shamim) wrote…

this is more of a question for @fqazi -- i thought this lease duration no longer was relevant with session based leasing?

also, is this going to be reliable? i thought the actual duration would get jittered?

Yeah the duration still applies on session based in terms of when old versions get evicted (i.e. transaction using the old version are retried due to the deadline). Even with jitter the intent was to speed up the test and confirm we wait for any dependent txn to be kicked.

@rafiss rafiss requested review from fqazi and yuzefovich May 9, 2024 16:56
Copy link
Collaborator

@rafiss rafiss left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @fqazi and @yuzefovich)


pkg/sql/schema_changer_test.go line 7834 at r7 (raw file):

Previously, fqazi (Faizan Qazi) wrote…

Yeah the duration still applies on session based in terms of when old versions get evicted (i.e. transaction using the old version are retried due to the deadline). Even with jitter the intent was to speed up the test and confirm we wait for any dependent txn to be kicked.

i see thanks

so if the setting is set to 15s, and there is jitter, does that mean the test could be flaky since it only checks: require.Greaterf(t, duration, 15*time.Second)

Copy link
Collaborator

@fqazi fqazi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @Dedej-Bergin, @rafiss, and @yuzefovich)


pkg/sql/schema_changer_test.go line 7834 at r7 (raw file):

Previously, rafiss (Rafi Shamim) wrote…

i see thanks

so if the setting is set to 15s, and there is jitter, does that mean the test could be flaky since it only checks: require.Greaterf(t, duration, 15*time.Second)

Yeah, it could be flaky. Lets just drop the timing angle all together, since the deadline error confirms that the deadline came into play. @Dedej-Bergin

@Dedej-Bergin Dedej-Bergin force-pushed the combine-jobs branch 2 times, most recently from e6648bb to 945389e Compare May 9, 2024 21:02
Copy link
Contributor Author

@Dedej-Bergin Dedej-Bergin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @fqazi, @rafiss, and @yuzefovich)


pkg/sql/grant_revoke.go line 395 at r7 (raw file):

Previously, fqazi (Faizan Qazi) wrote…

Yeah, probably better to have the dropped check in other cases too.

Sounds good, just added it, thanks!


pkg/sql/schema_changer_test.go line 7834 at r7 (raw file):

Previously, fqazi (Faizan Qazi) wrote…

Yeah, it could be flaky. Lets just drop the timing angle all together, since the deadline error confirms that the deadline came into play. @Dedej-Bergin

Sounds good, just removed the timing related code, thanks!

@rafiss rafiss requested a review from fqazi May 9, 2024 21:11
Copy link
Collaborator

@rafiss rafiss left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @fqazi and @yuzefovich)


pkg/sql/stats/automatic_stats_test.go line 312 at r3 (raw file):

Previously, Dedej-Bergin (Bergin Dedej) wrote…

sounds good I went ahead and moved it to grant_revoke_test.go, thanks Rafi!

lgtm! great work on taking care of all the followups and making a big perf improvement

Copy link
Collaborator

@fqazi fqazi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Reviewed 2 of 4 files at r7, 1 of 2 files at r8, 2 of 2 files at r9, all commit messages.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @rafiss and @yuzefovich)

@rafiss
Copy link
Collaborator

rafiss commented May 10, 2024

there's a linting failure:

=== NAME  TestLint/TestMissingLeakTest
    gen-lint_test.go:430: 
        sql/schema_changer_test.go: TestLeaseTimeoutWithConcurrentTransactions: missing defer leaktest.AfterTest

to resolve it, move the skip so it's after the defers

	skip.UnderDuress(t, "slow test")
	defer leaktest.AfterTest(t)()
	defer log.Scope(t).Close(t)

Previously we would create multiple jobs for granting on multiple
tables and types within one transaction.  This caused a performance
slowdown, these code changes skip the making of multiple jobs and
just execute the grants in a batch.

Fixes: cockroachdb#117643
Release note (performance improvement): Multiple/large amounts of grants on
tables and types within one transaction now run faster.
@Dedej-Bergin
Copy link
Contributor Author

bors r+

@craig
Copy link
Contributor

craig bot commented May 13, 2024

Build failed:

@Dedej-Bergin
Copy link
Contributor Author

bors r+

@craig craig bot merged commit f51ccd1 into cockroachdb:master May 13, 2024
21 of 22 checks passed
@renatolabs
Copy link
Contributor

There was another lint failure introduced in this PR that only shows up in Bazel Extended CI (see internal discussion).

Dedej-Bergin added a commit to Dedej-Bergin/cockroach that referenced this pull request May 14, 2024
Methods that returned an err were not checked in PR cockroachdb#123414 which is
causing an failure with our lint test, this PR contains those err checks.

Informs: cockroachdb#123414

Release note: None
craig bot pushed a commit that referenced this pull request May 14, 2024
124128: sql/schema_changer_test.go not checking err, fix for linting issue r=Dedej-Bergin a=Dedej-Bergin

Methods that returned an err were not checked in PR #123414 which is causing a failure with our lint test, this PR contains those err checks.

Informs: #123414

Release note: None

Co-authored-by: Bergin Dedej <[email protected]>
@Dedej-Bergin Dedej-Bergin deleted the combine-jobs branch May 14, 2024 15:18
aadityasondhi pushed a commit to aadityasondhi/cockroach that referenced this pull request May 15, 2024
Methods that returned an err were not checked in PR cockroachdb#123414 which is
causing an failure with our lint test, this PR contains those err checks.

Informs: cockroachdb#123414

Release note: None
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants