-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
sql: grant/revoke on a large number of objects can create a lot of jobs #123414
Conversation
1cf0d02
to
0d3571c
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good, just a few suggestions
Reviewed 3 of 4 files at r1, all commit messages.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @Dedej-Bergin)
-- commits
line 10 at r1:
*large number of (scale makes a difference)
pkg/sql/schema_changer_test.go
line 7850 at r1 (raw file):
// wait for channel before starting the sleep <-messages time.Sleep(2 * time.Second)
This will work okayish. We could also confirm this by having the lease manager block this committing as an extra source of truth.
i.e. We would set the cluster setting: sql.catalog.descriptor_lease_duration
to a low value like 15-30 seconds. Then have this transaction run queries like "SELECT 1" until it hits an error. What will happen is that once the lease duration has passed any other transactions is not allowed to use the old versions of the PROMO_CODES descriptor.
pkg/sql/schema_changer_test.go
line 7862 at r1 (raw file):
// post to channel to start sleep timer in go routine messages <- struct{}{}
Another option is to just do
close(messages)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @Dedej-Bergin and @fqazi)
Previously, fqazi (Faizan Qazi) wrote…
*large number of (scale makes a difference)
since this change is a performance improvement, we should include a microbenchmark that shows that improvement. (or if this improves an existing one, then show the results.)
to see an example of a microbenchmark so you can write a similar one, take a look at:
func BenchmarkEnsureAllTables(b *testing.B) { |
then you can compare the old and new performance by using:
N=10 BENCHTIMEOUT=24h PKG=./pkg/sql/ BENCHES=BenchmarkYourNameHere ./scripts/bench 'old-commit' 'new-commit'
(if you're adding a new benchmark, it's helpful to put it in its own commit before your other code changes, so you can use the above script.)
pkg/sql/grant_revoke.go
line 429 at r1 (raw file):
} case *schemadesc.Mutable: if err := p.writeSchemaDescChange(
nit: should this also be updated to use the batch?
pkg/sql/grant_revoke.go
line 448 at r1 (raw file):
} case *funcdesc.Mutable: if err := p.writeFuncSchemaChange(ctx, d); err != nil {
nit: should this also be updated to use the batch?
pkg/sql/schema_changer_test.go
line 7825 at r1 (raw file):
} func TestGrantAndSelectsWithinTransaction(t *testing.T) {
nit: could you add a comment to the test that explains what it is testing?
pkg/sql/schema_changer_test.go
line 7850 at r1 (raw file):
Previously, fqazi (Faizan Qazi) wrote…
This will work okayish. We could also confirm this by having the lease manager block this committing as an extra source of truth.
i.e. We would set the cluster setting:
sql.catalog.descriptor_lease_duration
to a low value like 15-30 seconds. Then have this transaction run queries like "SELECT 1" until it hits an error. What will happen is that once the lease duration has passed any other transactions is not allowed to use the old versions of the PROMO_CODES descriptor.
why is a 2 second sleep necessary?
pkg/sql/schema_changer_test.go
line 7869 at r1 (raw file):
err = group.Wait() require.True(t, duration > 1500 * time.Millisecond)
why do we want this 1500ms assertion? (this could use a comment)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @fqazi and @rafiss)
Previously, rafiss (Rafi Shamim) wrote…
since this change is a performance improvement, we should include a microbenchmark that shows that improvement. (or if this improves an existing one, then show the results.)
to see an example of a microbenchmark so you can write a similar one, take a look at:
func BenchmarkEnsureAllTables(b *testing.B) { then you can compare the old and new performance by using:
N=10 BENCHTIMEOUT=24h PKG=./pkg/sql/ BENCHES=BenchmarkYourNameHere ./scripts/bench 'old-commit' 'new-commit'
(if you're adding a new benchmark, it's helpful to put it in its own commit before your other code changes, so you can use the above script.)
Thanks, I changed multiple grants to a Multiple/large amounts of grants.
pkg/sql/grant_revoke.go
line 429 at r1 (raw file):
Previously, rafiss (Rafi Shamim) wrote…
nit: should this also be updated to use the batch?
I think the problem was mainly happening on tables and types, so it was decided not to update schema since the issue is not common there, but if you think that it would be good to use batch for schema grants than I could add that change in.
Let me know if you still think we should use batch for schema grants?
pkg/sql/grant_revoke.go
line 448 at r1 (raw file):
Previously, rafiss (Rafi Shamim) wrote…
nit: should this also be updated to use the batch?
I think the problem was mainly happening on tables and types, so it was decided not to update functions since the issue is not common there, but if you think that it would be good to use batch for function grants than I could add that change in.
Let me know if you still think we should use batch for function grants?
pkg/sql/schema_changer_test.go
line 7825 at r1 (raw file):
Previously, rafiss (Rafi Shamim) wrote…
nit: could you add a comment to the test that explains what it is testing?
Definitely, just added it thanks Rafi!
pkg/sql/schema_changer_test.go
line 7850 at r1 (raw file):
Previously, rafiss (Rafi Shamim) wrote…
why is a 2 second sleep necessary?
The sleep is in there because with a sleep on the transaction1 commit we end up holding transaction2, which is a functionality that we want to test/protect from someone breaking this commit/wait process in the future. For the duration of the sleep being 2 seconds, I could lower it if that would be better. I went ahead and added a comment for this sleep.
Also at the moment I'm looking into modifying the test to use sql.catalog.descriptor_lease_duration
pkg/sql/schema_changer_test.go
line 7862 at r1 (raw file):
Previously, fqazi (Faizan Qazi) wrote…
Another option is to just do
close(messages)
Sounds good I went ahead and made this change
pkg/sql/schema_changer_test.go
line 7869 at r1 (raw file):
Previously, rafiss (Rafi Shamim) wrote…
why do we want this 1500ms assertion? (this could use a comment)
Thanks Rafi, I went ahead and added a comment for this. Here we are verifying that transaction2 did hold, it should hold for up to 2 seconds, so we check that the duration is at least 1.5 seconds (1500ms). If you want I could reduce the wait time here.
93476e0
to
f227b7f
Compare
Previously, Dedej-Bergin (Bergin Dedej) wrote…
Thanks Rafi! I just now modified the benchmark_expectations file with the new improved benchmarks, I will look into BenchmarkEnsureAllTables as well |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @Dedej-Bergin and @fqazi)
pkg/sql/schema_changer_test.go
line 7850 at r1 (raw file):
Previously, Dedej-Bergin (Bergin Dedej) wrote…
The sleep is in there because with a sleep on the transaction1 commit we end up holding transaction2, which is a functionality that we want to test/protect from someone breaking this commit/wait process in the future. For the duration of the sleep being 2 seconds, I could lower it if that would be better. I went ahead and added a comment for this sleep.
Also at the moment I'm looking into modifying the test to use
sql.catalog.descriptor_lease_duration
replied to this in the comment below
pkg/sql/schema_changer_test.go
line 7869 at r1 (raw file):
Previously, Dedej-Bergin (Bergin Dedej) wrote…
Thanks Rafi, I went ahead and added a comment for this. Here we are verifying that transaction2 did hold, it should hold for up to 2 seconds, so we check that the duration is at least 1.5 seconds (1500ms). If you want I could reduce the wait time here.
ah ok that's what i was wondering! if the sleep
was just there to check that one txn finished before the other, it would actually be better to just create a boolean variable, like txn1Committed
or use a channel. we'd have to think a bit about how to make sure it's set atomically in order to make the test not be flaky. we can talk through ideas next week.
the main reason is that it's always best to avoid adding sleeps in tests. relying on sleeps makes tests flaky usually.
aaf09f5
to
5cd49d7
Compare
Previously, Dedej-Bergin (Bergin Dedej) wrote…
I just added to microbenchmark tests for grants on types and grants on tables, thanks Rafi! |
Previously, rafiss (Rafi Shamim) wrote…
We got rid of the sleeps and channel, and are now using 2 go routines and a concurrent safe boolean to get this test to provide the same coverage with a runtime in milliseconds. Thanks for your help! |
Previously, rafiss (Rafi Shamim) wrote…
We got rid of the sleeps and channel, and are now using 2 go routines and a concurrent safe boolean to get this test to provide the same coverage with a runtime in milliseconds. Thanks for your help! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @fqazi, @rafiss, and @yuzefovich)
Previously, Dedej-Bergin (Bergin Dedej) wrote…
I just added to microbenchmark tests for grants on types and grants on tables, thanks Rafi!
The benchmarks show that it's running a lot faster now! Grants on 1000 tables is 2,500% faster!
**
Benchmark Results with new Changes:
BenchmarkGrantTables/numTables=10
BenchmarkGrantTables/numTables=10-10 48 25188102 ns/op
BenchmarkGrantTables/numTables=100
BenchmarkGrantTables/numTables=100-10 6 195067576 ns/op
BenchmarkGrantTables/numTables=1000
BenchmarkGrantTables/numTables=1000-10 1 1828938834 ns/op
BenchmarkGrantTypes/numTypes=10
BenchmarkGrantTypes/numTypes=10-10 32 36117533 ns/op
BenchmarkGrantTypes/numTypes=100
BenchmarkGrantTypes/numTypes=100-10 4 328903364 ns/op
BenchmarkGrantTypes/numTypes=1000
BenchmarkGrantTypes/numTypes=1000-10 1 3449303583 ns/op
Benchmark Results without new Changes:
BenchmarkGrantTables/numTables=10
BenchmarkGrantTables/numTables=10-10 9 111688116 ns/op
BenchmarkGrantTables/numTables=100
BenchmarkGrantTables/numTables=100-10 2 720130208 ns/op
BenchmarkGrantTables/numTables=1000
BenchmarkGrantTables/numTables=1000-10 1 51262939792 ns/op
BenchmarkGrantTypes/numTypes=10
BenchmarkGrantTypes/numTypes=10-10 8 129941745 ns/op
BenchmarkGrantTypes/numTypes=100
BenchmarkGrantTypes/numTypes=100-10 1 1030206208 ns/op
BenchmarkGrantTypes/numTypes=1000
BenchmarkGrantTypes/numTypes=1000-10 1 10553151833 ns/op
**
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @Dedej-Bergin, @fqazi, and @rafiss)
pkg/sql/stats/automatic_stats_test.go
line 312 at r3 (raw file):
} func BenchmarkGrantTables(b *testing.B) {
nit: is there a reason these new benchmarks are added to stats
package? They seem orthogonal to the table stats feature.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @fqazi and @yuzefovich)
pkg/sql/stats/automatic_stats_test.go
line 312 at r3 (raw file):
Previously, yuzefovich (Yahor Yuzefovich) wrote…
nit: is there a reason these new benchmarks are added to
stats
package? They seem orthogonal to the table stats feature.
a better place for the test would be pkg/sql/grant_revoke_test.go
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @Dedej-Bergin, @fqazi, and @yuzefovich)
pkg/sql/grant_revoke.go
line 429 at r1 (raw file):
Previously, Dedej-Bergin (Bergin Dedej) wrote…
I think the problem was mainly happening on tables and types, so it was decided not to update schema since the issue is not common there, but if you think that it would be good to use batch for schema grants than I could add that change in.
Let me know if you still think we should use batch for schema grants?
i do think it would be nice to keep all the different code paths consistent. unless there's a reason we can't use the batch here (like it would break an existing test), i think we should use it in all the branches.
pkg/sql/grant_revoke.go
line 448 at r1 (raw file):
Previously, Dedej-Bergin (Bergin Dedej) wrote…
I think the problem was mainly happening on tables and types, so it was decided not to update functions since the issue is not common there, but if you think that it would be good to use batch for function grants than I could add that change in.
Let me know if you still think we should use batch for function grants?
see above comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @Dedej-Bergin, @rafiss, and @yuzefovich)
pkg/sql/grant_revoke.go
line 395 at r7 (raw file):
Previously, rafiss (Rafi Shamim) wrote…
this is unrelated to your PR, but as a separate followup, i wonder if we should have this
!d.Dropped()
check for the other cases also. seems like it could prevent some unneeded work. wdyt @fqazi ?
Yeah, probably better to have the dropped check in other cases too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @fqazi, @rafiss, and @yuzefovich)
pkg/sql/schema_changer_test.go
line 7832 at r7 (raw file):
Previously, rafiss (Rafi Shamim) wrote…
since it looks like we do need this test to be timing related, i think it would be better to skip this under expensive configurations. there's a helper function called
skip.UnderDuress(t, "reason")
, which will skip the test if the race or deadlock detector is on, or if it's being run under stress. those are all situations in which we want to avoid slow tests. can you use that helper function, note the reason for the skip when it's called?
Just added it, thanks Rafi!
pkg/sql/schema_changer_test.go
line 7871 at r7 (raw file):
Previously, rafiss (Rafi Shamim) wrote…
we could add something here to make sure it's the correct error, like this:
require.ErrorContains(t, err, "RETRY_COMMIT_DEADLINE_EXCEEDED")
I see, this is very nice, just added it, thanks Rafi!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @rafiss and @yuzefovich)
pkg/sql/schema_changer_test.go
line 7834 at r7 (raw file):
Previously, rafiss (Rafi Shamim) wrote…
this is more of a question for @fqazi -- i thought this lease duration no longer was relevant with session based leasing?
also, is this going to be reliable? i thought the actual duration would get jittered?
Yeah the duration still applies on session based in terms of when old versions get evicted (i.e. transaction using the old version are retried due to the deadline). Even with jitter the intent was to speed up the test and confirm we wait for any dependent txn to be kicked.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @fqazi and @yuzefovich)
pkg/sql/schema_changer_test.go
line 7834 at r7 (raw file):
Previously, fqazi (Faizan Qazi) wrote…
Yeah the duration still applies on session based in terms of when old versions get evicted (i.e. transaction using the old version are retried due to the deadline). Even with jitter the intent was to speed up the test and confirm we wait for any dependent txn to be kicked.
i see thanks
so if the setting is set to 15s, and there is jitter, does that mean the test could be flaky since it only checks: require.Greaterf(t, duration, 15*time.Second)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @Dedej-Bergin, @rafiss, and @yuzefovich)
pkg/sql/schema_changer_test.go
line 7834 at r7 (raw file):
Previously, rafiss (Rafi Shamim) wrote…
i see thanks
so if the setting is set to 15s, and there is jitter, does that mean the test could be flaky since it only checks:
require.Greaterf(t, duration, 15*time.Second)
Yeah, it could be flaky. Lets just drop the timing angle all together, since the deadline error confirms that the deadline came into play. @Dedej-Bergin
e6648bb
to
945389e
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @fqazi, @rafiss, and @yuzefovich)
pkg/sql/grant_revoke.go
line 395 at r7 (raw file):
Previously, fqazi (Faizan Qazi) wrote…
Yeah, probably better to have the dropped check in other cases too.
Sounds good, just added it, thanks!
pkg/sql/schema_changer_test.go
line 7834 at r7 (raw file):
Previously, fqazi (Faizan Qazi) wrote…
Yeah, it could be flaky. Lets just drop the timing angle all together, since the deadline error confirms that the deadline came into play. @Dedej-Bergin
Sounds good, just removed the timing related code, thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @fqazi and @yuzefovich)
pkg/sql/stats/automatic_stats_test.go
line 312 at r3 (raw file):
Previously, Dedej-Bergin (Bergin Dedej) wrote…
sounds good I went ahead and moved it to grant_revoke_test.go, thanks Rafi!
lgtm! great work on taking care of all the followups and making a big perf improvement
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 2 of 4 files at r7, 1 of 2 files at r8, 2 of 2 files at r9, all commit messages.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @rafiss and @yuzefovich)
there's a linting failure:
to resolve it, move the
|
Previously we would create multiple jobs for granting on multiple tables and types within one transaction. This caused a performance slowdown, these code changes skip the making of multiple jobs and just execute the grants in a batch. Fixes: cockroachdb#117643 Release note (performance improvement): Multiple/large amounts of grants on tables and types within one transaction now run faster.
ca7ae0f
to
25edb96
Compare
bors r+ |
Build failed: |
bors r+ |
There was another lint failure introduced in this PR that only shows up in |
Methods that returned an err were not checked in PR cockroachdb#123414 which is causing an failure with our lint test, this PR contains those err checks. Informs: cockroachdb#123414 Release note: None
124128: sql/schema_changer_test.go not checking err, fix for linting issue r=Dedej-Bergin a=Dedej-Bergin Methods that returned an err were not checked in PR #123414 which is causing a failure with our lint test, this PR contains those err checks. Informs: #123414 Release note: None Co-authored-by: Bergin Dedej <[email protected]>
Methods that returned an err were not checked in PR cockroachdb#123414 which is causing an failure with our lint test, this PR contains those err checks. Informs: cockroachdb#123414 Release note: None
Previously we would create multiple jobs for granting on multiple tables and types within one transaction. This caused a performance slowdown, these code changes skip the making of multiple jobs and just execute the grants in a batch.
Fixes:#117643
Release note(performance improvement): multiple grants on tables and types within one transaction now run faster.