-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
sql: initialize sql instance during instance provider start #83358
Conversation
10f1d74
to
5285611
Compare
5285611
to
35d13bb
Compare
pkg/server/server_sql.go
Outdated
// If the node id is already populated, we only need to create a placeholder | ||
// instance provider without initializing the instance, since this is not a | ||
// SQL pod server. | ||
_, isNotSQLPod := cfg.nodeIDContainer.OptionalNodeID() | ||
cfg.sqlInstanceProvider = instanceprovider.New( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My first instinct was to ask whether we can we just make this field nil in a kv server? From what I can tell, all of the code-paths which use the instanceProvider are defensive.
On some level, I don't love that because in the future somebody might not be defensive. How about as an alternative, we make an implementation of sqlinstance.Provider which just returns errors from all of its methods and then for the kv server we instantiate that fake implementation, otherwise, we instantiate the usual one.
That way we don't need to introduce this skeleton concept and we'll uncover bugs in the future more readily if one of these things which might have been marked as a skeleton makes its way somewhere it shouldn't be and instead of returning nothing, it returns a clear error.
if err := p.initAndWait(ctx); err != nil { | ||
//nolint:returnerrcheck | ||
return | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can also type this to defeat the linter:
// InitAndWaitForTest explicitly calls initAndWait for testing purposes.
func (p *provider) InitAndWaitForTest(ctx context.Context) {
_ = p.initAndWait(ctx)
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the fix. Code LGTM. Andrew's comment on potentially making a skeleton instance provider that returns error which can be used for the storage server seems like a good idea.
Reviewed 6 of 6 files at r1, all commit messages.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @rharding6373 and @rhu713)
35d13bb
to
38014d5
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TFTRs! I'm currently working on fixing a timeout that occurs in TestTenantUnauthenticatedAccess
, but have incorporated the other suggestions so far.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner, @rhu713, and @rimadeodhar)
pkg/server/server_sql.go
line 412 at r1 (raw file):
Previously, ajwerner wrote…
My first instinct was to ask whether we can we just make this field nil in a kv server? From what I can tell, all of the code-paths which use the instanceProvider are defensive.
On some level, I don't love that because in the future somebody might not be defensive. How about as an alternative, we make an implementation of sqlinstance.Provider which just returns errors from all of its methods and then for the kv server we instantiate that fake implementation, otherwise, we instantiate the usual one.
That way we don't need to introduce this skeleton concept and we'll uncover bugs in the future more readily if one of these things which might have been marked as a skeleton makes its way somewhere it shouldn't be and instead of returning nothing, it returns a clear error.
That's a good suggestion. We still need a Reader to implement the AddressResolver interface for SQL liveness heartbeat loops, so instead of throwing errors for every function it's still a bare bones implementation. This can probably be disentangled in the future, but it's out of scope for this PR.
pkg/sql/sqlinstance/instanceprovider/test_helpers.go
line 52 at r1 (raw file):
Previously, ajwerner wrote…
You can also type this to defeat the linter:
// InitAndWaitForTest explicitly calls initAndWait for testing purposes. func (p *provider) InitAndWaitForTest(ctx context.Context) { _ = p.initAndWait(ctx) }
Thanks for the tip!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner, @rharding6373, @rhu713, and @rimadeodhar)
pkg/server/server_sql.go
line 412 at r1 (raw file):
We still need a Reader to implement the AddressResolver interface for SQL liveness heartbeat loops
Shouldn't the address resolver for the kv-node be gossip-backed?
38014d5
to
2b887f0
Compare
// placeholderProvider implements the sqlinstance.Provider interface as a | ||
// placeholder for an instance provider, when an instance provider must be | ||
// instantiated for a non-SQL instance. It starts a Reader to provide the | ||
// AddressResolver interface, but otherwise throws unsupported errors. | ||
type placeholderProvider struct { | ||
*instancestorage.Reader | ||
} | ||
|
||
func NewPlaceholder() sqlinstance.Provider { | ||
return &placeholderProvider{} | ||
} | ||
|
||
// Start implements the sqlinstance.Provider interface. | ||
func (p *placeholderProvider) Start(ctx context.Context) error { | ||
return nil | ||
} | ||
|
||
// Instance implements the sqlinstance.Provider interface. | ||
func (p *placeholderProvider) Instance( | ||
ctx context.Context, | ||
) (_ base.SQLInstanceID, _ sqlliveness.SessionID, err error) { | ||
return base.SQLInstanceID(0), "", sqlinstance.NotASQLInstanceError | ||
} | ||
|
||
// GetInstance implements the AddressResolver interface. | ||
func (p *placeholderProvider) GetInstance(context.Context, base.SQLInstanceID) (sqlinstance.InstanceInfo, error) { | ||
return sqlinstance.InstanceInfo{}, sqlinstance.NotASQLInstanceError | ||
} | ||
|
||
// GetAllInstances implements the AddressResolver interface. | ||
func (p *placeholderProvider) GetAllInstances(context.Context) ([]sqlinstance.InstanceInfo, error) { | ||
return nil, sqlinstance.NotASQLInstanceError | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can see just defining this in the sqlinstance
package. It doesn't know anything about implementation details that this package knows about.
pkg/server/server_sql.go
Outdated
func maybeCheckTenantAccess(ctx context.Context, codec keys.SQLCodec, db *kv.DB) error { | ||
if codec.ForSystemTenant() { | ||
// Skip check for system tenant and return early. | ||
return nil | ||
} | ||
// Perform a simple read to the tenant in order to verify that this instance | ||
// will be authorized to access the tenant keyspace. | ||
err := db.Txn(ctx, func(ctx context.Context, txn *kv.Txn) error { | ||
start := codec.TenantPrefix() | ||
end := start.PrefixEnd() | ||
if _, err := txn.Scan(ctx, start, end, 1); err != nil { | ||
return err | ||
} | ||
return nil | ||
}) | ||
return err | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oof this hurts in the context of #79043
Is the deal that the Init
just hangs because the retry gets stuck restarting internally? I think we can do some more plumbing to get that error propagated without needing to wait for the extra round-trip here.
FWIW, I'd expect authorization errors to propagate to the rangefeed via this code here:
cockroach/pkg/sql/sqlinstance/instancestorage/instancereader.go
Lines 143 to 145 in f7bb118
// This is a hack around the fact that we do not get properly structured | |
// errors out of gRPC. See #56208. | |
strings.Contains(err.Error(), "rpc error: code = Unauthenticated") { |
2b887f0
to
2c3afe4
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner, @rharding6373, @rhu713, and @rimadeodhar)
pkg/server/server_sql.go
line 1142 at r3 (raw file):
Previously, ajwerner wrote…
Oof this hurts in the context of #79043
Is the deal that the
Init
just hangs because the retry gets stuck restarting internally? I think we can do some more plumbing to get that error propagated without needing to wait for the extra round-trip here.FWIW, I'd expect authorization errors to propagate to the rangefeed via this code here:
cockroach/pkg/sql/sqlinstance/instancestorage/instancereader.go
Lines 143 to 145 in f7bb118
// This is a hack around the fact that we do not get properly structured // errors out of gRPC. See #56208. strings.Contains(err.Error(), "rpc error: code = Unauthenticated") {
I don't think having a read here just to check if the tenant has permissions is ideal, no...
The hanging actually occurs when creating a session for initializing the instance (instanceprovider.go:l196 in this PR), which is to say it's waiting for the sql liveness provider to create a session after it's started. The problem is that we also get an unauthenticated error when trying to create the session, and then we retry an infinite number of times. Before this PR we started the instance reader first, so it would propagate the error before being stuck in this loop was a problem.
Maybe the solution here is to back out of the heartbeat loop if there are authentication errors?
2c3afe4
to
9074065
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner, @rhu713, and @rimadeodhar)
pkg/server/server_sql.go
line 412 at r1 (raw file):
Previously, ajwerner wrote…
We still need a Reader to implement the AddressResolver interface for SQL liveness heartbeat loops
Shouldn't the address resolver for the kv-node be gossip-backed?
I take that back. Can't reproduce the error I had any more.
pkg/server/server_sql.go
line 1142 at r3 (raw file):
Previously, rharding6373 (Rachael Harding) wrote…
I don't think having a read here just to check if the tenant has permissions is ideal, no...
The hanging actually occurs when creating a session for initializing the instance (instanceprovider.go:l196 in this PR), which is to say it's waiting for the sql liveness provider to create a session after it's started. The problem is that we also get an unauthenticated error when trying to create the session, and then we retry an infinite number of times. Before this PR we started the instance reader first, so it would propagate the error before being stuck in this loop was a problem.
Maybe the solution here is to back out of the heartbeat loop if there are authentication errors?
I made some changes in slinstance.go instead so we exit the loop on an unrecoverable error like authentication issues. I'm not sure if there's a corner case that could cause an issue with this implementation. Let me know what you think.
pkg/sql/sqlinstance/instanceprovider/instanceprovider.go
line 70 at r3 (raw file):
Previously, ajwerner wrote…
I can see just defining this in the
sqlinstance
package. It doesn't know anything about implementation details that this package knows about.
Done.
The MT team is going to find another owner to shepherd this PR through since I'm going on leave and may not have a chance to look at it again -- I know this bug is still causing CI pain. There's still one more unresolved test failure caused by this change afaict, |
@rharding6373 On the TestTenantStreaming, does #83697 cover the "unexpected memory allocated when the test ends" issue, or are you seeing a different issue? Might be worth it to keep TestTenantStreaming skipped if this change otherwise fixes the other issues we are seeing, then I can dig into what is happening on the TenantStreaming side. |
9074065
to
41855f7
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ajwerner does this PR have your blessing?
Reviewed 1 of 6 files at r1, 1 of 4 files at r2, 1 of 4 files at r4, 4 of 4 files at r5, 2 of 2 files at r6, all commit messages.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner, @rharding6373, and @rhu713)
pkg/ccl/streamingccl/streamingest/stream_ingestion_job_test.go
line 82 at r6 (raw file):
defer leaktest.AfterTest(t)() defer log.Scope(t).Close(t) skip.WithIssue(t, 83697)
I made the change here for TestTenantStreaming
to point to the new issue Steven linked above.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suppose. If I were being nit-picky, it'd be nice to have a test which exercises the change, but I'd rather we merge this to stop the flakes.
Reviewed 1 of 6 files at r1, 1 of 4 files at r2, 2 of 4 files at r5, 2 of 2 files at r6.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @rharding6373 and @rhu713)
pkg/sql/sqlliveness/slinstance/slinstance.go
line 195 at r6 (raw file):
// Unauthenticated errors are unrecoverable, we should break instead of // retrying. if strings.Contains(err.Error(), "rpc error: code = Unauthenticated") {
cockroach/pkg/util/grpcutil/grpc_util.go
Line 120 in 3fc6cc5
func IsAuthError(err error) bool { |
pkg/sql/sqlliveness/slinstance/slinstance.go
line 258 at r6 (raw file):
newSession, err := l.createSession(ctx) if err != nil { l.mu.Lock()
these days we do these in a closure like:
func() {
l.mu.Lock()
defer l.mu.Unlock()
l.startErr = err
// There was an unrecoverable error when trying to create the session.
// Notify all calls to Session that the session failed.
close(l.mu.blockCh)
}
Code quote:
l.mu.Lock()
l.startErr = err
// There was an unrecoverable error when trying to create the session.
// Notify all calls to Session that the session failed.
close(l.mu.blockCh)
l.mu.Unlock()
pkg/sql/sqlliveness/slinstance/slinstance.go
line 359 at r6 (raw file):
l.mu.Lock() err := l.startErr l.mu.Unlock()
same comment about the closure
Code quote:
err := l.startErr
l.mu.Unlock()
eb90a47
to
6c4b454
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! I agree with you on merging this to stop flakes, will merge once the CI is green.
Reviewed 1 of 1 files at r8, 1 of 1 files at r9, all commit messages.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner, @rharding6373, and @rhu713)
pkg/sql/sqlliveness/slinstance/slinstance.go
line 195 at r6 (raw file):
Previously, ajwerner wrote…
?cockroach/pkg/util/grpcutil/grpc_util.go
Line 120 in 3fc6cc5
func IsAuthError(err error) bool {
Done.
pkg/sql/sqlliveness/slinstance/slinstance.go
line 258 at r6 (raw file):
Previously, ajwerner wrote…
these days we do these in a closure like:
func() { l.mu.Lock() defer l.mu.Unlock() l.startErr = err // There was an unrecoverable error when trying to create the session. // Notify all calls to Session that the session failed. close(l.mu.blockCh) }
Done.
pkg/sql/sqlliveness/slinstance/slinstance.go
line 359 at r6 (raw file):
Previously, ajwerner wrote…
same comment about the closure
Done.
Before this change, there was a race condition where the instance provider and the instance reader would start before the instance provider created a SQL instance, potentially causing the reader to not cache the instance before initialization was complete. This is a problem in multi-tenant environments, where we may not be able to plan queries if the reader does not know of any SQL instances. This change moves sql instance initialization into the instance provider's `Start()` function before starting the reader, so that the instance is guaranteed to be visible to the reader on its first rangefeed scan of the `system.sql_instances` table. Release note: None
6c4b454
to
71e1af4
Compare
Quite a few tests in |
Looks like the race tests passed, so merging. bors r=ajwerner |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 3 of 3 files at r10, all commit messages.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner, @rharding6373, and @rhu713)
pkg/sql/sqlliveness/slinstance/slinstance.go
line 195 at r6 (raw file):
Previously, yuzefovich (Yahor Yuzefovich) wrote…
Done.
@ajwerner I just stumbled on this piece of code:
cockroach/pkg/sql/sqlinstance/instancestorage/instancereader.go
Lines 142 to 145 in 9c5472e
if grpcutil.IsAuthError(err) || | |
// This is a hack around the fact that we do not get properly structured | |
// errors out of gRPC. See #56208. | |
strings.Contains(err.Error(), "rpc error: code = Unauthenticated") { |
do you think it's worth doing the same here? Looks like
err
here comes from slstorage.Insert
which might issue a gRPC call.
Build failed (retrying...): |
Build succeeded: |
Before this change, there was a race condition where the instance
provider and the instance reader would start before the instance
provider created a SQL instance, potentially causing the reader to not
cache the instance before initialization was complete. This is
a problem in multi-tenant environments, where we may not be able to plan
queries if the reader does not know of any SQL instances.
This change moves sql instance initialization into the instance
provider's
Start()
function before starting the reader, so that theinstance is guaranteed to be visible to the reader on its first
rangefeed scan of the
system.sql_instances
table.Fixes: #82706
Fixes: #81807
Fixes: #81567
Release note: None