Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sql: initialize sql instance during instance provider start #83358

Merged
merged 1 commit into from
Jul 6, 2022

Conversation

rharding6373
Copy link
Collaborator

Before this change, there was a race condition where the instance
provider and the instance reader would start before the instance
provider created a SQL instance, potentially causing the reader to not
cache the instance before initialization was complete. This is
a problem in multi-tenant environments, where we may not be able to plan
queries if the reader does not know of any SQL instances.

This change moves sql instance initialization into the instance
provider's Start() function before starting the reader, so that the
instance is guaranteed to be visible to the reader on its first
rangefeed scan of the system.sql_instances table.

Fixes: #82706
Fixes: #81807
Fixes: #81567

Release note: None

@cockroach-teamcity
Copy link
Member

This change is Reviewable

@rharding6373 rharding6373 force-pushed the mt_no_sql_instances branch 2 times, most recently from 10f1d74 to 5285611 Compare June 27, 2022 18:36
@rharding6373 rharding6373 marked this pull request as ready for review June 27, 2022 19:36
@rharding6373 rharding6373 requested a review from a team as a code owner June 27, 2022 19:36
@rharding6373 rharding6373 force-pushed the mt_no_sql_instances branch from 5285611 to 35d13bb Compare June 27, 2022 19:43
@rharding6373 rharding6373 requested review from a team and rhu713 June 27, 2022 19:43
// If the node id is already populated, we only need to create a placeholder
// instance provider without initializing the instance, since this is not a
// SQL pod server.
_, isNotSQLPod := cfg.nodeIDContainer.OptionalNodeID()
cfg.sqlInstanceProvider = instanceprovider.New(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My first instinct was to ask whether we can we just make this field nil in a kv server? From what I can tell, all of the code-paths which use the instanceProvider are defensive.

On some level, I don't love that because in the future somebody might not be defensive. How about as an alternative, we make an implementation of sqlinstance.Provider which just returns errors from all of its methods and then for the kv server we instantiate that fake implementation, otherwise, we instantiate the usual one.

That way we don't need to introduce this skeleton concept and we'll uncover bugs in the future more readily if one of these things which might have been marked as a skeleton makes its way somewhere it shouldn't be and instead of returning nothing, it returns a clear error.

Comment on lines 49 to 52
if err := p.initAndWait(ctx); err != nil {
//nolint:returnerrcheck
return
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can also type this to defeat the linter:

// InitAndWaitForTest explicitly calls initAndWait for testing purposes.
func (p *provider) InitAndWaitForTest(ctx context.Context) {
    _ = p.initAndWait(ctx)
}

Copy link
Collaborator

@rimadeodhar rimadeodhar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the fix. Code LGTM. Andrew's comment on potentially making a skeleton instance provider that returns error which can be used for the storage server seems like a good idea.

Reviewed 6 of 6 files at r1, all commit messages.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @rharding6373 and @rhu713)

@rharding6373 rharding6373 force-pushed the mt_no_sql_instances branch from 35d13bb to 38014d5 Compare June 29, 2022 20:19
Copy link
Collaborator Author

@rharding6373 rharding6373 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TFTRs! I'm currently working on fixing a timeout that occurs in TestTenantUnauthenticatedAccess, but have incorporated the other suggestions so far.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner, @rhu713, and @rimadeodhar)


pkg/server/server_sql.go line 412 at r1 (raw file):

Previously, ajwerner wrote…

My first instinct was to ask whether we can we just make this field nil in a kv server? From what I can tell, all of the code-paths which use the instanceProvider are defensive.

On some level, I don't love that because in the future somebody might not be defensive. How about as an alternative, we make an implementation of sqlinstance.Provider which just returns errors from all of its methods and then for the kv server we instantiate that fake implementation, otherwise, we instantiate the usual one.

That way we don't need to introduce this skeleton concept and we'll uncover bugs in the future more readily if one of these things which might have been marked as a skeleton makes its way somewhere it shouldn't be and instead of returning nothing, it returns a clear error.

That's a good suggestion. We still need a Reader to implement the AddressResolver interface for SQL liveness heartbeat loops, so instead of throwing errors for every function it's still a bare bones implementation. This can probably be disentangled in the future, but it's out of scope for this PR.


pkg/sql/sqlinstance/instanceprovider/test_helpers.go line 52 at r1 (raw file):

Previously, ajwerner wrote…

You can also type this to defeat the linter:

// InitAndWaitForTest explicitly calls initAndWait for testing purposes.
func (p *provider) InitAndWaitForTest(ctx context.Context) {
    _ = p.initAndWait(ctx)
}

Thanks for the tip!

Copy link
Contributor

@ajwerner ajwerner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner, @rharding6373, @rhu713, and @rimadeodhar)


pkg/server/server_sql.go line 412 at r1 (raw file):

We still need a Reader to implement the AddressResolver interface for SQL liveness heartbeat loops

Shouldn't the address resolver for the kv-node be gossip-backed?

@rharding6373 rharding6373 force-pushed the mt_no_sql_instances branch from 38014d5 to 2b887f0 Compare June 30, 2022 20:01
Comment on lines 38 to 71
// placeholderProvider implements the sqlinstance.Provider interface as a
// placeholder for an instance provider, when an instance provider must be
// instantiated for a non-SQL instance. It starts a Reader to provide the
// AddressResolver interface, but otherwise throws unsupported errors.
type placeholderProvider struct {
*instancestorage.Reader
}

func NewPlaceholder() sqlinstance.Provider {
return &placeholderProvider{}
}

// Start implements the sqlinstance.Provider interface.
func (p *placeholderProvider) Start(ctx context.Context) error {
return nil
}

// Instance implements the sqlinstance.Provider interface.
func (p *placeholderProvider) Instance(
ctx context.Context,
) (_ base.SQLInstanceID, _ sqlliveness.SessionID, err error) {
return base.SQLInstanceID(0), "", sqlinstance.NotASQLInstanceError
}

// GetInstance implements the AddressResolver interface.
func (p *placeholderProvider) GetInstance(context.Context, base.SQLInstanceID) (sqlinstance.InstanceInfo, error) {
return sqlinstance.InstanceInfo{}, sqlinstance.NotASQLInstanceError
}

// GetAllInstances implements the AddressResolver interface.
func (p *placeholderProvider) GetAllInstances(context.Context) ([]sqlinstance.InstanceInfo, error) {
return nil, sqlinstance.NotASQLInstanceError
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can see just defining this in the sqlinstance package. It doesn't know anything about implementation details that this package knows about.

Comment on lines 1126 to 1135
func maybeCheckTenantAccess(ctx context.Context, codec keys.SQLCodec, db *kv.DB) error {
if codec.ForSystemTenant() {
// Skip check for system tenant and return early.
return nil
}
// Perform a simple read to the tenant in order to verify that this instance
// will be authorized to access the tenant keyspace.
err := db.Txn(ctx, func(ctx context.Context, txn *kv.Txn) error {
start := codec.TenantPrefix()
end := start.PrefixEnd()
if _, err := txn.Scan(ctx, start, end, 1); err != nil {
return err
}
return nil
})
return err
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oof this hurts in the context of #79043

Is the deal that the Init just hangs because the retry gets stuck restarting internally? I think we can do some more plumbing to get that error propagated without needing to wait for the extra round-trip here.

FWIW, I'd expect authorization errors to propagate to the rangefeed via this code here:

// This is a hack around the fact that we do not get properly structured
// errors out of gRPC. See #56208.
strings.Contains(err.Error(), "rpc error: code = Unauthenticated") {

@rharding6373 rharding6373 force-pushed the mt_no_sql_instances branch from 2b887f0 to 2c3afe4 Compare June 30, 2022 20:33
Copy link
Collaborator Author

@rharding6373 rharding6373 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner, @rharding6373, @rhu713, and @rimadeodhar)


pkg/server/server_sql.go line 1142 at r3 (raw file):

Previously, ajwerner wrote…

Oof this hurts in the context of #79043

Is the deal that the Init just hangs because the retry gets stuck restarting internally? I think we can do some more plumbing to get that error propagated without needing to wait for the extra round-trip here.

FWIW, I'd expect authorization errors to propagate to the rangefeed via this code here:

// This is a hack around the fact that we do not get properly structured
// errors out of gRPC. See #56208.
strings.Contains(err.Error(), "rpc error: code = Unauthenticated") {

I don't think having a read here just to check if the tenant has permissions is ideal, no...

The hanging actually occurs when creating a session for initializing the instance (instanceprovider.go:l196 in this PR), which is to say it's waiting for the sql liveness provider to create a session after it's started. The problem is that we also get an unauthenticated error when trying to create the session, and then we retry an infinite number of times. Before this PR we started the instance reader first, so it would propagate the error before being stuck in this loop was a problem.

Maybe the solution here is to back out of the heartbeat loop if there are authentication errors?

@rharding6373 rharding6373 force-pushed the mt_no_sql_instances branch from 2c3afe4 to 9074065 Compare June 30, 2022 21:59
Copy link
Collaborator Author

@rharding6373 rharding6373 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner, @rhu713, and @rimadeodhar)


pkg/server/server_sql.go line 412 at r1 (raw file):

Previously, ajwerner wrote…

We still need a Reader to implement the AddressResolver interface for SQL liveness heartbeat loops

Shouldn't the address resolver for the kv-node be gossip-backed?

I take that back. Can't reproduce the error I had any more.


pkg/server/server_sql.go line 1142 at r3 (raw file):

Previously, rharding6373 (Rachael Harding) wrote…

I don't think having a read here just to check if the tenant has permissions is ideal, no...

The hanging actually occurs when creating a session for initializing the instance (instanceprovider.go:l196 in this PR), which is to say it's waiting for the sql liveness provider to create a session after it's started. The problem is that we also get an unauthenticated error when trying to create the session, and then we retry an infinite number of times. Before this PR we started the instance reader first, so it would propagate the error before being stuck in this loop was a problem.

Maybe the solution here is to back out of the heartbeat loop if there are authentication errors?

I made some changes in slinstance.go instead so we exit the loop on an unrecoverable error like authentication issues. I'm not sure if there's a corner case that could cause an issue with this implementation. Let me know what you think.


pkg/sql/sqlinstance/instanceprovider/instanceprovider.go line 70 at r3 (raw file):

Previously, ajwerner wrote…

I can see just defining this in the sqlinstance package. It doesn't know anything about implementation details that this package knows about.

Done.

@rharding6373
Copy link
Collaborator Author

The MT team is going to find another owner to shepherd this PR through since I'm going on leave and may not have a chance to look at it again -- I know this bug is still causing CI pain.

There's still one more unresolved test failure caused by this change afaict, TestTenantStreaming. Although it appears that both source and dest tenants initialize without issues in the test, streaming between them never progresses and there is unexpected memory allocated when the test ends.

@stevendanna
Copy link
Collaborator

@rharding6373 On the TestTenantStreaming, does #83697 cover the "unexpected memory allocated when the test ends" issue, or are you seeing a different issue?

Might be worth it to keep TestTenantStreaming skipped if this change otherwise fixes the other issues we are seeing, then I can dig into what is happening on the TenantStreaming side.

@yuzefovich yuzefovich force-pushed the mt_no_sql_instances branch from 9074065 to 41855f7 Compare July 5, 2022 21:45
Copy link
Member

@yuzefovich yuzefovich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ajwerner does this PR have your blessing?

Reviewed 1 of 6 files at r1, 1 of 4 files at r2, 1 of 4 files at r4, 4 of 4 files at r5, 2 of 2 files at r6, all commit messages.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner, @rharding6373, and @rhu713)


pkg/ccl/streamingccl/streamingest/stream_ingestion_job_test.go line 82 at r6 (raw file):

	defer leaktest.AfterTest(t)()
	defer log.Scope(t).Close(t)
	skip.WithIssue(t, 83697)

I made the change here for TestTenantStreaming to point to the new issue Steven linked above.

Copy link
Contributor

@ajwerner ajwerner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suppose. If I were being nit-picky, it'd be nice to have a test which exercises the change, but I'd rather we merge this to stop the flakes.

Reviewed 1 of 6 files at r1, 1 of 4 files at r2, 2 of 4 files at r5, 2 of 2 files at r6.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @rharding6373 and @rhu713)


pkg/sql/sqlliveness/slinstance/slinstance.go line 195 at r6 (raw file):

			// Unauthenticated errors are unrecoverable, we should break instead of
			// retrying.
			if strings.Contains(err.Error(), "rpc error: code = Unauthenticated") {

func IsAuthError(err error) bool {
?


pkg/sql/sqlliveness/slinstance/slinstance.go line 258 at r6 (raw file):

				newSession, err := l.createSession(ctx)
				if err != nil {
					l.mu.Lock()

these days we do these in a closure like:

func() {
    l.mu.Lock()
    defer l.mu.Unlock()
    l.startErr = err
    // There was an unrecoverable error when trying to create the session.
    // Notify all calls to Session that the session failed.
    close(l.mu.blockCh)
}

Code quote:

					l.mu.Lock()
					l.startErr = err
					// There was an unrecoverable error when trying to create the session.
					// Notify all calls to Session that the session failed.
					close(l.mu.blockCh)
					l.mu.Unlock()

pkg/sql/sqlliveness/slinstance/slinstance.go line 359 at r6 (raw file):

			l.mu.Lock()
			err := l.startErr
			l.mu.Unlock()

same comment about the closure

Code quote:

			err := l.startErr
			l.mu.Unlock()

@yuzefovich yuzefovich force-pushed the mt_no_sql_instances branch 3 times, most recently from eb90a47 to 6c4b454 Compare July 5, 2022 22:00
Copy link
Member

@yuzefovich yuzefovich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! I agree with you on merging this to stop flakes, will merge once the CI is green.

Reviewed 1 of 1 files at r8, 1 of 1 files at r9, all commit messages.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner, @rharding6373, and @rhu713)


pkg/sql/sqlliveness/slinstance/slinstance.go line 195 at r6 (raw file):

Previously, ajwerner wrote…

func IsAuthError(err error) bool {
?

Done.


pkg/sql/sqlliveness/slinstance/slinstance.go line 258 at r6 (raw file):

Previously, ajwerner wrote…

these days we do these in a closure like:

func() {
    l.mu.Lock()
    defer l.mu.Unlock()
    l.startErr = err
    // There was an unrecoverable error when trying to create the session.
    // Notify all calls to Session that the session failed.
    close(l.mu.blockCh)
}

Done.


pkg/sql/sqlliveness/slinstance/slinstance.go line 359 at r6 (raw file):

Previously, ajwerner wrote…

same comment about the closure

Done.

Before this change, there was a race condition where the instance
provider and the instance reader would start before the instance
provider created a SQL instance, potentially causing the reader to not
cache the instance before initialization was complete. This is
a problem in multi-tenant environments, where we may not be able to plan
queries if the reader does not know of any SQL instances.

This change moves sql instance initialization into the instance
provider's `Start()` function before starting the reader, so that the
instance is guaranteed to be visible to the reader on its first
rangefeed scan of the `system.sql_instances` table.

Release note: None
@yuzefovich yuzefovich force-pushed the mt_no_sql_instances branch from 6c4b454 to 71e1af4 Compare July 5, 2022 23:13
@yuzefovich
Copy link
Member

Quite a few tests in streamingest package fail under the race on the CI run, but they also do on master, before this patch, so I filed #83867 and added a skip for those. Let's see what CI says now.

@yuzefovich
Copy link
Member

Looks like the race tests passed, so merging.

bors r=ajwerner

Copy link
Member

@yuzefovich yuzefovich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 3 of 3 files at r10, all commit messages.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner, @rharding6373, and @rhu713)


pkg/sql/sqlliveness/slinstance/slinstance.go line 195 at r6 (raw file):

Previously, yuzefovich (Yahor Yuzefovich) wrote…

Done.

@ajwerner I just stumbled on this piece of code:

if grpcutil.IsAuthError(err) ||
// This is a hack around the fact that we do not get properly structured
// errors out of gRPC. See #56208.
strings.Contains(err.Error(), "rpc error: code = Unauthenticated") {

do you think it's worth doing the same here? Looks like err here comes from slstorage.Insert which might issue a gRPC call.

@craig
Copy link
Contributor

craig bot commented Jul 6, 2022

Build failed (retrying...):

@craig
Copy link
Contributor

craig bot commented Jul 6, 2022

Build succeeded:

@craig craig bot merged commit 90b15cf into cockroachdb:master Jul 6, 2022
@shermanCRL shermanCRL added the A-tenant-streaming Including cluster streaming label Jul 29, 2022
@shermanCRL shermanCRL added this to the 22.2 milestone Jul 31, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-tenant-streaming Including cluster streaming
Projects
None yet
7 participants