Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

server: Reject SQL Pods with an invalid tenant id. #65683

Merged
merged 1 commit into from
Jun 1, 2021

Conversation

rimadeodhar
Copy link
Collaborator

This commit fixes an issue where attempting to start a SQL Pod
through cockroach mt start-sql command returns an error if the
tenant id has not been created before. Previously, attempting to
start a SQL Pod with a non existent tenant id would crash. Even
worse, it would poison the tenant id causing future
crdb_internal.create_sql_tenant calls to fail.
With this fix, the cockroach mt start-sql command returns an error
for non existent tenant ids and future calls to
crdb_internal.create_sql_tenant are successful.

Resolves: #64963

Release note (bug fix): The cockroach mt start-sql command with a
non existent tenant id returns an error. Previously, it would
crash and poison the tenant id for future usage.

@rimadeodhar rimadeodhar requested review from knz and ajwerner May 25, 2021 20:16
@cockroach-teamcity
Copy link
Member

This change is Reviewable

@rimadeodhar rimadeodhar requested a review from a team May 25, 2021 21:15
Copy link
Contributor

@ajwerner ajwerner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @knz and @rimadeodhar)


pkg/server/server_sql.go, line 763 at r1 (raw file):

	// Tenant has been bootstrapped successfully
	return nil
}

How about we put this in server/tenant.go? Also, can you explain in a sentence that the heuristic: Check if the system database entry in the namespace table exists inside the tenant and that is used because it is part of the bootstrap data for a tenant keyspace.


pkg/server/server_sql.go, line 788 at r1 (raw file):

Quoted 7 lines of code…
	// Confirm tenant exists prior to initialization. This is a sanity
	// check for the dev environment to ensure that a tenant has been
	// successfully created before attempting to initialize a SQL
	// server for it.
	if err := checkTenantExists(ctx, s.execCfg.Codec, s.execCfg.DB); err != nil {
		return err
	}

how do you feel about putting this inside the above tenant block?

Copy link
Collaborator Author

@rimadeodhar rimadeodhar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner and @knz)


pkg/server/server_sql.go, line 763 at r1 (raw file):

Previously, ajwerner wrote…

How about we put this in server/tenant.go? Also, can you explain in a sentence that the heuristic: Check if the system database entry in the namespace table exists inside the tenant and that is used because it is part of the bootstrap data for a tenant keyspace.

I can move this function to server/tenant.go and add a comment. I put it in here since for now, its mainly getting used within server_sql code. But don't feel very strongly about this.


pkg/server/server_sql.go, line 788 at r1 (raw file):

Previously, ajwerner wrote…
	// Confirm tenant exists prior to initialization. This is a sanity
	// check for the dev environment to ensure that a tenant has been
	// successfully created before attempting to initialize a SQL
	// server for it.
	if err := checkTenantExists(ctx, s.execCfg.Codec, s.execCfg.DB); err != nil {
		return err
	}

how do you feel about putting this inside the above tenant block?

Do you mean putting this function call within StartTenant in tenant.go? If so, my initial thought was to put the call there. However, during testing, I realized that I need the s.tenantConnect.Start call to have completed so I would have to put the tenant Exists check after the call to preStart. That seemed inefficient as there seemed to be lot of functionality within preStart thats's unnecessary if the tenant hasn't been created and poisons the tenant id for future use. So I made the call to the checkTenantExists function within preStart,

Copy link
Contributor

@ajwerner ajwerner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @knz and @rimadeodhar)


pkg/server/server_sql.go, line 752 at r1 (raw file):

// has been bootstrapped for the tenant. This is not a complete check and is only sufficient
// to be used in the dev environment.
func checkTenantExists(ctx context.Context, codec keys.SQLCodec, db *kv.DB) error {

Can you rename this to maybeCheckTenantExists and then return early if codec.ForSystemTenant()?


pkg/server/server_sql.go, line 761 at r1 (raw file):

		return errors.New("System DB uninitialized, check if tenant is non existent")
	}
	// Tenant has been bootstrapped successfully

nit: period on sentences


pkg/server/server_sql.go, line 763 at r1 (raw file):

Previously, rimadeodhar (Rima Deodhar) wrote…

I can move this function to server/tenant.go and add a comment. I put it in here since for now, its mainly getting used within server_sql code. But don't feel very strongly about this.

I don't care much about where you put the function, but the commentary would be appreciated.


pkg/server/server_sql.go, line 788 at r1 (raw file):

Previously, rimadeodhar (Rima Deodhar) wrote…

Do you mean putting this function call within StartTenant in tenant.go? If so, my initial thought was to put the call there. However, during testing, I realized that I need the s.tenantConnect.Start call to have completed so I would have to put the tenant Exists check after the call to preStart. That seemed inefficient as there seemed to be lot of functionality within preStart thats's unnecessary if the tenant hasn't been created and poisons the tenant id for future use. So I made the call to the checkTenantExists function within preStart,

I was thinking just up one line to be under the tenantConnect conditional. I'm realizing that doesn't work for the use case we're considering because for testing we don't use a tenantConnect.

Copy link
Collaborator Author

@rimadeodhar rimadeodhar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner, @knz, and @rimadeodhar)


pkg/server/server_sql.go, line 752 at r1 (raw file):

Previously, ajwerner wrote…

Can you rename this to maybeCheckTenantExists and then return early if codec.ForSystemTenant()?

Sure, will do.


pkg/server/server_sql.go, line 763 at r1 (raw file):

Previously, ajwerner wrote…

I don't care much about where you put the function, but the commentary would be appreciated.

Okay, I'll add the additional comments.

@rimadeodhar rimadeodhar force-pushed the check_tenant_exists branch from eadc7f7 to 0ad0476 Compare May 26, 2021 21:21
Copy link
Contributor

@ajwerner ajwerner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, last nice thing would be a test to show this works. It's probably as simple as adding a flag in TestTenantArgs to skip the check whether it exists and then also setting Existing and calling StartTenant on the TestServer.

rowCount, err := ts.InternalExecutor().(*sql.InternalExecutor).Exec(
ctx, "testserver-check-tenant-active", nil,
"SELECT 1 FROM system.tenants WHERE id=$1 AND active=true",
params.TenantID.ToUint64(),
)
if err != nil {
return nil, err
}
if rowCount == 0 {
return nil, errors.New("not found")
}

Reviewed 1 of 2 files at r2.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @knz and @rimadeodhar)


pkg/server/server_sql.go, line 754 at r2 (raw file):

func maybeCheckTenantExists(ctx context.Context, codec keys.SQLCodec, db *kv.DB) error {
	if codec.ForSystemTenant() {
		// Skip check for system tenant and return early

nit: period

Copy link
Collaborator Author

@rimadeodhar rimadeodhar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, in addition to local run test, I did try previously to add a unit test for this method. I used the same logic that you mentioned here and used a non existent tenant id. Oddly enough, the maybeCheckTenantExists as implemented above passes. While debugging it, I wasn't able to figure out whether the test environment mocks the db state somehow and so the system db shows up as existing. I'm trying to figure out why the check passes in the unit test environment.
The code definitely works as expected while running locally and fails for non existent tenants with the following error:

❯ cockroach mt start-sql --insecure --tenant-id 678 --http-addr 127.0.0.1:8081 --kv-addrs 127.0.0.1:26257 --sql-addr 127.0.0.1:36257 --log-dir=logs/mt-678 --logtostderr=INFO                  
                                                                                                                                                                                               
Flag --log-dir has been deprecated, use --log instead to specify 'file-defaults: {dir: ...}'                                                                                                   
Flag --logtostderr has been deprecated, use --log instead to specify 'sinks: {stderr: {filter: ...}}'.                                                                                         
W210525 18:20:37.680600 1 1@cli/start.go:1010  [-] 1  ALL SECURITY CONTROLS HAVE BEEN DISABLED!                                                                                                
W210525 18:20:37.680600 1 1@cli/start.go:1010  [-] 1 +                                                                                                                                         
W210525 18:20:37.680600 1 1@cli/start.go:1010  [-] 1 +This mode is intended for non-production testing only.                                                                                   
W210525 18:20:37.680600 1 1@cli/start.go:1010  [-] 1 +                                                                                                                                         
W210525 18:20:37.680600 1 1@cli/start.go:1010  [-] 1 +In this mode:                                                                                                                            
W210525 18:20:37.680600 1 1@cli/start.go:1010  [-] 1 +- Your cluster is open to any client that can access any of your IP addresses.                                                           
W210525 18:20:37.680600 1 1@cli/start.go:1010  [-] 1 +- Intruders with access to your machine or network can observe client-server traffic.                                                    
W210525 18:20:37.680600 1 1@cli/start.go:1010  [-] 1 +- Intruders can log in without password and read or write any data in the cluster.                                                       
W210525 18:20:37.680600 1 1@cli/start.go:1010  [-] 1 +- Intruders can consume all your server's resources and cause unavailability.                                                            
I210525 18:20:37.681510 1 1@cli/start.go:1020  [-] 2  To start a secure server without mandating TLS for clients,                                                                              
I210525 18:20:37.681510 1 1@cli/start.go:1020  [-] 2 +consider --accept-sql-without-tls instead. For other options, see:                                                                       
I210525 18:20:37.681510 1 1@cli/start.go:1020  [-] 2 +                                                                                                                                         
I210525 18:20:37.681510 1 1@cli/start.go:1020  [-] 2 +- https://go.crdb.dev/issue-v/53404/v21.2                                                                                                
I210525 18:20:37.681510 1 1@cli/start.go:1020  [-] 2 +- https://www.cockroachlabs.com/docs/v21.2/secure-a-cluster.html                                                                         
W210525 18:20:37.681592 1 1@cli/start.go:976  [-] 3  Using the default setting for --cache (128 MiB).                                                                                          
W210525 18:20:37.681592 1 1@cli/start.go:976  [-] 3 +  A significantly larger value is usually needed for good performance.                                                                    
W210525 18:20:37.681592 1 1@cli/start.go:976  [-] 3 +  If you have a dedicated server a reasonable setting is --cache=.25 (8.0 GiB).                                                           
I210525 18:20:37.681640 1 1@cli/start.go:1049  [-] 4  CockroachDB CCL v21.2.0-alpha.00000000-754-g5980f9c353-dirty (x86_64-apple-darwin20.4.0, built 2021/05/25 18:19:32, go1.15.11)           
I210525 18:20:37.683532 1 1@cli/start.go:1129  [-] 5  GEOS loaded from directory lib                                                                                                           
I210525 18:20:37.692857 1 3@vendor/github.com/cockroachdb/pebble/version_set.go:156  [pebble] 6  [JOB 1] MANIFEST created 000001                                                               
I210525 18:20:37.693265 1 3@vendor/github.com/cockroachdb/pebble/open.go:305  [pebble] 7  [JOB 1] WAL created 000002                                                                           
I210525 18:20:37.693671 40 3@vendor/github.com/cockroachdb/pebble/table_stats.go:118  [pebble] 8  [JOB 2] all initial table stats loaded                                                       
I210525 18:20:37.705098 1 server/goroutinedumper/goroutinedumper.go:120  [-] 9  writing goroutine dumps to /Users/rdeodhar/go/src/github.com/cockroachdb/cockroach/logs/mt-678/goroutine_dump  
I210525 18:20:37.705316 1 server/heapprofiler/heapprofiler.go:49  [-] 10  writing go heap profiles to /Users/rdeodhar/go/src/github.com/cockroachdb/cockroach/logs/mt-678/heap_profiler at leas
t every 1h0m0s
I210525 18:20:37.705347 1 server/heapprofiler/cgoprofiler.go:53  [-] 11  to enable jmalloc profiling: "export MALLOC_CONF=prof:true" or "ln -s prof:true /etc/malloc.conf"
I210525 18:20:37.705369 1 server/heapprofiler/statsprofiler.go:54  [-] 12  writing memory stats to /Users/rdeodhar/go/src/github.com/cockroachdb/cockroach/logs/mt-678/heap_profiler at last ev
ery 1h0m0s
W210525 18:20:37.710999 127 ccl/kvccl/kvtenantccl/connector.go:159  [sql,tenant-connector] 13  error consuming GossipSubscription RPC: rpc error: code = Canceled desc = context canceled
ERROR: System DB uninitialized, check if tenant is non existent
Failed running "mt start-sql"

So there is definitely something in the test framework which leads this tenant existence check to pass even for a non existent tenant.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @knz and @rimadeodhar)

Copy link
Contributor

@ajwerner ajwerner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you push the test so we can get to the bottom of it?

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @knz and @rimadeodhar)

Copy link
Collaborator Author

@rimadeodhar rimadeodhar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh nvm, I think I figured out whats going on. I'll publish a new iteration with the unit test :).

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @knz and @rimadeodhar)

Copy link
Collaborator Author

@rimadeodhar rimadeodhar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just for completeness, let me explain the reason. The issue I was facing was because I was using a tenant id without a security certificate. So it would block at the tenantConnect.Start and fail. Using the serverutil predefined tenant ids:

func TestTenantID() roachpb.TenantID {
worked as expected.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @knz and @rimadeodhar)

@rimadeodhar rimadeodhar force-pushed the check_tenant_exists branch from 0ad0476 to bc43a30 Compare May 27, 2021 18:50
Copy link
Contributor

@ajwerner ajwerner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Reviewed 1 of 2 files at r2, 4 of 4 files at r3.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @knz and @rimadeodhar)


pkg/server/server_sql.go, line 763 at r3 (raw file):

	}
	if result.Value == nil || result.ValueInt() != keys.SystemDatabaseID {
		return errors.New("System DB uninitialized, check if tenant is non existent")

nit: please do not use capital letters to start errors (see https://cockroachlabs.atlassian.net/wiki/spaces/CRDB/pages/1676804870/Error+concepts+and+handling+in+CockroachDB#Error-messages%2C-hints-and-codes)


pkg/server/server_sql.go, line 766 at r3 (raw file):

	}
	// Tenant has been confirmed to be bootstrapped successfully
	// as system database, which is a part of the bootstrap data for

nit: "the system database"

@rimadeodhar rimadeodhar force-pushed the check_tenant_exists branch from bc43a30 to ad13981 Compare June 1, 2021 18:42
@rimadeodhar rimadeodhar requested review from a team and adityamaru and removed request for a team June 1, 2021 18:42
This commit fixes an issue where attempting to start a SQL Pod
through cockroach mt start-sql command returns an error if the
tenant id has not been created before. Previously, attempting to
start a SQL Pod with a non existent tenant id would crash. Even
worse, it would poison the tenant id causing future
crdb_internal.create_sql_tenant calls to fail.
With this fix, the cockroach mt start-sql command returns an error
for non existent tenant ids and future calls to
crdb_internal.create_sql_tenant are successful.
Resolves: cockroachdb#64963
Release note (bug fix): The cockroach mt start-sql command with a
non existent tenant id returns an error. Previously, it would
crash and poison the tenant id for future usage.
@rimadeodhar rimadeodhar force-pushed the check_tenant_exists branch from ad13981 to 57cad09 Compare June 1, 2021 19:12
Copy link
Contributor

@ajwerner ajwerner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm_strong:

Reviewed 5 of 5 files at r4.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @adityamaru and @knz)

@rimadeodhar
Copy link
Collaborator Author

TFTR

bors r=ajwerner

@craig
Copy link
Contributor

craig bot commented Jun 1, 2021

Build succeeded:

@craig craig bot merged commit 94e48ae into cockroachdb:master Jun 1, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

kvserver: starting a SQL pod for an invalid tenant ID causes a panic and poisons that tenant ID
3 participants