Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sql: don't distribute migration queries #44102

Merged
merged 2 commits into from
Jan 23, 2020

Conversation

andreimatei
Copy link
Contributor

This patch inhibits DistSQL distribution for the queries that the
migrations run. This was prompted by #44101, which is causing a
distributed query done soon after a node startup to sometimes fail.

I've considered more bluntly disabling distribution for any query for a
short period of time after the node starts up, but I went with the more
targeted change to migrations because I think it's a bad idea for
migrations to use query distribution even outside of #44101 -
distributed queries are more fragile than local execution in general
(for example, because of DistSender retries). And migrations can't
tolerate any flakiness.

Fixes #43957
Fixes #44005
Touches #44101

@cockroach-teamcity
Copy link
Member

This change is Reviewable

Copy link
Contributor

@solongordon solongordon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm: Thank you for digging into this issue!

Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @asubiotto, @knz, and @solongordon)

Copy link
Contributor

@knz knz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 7 of 7 files at r1.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @andreimatei and @asubiotto)


pkg/server/server.go, line 1603 at r1 (raw file):

		mmKnobs = *migrationManagerTestingKnobs.(*sqlmigrations.MigrationManagerTestingKnobs)
	}
	migrationsExecutor := sql.NewSessionBoundInternalExecutor(

I wonder if it's not desirable instead to temporarily override the ExecCfg.InternalExecutor for the duration of the migrations. I am asking this because each of the SQL queries executed by a migration can itself internally issue some other SQL via the server-wide InternalExecutor and that could get accidentally distributed.

The alternative would be to temporarily set the server-wide InternalExecutor to nil to guarantee a failure until the migrations complete? I'm not sure.


pkg/server/server.go, line 1606 at r1 (raw file):

		ctx,
		&sessiondata.SessionData{
			User: security.NodeUser,

Out of curiosity, why did you choose NodeUser for this? This is highly unusual, nowhere else do we use SQL with that user. At the very least explain in a comment.

@andreimatei andreimatei force-pushed the sql.migrations-no-distsql branch from 4a7bd94 to a15461d Compare January 17, 2020 14:38
Copy link
Contributor Author

@andreimatei andreimatei left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @asubiotto and @knz)


pkg/server/server.go, line 1603 at r1 (raw file):

Previously, knz (kena) wrote…

I wonder if it's not desirable instead to temporarily override the ExecCfg.InternalExecutor for the duration of the migrations. I am asking this because each of the SQL queries executed by a migration can itself internally issue some other SQL via the server-wide InternalExecutor and that could get accidentally distributed.

The alternative would be to temporarily set the server-wide InternalExecutor to nil to guarantee a failure until the migrations complete? I'm not sure.

I'm reticent to muck with Server.InternalExecutor in order to make migrations safer. Simply setting it to nil for a while would be too ugly for my taste and also I don't think it would even work because I'm pretty sure we've started a bunch of background work that needs it by now - notice that these migrations are run pretty late.
There are ways to prohibit the distribution for all queries during server startup, and even for a while after startup, regardless of what executor is used. Given #44101, that's not too unreasonable (in fact, I even coded it up), but I backed out cause I'd rather do a better fix for #44101 which is a bug beyond DistSQL.
For migrations in particular, it seems to me that we generally need to make them more resilient to failures of other nodes by retrying them for a while. Even without DistSQL distribution, they are currently plenty vulnerable to node failure.
And then generally, we should also make DistSQL more resilient to node failure - in particular we should be resilient to failures of scheduling a flow on a remote node by moving that flow locally. This wouldn't cover #44101, but if #44101 were fixed, it'd go a long way towards making distributed queries as resilient (or, as fragile) as non-distributed ones. At that point I think there'd be little reason to inhibit distribution for migration queries, and so I'd rather not deal with it more than I already have if I'm not forced too.

Having said all this, I also find it hard to imagine that a migration would run a SQL query that in turn would use the executor internally to run another query which other query ends up being distributed. The uses of the internal executor from within SQL queries that I found are pretty narrow, and mostly around crdb_internal tables which are not distributed. But if you open my eyes, I mind change my mind.


pkg/server/server.go, line 1606 at r1 (raw file):

Previously, knz (kena) wrote…

Out of curiosity, why did you choose NodeUser for this? This is highly unusual, nowhere else do we use SQL with that user. At the very least explain in a comment.

Added a comment. It's because some migrations already were running under it for permissions. Now all are.

Copy link
Contributor

@knz knz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @andreimatei, @asubiotto, and @knz)


pkg/server/server.go, line 1603 at r1 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

I'm reticent to muck with Server.InternalExecutor in order to make migrations safer. Simply setting it to nil for a while would be too ugly for my taste and also I don't think it would even work because I'm pretty sure we've started a bunch of background work that needs it by now - notice that these migrations are run pretty late.
There are ways to prohibit the distribution for all queries during server startup, and even for a while after startup, regardless of what executor is used. Given #44101, that's not too unreasonable (in fact, I even coded it up), but I backed out cause I'd rather do a better fix for #44101 which is a bug beyond DistSQL.
For migrations in particular, it seems to me that we generally need to make them more resilient to failures of other nodes by retrying them for a while. Even without DistSQL distribution, they are currently plenty vulnerable to node failure.
And then generally, we should also make DistSQL more resilient to node failure - in particular we should be resilient to failures of scheduling a flow on a remote node by moving that flow locally. This wouldn't cover #44101, but if #44101 were fixed, it'd go a long way towards making distributed queries as resilient (or, as fragile) as non-distributed ones. At that point I think there'd be little reason to inhibit distribution for migration queries, and so I'd rather not deal with it more than I already have if I'm not forced too.

Having said all this, I also find it hard to imagine that a migration would run a SQL query that in turn would use the executor internally to run another query which other query ends up being distributed. The uses of the internal executor from within SQL queries that I found are pretty narrow, and mostly around crdb_internal tables which are not distributed. But if you open my eyes, I mind change my mind.

I'll take that explanation. But that deserves a clarifying comment:

// Although SQL queries may also use the server-wide InternalExecutor, we are assuming that the uses of the internal executor
// from within SQL queries run by migrations only use crdb_internal tables and do not get distributed by the query planner.
//
// TODO(andrei): Revisit this when db-wide FK and PK migrations are introduced that require distributed processors.

pkg/server/server.go, line 1606 at r1 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

Added a comment. It's because some migrations already were running under it for permissions. Now all are.

Woah no that doesn't seem right. If you create a db or table with NodeUser it will have its SQL privileges set only for node and then neither root nor anything else can access it.

node really should not be used by default.

@andreimatei andreimatei force-pushed the sql.migrations-no-distsql branch 2 times, most recently from 3e23927 to c0af79e Compare January 18, 2020 00:11
Copy link
Contributor Author

@andreimatei andreimatei left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

prepended a commit reworking the internal executor. PTAL.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @asubiotto and @knz)


pkg/server/server.go, line 1603 at r1 (raw file):

Previously, knz (kena) wrote…

I'll take that explanation. But that deserves a clarifying comment:

// Although SQL queries may also use the server-wide InternalExecutor, we are assuming that the uses of the internal executor
// from within SQL queries run by migrations only use crdb_internal tables and do not get distributed by the query planner.
//
// TODO(andrei): Revisit this when db-wide FK and PK migrations are introduced that require distributed processors.

I've considered it Rafa, but I don't think any of this would make much sense to a reader. I'd leave it as is...


pkg/server/server.go, line 1606 at r1 (raw file):

Previously, knz (kena) wrote…

Woah no that doesn't seem right. If you create a db or table with NodeUser it will have its SQL privileges set only for node and then neither root nor anything else can access it.

node really should not be used by default.

As discussed, node didn't seem to cause problems, but I did get sniped. Through some maximal intrusion, we can now bind the DistSQLMode variable, but let the user be controlled by the migrations - and so most run as root again.
https://www.youtube.com/watch?v=AbSehcT19u0

Copy link
Contributor

@knz knz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First commit looks generally good, but please also create a separate PR with just that commit so we can see what TC thinks about it.

2nd and 3rd commit LGTM and good to go once the first commit is cleared to go too.

Reviewed 15 of 15 files at r2, 4 of 4 files at r4.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @andreimatei, @asubiotto, and @knz)


pkg/sql/conn_executor.go, line 478 at r2 (raw file):

// populateMinimalSessionData populates sd with some minimal values needed for
// not crashing.

Expand the comment to mention that the function is called on already-populated SessionData objects coming from a SQL client, so it must not override fields already set.


pkg/sql/virtual_schema.go, line 252 at r2 (raw file):

		} else {
			if !e.validWithNoDatabaseContext {
				debug.PrintStack()

1st commit: remove this


pkg/sql/logictest/testdata/logic_test/event_log, line 368 at r4 (raw file):

ORDER BY "timestamp"
----
0  1  {"SettingName":"diagnostics.reporting.enabled","Value":"true","User":"node"}

I think you need to revert this now

@andreimatei andreimatei force-pushed the sql.migrations-no-distsql branch from c0af79e to bbc3a34 Compare January 21, 2020 19:24
Copy link
Contributor Author

@andreimatei andreimatei left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

first commit is now #44170

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @asubiotto and @knz)


pkg/sql/conn_executor.go, line 478 at r2 (raw file):

Previously, knz (kena) wrote…

Expand the comment to mention that the function is called on already-populated SessionData objects coming from a SQL client, so it must not override fields already set.

done


pkg/sql/virtual_schema.go, line 252 at r2 (raw file):

Previously, knz (kena) wrote…

1st commit: remove this

:S
done


pkg/sql/logictest/testdata/logic_test/event_log, line 368 at r4 (raw file):

Previously, knz (kena) wrote…

I think you need to revert this now

done

Copy link
Contributor

@knz knz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM pending resolution of the remaining error in #44170.

Reviewed 3 of 8 files at r5, 3 of 3 files at r7.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @asubiotto and @knz)

It wasn't necessary. Run it regularly, as root.

Release note: None
This patch inhibits DistSQL distribution for the queries that the
migrations run. This was prompted by cockroachdb#44101, which is causing a
distributed query done soon after a node startup to sometimes fail.

I've considered more bluntly disabling distribution for any query for a
short period of time after the node starts up, but I went with the more
targeted change to migrations because I think it's a bad idea for
migrations to use query distribution even outside of cockroachdb#44101 -
distributed queries are more fragile than local execution in general
(for example, because of DistSender retries). And migrations can't
tolerate any flakiness.

Fixes cockroachdb#43957
Fixes cockroachdb#44005
Touches cockroachdb#44101

Release note: None
@andreimatei andreimatei force-pushed the sql.migrations-no-distsql branch from bbc3a34 to e12735f Compare January 23, 2020 19:45
@andreimatei
Copy link
Contributor Author

bors r+

craig bot pushed a commit that referenced this pull request Jan 23, 2020
44102: sql: don't distribute migration queries r=andreimatei a=andreimatei

This patch inhibits DistSQL distribution for the queries that the
migrations run. This was prompted by #44101, which is causing a
distributed query done soon after a node startup to sometimes fail.

I've considered more bluntly disabling distribution for any query for a
short period of time after the node starts up, but I went with the more
targeted change to migrations because I think it's a bad idea for
migrations to use query distribution even outside of #44101 -
distributed queries are more fragile than local execution in general
(for example, because of DistSender retries). And migrations can't
tolerate any flakiness.

Fixes #43957
Fixes #44005
Touches #44101

Co-authored-by: Andrei Matei <[email protected]>
@craig
Copy link
Contributor

craig bot commented Jan 23, 2020

Build succeeded

@craig craig bot merged commit e12735f into cockroachdb:master Jan 23, 2020
@andreimatei andreimatei deleted the sql.migrations-no-distsql branch January 29, 2020 18:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

roachtest: acceptance/version-upgrade failed roachtest: acceptance/version-upgrade failed
4 participants