-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
sql: don't distribute migration queries #44102
sql: don't distribute migration queries #44102
Conversation
34d85e2
to
4a7bd94
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for digging into this issue!
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @asubiotto, @knz, and @solongordon)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 7 of 7 files at r1.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @andreimatei and @asubiotto)
pkg/server/server.go, line 1603 at r1 (raw file):
mmKnobs = *migrationManagerTestingKnobs.(*sqlmigrations.MigrationManagerTestingKnobs) } migrationsExecutor := sql.NewSessionBoundInternalExecutor(
I wonder if it's not desirable instead to temporarily override the ExecCfg.InternalExecutor for the duration of the migrations. I am asking this because each of the SQL queries executed by a migration can itself internally issue some other SQL via the server-wide InternalExecutor and that could get accidentally distributed.
The alternative would be to temporarily set the server-wide InternalExecutor to nil
to guarantee a failure until the migrations complete? I'm not sure.
pkg/server/server.go, line 1606 at r1 (raw file):
ctx, &sessiondata.SessionData{ User: security.NodeUser,
Out of curiosity, why did you choose NodeUser for this? This is highly unusual, nowhere else do we use SQL with that user. At the very least explain in a comment.
4a7bd94
to
a15461d
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @asubiotto and @knz)
pkg/server/server.go, line 1603 at r1 (raw file):
Previously, knz (kena) wrote…
I wonder if it's not desirable instead to temporarily override the ExecCfg.InternalExecutor for the duration of the migrations. I am asking this because each of the SQL queries executed by a migration can itself internally issue some other SQL via the server-wide InternalExecutor and that could get accidentally distributed.
The alternative would be to temporarily set the server-wide InternalExecutor to
nil
to guarantee a failure until the migrations complete? I'm not sure.
I'm reticent to muck with Server.InternalExecutor
in order to make migrations safer. Simply setting it to nil for a while would be too ugly for my taste and also I don't think it would even work because I'm pretty sure we've started a bunch of background work that needs it by now - notice that these migrations are run pretty late.
There are ways to prohibit the distribution for all queries during server startup, and even for a while after startup, regardless of what executor is used. Given #44101, that's not too unreasonable (in fact, I even coded it up), but I backed out cause I'd rather do a better fix for #44101 which is a bug beyond DistSQL.
For migrations in particular, it seems to me that we generally need to make them more resilient to failures of other nodes by retrying them for a while. Even without DistSQL distribution, they are currently plenty vulnerable to node failure.
And then generally, we should also make DistSQL more resilient to node failure - in particular we should be resilient to failures of scheduling a flow on a remote node by moving that flow locally. This wouldn't cover #44101, but if #44101 were fixed, it'd go a long way towards making distributed queries as resilient (or, as fragile) as non-distributed ones. At that point I think there'd be little reason to inhibit distribution for migration queries, and so I'd rather not deal with it more than I already have if I'm not forced too.
Having said all this, I also find it hard to imagine that a migration would run a SQL query that in turn would use the executor internally to run another query which other query ends up being distributed. The uses of the internal executor from within SQL queries that I found are pretty narrow, and mostly around crdb_internal tables which are not distributed. But if you open my eyes, I mind change my mind.
pkg/server/server.go, line 1606 at r1 (raw file):
Previously, knz (kena) wrote…
Out of curiosity, why did you choose NodeUser for this? This is highly unusual, nowhere else do we use SQL with that user. At the very least explain in a comment.
Added a comment. It's because some migrations already were running under it for permissions. Now all are.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @andreimatei, @asubiotto, and @knz)
pkg/server/server.go, line 1603 at r1 (raw file):
Previously, andreimatei (Andrei Matei) wrote…
I'm reticent to muck with
Server.InternalExecutor
in order to make migrations safer. Simply setting it to nil for a while would be too ugly for my taste and also I don't think it would even work because I'm pretty sure we've started a bunch of background work that needs it by now - notice that these migrations are run pretty late.
There are ways to prohibit the distribution for all queries during server startup, and even for a while after startup, regardless of what executor is used. Given #44101, that's not too unreasonable (in fact, I even coded it up), but I backed out cause I'd rather do a better fix for #44101 which is a bug beyond DistSQL.
For migrations in particular, it seems to me that we generally need to make them more resilient to failures of other nodes by retrying them for a while. Even without DistSQL distribution, they are currently plenty vulnerable to node failure.
And then generally, we should also make DistSQL more resilient to node failure - in particular we should be resilient to failures of scheduling a flow on a remote node by moving that flow locally. This wouldn't cover #44101, but if #44101 were fixed, it'd go a long way towards making distributed queries as resilient (or, as fragile) as non-distributed ones. At that point I think there'd be little reason to inhibit distribution for migration queries, and so I'd rather not deal with it more than I already have if I'm not forced too.Having said all this, I also find it hard to imagine that a migration would run a SQL query that in turn would use the executor internally to run another query which other query ends up being distributed. The uses of the internal executor from within SQL queries that I found are pretty narrow, and mostly around crdb_internal tables which are not distributed. But if you open my eyes, I mind change my mind.
I'll take that explanation. But that deserves a clarifying comment:
// Although SQL queries may also use the server-wide InternalExecutor, we are assuming that the uses of the internal executor
// from within SQL queries run by migrations only use crdb_internal tables and do not get distributed by the query planner.
//
// TODO(andrei): Revisit this when db-wide FK and PK migrations are introduced that require distributed processors.
pkg/server/server.go, line 1606 at r1 (raw file):
Previously, andreimatei (Andrei Matei) wrote…
Added a comment. It's because some migrations already were running under it for permissions. Now all are.
Woah no that doesn't seem right. If you create a db or table with NodeUser it will have its SQL privileges set only for node
and then neither root
nor anything else can access it.
node
really should not be used by default.
3e23927
to
c0af79e
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
prepended a commit reworking the internal executor. PTAL.
Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @asubiotto and @knz)
pkg/server/server.go, line 1603 at r1 (raw file):
Previously, knz (kena) wrote…
I'll take that explanation. But that deserves a clarifying comment:
// Although SQL queries may also use the server-wide InternalExecutor, we are assuming that the uses of the internal executor // from within SQL queries run by migrations only use crdb_internal tables and do not get distributed by the query planner. // // TODO(andrei): Revisit this when db-wide FK and PK migrations are introduced that require distributed processors.
I've considered it Rafa, but I don't think any of this would make much sense to a reader. I'd leave it as is...
pkg/server/server.go, line 1606 at r1 (raw file):
Previously, knz (kena) wrote…
Woah no that doesn't seem right. If you create a db or table with NodeUser it will have its SQL privileges set only for
node
and then neitherroot
nor anything else can access it.
node
really should not be used by default.
As discussed, node
didn't seem to cause problems, but I did get sniped. Through some maximal intrusion, we can now bind the DistSQLMode
variable, but let the user be controlled by the migrations - and so most run as root
again.
https://www.youtube.com/watch?v=AbSehcT19u0
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
First commit looks generally good, but please also create a separate PR with just that commit so we can see what TC thinks about it.
2nd and 3rd commit LGTM and good to go once the first commit is cleared to go too.
Reviewed 15 of 15 files at r2, 4 of 4 files at r4.
Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @andreimatei, @asubiotto, and @knz)
pkg/sql/conn_executor.go, line 478 at r2 (raw file):
// populateMinimalSessionData populates sd with some minimal values needed for // not crashing.
Expand the comment to mention that the function is called on already-populated SessionData objects coming from a SQL client, so it must not override fields already set.
pkg/sql/virtual_schema.go, line 252 at r2 (raw file):
} else { if !e.validWithNoDatabaseContext { debug.PrintStack()
1st commit: remove this
pkg/sql/logictest/testdata/logic_test/event_log, line 368 at r4 (raw file):
ORDER BY "timestamp" ---- 0 1 {"SettingName":"diagnostics.reporting.enabled","Value":"true","User":"node"}
I think you need to revert this now
c0af79e
to
bbc3a34
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
first commit is now #44170
Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @asubiotto and @knz)
pkg/sql/conn_executor.go, line 478 at r2 (raw file):
Previously, knz (kena) wrote…
Expand the comment to mention that the function is called on already-populated SessionData objects coming from a SQL client, so it must not override fields already set.
done
pkg/sql/virtual_schema.go, line 252 at r2 (raw file):
Previously, knz (kena) wrote…
1st commit: remove this
:S
done
pkg/sql/logictest/testdata/logic_test/event_log, line 368 at r4 (raw file):
Previously, knz (kena) wrote…
I think you need to revert this now
done
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM pending resolution of the remaining error in #44170.
Reviewed 3 of 8 files at r5, 3 of 3 files at r7.
Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @asubiotto and @knz)
It wasn't necessary. Run it regularly, as root. Release note: None
This patch inhibits DistSQL distribution for the queries that the migrations run. This was prompted by cockroachdb#44101, which is causing a distributed query done soon after a node startup to sometimes fail. I've considered more bluntly disabling distribution for any query for a short period of time after the node starts up, but I went with the more targeted change to migrations because I think it's a bad idea for migrations to use query distribution even outside of cockroachdb#44101 - distributed queries are more fragile than local execution in general (for example, because of DistSender retries). And migrations can't tolerate any flakiness. Fixes cockroachdb#43957 Fixes cockroachdb#44005 Touches cockroachdb#44101 Release note: None
bbc3a34
to
e12735f
Compare
bors r+ |
44102: sql: don't distribute migration queries r=andreimatei a=andreimatei This patch inhibits DistSQL distribution for the queries that the migrations run. This was prompted by #44101, which is causing a distributed query done soon after a node startup to sometimes fail. I've considered more bluntly disabling distribution for any query for a short period of time after the node starts up, but I went with the more targeted change to migrations because I think it's a bad idea for migrations to use query distribution even outside of #44101 - distributed queries are more fragile than local execution in general (for example, because of DistSender retries). And migrations can't tolerate any flakiness. Fixes #43957 Fixes #44005 Touches #44101 Co-authored-by: Andrei Matei <[email protected]>
Build succeeded |
This patch inhibits DistSQL distribution for the queries that the
migrations run. This was prompted by #44101, which is causing a
distributed query done soon after a node startup to sometimes fail.
I've considered more bluntly disabling distribution for any query for a
short period of time after the node starts up, but I went with the more
targeted change to migrations because I think it's a bad idea for
migrations to use query distribution even outside of #44101 -
distributed queries are more fragile than local execution in general
(for example, because of DistSender retries). And migrations can't
tolerate any flakiness.
Fixes #43957
Fixes #44005
Touches #44101