-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
sql: PartitionSpan should only use healthy nodes in mixed-process mode #111337
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 5 of 5 files at r1, all commit messages.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @rhu713 and @stevendanna)
-- commits
line 26 at r1:
It's not a concern because we don't plan to bump DistSQL version anymore (#98550).
pkg/sql/distsql_physical_planner.go
line 1402 at r1 (raw file):
ctx context.Context, planCtx *PlanningCtx, ) func(nodeID roachpb.NodeID) base.SQLInstanceID { allhealthy, err := dsp.sqlAddressResolver.GetAllInstances(ctx)
nit: s/allhealthy/allHealthy/
.
pkg/sql/distsql_physical_planner.go
line 1405 at r1 (raw file):
if err != nil { log.Warningf(ctx, "could not get all instances: %v", err) return func(_ roachpb.NodeID) base.SQLInstanceID {
nit: we already have dsp.alwaysUseGateway
defined for this.
pkg/sql/distsql_physical_planner.go
line 1409 at r1 (raw file):
} } healthyNodes := make(map[base.SQLInstanceID]struct{})
nit: we could pre-size the map with len(allhealthy)
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, do we want to backport this to 23.1?
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @rhu713 and @stevendanna)
df55b39
to
5a503e4
Compare
Previously, when running in mixed-process mode, the DistSQLPlanner's PartitionSpans method would assume that it could directly assign a given span to the SQLInstanceID that matches the NodeID of whatever replica the current replica oracle returned, without regard to whether the SQL instance was available. This is different from the system tenant code paths which proactively check node health and the non-mixed-process MT code paths which would use an eventually consistent view of healthy nodes. As a result, processes that use PartitionSpans such as BACKUP may fail when a node was down. Here, we have the mixed-process case work more like the separate process case in which we only use nodes returned by the instance reader. This list should eventually exclude any down nodes. An alternative (or perhaps an addition) would be to allow MT planning to do direct status checks more similar to how they are done for the system tenant. Finally, this also adds another error to our list of non-permanent errors. Namely, if we fail to find a SQL instance, we don't tread that as permanent. Fixes cockroachdb#111319 Release note (bug fix): When using a private preview of physical cluster replication, in some circumstances the source cluster would be unable to take backups when a source cluster node was unavailable.
5a503e4
to
44fac37
Compare
bors r=yuzefovich |
Build succeeded: |
Encountered an error creating backports. Some common things that can go wrong:
You might need to create your backport manually using the backport tool. error creating merge commit from 44fac37 to blathers/backport-release-23.1-111337: POST https://api.github.com/repos/cockroachdb/cockroach/merges: 409 Merge conflict [] you may need to manually resolve merge conflicts with the backport tool. Backport to branch 23.1.x failed. See errors above. 🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf. |
Previously, when running in mixed-process mode, the DistSQLPlanner's PartitionSpans method would assume that it could directly assign a given span to the SQLInstanceID that matches the NodeID of whatever replica the current replica oracle returned, without regard to whether the SQL instance was available.
This is different from the system tenant code paths which proactively check node health and the non-mixed-process MT code paths which would use an eventually consistent view of healthy nodes.
As a result, processes that use PartitionSpans such as BACKUP may fail when a node was down.
Here, we have the mixed-process case work more like the separate process case in which we only use nodes returned by the instance reader. This list should eventually exclude any down nodes.
An alternative (or perhaps an addition) would be to allow MT planning to do direct status checks more similar to how they are done for the system tenant.
When reading this code, I also noted that we don't do DistSQL version compatibility checks like we do in the SystemTenant case. I am not sure on the impact of that.
Finally, this also adds another error to our list of non-permanent errors. Namely, if we fail to find a SQL instance, we don't tread that as permanent.
Fixes #111319
Release note (bug fix): When using a private preview of physical cluster replication, in some circumstances the source cluster would be unable to take backups when a source cluster node was unavailable.