-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Rebuild SrvVSchema during InitTablet. #6276
Conversation
We used to rebuild SrvVSchema in all cells as part of running GetOrCreateShard from any cell. That caused VSchema problems to spread instantly to all cells at once, so we changed EnsureVSchema to only rebuild SrvVSchema in the local cell. We believed that at least one tablet in a given cell would still end up rebuilding SrvVschema in the cell since they all call GetOrCreateShard at startup. However, I didn't realize that GetOrCreateShard short-circuits if the shard already exists, so it was skipping EnsureVSchema entirely. This moves the rebuild of SrvVSchema to be an explicit step in InitTablet alongside the rebuild of the keyspace graph. This keeps GetOrCreateShard as a task that only needs to be done by one tablet globally because it affects global topo. The rebuilding of per-cell records should be done by a tablet in that cell. Signed-off-by: Anthony Yeh <[email protected]>
switch { | ||
case err == nil: | ||
// Check if vschema was rebuilt after the initial creation of the keyspace. | ||
if _, keyspaceExists := srvVSchema.GetKeyspaces()[*initKeyspace]; !keyspaceExists { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doesn't this also need a RebuildKeyspaceGraph
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand. Are you saying if a keyspace is missing from SrvVSchema, it implies that RebuildKeyspace has not run for that keyspace?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's what I thought.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought that SrvKeyspace and SrvVSchema were completely independent. I'll double check.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're probably right. I think they are independent.
But I also think that a missing SrvKeyspace is going to be a problem. And now that I think about it, there are some workflows where this has led to odd errors.
However, I think that's an orthogonal issue, and it's better to resolve that separately.
switch { | ||
case err == nil: | ||
// Check if vschema was rebuilt after the initial creation of the keyspace. | ||
if _, keyspaceExists := srvVSchema.GetKeyspaces()[*initKeyspace]; !keyspaceExists { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're probably right. I think they are independent.
But I also think that a missing SrvKeyspace is going to be a problem. And now that I think about it, there are some workflows where this has led to odd errors.
However, I think that's an orthogonal issue, and it's better to resolve that separately.
We used to rebuild SrvVSchema in all cells as part of running GetOrCreateShard from any cell. That caused VSchema problems to spread instantly to all cells at once, so we changed EnsureVSchema to only rebuild SrvVSchema in the local cell (#5930).
We believed that at least one tablet in a given cell would still end up rebuilding SrvVschema in the cell since they all call GetOrCreateShard at startup. However, I didn't realize that GetOrCreateShard
short-circuits if the shard already exists, so it was skipping EnsureVSchema entirely. As a result, only one cell (corresponding to the one tablet that won the race) actually had its SrvVSchema rebuilt.
This moves the rebuild of SrvVSchema to be an explicit step in InitTablet alongside the rebuild of the keyspace graph. This keeps GetOrCreateShard as a task that only needs to be done by one tablet globally because it affects global topo. The rebuilding of per-cell records should be done by a tablet in that cell.