Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kv: in v20.2, bootstrapping multiple stores can result in duplicate store IDs #61218

Closed
3 of 4 tasks
nvanbenschoten opened this issue Feb 27, 2021 · 5 comments
Closed
3 of 4 tasks
Assignees
Labels
A-kv Anything in KV that doesn't belong in a more specific category. C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. O-postmortem Originated from a Postmortem action item. S-1 High impact: many users impacted, serious risk of high unavailability or data loss

Comments

@nvanbenschoten
Copy link
Member

nvanbenschoten commented Feb 27, 2021

See https://cockroachlabs.slack.com/archives/C0KB9Q03D/p1614352496110700 for context.

We've found that in v20.2.5, it is possible for a race between two bootstrapping nodes to result in store ID re-use. This is because the following code, which mistakenly assumes that the store ID allocated during the join RPC handshake and the additional store IDs allocated by the call to allocateStoreIDs are continuous.

startID, err := allocateStoreIDs(ctx, n.Descriptor.NodeID, storeIDAlloc, n.storeCfg.DB)
if firstStoreID == 0 {
firstStoreID = startID
}

sIdent.StoreID++

This assumption is broken when multiple nodes are bootstrapped at the same time. In such cases, it becomes possible for the join RPC store ID allocations to interleave with each other.

There are a few action items here.

  • Fix the issue on release-20.2

First and foremost, we need to fix the bug and backport that fix to release-20.2 as soon as possible, ideally for the v20.2.6. A strawman fix is:

diff --git a/pkg/server/node.go b/pkg/server/node.go
index 6add213574..c4d165a794 100644
--- a/pkg/server/node.go
+++ b/pkg/server/node.go
@@ -610,7 +610,11 @@ func (n *Node) bootstrapStores(
                                log.Warningf(ctx, "error doing initial gossiping: %s", err)
                        }
-                       sIdent.StoreID++
+                       if sIdent.StoreID == firstStoreID {
+                               sIdent.StoreID = startID
+                       } else {
+                               sIdent.StoreID++
+                       }
                }
        }

We probably should also not be calling allocateStoreIDs at all if storeIDAlloc is 0, or at least make it a no-op.

  • Unit test this scenario

We should be able to create this situation in a unit test, which will then allow us to confirm that we have properly fixed the problem.

  • Confirm that this bug is not present before or after v20.2

It doesn't look like it is, but we should confirm.

  • Improve multi-store testing

In the past week, customers have hit this issue and #60545. We clearly aren't testing multi-store deployments enough, or we would have found them before our customers did. We now have the ability to run multi-store roachprod clusters, so we have the tools to improve our testing story here.

@nvanbenschoten nvanbenschoten added C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. S-1 High impact: many users impacted, serious risk of high unavailability or data loss A-kv Anything in KV that doesn't belong in a more specific category. labels Feb 27, 2021
@nvanbenschoten
Copy link
Member Author

Am I reading #56272 and #56271 correctly in that we knew about this issue back in November?

@irfansharif
Copy link
Contributor

Yes, looks like the reason we didn't backport was because I mistakenly assumed we had (inadvertant) protection against it: #56271 (comment). I think what I was talking about there was another double store ID allocation bug, around the same code-paths.

@nvanbenschoten
Copy link
Member Author

I see, makes sense if we thought things were ok on 20.2. Do you have thoughts on how we'll want to patch this? Should we do the least invasive thing for release-20.2 since things are so much cleaner on master already?

irfansharif added a commit to irfansharif/cockroach that referenced this issue Mar 1, 2021
Touches cockroachdb#61218. It was possible for us to re-use store IDs in
multi-store setups when nodes were started up concurrently.
This was due to an incorrect code assumption (courtesy of yours truly)
that the firstStoreID retrieved through the join rpc was contiguous with
the store IDs allocated by the node for extra engines. This is not the
case; it's possible for concurrent node startups to allocate store IDs
for auxiliary stores in an that interleaves between firstStoreID and the
remaining ones.

Our naive implementation of simply incrementing from firstStoreID failed
to account for this. See TestMultiStoreIDAlloc for a reliable repro of
this race.

Release note (bug fix): Fix a bug with multi-store nodes where
concurrent node startups could result in re-use of store IDs. This could
manifest in many different ways. One e.g. is replica thrashing due to
the store ID collision.
@irfansharif
Copy link
Contributor

Yup, let's carry over in #61262. I'm planning to add same test there to master.

irfansharif added a commit to irfansharif/cockroach that referenced this issue Mar 1, 2021
Touches cockroachdb#61218. TestMultiStoreIDAlloc validates that we don't
accidentally re-use or skip-over allocated store IDs in multi-store
setups.

Release justification: Non-production code changes
Release note: None
irfansharif added a commit to irfansharif/cockroach that referenced this issue Mar 1, 2021
Touches cockroachdb#61218. It was possible for us to re-use store IDs in
multi-store setups when nodes were started up concurrently.
This was due to an incorrect code assumption (courtesy of yours truly)
that the firstStoreID retrieved through the join rpc was contiguous with
the store IDs allocated by the node for extra engines. This is not the
case; it's possible for concurrent node startups to allocate store IDs
for auxiliary stores in a manner that interleaves between firstStoreID
and the remaining ones.

Our naive implementation of simply incrementing from firstStoreID failed
to account for this. See TestMultiStoreIDAlloc for a reliable repro of
this race.

Release note (bug fix): Fix a bug with multi-store nodes where
concurrent node startups could result in re-use of store IDs. This could
manifest in many different ways. One e.g. is replica thrashing due to
the store ID collision.
irfansharif added a commit to irfansharif/cockroach that referenced this issue Mar 1, 2021
Touches cockroachdb#61218. It was possible for us to re-use store IDs in
multi-store setups when nodes were started up concurrently.
This was due to an incorrect code assumption (courtesy of yours truly)
that the firstStoreID retrieved through the join rpc was contiguous with
the store IDs allocated by the node for extra engines. This is not the
case; it's possible for concurrent node startups to allocate store IDs
for auxiliary stores in a manner that interleaves between firstStoreID
and the remaining ones.

Our naive implementation of simply incrementing from firstStoreID failed
to account for this. See TestMultiStoreIDAlloc for a reliable repro of
this race.

Release note (bug fix): Fix a bug with multi-store nodes where
concurrent node startups could result in re-use of store IDs. This could
manifest in many different ways. One e.g. is replica thrashing due to
the store ID collision.
@irfansharif
Copy link
Contributor

@nvanbenschoten, I'm going to close this issue out. Agreed that more multi-store testing seems prudent, both in our roachtests + unit tests. We'd have caught this earlier if we did so.

craig bot pushed a commit that referenced this issue Mar 2, 2021
61264: server: introduce a test for store ID re-use r=irfansharif a=irfansharif

Touches #61218. TestMultiStoreIDAlloc validates that we don't
accidentally re-use or skip-over allocated store IDs in multi-store
setups.

Release justification: Non-production code changes
Release note: None

61355: vendor: bump pebble to 6eee4f6b45b1 r=sumeerbhola a=sumeerbhola

6eee4f6b db: cleanup Iterator.valid behavior
91c29daa Merge pull request #1072 from cockroachdb/ssd/frsize
7ea0bd98 db: remove Iterator.SetBounds optimization for unchanging bounds
6d790ae6 vfs: Use frsize in GetFreeSpace
68984a55 db: amend BenchmarkRangeDelIterate
560af6e1 db: fix bug in seek noop optimization and add more testing
959663f8 db: optimize SeekGE/SeekLT for the noop case

Release justification: Fix for high-severity bug in existing
Pebble functionality.

Release note: None

Co-authored-by: irfan sharif <[email protected]>
Co-authored-by: sumeerbhola <[email protected]>
@lunevalex lunevalex added the O-postmortem Originated from a Postmortem action item. label Mar 4, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-kv Anything in KV that doesn't belong in a more specific category. C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. O-postmortem Originated from a Postmortem action item. S-1 High impact: many users impacted, serious risk of high unavailability or data loss
Projects
None yet
Development

No branches or pull requests

3 participants