[dnm,wip] server: reproduce double-allocation of store IDs #56271

irfansharif · 2020-11-03T23:37:26Z

We seem to be doing a few unsound things with how we allocate store IDs
(courtesy of yours truly). The changes made to a recently added test
stressing our multi-store behaviour demonstrates the bug.

--- FAIL: TestAddNewStoresToExistingNodes (38.44s)
    multi_store_test.go:155: expected the 6th store to have storeID s6, found s5

Release note: None

cockroach-teamcity · 2020-11-03T23:37:33Z

This change is

We seem to be doing a few unsound things with how we allocate store IDs (courtesy of yours truly). The changes made to a recently added test stressing our multi-store behaviour demonstrates the bug. --- FAIL: TestAddNewStoresToExistingNodes (38.44s) multi_store_test.go:155: expected the 6th store to have storeID s6, found s5 Release note: None

tbg · 2020-11-04T14:28:05Z

Are you sure there's anything wrong here? I took a look but didn't find anything unusual. The firstStoreID stuff is not super idiomatic, but it seems to do the right thing. I think the reason the adjusted test fails is because ParallelStart is true, and so the storeIDs are allocated concurrently. This doesn't mean that they're handed out twice, but it does mean that it's not clear that the newly added stores on n1 get ids 4 and 5 (those might go to stores on, say, n2).

irfansharif · 2020-11-04T14:40:56Z

Perhaps the test is wrong, but with manually added logging here I did see double allocation of store IDs across different nodes.

irfansharif · 2020-11-04T14:43:55Z

but it does mean that it's not clear that the newly added stores on n1 get ids 4 and 5 (those might go to stores on, say, n2).

In the changes made to the test above, I sort all store IDs (regardless of origin) and make sure we for N stores we have store IDs s1 through sN. When there are duplicate store IDs in our sorted list we'll have a particular node ID appear twice, so I think what I mentioned above still holds.

irfansharif · 2020-11-04T16:11:43Z

Our saving grace thus far seems to be this:

cockroach/pkg/server/node.go

Lines 600 to 605 in b03fb74

    
           // We lied, we don't have a firstStoreID; we'll need to allocate for 
        
           // that too. 
        
           // 
        
           // TODO(irfansharif): We get here if we're falling back to 
        
           // gossip-based connectivity. This can be removed in 21.1. 
        
           storeIDAlloc++

I suppose I only ran into the bug in trying to remove it in #56263. This branch alone (not rebased on top of #56263) works just fine, albeit inadvertently. Hm. Definitely want to clean up here first then.

...when joining an existing cluster. This diff adds some sanity around how we bootstrap stores for nodes when we're informed by the existing cluster what the first store ID should be. Previously we were bootstrapping the first store asynchronously, and that is not what we want. We first observed the implications of doing so in cockroachdb#56263, which attempted to remove the use of gossip in cluster/node ID distribution. There we noticed that our somewhat haphazard structure around initialization of stores could lead to doubly allocating store IDs (cockroachdb#56272). We were inadvertently safeguarded against this, as described in cockroachdb#56271, but this structure is still pretty confusing and needed cleanup. Now the new store initialization structure is the following: ``` - In the init code: - If we're being bootstrapped: - Initialize all the stores - If we're joining an existing cluster: - Only initialize first store, leave remaining stores to start up code later - Later, when initializing additional new stores: - Allocate len(auxiliary engines) store IDs, and initialize them asynchronously. ``` This lets us avoid threading in the first store ID, and always rely on the KV increment operation to tell us what the store ID should be for the first additional store. We update TestAddNewStoresToExistingNodes to test allocation behaviour with more than just two stores. Eventually we could simplify the init code to only initialize the first store when we're bootstrapping (there's a longstanding TODO from Andrei to that effect), but it's not strictly needed. This PR unblocks cockroachdb#56263. Release note: None

56299: server: initialize first store within the init server r=irfansharif a=irfansharif ...when joining an existing cluster. This diff adds some sanity around how we bootstrap stores for nodes when we're informed by the existing cluster what the first store ID should be. Previously we were bootstrapping the first store asynchronously, and that is not what we want. We first observed the implications of doing so in #56263, which attempted to remove the use of gossip in cluster/node ID distribution. There we noticed that our somewhat haphazard structure around initialization of stores could lead to doubly allocating store IDs (#56272). We were inadvertently safeguarded against this, as described in #56271, but this structure is still pretty confusing and needed cleanup. Now the new store initialization structure is the following: ``` - In the init code: - If we're being bootstrapped: - Initialize all the stores - If we're joining an existing cluster: - Only initialize first store, leave remaining stores to start up code later - Later, when initializing additional new stores: - Allocate len(auxiliary engines) store IDs, and initialize them asynchronously. ``` This lets us avoid threading in the first store ID, and always rely on the KV increment operation to tell us what the store ID should be for the first additional store. We update TestAddNewStoresToExistingNodes to test allocation behaviour with more than just two stores. Eventually we could simplify the init code to only initialize the first store when we're bootstrapping (there's a longstanding TODO from Andrei to that effect), but it's not strictly needed. This PR unblocks #56263. Release note: None Co-authored-by: irfan sharif <[email protected]>

...when joining an existing cluster. This diff adds some sanity around how we bootstrap stores for nodes when we're informed by the existing cluster what the first store ID should be. Previously we were bootstrapping the first store asynchronously, and that is not what we want. We first observed the implications of doing so in cockroachdb#56263, which attempted to remove the use of gossip in cluster/node ID distribution. There we noticed that our somewhat haphazard structure around initialization of stores could lead to doubly allocating store IDs (cockroachdb#56272). We were inadvertently safeguarded against this, as described in cockroachdb#56271, but this structure is still pretty confusing and needed cleanup. Now the new store initialization structure is the following: ``` - In the init code: - If we're being bootstrapped: - Initialize all the stores - If we're joining an existing cluster: - Only initialize first store, leave remaining stores to start up code later - Later, when initializing additional new stores: - Allocate len(auxiliary engines) store IDs, and initialize them asynchronously. ``` This lets us avoid threading in the first store ID, and always rely on the KV increment operation to tell us what the store ID should be for the first additional store. We update TestAddNewStoresToExistingNodes to test allocation behaviour with more than just two stores. Eventually we could simplify the init code to only initialize the first store when we're bootstrapping (there's a longstanding TODO from Andrei to that effect), but it's not strictly needed. This PR unblocks cockroachdb#56263. Release note: None

irfansharif · 2020-11-06T05:09:43Z

Cleaned up across #56299 and #56302.

irfansharif added the do-not-merge bors won't merge a PR with this label. label Nov 3, 2020

irfansharif mentioned this pull request Nov 3, 2020

server: double allocation of store IDs in multi-store setups #56272

Closed

irfansharif force-pushed the 201103.store-alloc-bug branch from 16d1d3d to 3c6e0c5 Compare November 4, 2020 00:58

irfansharif mentioned this pull request Nov 4, 2020

server: initialize first store within the init server #56299

Merged

irfansharif closed this Nov 6, 2020

irfansharif deleted the 201103.store-alloc-bug branch November 6, 2020 05:09

nvanbenschoten mentioned this pull request Mar 1, 2021

kv: in v20.2, bootstrapping multiple stores can result in duplicate store IDs #61218

Closed

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[dnm,wip] server: reproduce double-allocation of store IDs #56271

[dnm,wip] server: reproduce double-allocation of store IDs #56271

irfansharif commented Nov 3, 2020

cockroach-teamcity commented Nov 3, 2020

tbg commented Nov 4, 2020

irfansharif commented Nov 4, 2020

irfansharif commented Nov 4, 2020

irfansharif commented Nov 4, 2020

irfansharif commented Nov 6, 2020

[dnm,wip] server: reproduce double-allocation of store IDs #56271

[dnm,wip] server: reproduce double-allocation of store IDs #56271

Conversation

irfansharif commented Nov 3, 2020

cockroach-teamcity commented Nov 3, 2020

tbg commented Nov 4, 2020

irfansharif commented Nov 4, 2020

irfansharif commented Nov 4, 2020

irfansharif commented Nov 4, 2020

irfansharif commented Nov 6, 2020