-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[dnm,wip] server: reproduce double-allocation of store IDs #56271
Conversation
We seem to be doing a few unsound things with how we allocate store IDs (courtesy of yours truly). The changes made to a recently added test stressing our multi-store behaviour demonstrates the bug. --- FAIL: TestAddNewStoresToExistingNodes (38.44s) multi_store_test.go:155: expected the 6th store to have storeID s6, found s5 Release note: None
16d1d3d
to
3c6e0c5
Compare
Are you sure there's anything wrong here? I took a look but didn't find anything unusual. The |
Perhaps the test is wrong, but with manually added logging here I did see double allocation of store IDs across different nodes. |
In the changes made to the test above, I sort all store IDs (regardless of origin) and make sure we for N stores we have store IDs s1 through sN. When there are duplicate store IDs in our sorted list we'll have a particular node ID appear twice, so I think what I mentioned above still holds. |
Our saving grace thus far seems to be this: Lines 600 to 605 in b03fb74
I suppose I only ran into the bug in trying to remove it in #56263. This branch alone (not rebased on top of #56263) works just fine, albeit inadvertently. Hm. Definitely want to clean up here first then. |
...when joining an existing cluster. This diff adds some sanity around how we bootstrap stores for nodes when we're informed by the existing cluster what the first store ID should be. Previously we were bootstrapping the first store asynchronously, and that is not what we want. We first observed the implications of doing so in cockroachdb#56263, which attempted to remove the use of gossip in cluster/node ID distribution. There we noticed that our somewhat haphazard structure around initialization of stores could lead to doubly allocating store IDs (cockroachdb#56272). We were inadvertently safeguarded against this, as described in cockroachdb#56271, but this structure is still pretty confusing and needed cleanup. Now the new store initialization structure is the following: ``` - In the init code: - If we're being bootstrapped: - Initialize all the stores - If we're joining an existing cluster: - Only initialize first store, leave remaining stores to start up code later - Later, when initializing additional new stores: - Allocate len(auxiliary engines) store IDs, and initialize them asynchronously. ``` This lets us avoid threading in the first store ID, and always rely on the KV increment operation to tell us what the store ID should be for the first additional store. We update TestAddNewStoresToExistingNodes to test allocation behaviour with more than just two stores. Eventually we could simplify the init code to only initialize the first store when we're bootstrapping (there's a longstanding TODO from Andrei to that effect), but it's not strictly needed. This PR unblocks cockroachdb#56263. Release note: None
56299: server: initialize first store within the init server r=irfansharif a=irfansharif ...when joining an existing cluster. This diff adds some sanity around how we bootstrap stores for nodes when we're informed by the existing cluster what the first store ID should be. Previously we were bootstrapping the first store asynchronously, and that is not what we want. We first observed the implications of doing so in #56263, which attempted to remove the use of gossip in cluster/node ID distribution. There we noticed that our somewhat haphazard structure around initialization of stores could lead to doubly allocating store IDs (#56272). We were inadvertently safeguarded against this, as described in #56271, but this structure is still pretty confusing and needed cleanup. Now the new store initialization structure is the following: ``` - In the init code: - If we're being bootstrapped: - Initialize all the stores - If we're joining an existing cluster: - Only initialize first store, leave remaining stores to start up code later - Later, when initializing additional new stores: - Allocate len(auxiliary engines) store IDs, and initialize them asynchronously. ``` This lets us avoid threading in the first store ID, and always rely on the KV increment operation to tell us what the store ID should be for the first additional store. We update TestAddNewStoresToExistingNodes to test allocation behaviour with more than just two stores. Eventually we could simplify the init code to only initialize the first store when we're bootstrapping (there's a longstanding TODO from Andrei to that effect), but it's not strictly needed. This PR unblocks #56263. Release note: None Co-authored-by: irfan sharif <[email protected]>
...when joining an existing cluster. This diff adds some sanity around how we bootstrap stores for nodes when we're informed by the existing cluster what the first store ID should be. Previously we were bootstrapping the first store asynchronously, and that is not what we want. We first observed the implications of doing so in cockroachdb#56263, which attempted to remove the use of gossip in cluster/node ID distribution. There we noticed that our somewhat haphazard structure around initialization of stores could lead to doubly allocating store IDs (cockroachdb#56272). We were inadvertently safeguarded against this, as described in cockroachdb#56271, but this structure is still pretty confusing and needed cleanup. Now the new store initialization structure is the following: ``` - In the init code: - If we're being bootstrapped: - Initialize all the stores - If we're joining an existing cluster: - Only initialize first store, leave remaining stores to start up code later - Later, when initializing additional new stores: - Allocate len(auxiliary engines) store IDs, and initialize them asynchronously. ``` This lets us avoid threading in the first store ID, and always rely on the KV increment operation to tell us what the store ID should be for the first additional store. We update TestAddNewStoresToExistingNodes to test allocation behaviour with more than just two stores. Eventually we could simplify the init code to only initialize the first store when we're bootstrapping (there's a longstanding TODO from Andrei to that effect), but it's not strictly needed. This PR unblocks cockroachdb#56263. Release note: None
We seem to be doing a few unsound things with how we allocate store IDs
(courtesy of yours truly). The changes made to a recently added test
stressing our multi-store behaviour demonstrates the bug.
Release note: None