-
Notifications
You must be signed in to change notification settings - Fork 3.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[error while unmarshalling cockroach.kv.kvpb.LockConflictError] roachtest: backup-restore/mixed-version failed #113271
Comments
|
looks like nodes 1, 2 and 4 were running binary release-23.2 and 3 was running release-23.1. We planned the backup on n1 but then disabled job adoption for all nodes except for 3. So n3 would have setup the distsql plan and coordinated the execution of the backup. |
The logs have:
before the backup fails terminally |
I'm going to rope in KV incase any recent changes/backports come to mind. I found #109107 and #108190 as changes in seemingly related logic but these have been merged for a while cc/ @nvanbenschoten |
This probably is related to #109107, but I don't understand why. We did set up the |
When I get a chance (hopefully today, on call), I'll test this out by running a |
I can reproduce this: roachprod create nathan-113271 -n3
roachprod stage nathan-113271 release v23.1.11
roachprod start nathan-113271
roachprod stop nathan-113271:2-3
roachprod stage nathan-113271:2-3 cockroach
roachprod start nathan-113271:2-3 --sequential=false
roachprod sql nathan-113271:1
roachprod sql nathan-113271:2
# either shell
create table t(i int primary key);
insert into t values (1);
select lease_holder from [show ranges from table t with details];
alter table t scatter;
# 23.2 shell
begin; select * from t for update;
# 23.1 shell
begin; select * from t for update nowait; |
I think we have to revert 350dc60. Looking back through the git history, I think this may have been known in 097e1b7 but then later forgotten. |
Do we really need to replace the generated code? or can we just add our own extra call to |
This and I'm worried that this is just the first of multiple subtle problems. I thought proto message renaming with the |
113646: kv: split LockConflictError, revive WriteIntentError over wire r=nvanbenschoten a=nvanbenschoten Fixes #113271. This commit resolves the backwards incompatibility introduced by 350dc60 when `WriteIntentError` was renamed to `LockConflictError`. This rename broke mixed-version compatibility, because error details in `kvpb.Error` are packaged into an `errorspb.EncodedError`, which internally uses a `protobuf/types.Any`. `protobuf/types.Any` encodes the error's name as a string, relying on the receiving node having a matching type in order to decode the error. Without this, we saw the following logs on v23.1 nodes. ``` error while unmarshalling error: ‹any: message type "cockroach.kv.kvpb.LockConflictError" isn't linked ``` As a result, error handling for requests that used `WaitPolicy_Error` was broken. This commit resolves the issue by re-introducing `WriteIntentError` over the wire, so that v23.1 and v23.2 nodes still use the same name to refer to the same error. It does so without reverting 350dc60 and losing the naming improvement in most of the code by splitting `LockConflictError` into its two distinct roles. `LockConflictError` remains in the kvserver to communicate locking conflicts between batch evaluation and concurrency handling. However, the smaller role of communicating locking conflicts to clients that use a `WaitPolicy_Error`, a lock timeout, or a maximum wait-queue length is split into a "new" error called `WriteIntentError`. Splitting these errors was a cleanup we wanted to do anyway, so this commit just does it now to fix the bug. The unfortunate naming of `WriteIntentError` is a battle that we can fight another day. While this commit doesn't introduce any new tests, we have sufficient testing of the two uses of `WriteIntentError` for single-version clusters in the unit tests. For mixed-version clusters, we have the `backup-restore/mixed-version` roachtest, which caught the bug and exercises backup's use of `WriteIntentError`. The remaining place where this broke mixed-version compatibility was `SELECT FOR UPDATE NOWAIT`. We should add mixed-version testing for all `SELECT FOR UPDATE` variants. In the meantime, I have manually verified that the following script works on a mixed-version cluster: ``` roachprod create nathan-113271 -n3 roachprod stage nathan-113271 release v23.1.11 roachprod start nathan-113271 roachprod stop nathan-113271:2-3 roachprod put nathan-113271:2-3 cockroach # with this commit roachprod start nathan-113271:2-3 --sequential=false roachprod sql nathan-113271:1 roachprod sql nathan-113271:2 -- either shell create table t(i int primary key); insert into t values (1); select lease_holder from [show ranges from table t with details]; alter table t scatter; -- 23.2 shell begin; select * from t for update; -- 23.1 shell begin; select * from t for update nowait; -- if broken: ERROR: conflicting locks on /Table/104/1/1/0 [reason=wait_policy] -- if fixed: ERROR: could not obtain lock on row (i)=(1) in t@t_pkey -- same thing but in opposite direction, with 23.1 leaseholder and 23.2 gateway ``` Release note: None Co-authored-by: Nathan VanBenschoten <[email protected]>
Fixes #113271. This commit resolves the backwards incompatibility introduced by 350dc60 when `WriteIntentError` was renamed to `LockConflictError`. This rename broke mixed-version compatibility, because error details in `kvpb.Error` are packaged into an `errorspb.EncodedError`, which internally uses a `protobuf/types.Any`. `protobuf/types.Any` encodes the error's name as a string, relying on the receiving node having a matching type in order to decode the error. Without this, we saw the following logs on v23.1 nodes. ``` error while unmarshalling error: ‹any: message type "cockroach.kv.kvpb.LockConflictError" isn't linked ``` As a result, error handling for requests that used `WaitPolicy_Error` was broken. This commit resolves the issue by re-introducing `WriteIntentError` over the wire, so that v23.1 and v23.2 nodes still use the same name to refer to the same error. It does so without reverting 350dc60 and losing the naming improvement in most of the code by splitting `LockConflictError` into its two distinct roles. `LockConflictError` remains in the kvserver to communicate locking conflicts between batch evaluation and concurrency handling. However, the smaller role of communicating locking conflicts to clients that use a `WaitPolicy_Error`, a lock timeout, or a maximum wait-queue length is split into a "new" error called `WriteIntentError`. Splitting these errors was a cleanup we wanted to do anyway, so this commit just does it now to fix the bug. The unfortunate naming of `WriteIntentError` is a battle that we can fight another day. While this commit doesn't introduce any new tests, we have sufficient testing of the two uses of `WriteIntentError` for single-version clusters in the unit tests. For mixed-version clusters, we have the `backup-restore/mixed-version` roachtest, which caught the bug and exercises backup's use of `WriteIntentError`. The remaining place where this broke mixed-version compatibility was `SELECT FOR UPDATE NOWAIT`. We should add mixed-version testing for all `SELECT FOR UPDATE` variants. In the meantime, I have manually verified that the following script works on a mixed-version cluster: ``` roachprod create nathan-113271 -n3 roachprod stage nathan-113271 release v23.1.11 roachprod start nathan-113271 roachprod stop nathan-113271:2-3 roachprod put nathan-113271:2-3 cockroach # with this commit roachprod start nathan-113271:2-3 --sequential=false roachprod sql nathan-113271:1 roachprod sql nathan-113271:2 -- either shell create table t(i int primary key); insert into t values (1); select lease_holder from [show ranges from table t with details]; alter table t scatter; -- 23.2 shell begin; select * from t for update; -- 23.1 shell begin; select * from t for update nowait; -- if broken: ERROR: conflicting locks on /Table/104/1/1/0 [reason=wait_policy] -- if fixed: ERROR: could not obtain lock on row (i)=(1) in t@t_pkey -- same thing but in opposite direction, with 23.1 leaseholder and 23.2 gateway ``` Release note: None
roachtest.backup-restore/mixed-version failed with artifacts on release-23.2 @ 6aa847e0e6f7eb6ab26dd9b150f165ccae9dd2c6:
Parameters:
ROACHTEST_arch=amd64
,ROACHTEST_cloud=gce
,ROACHTEST_cpu=4
,ROACHTEST_encrypted=false
,ROACHTEST_fs=ext4
,ROACHTEST_localSSD=true
,ROACHTEST_ssd=0
Help
See: roachtest README
See: How To Investigate (internal)
See: Grafana
This test on roachdash | Improve this report!
Jira issue: CRDB-32841
The text was updated successfully, but these errors were encountered: