Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[error while unmarshalling cockroach.kv.kvpb.LockConflictError] roachtest: backup-restore/mixed-version failed #113271

Closed
cockroach-teamcity opened this issue Oct 29, 2023 · 10 comments · Fixed by #113646
Assignees
Labels
A-kv-transactions Relating to MVCC and the transactional model. branch-release-23.2 Used to mark GA and release blockers, technical advisories, and bugs for 23.2 C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. T-kv KV Team X-noreuse Prevent automatic commenting from CI test failures
Milestone

Comments

@cockroach-teamcity
Copy link
Member

cockroach-teamcity commented Oct 29, 2023

roachtest.backup-restore/mixed-version failed with artifacts on release-23.2 @ 6aa847e0e6f7eb6ab26dd9b150f165ccae9dd2c6:

(mixedversion.go:538).Run: mixed-version test failure while running step 20 (run "plan and run backups"): error waiting for job to finish: job 912597787176796161 failed with error: failed to run backup: exporting 837 ranges: exporting /Table/182/1/29{1-2}: conflicting locks on /Table/182/1/291/0 [reason=wait_policy]
test artifacts and logs in: /artifacts/backup-restore/mixed-version/run_1

Parameters: ROACHTEST_arch=amd64 , ROACHTEST_cloud=gce , ROACHTEST_cpu=4 , ROACHTEST_encrypted=false , ROACHTEST_fs=ext4 , ROACHTEST_localSSD=true , ROACHTEST_ssd=0

Help

See: roachtest README

See: How To Investigate (internal)

See: Grafana

/cc @cockroachdb/disaster-recovery

This test on roachdash | Improve this report!

Jira issue: CRDB-32841

@cockroach-teamcity cockroach-teamcity added branch-release-23.2 Used to mark GA and release blockers, technical advisories, and bugs for 23.2 C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. T-disaster-recovery labels Oct 29, 2023
@cockroach-teamcity cockroach-teamcity added this to the 23.2 milestone Oct 29, 2023
@adityamaru
Copy link
Contributor

exporting /Table/182/1/29{1-2}: conflicting locks on /Table/182/1/291/0 [reason=wait_policy] I haven't seen this failure before

@adityamaru adityamaru self-assigned this Oct 30, 2023
@adityamaru
Copy link
Contributor

looks like nodes 1, 2 and 4 were running binary release-23.2 and 3 was running release-23.1. We planned the backup on n1 but then disabled job adoption for all nodes except for 3. So n3 would have setup the distsql plan and coordinated the execution of the backup.

@adityamaru
Copy link
Contributor

The logs have:

W231029 09:52:47.281828 5909 errors/errbase/decode.go:44 ⋮ [-] 265  error while unmarshalling error: ‹any: message type "cockroach.kv.kvpb.LockConflictError" isn't linked
in›
W231029 09:52:47.281828 5909 errors/errbase/decode.go:44 ⋮ [-] 265 +(1) ‹any: message type "cockroach.kv.kvpb.LockConflictError" isn't linked in›
W231029 09:52:47.281828 5909 errors/errbase/decode.go:44 ⋮ [-] 265 +Error types: (1) *errors.errorString
W231029 09:52:47.281911 5909 errors/errbase/decode.go:44 ⋮ [-] 266  error while unmarshalling error: ‹any: message type "cockroach.kv.kvpb.LockConflictError" isn't linked
in›
W231029 09:52:47.281911 5909 errors/errbase/decode.go:44 ⋮ [-] 266 +(1) ‹any: message type "cockroach.kv.kvpb.LockConflictError" isn't linked in›
W231029 09:52:47.281911 5909 errors/errbase/decode.go:44 ⋮ [-] 266 +Error types: (1) *errors.errorString
W231029 09:52:47.281930 5909 errors/errbase/decode.go:44 ⋮ [-] 267  error while unmarshalling error: ‹any: message type "cockroach.kv.kvpb.LockConflictError" isn't linked in›
W231029 09:52:47.281930 5909 errors/errbase/decode.go:44 ⋮ [-] 267 +(1) ‹any: message type "cockroach.kv.kvpb.LockConflictError" isn't linked in›
W231029 09:52:47.281930 5909 errors/errbase/decode.go:44 ⋮ [-] 267 +Error types: (1) *errors.errorString
W231029 09:52:47.281950 5909 errors/errbase/decode.go:44 ⋮ [-] 268  error while unmarshalling error: ‹any: message type "cockroach.kv.kvpb.LockConflictError" isn't linked
in›
W231029 09:52:47.281950 5909 errors/errbase/decode.go:44 ⋮ [-] 268 +(1) ‹any: message type "cockroach.kv.kvpb.LockConflictError" isn't linked in›
W231029 09:52:47.281950 5909 errors/errbase/decode.go:44 ⋮ [-] 268 +Error types: (1) *errors.errorString
W231029 09:52:47.281971 5909 errors/errbase/decode.go:44 ⋮ [-] 269  error while unmarshalling error: ‹any: message type "cockroach.kv.kvpb.LockConflictError" isn't linked
in›
W231029 09:52:47.281971 5909 errors/errbase/decode.go:44 ⋮ [-] 269 +(1) ‹any: message type "cockroach.kv.kvpb.LockConflictError" isn't linked in›
W231029 09:52:47.281971 5909 errors/errbase/decode.go:44 ⋮ [-] 269 +Error types: (1) *errors.errorString
W231029 09:52:47.281994 5909 errors/errbase/decode.go:44 ⋮ [-] 270  error while unmarshalling error: ‹any: message type "cockroach.kv.kvpb.LockConflictError" isn't linked in›
W231029 09:52:47.281994 5909 errors/errbase/decode.go:44 ⋮ [-] 270 +(1) ‹any: message type "cockroach.kv.kvpb.LockConflictError" isn't linked in›
W231029 09:52:47.281994 5909 errors/errbase/decode.go:44 ⋮ [-] 270 +Error types: (1) *errors.errorString
W231029 09:52:47.282012 5909 errors/errbase/decode.go:44 ⋮ [-] 271  error while unmarshalling error: ‹any: message type "cockroach.kv.kvpb.LockConflictError" isn't linked in›
W231029 09:52:47.282012 5909 errors/errbase/decode.go:44 ⋮ [-] 271 +(1) ‹any: message type "cockroach.kv.kvpb.LockConflictError" isn't linked in›
W231029 09:52:47.282012 5909 errors/errbase/decode.go:44 ⋮ [-] 271 +Error types: (1) *errors.errorString

before the backup fails terminally

@adityamaru
Copy link
Contributor

I'm going to rope in KV incase any recent changes/backports come to mind. I found #109107 and #108190 as changes in seemingly related logic but these have been merged for a while cc/ @nvanbenschoten

@adityamaru adityamaru added the T-kv KV Team label Oct 30, 2023
@nvanbenschoten
Copy link
Member

This probably is related to #109107, but I don't understand why. We did set up the cockroachdb/errors type migration in pkg/kv/kvpb/errors.go.

@nvanbenschoten
Copy link
Member

nvanbenschoten commented Oct 30, 2023

When I get a chance (hopefully today, on call), I'll test this out by running a SELECT FOR UPDATE NOWAIT from a v23.1 gateway in a mixed version cluster where the leaseholder is on a master node. That should exercise this error encoding/decoding behavior in a mixed-version cluster.

@nvanbenschoten
Copy link
Member

I can reproduce this:

roachprod create nathan-113271 -n3
roachprod stage  nathan-113271 release v23.1.11
roachprod start  nathan-113271
roachprod stop   nathan-113271:2-3
roachprod stage  nathan-113271:2-3 cockroach
roachprod start  nathan-113271:2-3 --sequential=false

roachprod sql nathan-113271:1
roachprod sql nathan-113271:2

# either shell
create table t(i int primary key);
insert into t values (1);
select lease_holder from [show ranges from table t with details];
alter table t scatter;

# 23.2 shell
begin; select * from t for update;

# 23.1 shell
begin; select * from t for update nowait;

@nvanbenschoten
Copy link
Member

I think we have to revert 350dc60. errors.RegisterTypeMigration is only part of the story for renames of error protobufs. These protobuf messages are also stuffed inside of a protobuf/types.Any in errorspb.EncodedErrorDetails.FullDetails. types.Any uses a string (TypeUrl) to associated encoded protos with their runtime type, passing through proto.MessageType. Unless we want to replace some generated proto code with a manual call to proto.RegisterType, I think it's safer if we just undo the rename, which is unfortunate.

Looking back through the git history, I think this may have been known in 097e1b7 but then later forgotten.

@adityamaru adityamaru added the X-noreuse Prevent automatic commenting from CI test failures label Oct 30, 2023
@adityamaru adityamaru changed the title roachtest: backup-restore/mixed-version failed [error while unmarshalling cockroach.kv.kvpb.LockConflictError] roachtest: backup-restore/mixed-version failed Oct 30, 2023
@dt
Copy link
Member

dt commented Oct 30, 2023

Do we really need to replace the generated code? or can we just add our own extra call to RegisterType() with the additional mapping? or are we worried about who wins the revProtoTypes entry?

@nvanbenschoten
Copy link
Member

or are we worried about who wins the revProtoTypes entry?

This and I'm worried that this is just the first of multiple subtle problems. I thought proto message renaming with the cockroachdb/errors library was a well-trodden path. That doesn't appear to be the case.

@nvanbenschoten nvanbenschoten added C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. A-kv-transactions Relating to MVCC and the transactional model. labels Nov 2, 2023
craig bot pushed a commit that referenced this issue Nov 3, 2023
113646: kv: split LockConflictError, revive WriteIntentError over wire r=nvanbenschoten a=nvanbenschoten

Fixes #113271.

This commit resolves the backwards incompatibility introduced by 350dc60 when `WriteIntentError` was renamed to `LockConflictError`. This rename broke mixed-version compatibility, because error details in `kvpb.Error` are packaged into an `errorspb.EncodedError`, which internally uses a `protobuf/types.Any`. `protobuf/types.Any` encodes the error's name as a string, relying on the receiving node having a matching type in order to decode the error.

Without this, we saw the following logs on v23.1 nodes.
```
error while unmarshalling error: ‹any: message type "cockroach.kv.kvpb.LockConflictError" isn't linked
```
As a result, error handling for requests that used `WaitPolicy_Error` was broken.

This commit resolves the issue by re-introducing `WriteIntentError` over the wire, so that v23.1 and v23.2 nodes still use the same name to refer to the same error. It does so without reverting 350dc60 and losing the naming improvement in most of the code by splitting `LockConflictError` into its two distinct roles. `LockConflictError` remains in the kvserver to communicate locking conflicts between batch evaluation and concurrency handling. However, the smaller role of communicating locking conflicts to clients that use a `WaitPolicy_Error`, a lock timeout, or a maximum wait-queue length is split into a "new" error called `WriteIntentError`. Splitting these errors was a cleanup we wanted to do anyway, so this commit just does it now to fix the bug. The unfortunate naming of `WriteIntentError` is a battle that we can fight another day.

While this commit doesn't introduce any new tests, we have sufficient testing of the two uses of `WriteIntentError` for single-version clusters in the unit tests. For mixed-version clusters, we have the `backup-restore/mixed-version` roachtest, which caught the bug and exercises backup's use of `WriteIntentError`.

The remaining place where this broke mixed-version compatibility was `SELECT FOR UPDATE NOWAIT`. We should add mixed-version testing for all `SELECT FOR UPDATE` variants. In the meantime, I have manually verified that the following script works on a mixed-version cluster:
```
roachprod create nathan-113271 -n3
roachprod stage  nathan-113271 release v23.1.11
roachprod start  nathan-113271
roachprod stop   nathan-113271:2-3
roachprod put    nathan-113271:2-3 cockroach # with this commit
roachprod start  nathan-113271:2-3 --sequential=false

roachprod sql nathan-113271:1
roachprod sql nathan-113271:2

-- either shell
create table t(i int primary key);
insert into t values (1);
select lease_holder from [show ranges from table t with details];
alter table t scatter;

-- 23.2 shell
begin; select * from t for update;

-- 23.1 shell
begin; select * from t for update nowait;

-- if broken:
ERROR: conflicting locks on /Table/104/1/1/0 [reason=wait_policy]
-- if fixed:
ERROR: could not obtain lock on row (i)=(1) in t@t_pkey

-- same thing but in opposite direction, with 23.1 leaseholder and 23.2 gateway
```

Release note: None

Co-authored-by: Nathan VanBenschoten <[email protected]>
@craig craig bot closed this as completed in f294b36 Nov 3, 2023
blathers-crl bot pushed a commit that referenced this issue Nov 3, 2023
Fixes #113271.

This commit resolves the backwards incompatibility introduced by 350dc60
when `WriteIntentError` was renamed to `LockConflictError`. This rename
broke mixed-version compatibility, because error details in `kvpb.Error`
are packaged into an `errorspb.EncodedError`, which internally uses a
`protobuf/types.Any`. `protobuf/types.Any` encodes the error's name as a
string, relying on the receiving node having a matching type in order to
decode the error.

Without this, we saw the following logs on v23.1 nodes.
```
error while unmarshalling error: ‹any: message type "cockroach.kv.kvpb.LockConflictError" isn't linked
```
As a result, error handling for requests that used `WaitPolicy_Error` was
broken.

This commit resolves the issue by re-introducing `WriteIntentError` over
the wire, so that v23.1 and v23.2 nodes still use the same name to refer
to the same error. It does so without reverting 350dc60 and losing the
naming improvement in most of the code by splitting `LockConflictError`
into its two distinct roles. `LockConflictError` remains in the kvserver
to communicate locking conflicts between batch evaluation and concurrency
handling. However, the smaller role of communicating locking conflicts to
clients that use a `WaitPolicy_Error`, a lock timeout, or a maximum
wait-queue length is split into a "new" error called `WriteIntentError`.
Splitting these errors was a cleanup we wanted to do anyway, so this
commit just does it now to fix the bug. The unfortunate naming of
`WriteIntentError` is a battle that we can fight another day.

While this commit doesn't introduce any new tests, we have sufficient
testing of the two uses of `WriteIntentError` for single-version clusters
in the unit tests. For mixed-version clusters, we have the
`backup-restore/mixed-version` roachtest, which caught the bug and
exercises backup's use of `WriteIntentError`.

The remaining place where this broke mixed-version compatibility was
`SELECT FOR UPDATE NOWAIT`. We should add mixed-version testing for all
`SELECT FOR UPDATE` variants. In the meantime, I have manually verified
that the following script works on a mixed-version cluster:
```
roachprod create nathan-113271 -n3
roachprod stage  nathan-113271 release v23.1.11
roachprod start  nathan-113271
roachprod stop   nathan-113271:2-3
roachprod put    nathan-113271:2-3 cockroach # with this commit
roachprod start  nathan-113271:2-3 --sequential=false

roachprod sql nathan-113271:1
roachprod sql nathan-113271:2

-- either shell
create table t(i int primary key);
insert into t values (1);
select lease_holder from [show ranges from table t with details];
alter table t scatter;

-- 23.2 shell
begin; select * from t for update;

-- 23.1 shell
begin; select * from t for update nowait;

-- if broken:
ERROR: conflicting locks on /Table/104/1/1/0 [reason=wait_policy]
-- if fixed:
ERROR: could not obtain lock on row (i)=(1) in t@t_pkey

-- same thing but in opposite direction, with 23.1 leaseholder and 23.2 gateway
```

Release note: None
@github-project-automation github-project-automation bot moved this to Closed in KV Aug 28, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-kv-transactions Relating to MVCC and the transactional model. branch-release-23.2 Used to mark GA and release blockers, technical advisories, and bugs for 23.2 C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. T-kv KV Team X-noreuse Prevent automatic commenting from CI test failures
Projects
No open projects
Archived in project
Development

Successfully merging a pull request may close this issue.

4 participants