Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kvnemesis: TestKVNemesisSingleNode: committed deleteRangeUsingTombstone non-atomic timestamps #104865

Closed
tbg opened this issue Jun 14, 2023 · 12 comments · Fixed by #104867
Closed
Assignees
Labels
branch-master Failures and bugs on the master branch. C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. T-kv KV Team

Comments

@tbg
Copy link
Member

tbg commented Jun 14, 2023

Describe the problem

committed deleteRangeUsingTombstone non-atomic timestamps: 
[d][/Table/100/"0000000000000000",/Table/100/"6ba66e4d0c6ea25e"):1686739361.925656994,1-><nil>@s27
[d][/Table/100/"6ba66e4d0c6ea25e",/Table/100/"6ba66e4d0c6ea25f"):1686739361.925656994,1-><nil>@s27
[d][/Table/100/"6ba66e4d0c6ea25f",/Table/100/"6d1492740d73eb41"):1686739361.925656994,1-><nil>@s27
[d][/Table/100/"6d1492740d73eb41",/Table/100/"7b206bdfe9a5b7e6"):1686739361.925656994,1-><nil>@s27
[d][/Table/100/"7b206bdfe9a5b7e6",/Table/100/"fe12d6f8c75f604b"):1686739362.069868437,0-><nil>@s27
[d][/Table/100/"fe12d6f8c75f604b",/Table/100/"ffffffffffffffff"):1686739362.069246287,0-><nil>@s27
[s]/Table/100/"{0000000000000000"-ffffffffffffffff"}:
  {gap:[
    <min>, 1686739361.781522812,0),
    [1686739361.787564641,0, 1686739361.791728790,0),
    [1686739362.069868437,1, <max>),
    [1686739362.069246287,0, 1686739362.069868437,1)
  }->[]

Seen on a non-batched bors build1 for #104658 on top of 3572499 (#104685 changes only comments).

Artifacts: 1f42cf5be2fc021646bf9b2daf5eaef3.zip

Jira issue: CRDB-28758

Footnotes

  1. https://teamcity.cockroachdb.com/viewLog.html?buildId=10528742&tab=buildResultsDiv&buildTypeId=Cockroach_UnitTests_BazelUnitTests

@tbg tbg added C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. T-kv-replication labels Jun 14, 2023
@blathers-crl
Copy link

blathers-crl bot commented Jun 14, 2023

cc @cockroachdb/replication

@blathers-crl
Copy link

blathers-crl bot commented Jun 14, 2023

Hi @tbg, please add branch-* labels to identify which branch(es) this release-blocker affects.

🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.

@tbg tbg added branch-master Failures and bugs on the master branch. T-kv KV Team and removed T-kv-replication labels Jun 14, 2023
@tbg tbg added T-kv-replication and removed T-kv KV Team labels Jun 14, 2023
@blathers-crl
Copy link

blathers-crl bot commented Jun 14, 2023

cc @cockroachdb/replication

@erikgrinaker
Copy link
Contributor

deleteRangeUsingTombstone isn't guaranteed to be atomic across ranges though. We don't have ranged intents, so when spanning range boundaries each range is evaluated independently.

@tbg
Copy link
Member Author

tbg commented Jun 14, 2023

I did something to make this work when I wrote kvnemesis v2 (by preventing splitting them across ranges), maybe that somehow broke, will take a quick look

@tbg
Copy link
Member Author

tbg commented Jun 14, 2023

Yeah we just prevent range-spanning here:

KVClient: &kvcoord.ClientTestingKnobs{
// Don't let DistSender split DeleteRangeUsingTombstone across range boundaries.
// This does happen in real CRDB, but leads to separate atomic subunits, which
// would add complexity to kvnemesis that isn't worth it. Instead, the operation
// generator for the most part tries to avoid range-spanning requests, and the
// ones that do end up happening get a hard error.
OnRangeSpanningNonTxnalBatch: func(ba *kvpb.BatchRequest) *kvpb.Error {
for _, req := range ba.Requests {
if req.GetInner().Method() != kvpb.DeleteRange {
continue
}
if req.GetDeleteRange().UseRangeTombstone == true {
return kvpb.NewError(errDelRangeUsingTombstoneStraddlesRangeBoundary)
}
}
return nil
},

and we don't ever use it in a txn:

// DeleteRangeUsingTombstone does not support transactions.
config.Ops.ClosureTxn.TxnClientOps.DeleteRangeUsingTombstone = 0

I'm not sure what's going on here, but kvnemesis was changed a few times recently by @nvanbenschoten, most recently in #104356.

@tbg
Copy link
Member Author

tbg commented Jun 14, 2023

Sure looks like it is using DelRange(..., tombstone=true) in a txn:

image

@tbg
Copy link
Member Author

tbg commented Jun 14, 2023

Ah no, DelRangeUsingTombstone is what we're looking for. Can confirm that it was carried out as individual operations only.

@erikgrinaker
Copy link
Contributor

Ok, I'll have a look in a bit.

@erikgrinaker erikgrinaker self-assigned this Jun 14, 2023
@tbg
Copy link
Member Author

tbg commented Jun 14, 2023

The error shows that this DeleteRangeUsingTombstone (DRUT)

db0.DelRangeUsingTombstone(ctx, tk(0), tk(18446744073709551615) /* @s27 */) // @1686739362.069868437,0 <nil>
  // ^-- trace in: /artifacts/tmp/_tmp/1f42cf5be2fc021646bf9b2daf5eaef3/kvnemesis2422980857/w0_step9_trace

created six observed fragments in the rangefeed1. These fragments all ought to be at the same timestamp. They are not: the first four fragments match, then we get two more fragments with unique timestamps. The official execution timestamp ends up being the second to last unique timestamp (1686739362.069868437,0), which is also the largest of the bunch.

That timestamp is interesting because we also committed a SNAPSHOT txn right then:

  db0.Txn(ctx, func(ctx context.Context, txn *kv.Txn) error {
    txn.SetIsoLevel(isolation.Snapshot)
    txn.DelRange(ctx, tk(420403431433673189), tk(10570627412767213890), true /* @s23 */) // @1686739362.069868437,1 <nil>
    txn.Del(ctx, tk(12162221775285847948) /* @s24 */) // @1686739362.069868437,1 <nil>
    b := &kv.Batch{}
    b.GetForUpdate(tk(9723121657551650044)) // (<nil>, <nil>)
    b.GetForUpdate(tk(16369197900219584719)) // (<nil>, <nil>)
    txn.CommitInBatch(ctx, b) // @1686739362.069868437,1 <nil>
    return nil
  }) // @1686739362.069868437,1 <nil>
  // ^-- txnpb:("unnamed" meta={id=96d89cc1 key=/Table/100/"05d592b188f599e5" iso=Snapshot pri=0.00053365 epo=0 ts=1686739362.069868437,1 min=1686739362.049259529,0 seq=3} lock=true stat=COMMITTED rts=1686739362.069868437,1 wto=false gul=1686739362.059259529,0 int=3)
  // ^-- trace in: /artifacts/tmp/_tmp/1f42cf5be2fc021646bf9b2daf5eaef3/kvnemesis2422980857/w4_step8_trace

Note the txn timestamp is the DRUT's timestamp plus one logical tick. Perhaps this txn and the DRUT interacted somehow in a way that facilitated this issue.

However, the last DRUT fragment (timestamp 1686739362.0692...) is not such an easy match for anything. That timestamp doesn't show up anywhere else.

I'll leave the hard part of the analysis to you, @erikgrinaker, but happy to be a rubber 🦆.

Footnotes

  1. https://github.com/cockroachdb/cockroach/blob/6369846e12471585b1063e9bc7d19597428f53aa/pkg/kv/kvnemesis/validator.go#L481-L556

@tbg
Copy link
Member Author

tbg commented Jun 14, 2023

Looking at the trace, it does seem like DistSender split up the request!

     0.484ms      0.025ms                event:kv/kvserver/store_send.go:154 [n1,s1] executing DeleteRange [/Table/100/"0000000000000000",/Table/100/"7b206bdfe9a5b7e6")
     0.258ms      0.025ms            event:kv/kvserver/store_send.go:154 [n1,s1] executing DeleteRange [/Table/100/"7b206bdfe9a5b7e6",/Table/100/"ffffffffffffffff")

w0_step9_trace.txt

So this is likely a bug in the suppression of precisely that, i.e. a test-only problem.

@tbg
Copy link
Member Author

tbg commented Jun 14, 2023

I think the interceptor that prevents the range-spanning request accidentally got removed here: https://github.com/cockroachdb/cockroach/pull/103963/files#r1229494377

@wenyihu6 could you send a fix? Thanks!

@tbg tbg assigned wenyihu6 and unassigned erikgrinaker Jun 14, 2023
@tbg tbg added T-kv KV Team and removed release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. T-kv-replication labels Jun 14, 2023
craig bot pushed a commit that referenced this issue Jun 14, 2023
104658: kvserver: attempt to explain ProposalData lifecycle r=tbg a=tbg

Epic: CRDB-25287
Release note: None


104867: kvnemesis: add TestingKnobs.OnRangeSpanningNonTxnalBatch back r=tbg a=wenyihu6

#103963 accidentally removed a testing knob which caused #104865. 
This commit adds the testing knob back.

Fixes: #104865
Release note: none

Co-authored-by: Tobias Grieger <[email protected]>
Co-authored-by: Wenyi <[email protected]>
@craig craig bot closed this as completed in 5dd8029 Jun 14, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
branch-master Failures and bugs on the master branch. C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. T-kv KV Team
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants