Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kv: txn with inconsistent read & write timestamps #45586

Closed
danhhz opened this issue Mar 2, 2020 · 23 comments
Closed

kv: txn with inconsistent read & write timestamps #45586

danhhz opened this issue Mar 2, 2020 · 23 comments
Assignees
Labels
A-kv Anything in KV that doesn't belong in a more specific category. C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior.

Comments

@danhhz
Copy link
Contributor

danhhz commented Mar 2, 2020

#45542 adds support to kvnemesis for verifying that a txn's reads are consistent with its writes. It seems to be seeing a read that was only valid for a window entirely before the timestamp it ended up committing at. I haven't fully confirmed this, but the issue seems to go away if I disable the use of CommitInBatch in kvnemesis.

Found by kvnemesis.

@danhhz danhhz added C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. A-kv Anything in KV that doesn't belong in a more specific category. labels Mar 2, 2020
@danhhz
Copy link
Contributor Author

danhhz commented Mar 2, 2020

Here's a log from a run with this issue as of ce6b98a438cfa316527e8b0357c987f6d3308d7e (timestamps are abbreviated to make things obvious):

There is one txn that writes 3cb8b431->v11 and reads 872aa197 getting v-18. According to client.Txn.CommitTimestamp, it committed at 736.783 which matches what we got from RangeFeed for 3cb8b431->v11 and also the 3cb8b431->v11 entry in KV (it scans the latest values as the last part of the test).

872aa197->v18 was written in a txn that committed at 736.779 and this timestamp matches what we got from RangeFeed.

872aa197 was overwritten with v-19 in a txn that committed at 736.781. This timestamp matches the one we got in RangeFeed and the one in the KV scan.

This means the allowable window that a read of v-18 is [736.779,736.781), but the txn that read it committed at 736.783, which is after this range. It should have read v-19. This is a serializability violation.

Reproduces in minutes with roachprod-stress:

make roachprod-stress PKG=./pkg/kv/kvnemesis TESTS=TestKVNemesisSingleNode TESTTIMEOUT=60s TESTFLAGS='-v -show-logs' CLUSTER=dan-stress
=== RUN   TestKVNemesisSingleNode
I200228 22:25:36.517183 66 base/addr_validation.go:296  [n?] server certificate addresses: IP=127.0.0.1,::1; DNS=localhost,*.local; CN=node
I200228 22:25:36.517270 66 base/addr_validation.go:342  [n?] web UI certificate addresses: IP=127.0.0.1,::1; DNS=localhost,*.local; CN=node
W200228 22:25:36.526858 66 server/status/runtime.go:308  [n?] Could not parse build timestamp: parsing time "" as "2006/01/02 15:04:05": cannot parse "" as "2006"
I200228 22:25:36.550327 66 server/server.go:1025  [n?] monitoring forward clock jumps based on server.clock.forward_jump_check_enabled
I200228 22:25:36.580218 66 server/config.go:570  [n?] 1 storage engine initialized
I200228 22:25:36.580242 66 server/config.go:573  [n?] RocksDB cache size: 128 MiB
I200228 22:25:36.580250 66 server/config.go:573  [n?] store 0: in-memory, size 0 B
I200228 22:25:36.580446 66 server/server.go:1480  [n?] no stores bootstrapped and --join flag specified, awaiting init command or join with an already initialized node.
I200228 22:25:36.580490 66 server/server.go:1490  [n?] **** add additional nodes by specifying --join=127.0.0.1:37085
I200228 22:25:36.586490 66 server/node.go:323  [n?] **** cluster 065ddae7-e51d-4e3d-ba95-b595e77630df has been created
I200228 22:25:36.586767 66 gossip/gossip.go:395  [n1] NodeDescriptor set to node_id:1 address: attrs:<> locality: tiers: > ServerVersion: build_tag:"v20.1.0-beta.1-471-g669f68f-dirty" started_at:1582928736586580669 cluster_name:"" sql_address:
W200228 22:25:36.591143 202 storage/store.go:1538  [n1,s1,r6/1:/Table/{SystemCon…-11}] could not gossip system config: [NotLeaseHolderError] r6: replica (n1,s1):1 not lease holder; lease holder unknown
I200228 22:25:36.591245 66 server/node.go:427  [n1] initialized store [n1,s1]: disk (capacity=512 MiB, available=512 MiB, used=0 B, logicalBytes=16 KiB), ranges=32, leases=0, queries=0.00, writes=0.00, bytesPerReplica={p10=0.00 p25=0.00 p50=0.00 p75=0.00 p90=166.00 pMax=13271.00}, writesPerReplica={p10=0.00 p25=0.00 p50=0.00 p75=0.00 p90=0.00 pMax=0.00}
I200228 22:25:36.591374 66 storage/stores.go:231  [n1] read 0 node addresses from persistent storage
I200228 22:25:36.591477 66 server/node.go:644  [n1] connecting to gossip network to verify cluster ID...
I200228 22:25:36.591506 66 server/node.go:664  [n1] node connected via gossip and verified as part of cluster "065ddae7-e51d-4e3d-ba95-b595e77630df"
I200228 22:25:36.591561 66 server/node.go:511  [n1] node=1: started with [=] engine(s) and attributes []
I200228 22:25:36.591618 66 server/server.go:2172  [n1] Could not start goroutine dumper worker due to: directory to store dumps could not be determined
I200228 22:25:36.591731 66 server/server.go:1610  [n1] starting https server at 127.0.0.1:43561 (use: 127.0.0.1:43561)
I200228 22:25:36.591766 66 server/server.go:1615  [n1] starting postgres server at 127.0.0.1:41117 (use: 127.0.0.1:41117)
I200228 22:25:36.591787 66 server/server.go:1617  [n1] starting grpc server at 127.0.0.1:37085
I200228 22:25:36.591808 66 server/server.go:1618  [n1] advertising CockroachDB node at 127.0.0.1:37085
W200228 22:25:36.646461 202 storage/store.go:1538  [n1,s1,r6/1:/Table/{SystemCon…-11}] could not gossip system config: [NotLeaseHolderError] r6: replica (n1,s1):1 not lease holder; lease holder unknown
I200228 22:25:36.666708 330 sql/event_log.go:132  [n1,intExec=optInToDiagnosticsStatReporting] Event: "set_cluster_setting", target: 0, info: {SettingName:diagnostics.reporting.enabled Value:true User:root}
I200228 22:25:36.672887 257 sql/event_log.go:132  [n1,intExec=set-setting] Event: "set_cluster_setting", target: 0, info: {SettingName:version Value:19.2-14 User:root}
I200228 22:25:36.689822 733 sql/event_log.go:132  [n1,intExec=initializeClusterSecret] Event: "set_cluster_setting", target: 0, info: {SettingName:cluster.secret Value:92c7c58f-8c5d-4f64-b998-186471886923 User:root}
I200228 22:25:36.696688 788 sql/event_log.go:132  [n1,intExec=create-default-db] Event: "create_database", target: 50, info: {DatabaseName:defaultdb Statement:CREATE DATABASE IF NOT EXISTS defaultdb User:root}
I200228 22:25:36.700642 753 sql/event_log.go:132  [n1,intExec=create-default-db] Event: "create_database", target: 51, info: {DatabaseName:postgres Statement:CREATE DATABASE IF NOT EXISTS postgres User:root}
I200228 22:25:36.725784 66 server/server.go:1696  [n1] done ensuring all necessary migrations have run
I200228 22:25:36.725843 66 server/server.go:1977  [n1] serving sql connections
I200228 22:25:36.726484 66 testutils/testcluster/testcluster.go:760  WaitForFullReplication took: 3.874µs
I200228 22:25:36.730006 1049 sql/event_log.go:132  [n1] Event: "node_join", target: 1, info: {Descriptor:{NodeID:1 Address:127.0.0.1:37085 Attrs: Locality:region=test,dc=dc1 ServerVersion:19.2-14 BuildTag:v20.1.0-beta.1-471-g669f68f-dirty StartedAt:1582928736586580669 LocalityAddress:[] ClusterName: SQLAddress:127.0.0.1:41117} ClusterID:065ddae7-e51d-4e3d-ba95-b595e77630df StartedAt:1582928736586580669 LastUp:1582928736586580669}
I200228 22:25:36.737803 1047 server/server_update.go:55  [n1] no need to upgrade, cluster already at the newest version
I200228 22:25:36.760304 1028 sql/event_log.go:132  [n1,client=127.0.0.1:45322,user=root] Event: "set_cluster_setting", target: 0, info: {SettingName:kv.rangefeed.enabled Value:true User:root}
I200228 22:25:36.761708 66 kv/kvnemesis/watcher.go:134  watcher waiting for 1582928736.761683056,0
I200228 22:25:36.762846 1152 storage/replica_rangefeed.go:608  [n1,s1,r32/1:/{Table/36-Max}] RangeFeed closed timestamp is empty
I200228 22:25:36.764947 1028 sql/event_log.go:132  [n1,client=127.0.0.1:45322,user=root] Event: "set_cluster_setting", target: 0, info: {SettingName:kv.closed_timestamp.target_duration Value:00:00:00.001 User:root}
I200228 22:25:36.768663 1148 kv/kvnemesis/watcher.go:201  watcher reached frontier 1582928736.767300681,0 lagging by 1.340201ms
I200228 22:25:36.768920 1148 kv/kvnemesis/watcher.go:201  watcher reached frontier 1582928736.767595811,0 lagging by 1.314487ms
I200228 22:25:36.769346 1148 kv/kvnemesis/watcher.go:201  watcher reached frontier 1582928736.767879049,0 lagging by 1.454447ms
I200228 22:25:36.769641 1148 kv/kvnemesis/watcher.go:201  watcher reached frontier 1582928736.768288577,0 lagging by 1.341444ms
I200228 22:25:36.770070 1148 kv/kvnemesis/watcher.go:201  watcher reached frontier 1582928736.768589993,0 lagging by 1.470179ms
I200228 22:25:36.770412 1148 kv/kvnemesis/watcher.go:201  watcher reached frontier 1582928736.769032434,0 lagging by 1.369582ms
I200228 22:25:36.770436 1148 kv/kvnemesis/watcher.go:207  watcher notifying 1582928736.761683056,0
I200228 22:25:36.770680 1148 kv/kvnemesis/watcher.go:201  watcher reached frontier 1582928736.769337201,0 lagging by 1.320038ms
I200228 22:25:36.770940 1148 kv/kvnemesis/watcher.go:201  watcher reached frontier 1582928736.769615731,0 lagging by 1.315854ms
I200228 22:25:36.771212 1148 kv/kvnemesis/watcher.go:201  watcher reached frontier 1582928736.769900422,0 lagging by 1.303958ms
I200228 22:25:36.771510 1148 kv/kvnemesis/watcher.go:201  watcher reached frontier 1582928736.770179178,0 lagging by 1.321315ms
I200228 22:25:36.771780 1148 kv/kvnemesis/watcher.go:201  watcher reached frontier 1582928736.770468271,0 lagging by 1.304565ms
I200228 22:25:36.772055 1148 kv/kvnemesis/watcher.go:201  watcher reached frontier 1582928736.770747130,0 lagging by 1.300883ms
I200228 22:25:36.772360 1148 kv/kvnemesis/watcher.go:201  watcher reached frontier 1582928736.771026206,0 lagging by 1.326647ms
I200228 22:25:36.772632 1148 kv/kvnemesis/watcher.go:201  watcher reached frontier 1582928736.771328264,0 lagging by 1.295833ms
I200228 22:25:36.772933 1148 kv/kvnemesis/watcher.go:201  watcher reached frontier 1582928736.771588382,0 lagging by 1.332202ms
I200228 22:25:36.773490 1028 sql/event_log.go:132  [n1,client=127.0.0.1:45322,user=root] Event: "set_cluster_setting", target: 0, info: {SettingName:kv.closed_timestamp.target_duration Value:DEFAULT User:root}
I200228 22:25:36.775754 1300 kv/kvnemesis/kvnemesis.go:65
  before: 1582928736.774939169,0
  4 OP  db0.Txn(ctx, func(ctx context.Context, txn *client.Txn) error {
  4 OP    txn.Put(ctx, /Table/50/"7b2e13b1", v-0) // nil
  4 OP    txn.Get(ctx, /Table/50/"34d481fb") // (nil, nil)
  4 OP    b := &Batch{}
  4 OP    b.Get(ctx, /Table/50/"0a98d198") // (nil, nil)
  4 OP    b.Get(ctx, /Table/50/"27db36ad") // (nil, nil)
  4 OP    txn.CommitInBatch(ctx, b) // nil
  4 OP    return nil
  4 OP  }) // nil 1582928736.774939693,0
  after: 1582928736.775712344,0
I200228 22:25:36.776643 1300 kv/kvnemesis/kvnemesis.go:65
  before: 1582928736.775779571,0
  4 OP  db0.Txn(ctx, func(ctx context.Context, txn *client.Txn) error {
  4 OP    txn.Put(ctx, /Table/50/"7b2e13b1", v-1) // nil
  4 OP    txn.Put(ctx, /Table/50/"7b2e13b1", v-2) // nil
  4 OP    txn.Get(ctx, /Table/50/"7b2e13b1") // ("v-2", nil)
  4 OP    return errors.New("rollback")
  4 OP  }) // rollback
  after: 1582928736.776616416,0
I200228 22:25:36.776678 1300 kv/kvnemesis/kvnemesis.go:65
  before: 1582928736.776660776,0
  4 OP  db0.Txn(ctx, func(ctx context.Context, txn *client.Txn) error {
  4 OP    return nil
  4 OP  }) // nil 1582928736.776661202,0
  after: 1582928736.776668834,0
I200228 22:25:36.777273 1300 kv/kvnemesis/kvnemesis.go:65
  before: 1582928736.776694928,0
  4 OP  db0.Txn(ctx, func(ctx context.Context, txn *client.Txn) error {
  4 OP    txn.Put(ctx, /Table/50/"45258014", v-3) // nil
  4 OP    txn.Get(ctx, /Table/50/"45258014") // ("v-3", nil)
  4 OP    return nil
  4 OP  }) // nil 1582928736.776695227,0
  after: 1582928736.777244737,0
I200228 22:25:36.777528 1300 kv/kvnemesis/kvnemesis.go:65
  before: 1582928736.777292691,0
  4 OP  db0.Put(ctx, /Table/50/"7b2e13b1", v-4) // nil
  after: 1582928736.777508139,0
I200228 22:25:36.777648 1300 kv/kvnemesis/kvnemesis.go:65
  before: 1582928736.777546885,0
  4 OP  db0.AdminMerge(ctx, /Table/50/"9564ef5c") // storage/replica_command.go:776: merge failed: cannot merge final range
  after: 1582928736.777625582,0
I200228 22:25:36.777680 1300 kv/kvnemesis/kvnemesis.go:65
  before: 1582928736.777661271,0
  4 OP  db0.Txn(ctx, func(ctx context.Context, txn *client.Txn) error {
  4 OP    b := &Batch{}
  4 OP    txn.CommitInBatch(ctx, b) // nil
  4 OP    return nil
  4 OP  }) // nil 1582928736.777661683,0
  after: 1582928736.777670176,0
I200228 22:25:36.777774 1300 kv/kvnemesis/kvnemesis.go:65
  before: 1582928736.777689466,0
  4 OP  db0.Get(ctx, /Table/50/"04831939") // (nil, nil)
  after: 1582928736.777758808,0
I200228 22:25:36.777806 1300 kv/kvnemesis/kvnemesis.go:65
  before: 1582928736.777788114,0
  4 OP  db0.Txn(ctx, func(ctx context.Context, txn *client.Txn) error {
  4 OP    b := &Batch{}
  4 OP    txn.CommitInBatch(ctx, b) // nil
  4 OP    return nil
  4 OP  }) // nil 1582928736.777788512,0
  after: 1582928736.777796808,0
I200228 22:25:36.777902 1300 kv/kvnemesis/kvnemesis.go:65
  before: 1582928736.777819587,0
  4 OP  db0.Txn(ctx, func(ctx context.Context, txn *client.Txn) error {
  4 OP    b := &Batch{}
  4 OP    b.Get(ctx, /Table/50/"90025020") // (nil, nil)
  4 OP    txn.CommitInBatch(ctx, b) // nil
  4 OP    return nil
  4 OP  }) // nil 1582928736.777819887,0
  after: 1582928736.777887041,0
I200228 22:25:36.777979 1300 kv/kvnemesis/kvnemesis.go:65
  before: 1582928736.777915347,0
  4 OP  db0.Get(ctx, /Table/50/"96bf0183") // (nil, nil)
  after: 1582928736.777965718,0
I200228 22:25:36.778154 1300 kv/kvnemesis/kvnemesis.go:65
  before: 1582928736.777998726,0
  4 OP  db0.Txn(ctx, func(ctx context.Context, txn *client.Txn) error {
  4 OP    txn.Get(ctx, /Table/50/"45258014") // ("v-3", nil)
  4 OP    b := &Batch{}
  4 OP    b.Get(ctx, /Table/50/"5f5cb3a8") // (nil, nil)
  4 OP    b.Get(ctx, /Table/50/"95ccf533") // (nil, nil)
  4 OP    b.Get(ctx, /Table/50/"7b2e13b1") // ("v-4", nil)
  4 OP    txn.CommitInBatch(ctx, b) // nil
  4 OP    return nil
  4 OP  }) // nil 1582928736.777999088,0
  after: 1582928736.778122892,0
I200228 22:25:36.778359 1148 kv/kvnemesis/watcher.go:165  rangefeed Put /Table/50/"7b2e13b1" 1582928736.774939693,0 -> /BYTES/v-0 (prev /)
I200228 22:25:36.778407 1148 kv/kvnemesis/watcher.go:165  rangefeed Put /Table/50/"45258014" 1582928736.776695227,0 -> /BYTES/v-3 (prev /)
I200228 22:25:36.778432 1148 kv/kvnemesis/watcher.go:165  rangefeed Put /Table/50/"7b2e13b1" 1582928736.777318540,0 -> /BYTES/v-4 (prev /BYTES/v-0)
I200228 22:25:36.778832 1300 kv/kvnemesis/kvnemesis.go:65
  before: 1582928736.778173756,0
  4 OP  db0.Txn(ctx, func(ctx context.Context, txn *client.Txn) error {
  4 OP    txn.Put(ctx, /Table/50/"7b2e13b1", v-5) // nil
  4 OP    txn.Put(ctx, /Table/50/"872aa197", v-6) // nil
  4 OP    return errors.New("rollback")
  4 OP  }) // rollback
  after: 1582928736.778806599,0
I200228 22:25:36.778936 1300 kv/kvnemesis/kvnemesis.go:65
  before: 1582928736.778861181,0
  4 OP  db0.AdminMerge(ctx, /Table/50/"2dcdefcb") // storage/replica_command.go:776: merge failed: cannot merge final range
  after: 1582928736.778923402,0
I200228 22:25:36.778973 1300 kv/kvnemesis/kvnemesis.go:65
  before: 1582928736.778955717,0
  4 OP  db0.Txn(ctx, func(ctx context.Context, txn *client.Txn) error {
  4 OP    b := &Batch{}
  4 OP    txn.CommitInBatch(ctx, b) // nil
  4 OP    return nil
  4 OP  }) // nil 1582928736.778956096,0
  after: 1582928736.778963259,0
I200228 22:25:36.779430 1148 kv/kvnemesis/watcher.go:165  rangefeed Put /Table/50/"7b2e13b1" 1582928736.778571270,0 -> /BYTES/v-7 (prev /BYTES/v-4)
I200228 22:25:36.779485 1089 kv/kvnemesis/kvnemesis.go:65
  before: 1582928736.778507464,0
  1 OP  db0.Put(ctx, /Table/50/"7b2e13b1", v-7) // nil
  after: 1582928736.779473364,0
I200228 22:25:36.779440 1088 kv/kvnemesis/kvnemesis.go:65
  before: 1582928736.778614415,0
  0 OP  db0.Txn(ctx, func(ctx context.Context, txn *client.Txn) error {
  0 OP    txn.Put(ctx, /Table/50/"9db33549", v-8) // nil
  0 OP    txn.Get(ctx, /Table/50/"45258014") // ("v-3", nil)
  0 OP    return nil
  0 OP  }) // nil 1582928736.778614912,0
  after: 1582928736.779418598,0
I200228 22:25:36.781035 1300 kv/kvnemesis/kvnemesis.go:65
  before: 1582928736.778998790,0
  4 OP  db0.Txn(ctx, func(ctx context.Context, txn *client.Txn) error {
  4 OP    txn.Put(ctx, /Table/50/"45258014", v-9) // nil
  4 OP    txn.Get(ctx, /Table/50/"872aa197") // (nil, nil)
  4 OP    txn.Put(ctx, /Table/50/"3cb8b431", v-10) // nil
  4 OP    b := &Batch{}
  4 OP    txn.CommitInBatch(ctx, b) // nil
  4 OP    return nil
  4 OP  }) // nil 1582928736.778999096,0
  after: 1582928736.781005095,0
I200228 22:25:36.781240 1148 kv/kvnemesis/watcher.go:165  rangefeed Put /Table/50/"9db33549" 1582928736.778614912,0 -> /BYTES/v-8 (prev /)
I200228 22:25:36.782345 1299 kv/kvnemesis/kvnemesis.go:65
  before: 1582928736.779542124,0
  3 OP  db0.Txn(ctx, func(ctx context.Context, txn *client.Txn) error {
  3 OP    txn.Put(ctx, /Table/50/"e5192bc8", v-17) // nil
  3 OP    txn.Put(ctx, /Table/50/"872aa197", v-18) // nil
  3 OP    return nil
  3 OP  }) // nil 1582928736.779542484,0
  after: 1582928736.782318683,0
I200228 22:25:36.782441 1299 kv/kvnemesis/kvnemesis.go:65
  before: 1582928736.782370805,0
  3 OP  db0.AdminMerge(ctx, /Table/50/"69e0c8cf") // storage/replica_command.go:776: merge failed: cannot merge final range
  after: 1582928736.782428397,0
I200228 22:25:36.782614 1148 kv/kvnemesis/watcher.go:165  rangefeed Put /Table/50/"3cb8b431" 1582928736.778999096,0 -> /BYTES/v-10 (prev /)
I200228 22:25:36.782656 1148 kv/kvnemesis/watcher.go:165  rangefeed Put /Table/50/"45258014" 1582928736.778999096,0 -> /BYTES/v-9 (prev /BYTES/v-3)
I200228 22:25:36.783234 1089 kv/kvnemesis/kvnemesis.go:65
  before: 1582928736.779520628,0
  1 OP  db0.Txn(ctx, func(ctx context.Context, txn *client.Txn) error {
  1 OP    txn.Put(ctx, /Table/50/"c9ee039a", v-12) // nil
  1 OP    txn.Put(ctx, /Table/50/"131be9e4", v-13) // nil
  1 OP    txn.Put(ctx, /Table/50/"131be9e4", v-14) // nil
  1 OP    b := &Batch{}
  1 OP    b.Put(ctx, /Table/50/"28645076", v-15) // nil
  1 OP    b.Get(ctx, /Table/50/"872aa197") // (nil, nil)
  1 OP    b.Put(ctx, /Table/50/"3cb8b431", v-16) // nil
  1 OP    txn.CommitInBatch(ctx, b) // nil
  1 OP    return nil
  1 OP  }) // nil 1582928736.779521145,0
  after: 1582928736.783191745,0
I200228 22:25:36.783935 1089 kv/kvnemesis/kvnemesis.go:65
  before: 1582928736.783260581,0
  1 OP  db0.Txn(ctx, func(ctx context.Context, txn *client.Txn) error {
  1 OP    txn.Get(ctx, /Table/50/"2a87e3a8") // (nil, nil)
  1 OP    txn.Put(ctx, /Table/50/"3cb8b431", v-21) // nil
  1 OP    b := &Batch{}
  1 OP    b.Get(ctx, /Table/50/"9db33549") // ("v-8", nil)
  1 OP    b.Get(ctx, /Table/50/"57d1a821") // (nil, nil)
  1 OP    txn.CommitInBatch(ctx, b) // nil
  1 OP    return nil
  1 OP  }) // nil 1582928736.783261063,0
  after: 1582928736.783899491,0
I200228 22:25:36.784030 1148 kv/kvnemesis/watcher.go:165  rangefeed Put /Table/50/"131be9e4" 1582928736.779521145,0 -> /BYTES/v-14 (prev /)
I200228 22:25:36.784065 1148 kv/kvnemesis/watcher.go:165  rangefeed Put /Table/50/"28645076" 1582928736.779521145,0 -> /BYTES/v-15 (prev /)
I200228 22:25:36.784086 1148 kv/kvnemesis/watcher.go:165  rangefeed Put /Table/50/"3cb8b431" 1582928736.779521145,0 -> /BYTES/v-16 (prev /BYTES/v-10)
I200228 22:25:36.784170 1148 kv/kvnemesis/watcher.go:165  rangefeed Put /Table/50/"c9ee039a" 1582928736.779521145,0 -> /BYTES/v-12 (prev /)
I200228 22:25:36.784194 1148 kv/kvnemesis/watcher.go:165  rangefeed Put /Table/50/"3cb8b431" 1582928736.783261063,0 -> /BYTES/v-21 (prev /BYTES/v-16)
I200228 22:25:36.784503 1148 kv/kvnemesis/watcher.go:165  rangefeed Put /Table/50/"872aa197" 1582928736.779542484,0 -> /BYTES/v-18 (prev /)
I200228 22:25:36.784554 1148 kv/kvnemesis/watcher.go:165  rangefeed Put /Table/50/"e5192bc8" 1582928736.779542484,0 -> /BYTES/v-17 (prev /)
I200228 22:25:36.790137 1088 kv/kvnemesis/kvnemesis.go:65
  before: 1582928736.779870860,0
  0 OP  db0.Txn(ctx, func(ctx context.Context, txn *client.Txn) error {
  0 OP    txn.Get(ctx, /Table/50/"e2374831") // (nil, nil)
  0 OP    txn.Get(ctx, /Table/50/"e5192bc8") // ("v-17", nil)
  0 OP    return errors.New("rollback")
  0 OP  }) // rollback
  after: 1582928736.790110591,0
I200228 22:25:36.790874 1299 kv/kvnemesis/kvnemesis.go:65
  before: 1582928736.782487298,0
  3 OP  db0.Put(ctx, /Table/50/"e5192bc8", v-20) // nil
  after: 1582928736.790852675,0
I200228 22:25:36.791101 1298 kv/kvnemesis/kvnemesis.go:65
  before: 1582928736.779218875,0
  2 OP  db0.Txn(ctx, func(ctx context.Context, txn *client.Txn) error {
  2 OP    b := &Batch{}
  2 OP    b.Put(ctx, /Table/50/"3cb8b431", v-11) // nil
  2 OP    b.Get(ctx, /Table/50/"872aa197") // ("v-18", nil)
  2 OP    b.Get(ctx, /Table/50/"81e0cbc0") // (nil, nil)
  2 OP    txn.CommitInBatch(ctx, b) // nil
  2 OP    return nil
  2 OP  }) // nil 1582928736.783261063,1
  after: 1582928736.791068621,0
I200228 22:25:36.791212 1298 kv/kvnemesis/kvnemesis.go:65
  before: 1582928736.791129624,0
  2 OP  db0.Get(ctx, /Table/50/"603ced73") // (nil, nil)
  after: 1582928736.791198076,0
I200228 22:25:36.791263 1148 kv/kvnemesis/watcher.go:165  rangefeed Put /Table/50/"3cb8b431" 1582928736.783261063,1 -> /BYTES/v-11 (prev /BYTES/v-21)
I200228 22:25:36.791313 1148 kv/kvnemesis/watcher.go:165  rangefeed Put /Table/50/"e5192bc8" 1582928736.782504859,0 -> /BYTES/v-20 (prev /BYTES/v-17)
I200228 22:25:36.791390 1089 kv/kvnemesis/kvnemesis.go:65
  before: 1582928736.783952442,0
  1 OP  db0.Get(ctx, /Table/50/"e5192bc8") // ("v-20", nil)
  after: 1582928736.791368983,0
I200228 22:25:36.791422 1148 kv/kvnemesis/watcher.go:165  rangefeed Put /Table/50/"a57481b1" 1582928736.790170558,0 -> /BYTES/v-22 (prev /)
I200228 22:25:36.791485 1300 kv/kvnemesis/kvnemesis.go:65
  before: 1582928736.781062882,0
  4 OP  db0.Txn(ctx, func(ctx context.Context, txn *client.Txn) error {
  4 OP    txn.Get(ctx, /Table/50/"25d76e13") // (nil, nil)
  4 OP    txn.Put(ctx, /Table/50/"872aa197", v-19) // nil
  4 OP    return nil
  4 OP  }) // nil 1582928736.781063302,0
  after: 1582928736.791449751,0
I200228 22:25:36.791771 1299 kv/kvnemesis/kvnemesis.go:65
  before: 1582928736.790909909,0
  3 OP  db0.Txn(ctx, func(ctx context.Context, txn *client.Txn) error {
  3 OP    txn.Get(ctx, /Table/50/"a57481b1") // ("v-22", nil)
  3 OP    txn.Put(ctx, /Table/50/"c61c1493", v-23) // nil
  3 OP    b := &Batch{}
  3 OP    b.Put(ctx, /Table/50/"4f40f9fe", v-24) // nil
  3 OP    b.Put(ctx, /Table/50/"45258014", v-25) // nil
  3 OP    b.Put(ctx, /Table/50/"9db33549", v-26) // nil
  3 OP    txn.CommitInBatch(ctx, b) // nil
  3 OP    return nil
  3 OP  }) // nil 1582928736.790910461,0
  after: 1582928736.791725299,0
I200228 22:25:36.791939 1148 kv/kvnemesis/watcher.go:165  rangefeed Put /Table/50/"872aa197" 1582928736.781063302,0 -> /BYTES/v-19 (prev /BYTES/v-18)
I200228 22:25:36.792077 1088 kv/kvnemesis/kvnemesis.go:65
  before: 1582928736.790170038,0
  0 OP  db0.Txn(ctx, func(ctx context.Context, txn *client.Txn) error {
  0 OP    txn.Get(ctx, /Table/50/"f890cec8") // (nil, nil)
  0 OP    b := &Batch{}
  0 OP    b.Get(ctx, /Table/50/"07781c98") // (nil, nil)
  0 OP    b.Put(ctx, /Table/50/"a57481b1", v-22) // nil
  0 OP    txn.CommitInBatch(ctx, b) // nil
  0 OP    return nil
  0 OP  }) // nil 1582928736.790170558,0
  after: 1582928736.790878479,0
I200228 22:25:36.792101 66 kv/kvnemesis/watcher.go:134  watcher waiting for 1582928736.791725299,0
I200228 22:25:36.792372 1148 kv/kvnemesis/watcher.go:165  rangefeed Put /Table/50/"45258014" 1582928736.790910461,0 -> /BYTES/v-25 (prev /BYTES/v-9)
I200228 22:25:36.792530 1148 kv/kvnemesis/watcher.go:165  rangefeed Put /Table/50/"4f40f9fe" 1582928736.790910461,0 -> /BYTES/v-24 (prev /)
I200228 22:25:36.792558 1148 kv/kvnemesis/watcher.go:165  rangefeed Put /Table/50/"9db33549" 1582928736.790910461,0 -> /BYTES/v-26 (prev /BYTES/v-8)
I200228 22:25:36.792673 1148 kv/kvnemesis/watcher.go:165  rangefeed Put /Table/50/"c61c1493" 1582928736.790910461,0 -> /BYTES/v-23 (prev /)
I200228 22:25:36.795127 1028 sql/event_log.go:132  [n1,client=127.0.0.1:45322,user=root] Event: "set_cluster_setting", target: 0, info: {SettingName:kv.closed_timestamp.target_duration Value:00:00:00.001 User:root}
I200228 22:25:36.796661 1148 kv/kvnemesis/watcher.go:201  watcher reached frontier 1582928736.771880485,0 lagging by 24.769339ms
I200228 22:25:36.796943 1148 kv/kvnemesis/watcher.go:201  watcher reached frontier 1582928736.795582419,0 lagging by 1.342785ms
I200228 22:25:36.796981 1148 kv/kvnemesis/watcher.go:207  watcher notifying 1582928736.791725299,0
I200228 22:25:36.797201 1148 kv/kvnemesis/watcher.go:201  watcher reached frontier 1582928736.795860079,0 lagging by 1.330671ms
I200228 22:25:36.797461 1148 kv/kvnemesis/watcher.go:201  watcher reached frontier 1582928736.796147657,0 lagging by 1.303048ms
I200228 22:25:36.797713 1148 kv/kvnemesis/watcher.go:201  watcher reached frontier 1582928736.796401138,0 lagging by 1.301896ms
I200228 22:25:36.798029 1148 kv/kvnemesis/watcher.go:201  watcher reached frontier 1582928736.796659853,0 lagging by 1.360986ms
I200228 22:25:36.798300 1148 kv/kvnemesis/watcher.go:201  watcher reached frontier 1582928736.796992948,0 lagging by 1.293998ms
I200228 22:25:36.799668 1148 kv/kvnemesis/watcher.go:201  watcher reached frontier 1582928736.797256937,0 lagging by 2.398627ms
I200228 22:25:36.799905 1148 kv/kvnemesis/watcher.go:201  watcher reached frontier 1582928736.798585909,0 lagging by 1.307073ms
I200228 22:25:36.800065 1028 sql/event_log.go:132  [n1,client=127.0.0.1:45322,user=root] Event: "set_cluster_setting", target: 0, info: {SettingName:kv.closed_timestamp.target_duration Value:DEFAULT User:root}
I200228 22:25:36.800640 1148 kv/kvnemesis/watcher.go:201  watcher reached frontier 1582928736.798836304,0 lagging by 1.789708ms
I200228 22:25:36.801210 66 kv/kvnemesis/kvnemesis.go:89  reproduction steps:
g := ctxgroup.WithContext(ctx)
g.GoCtx(func(ctx context.Context) error {
  db0.Txn(ctx, func(ctx context.Context, txn *client.Txn) error {
    txn.Put(ctx, /Table/50/"9db33549", v-8) // nil
    txn.Get(ctx, /Table/50/"45258014") // ("v-3", nil)
    return nil
  }) // nil 1582928736.778614912,0
  db0.Txn(ctx, func(ctx context.Context, txn *client.Txn) error {
    txn.Get(ctx, /Table/50/"e2374831") // (nil, nil)
    txn.Get(ctx, /Table/50/"e5192bc8") // ("v-17", nil)
    return errors.New("rollback")
  }) // rollback
  db0.Txn(ctx, func(ctx context.Context, txn *client.Txn) error {
    txn.Get(ctx, /Table/50/"f890cec8") // (nil, nil)
    b := &Batch{}
    b.Get(ctx, /Table/50/"07781c98") // (nil, nil)
    b.Put(ctx, /Table/50/"a57481b1", v-22) // nil
    txn.CommitInBatch(ctx, b) // nil
    return nil
  }) // nil 1582928736.790170558,0
  return nil
})
g.GoCtx(func(ctx context.Context) error {
  db0.Put(ctx, /Table/50/"7b2e13b1", v-7) // nil
  db0.Txn(ctx, func(ctx context.Context, txn *client.Txn) error {
    txn.Put(ctx, /Table/50/"c9ee039a", v-12) // nil
    txn.Put(ctx, /Table/50/"131be9e4", v-13) // nil
    txn.Put(ctx, /Table/50/"131be9e4", v-14) // nil
    b := &Batch{}
    b.Put(ctx, /Table/50/"28645076", v-15) // nil
    b.Get(ctx, /Table/50/"872aa197") // (nil, nil)
    b.Put(ctx, /Table/50/"3cb8b431", v-16) // nil
    txn.CommitInBatch(ctx, b) // nil
    return nil
  }) // nil 1582928736.779521145,0
  db0.Txn(ctx, func(ctx context.Context, txn *client.Txn) error {
    txn.Get(ctx, /Table/50/"2a87e3a8") // (nil, nil)
    txn.Put(ctx, /Table/50/"3cb8b431", v-21) // nil
    b := &Batch{}
    b.Get(ctx, /Table/50/"9db33549") // ("v-8", nil)
    b.Get(ctx, /Table/50/"57d1a821") // (nil, nil)
    txn.CommitInBatch(ctx, b) // nil
    return nil
  }) // nil 1582928736.783261063,0
  db0.Get(ctx, /Table/50/"e5192bc8") // ("v-20", nil)
  return nil
})
g.GoCtx(func(ctx context.Context) error {
  db0.Txn(ctx, func(ctx context.Context, txn *client.Txn) error {
    b := &Batch{}
    b.Put(ctx, /Table/50/"3cb8b431", v-11) // nil
    b.Get(ctx, /Table/50/"872aa197") // ("v-18", nil)
    b.Get(ctx, /Table/50/"81e0cbc0") // (nil, nil)
    txn.CommitInBatch(ctx, b) // nil
    return nil
  }) // nil 1582928736.783261063,1
  db0.Get(ctx, /Table/50/"603ced73") // (nil, nil)
  return nil
})
g.GoCtx(func(ctx context.Context) error {
  db0.Txn(ctx, func(ctx context.Context, txn *client.Txn) error {
    txn.Put(ctx, /Table/50/"e5192bc8", v-17) // nil
    txn.Put(ctx, /Table/50/"872aa197", v-18) // nil
    return nil
  }) // nil 1582928736.779542484,0
  db0.AdminMerge(ctx, /Table/50/"69e0c8cf") // storage/replica_command.go:776: merge failed: cannot merge final range
  db0.Put(ctx, /Table/50/"e5192bc8", v-20) // nil
  db0.Txn(ctx, func(ctx context.Context, txn *client.Txn) error {
    txn.Get(ctx, /Table/50/"a57481b1") // ("v-22", nil)
    txn.Put(ctx, /Table/50/"c61c1493", v-23) // nil
    b := &Batch{}
    b.Put(ctx, /Table/50/"4f40f9fe", v-24) // nil
    b.Put(ctx, /Table/50/"45258014", v-25) // nil
    b.Put(ctx, /Table/50/"9db33549", v-26) // nil
    txn.CommitInBatch(ctx, b) // nil
    return nil
  }) // nil 1582928736.790910461,0
  return nil
})
g.GoCtx(func(ctx context.Context) error {
  db0.Txn(ctx, func(ctx context.Context, txn *client.Txn) error {
    txn.Put(ctx, /Table/50/"7b2e13b1", v-0) // nil
    txn.Get(ctx, /Table/50/"34d481fb") // (nil, nil)
    b := &Batch{}
    b.Get(ctx, /Table/50/"0a98d198") // (nil, nil)
    b.Get(ctx, /Table/50/"27db36ad") // (nil, nil)
    txn.CommitInBatch(ctx, b) // nil
    return nil
  }) // nil 1582928736.774939693,0
  db0.Txn(ctx, func(ctx context.Context, txn *client.Txn) error {
    txn.Put(ctx, /Table/50/"7b2e13b1", v-1) // nil
    txn.Put(ctx, /Table/50/"7b2e13b1", v-2) // nil
    txn.Get(ctx, /Table/50/"7b2e13b1") // ("v-2", nil)
    return errors.New("rollback")
  }) // rollback
  db0.Txn(ctx, func(ctx context.Context, txn *client.Txn) error {
    return nil
  }) // nil 1582928736.776661202,0
  db0.Txn(ctx, func(ctx context.Context, txn *client.Txn) error {
    txn.Put(ctx, /Table/50/"45258014", v-3) // nil
    txn.Get(ctx, /Table/50/"45258014") // ("v-3", nil)
    return nil
  }) // nil 1582928736.776695227,0
  db0.Put(ctx, /Table/50/"7b2e13b1", v-4) // nil
  db0.AdminMerge(ctx, /Table/50/"9564ef5c") // storage/replica_command.go:776: merge failed: cannot merge final range
  db0.Txn(ctx, func(ctx context.Context, txn *client.Txn) error {
    b := &Batch{}
    txn.CommitInBatch(ctx, b) // nil
    return nil
  }) // nil 1582928736.777661683,0
  db0.Get(ctx, /Table/50/"04831939") // (nil, nil)
  db0.Txn(ctx, func(ctx context.Context, txn *client.Txn) error {
    b := &Batch{}
    txn.CommitInBatch(ctx, b) // nil
    return nil
  }) // nil 1582928736.777788512,0
  db0.Txn(ctx, func(ctx context.Context, txn *client.Txn) error {
    b := &Batch{}
    b.Get(ctx, /Table/50/"90025020") // (nil, nil)
    txn.CommitInBatch(ctx, b) // nil
    return nil
  }) // nil 1582928736.777819887,0
  db0.Get(ctx, /Table/50/"96bf0183") // (nil, nil)
  db0.Txn(ctx, func(ctx context.Context, txn *client.Txn) error {
    txn.Get(ctx, /Table/50/"45258014") // ("v-3", nil)
    b := &Batch{}
    b.Get(ctx, /Table/50/"5f5cb3a8") // (nil, nil)
    b.Get(ctx, /Table/50/"95ccf533") // (nil, nil)
    b.Get(ctx, /Table/50/"7b2e13b1") // ("v-4", nil)
    txn.CommitInBatch(ctx, b) // nil
    return nil
  }) // nil 1582928736.777999088,0
  db0.Txn(ctx, func(ctx context.Context, txn *client.Txn) error {
    txn.Put(ctx, /Table/50/"7b2e13b1", v-5) // nil
    txn.Put(ctx, /Table/50/"872aa197", v-6) // nil
    return errors.New("rollback")
  }) // rollback
  db0.AdminMerge(ctx, /Table/50/"2dcdefcb") // storage/replica_command.go:776: merge failed: cannot merge final range
  db0.Txn(ctx, func(ctx context.Context, txn *client.Txn) error {
    b := &Batch{}
    txn.CommitInBatch(ctx, b) // nil
    return nil
  }) // nil 1582928736.778956096,0
  db0.Txn(ctx, func(ctx context.Context, txn *client.Txn) error {
    txn.Put(ctx, /Table/50/"45258014", v-9) // nil
    txn.Get(ctx, /Table/50/"872aa197") // (nil, nil)
    txn.Put(ctx, /Table/50/"3cb8b431", v-10) // nil
    b := &Batch{}
    txn.CommitInBatch(ctx, b) // nil
    return nil
  }) // nil 1582928736.778999096,0
  db0.Txn(ctx, func(ctx context.Context, txn *client.Txn) error {
    txn.Get(ctx, /Table/50/"25d76e13") // (nil, nil)
    txn.Put(ctx, /Table/50/"872aa197", v-19) // nil
    return nil
  }) // nil 1582928736.781063302,0
  return nil
})
g.Wait()
I200228 22:25:36.801385 66 kv/kvnemesis/kvnemesis.go:90  kvs (recorded from rangefeed):
  /Table/50/"131be9e4" 1582928736.779521145,0 -> /BYTES/v-14
  /Table/50/"28645076" 1582928736.779521145,0 -> /BYTES/v-15
  /Table/50/"3cb8b431" 1582928736.783261063,1 -> /BYTES/v-11
  /Table/50/"3cb8b431" 1582928736.783261063,0 -> /BYTES/v-21
  /Table/50/"3cb8b431" 1582928736.779521145,0 -> /BYTES/v-16
  /Table/50/"3cb8b431" 1582928736.778999096,0 -> /BYTES/v-10
  /Table/50/"45258014" 1582928736.790910461,0 -> /BYTES/v-25
  /Table/50/"45258014" 1582928736.778999096,0 -> /BYTES/v-9
  /Table/50/"45258014" 1582928736.776695227,0 -> /BYTES/v-3
  /Table/50/"4f40f9fe" 1582928736.790910461,0 -> /BYTES/v-24
  /Table/50/"7b2e13b1" 1582928736.778571270,0 -> /BYTES/v-7
  /Table/50/"7b2e13b1" 1582928736.777318540,0 -> /BYTES/v-4
  /Table/50/"7b2e13b1" 1582928736.774939693,0 -> /BYTES/v-0
  /Table/50/"872aa197" 1582928736.781063302,0 -> /BYTES/v-19
  /Table/50/"872aa197" 1582928736.779542484,0 -> /BYTES/v-18
  /Table/50/"9db33549" 1582928736.790910461,0 -> /BYTES/v-26
  /Table/50/"9db33549" 1582928736.778614912,0 -> /BYTES/v-8
  /Table/50/"a57481b1" 1582928736.790170558,0 -> /BYTES/v-22
  /Table/50/"c61c1493" 1582928736.790910461,0 -> /BYTES/v-23
  /Table/50/"c9ee039a" 1582928736.779521145,0 -> /BYTES/v-12
  /Table/50/"e5192bc8" 1582928736.782504859,0 -> /BYTES/v-20
  /Table/50/"e5192bc8" 1582928736.779542484,0 -> /BYTES/v-17
I200228 22:25:36.801598 66 kv/kvnemesis/kvnemesis.go:101  kvs (scan of latest values according to crdb):
  /Table/50/"131be9e4" 1582928736.779521145,0 -> /BYTES/v-14
  /Table/50/"28645076" 1582928736.779521145,0 -> /BYTES/v-15
  /Table/50/"3cb8b431" 1582928736.783261063,1 -> /BYTES/v-11
  /Table/50/"45258014" 1582928736.790910461,0 -> /BYTES/v-25
  /Table/50/"4f40f9fe" 1582928736.790910461,0 -> /BYTES/v-24
  /Table/50/"7b2e13b1" 1582928736.778571270,0 -> /BYTES/v-7
  /Table/50/"872aa197" 1582928736.781063302,0 -> /BYTES/v-19
  /Table/50/"9db33549" 1582928736.790910461,0 -> /BYTES/v-26
  /Table/50/"a57481b1" 1582928736.790170558,0 -> /BYTES/v-22
  /Table/50/"c61c1493" 1582928736.790910461,0 -> /BYTES/v-23
  /Table/50/"c9ee039a" 1582928736.779521145,0 -> /BYTES/v-12
  /Table/50/"e5192bc8" 1582928736.782504859,0 -> /BYTES/v-20
I200228 22:25:36.801770 66 util/stop/stopper.go:539  quiescing
I200228 22:25:36.801798 1451 util/stop/stopper.go:539  quiescing
W200228 22:25:36.802041 500 storage/intentresolver/intent_resolver.go:833  failed to gc transaction record: could not GC completed transaction anchored at /Table/SystemConfigSpan/Start: node unavailable; try another peer
--- FAIL: TestKVNemesisSingleNode (0.30s)
    kvnemesis_test.go:48: failure:
        error with attached stack trace:
            github.com/cockroachdb/cockroach/pkg/kv/kvnemesis.(*validator).checkCommittedTxn
            	/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvnemesis/validator.go:374
            github.com/cockroachdb/cockroach/pkg/kv/kvnemesis.(*validator).checkAtomic
            	/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvnemesis/validator.go:293
            github.com/cockroachdb/cockroach/pkg/kv/kvnemesis.(*validator).processOp
            	/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvnemesis/validator.go:279
            github.com/cockroachdb/cockroach/pkg/kv/kvnemesis.Validate
            	/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvnemesis/validator.go:72
            github.com/cockroachdb/cockroach/pkg/kv/kvnemesis.RunNemesis
            	/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvnemesis/kvnemesis.go:86
            github.com/cockroachdb/cockroach/pkg/kv/kvnemesis.TestKVNemesisSingleNode
            	/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvnemesis/kvnemesis_test.go:44
            testing.tRunner
            	/usr/local/go/src/testing/testing.go:909
            runtime.goexit
            	/usr/local/go/src/runtime/asm_amd64.s:1357
          - error with embedded safe details: %s: %s
            -- arg 1: 
            -- arg 2: 
          - committed txn non-atomic timestamps: [w]/Table/50/"3cb8b431":1582928736.783261063,1->v-11 [r]/Table/50/"872aa197":[1582928736.779542484,0,1582928736.781063302,0)->v-18 [r]/Table/50/"81e0cbc0":[0,0,)->
FAIL

@nvanbenschoten
Copy link
Member

Previous discussion in #45542 (comment).

This certainly looks real. Dan mentioned that this didn't occur when we ran without CommitInBatch operations, so I'm hopeful that like #44315, this is only possible with reads in the same batch as an EndTxn, which don't issue in practice today. It certainly wouldn't surprise me if that was the case.

I'm looking forward to digging in deeper, though I won't be able to get to this until at least next week.

@nvanbenschoten
Copy link
Member

@OwenQian do you mind posting an update about your progress here when you get a chance? Are you able to reproduce this?

@OwenQian
Copy link

OwenQian commented Mar 30, 2020

I was able to reproduce the bug two weeks ago when Dan's branch was at ce6b98a but recently, the failure doesn't seem to reproduce.

I looked through the PR for #45586 and it seems like the configs were changed to disable CommitInBatch by setting the probability of GetExisting and GetMissing to 0 while this issue exists so while trying to reproduce the bug, I commented out those lines in NewDefaultConfig() in pkg/kv/kvnemesis/generator.go.

I've been running

make roachprod-stress PKG=./pkg/kv/kvnemesis TESTS=TestKVNemesisSingleNode TESTTIMEOUT=60s TESTFLAGS='-v -show-logs' CLUSTER=owenq-stress

on dce269d4c2 without any reproduction success.

When the above does run into a failure, it seems to always be this error instead:

I200327 20:52:03.649491 1085 kv/kvnemesis/watcher.go:165  rangefeed Put /Table/50/"778572df" 1585342323.543796807,0 -> /BYTES/v-16 (prev /<empty>)
I200327 20:52:03.649541 1085 kv/kvnemesis/watcher.go:165  rangefeed Put /Table/50/"e5d8f188" 1585342323.543796807,0 -> /BYTES/v-15 (prev /BYTES/v-1)
I200327 20:52:03.649881 1085 kv/kvnemesis/watcher.go:165  rangefeed Put /Table/50/"e4802da4" 1585342323.649024530,0 -> /BYTES/v-20 (prev /<empty>)
I200327 20:52:03.649928 1277 kv/kvnemesis/kvnemesis.go:65  
  before: 1585342323.649024053,0
  1 OP  db0.Txn(ctx, func(ctx context.Context, txn *kv.Txn) error {
  1 OP    txn.Put(ctx, /Table/50/"e4802da4", v-20) // nil
  1 OP    txn.Get(ctx, /Table/50/"35e4c5e0") // ("v-6", nil)
  1 OP    b := &Batch{}
  1 OP    b.Get(ctx, /Table/50/"35e4c5e0") // ("v-6", nil)
  1 OP    txn.CommitInBatch(ctx, b) // nil
  1 OP    return nil
  1 OP  }) // nil txnpb:("unnamed" meta={id=a8e399f3 key=/Table/50/"e4802da4" pri=0.02717604 epo=0 ts=1585342323.649024530,0 min=1585342323.649024530,0 seq=2} lock=true stat=COMMITTED rts=1585342323.649024530,0 wto=false max=1585342324.149024530,0)
  after: 1585342323.649879753,0
F200327 20:52:03.655363 1278 kv/kvserver/concurrency/concurrency_manager.go:261  [n1,s1,r32/1:/{Table/36-Max}] discovered non-conflicting lock

stresslog.txt

@nvanbenschoten
Copy link
Member

Thanks for the update. The error you're seeing is fixed by #46625. Is the issue that you're hitting this bug too quickly to be able to run for long enough, or do you suspect that it was actually fixed? If you have reason to believe that it was already fixed then it would be good to determine what fixed it.

@OwenQian
Copy link

I think it's most likely that the concurrency manager bug is preempting this bug; I don't currently have any evidence that suggests that the bug has fixed itself. I'll apply #46625 and see if the bug becomes reproducible.

@OwenQian
Copy link

After upgrading to Catalina, I've been running into some issues trying to get make roachprod-stress to even run.

Running it with #45542 checked out along with cherry picking #46625 (to fix the non-conficting lock concurrency_manager erorr) and #46673 (to fix an XCode versioning issue) is running into a cmake error that seems identical to the CI failures that are being discussed in the dev-inf slack channel, but I'm not sure if they're related.

CMake Error: The source directory "" does not exist.
Specify --help for usage, or press the help button on the CMake GUI.
Makefile:2147: recipe for target 'cmake_check_build_system' failed
make[1]: *** [cmake_check_build_system] Error 1
Makefile:738: recipe for target '/go/native/x86_64-pc-linux-gnu/libroach/libroachccl.a' failed
make: *** [/go/native/x86_64-pc-linux-gnu/libroach/libroachccl.a] Error 2
make: *** Waiting for unfinished jobs....

However, between #45542 and master, something seems to have fixed this CMake error as roachprod-stress does seem to work on the master branch as of commit

2020-03-31 8a50e9f474 Merge #46731 #46735 #46738 #46748

@OwenQian
Copy link

OwenQian commented Mar 31, 2020

This does not seem to be reproducing easily on the master branch. After 13731 runs (8m30s) a run of the test timed out. I've seen this happen several times: a run of the stress test times out before the inconsistent timestamp bug appears. I've bumped up the timeout to 90s from 60s to see if that makes a difference.

Does it seem like I'm doing anything I'm doing wrong in the reproduction steps? If not, it's seeming more plausible that this bug has been silently resolved.

@danhhz when you first discovered this bug, was it reproducing reliably?

@OwenQian
Copy link

It seems like even with the timeout bumped to 90s, it's still timing out rather than reproducing the bug with 20083 runs completed over 12m27s.

@nvanbenschoten
Copy link
Member

I suspect that the timeout you're seeing is due to #46639. I was seeing the same thing last week. Until we fix that more generally, you'll probably either want to disable Range merges (if that doesn't prevent the bug from reproducing) or apply the following patch:

diff --git a/pkg/kv/kvserver/replica_command.go b/pkg/kv/kvserver/replica_command.go
index ddba3d5b4d..880a2ca2ec 100644
--- a/pkg/kv/kvserver/replica_command.go
+++ b/pkg/kv/kvserver/replica_command.go
@@ -778,6 +778,7 @@ func (r *Replica) AdminMerge(
                        }
                        return reply, nil
                }
+               time.Sleep(jitteredInterval(10 * time.Millisecond))
        }
 }

It seems like even with the timeout bumped to 90s, it's still timing out rather than reproducing the bug with 20083 runs completed over 12m27s.

How frequently were you seeing this reproduce before?

@OwenQian
Copy link

OwenQian commented Mar 31, 2020

How frequently were you seeing this reproduce before?

Previously it was almost always reproducing under 10m.

Adding the timing jitter fixed the Range merge time out, now it seems very likely that some change in the master branch has silently fixed this bug. I ran it on master for 110097 runs in 1h8m25s without any failures. So now I'm trying to track down the change that fixed it.

@irfansharif
Copy link
Contributor

irfansharif commented Mar 31, 2020

I'd probably sanity check by finding the SHA that introduced it, what you're seeing could very well be recent changes masking the fault, not fixing it (though it could be that too, if you're able to piece together how it got fixed from this information).

Are you using roachprod-stress workers by any chance? Definitely check that out if you haven't already, it could reduce your search time by a lot.

@OwenQian
Copy link

running into a cmake error that seems identical to the CI failures that are being discussed in the dev-inf slack channel

I fixed this with the solution mentioned in this issue (in the slack channel): https://github.com/cockroachdb/dev-inf/issues/66, except I did the version change in the reverse direction.

diff --git i/build/builder.sh w/build/builder.sh
index b6cdca1679..26b8037ba5 100755
--- i/build/builder.sh
+++ w/build/builder.sh
@@ -3,7 +3,8 @@
 set -euo pipefail

 image=cockroachdb/builder
-version=20200115-154509
+version=20200326-092324
+

Not really sure what's going on there, but I'm able to run it locally on dce269d4c2 again.

@OwenQian
Copy link

OwenQian commented Apr 1, 2020

As a sanity check, this bug is still reproducing on ce6b98a438.

@nvanbenschoten
Copy link
Member

Nice, that's reassuring. Any luck with the bisect strategy we discussed?

@OwenQian
Copy link

OwenQian commented Apr 1, 2020

The process seems to be going smoothly enough, but I haven't narrowed it down to the root cause commit yet.

@OwenQian
Copy link

OwenQian commented Apr 7, 2020

This bisect process now seems to be going less than smoothly. I've run into a multitude of extraneous issues that are getting in the way of me seeing if this bug exists on the SHA or not.

There seems to be another test timeout problem that's distinct from the Range merges; it occurs even with the jitter.

** Compaction Stats [default] **
Level    Files   Size     Score Read(GB)  Rn(GB) Rnp1(GB) Write(GB) Wnew(GB) Moved(GB) W-Amp Rd(MB/s) Wr(MB/s) Comp(sec) CompMergeCPU(sec) Comp(cnt) Avg(sec) KeyIn KeyDrop
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Sum      0/0    0.00 KB   0.0      0.0     0.0      0.0       0.0      0.0       0.0   0.0      0.0      0.0      0.00              0.00         0    0.000       0      0
 Int      0/0    0.00 KB   0.0      0.0     0.0      0.0       0.0      0.0       0.0   0.0      0.0      0.0      0.00              0.00         0    0.000       0      0

** Compaction Stats [default] **
Priority    Files   Size     Score Read(GB)  Rn(GB) Rnp1(GB) Write(GB) Wnew(GB) Moved(GB) W-Amp Rd(MB/s) Wr(MB/s) Comp(sec) CompMergeCPU(sec) Comp(cnt) Avg(sec) KeyIn KeyDrop
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Uptime(secs): 20.0 total, 20.0 interval
Flush(GB): cumulative 0.000, interval 0.000
AddFile(GB): cumulative 0.000, interval 0.000
AddFile(Total Files): cumulative 0, interval 0
AddFile(L0 Files): cumulative 0, interval 0
AddFile(Keys): cumulative 0, interval 0
Cumulative compaction: 0.00 GB write, 0.00 MB/s write, 0.00 GB read, 0.00 MB/s read, 0.0 seconds
Interval compaction: 0.00 GB write, 0.00 MB/s write, 0.00 GB read, 0.00 MB/s read, 0.0 seconds
Stalls(count): 0 level0_slowdown, 0 level0_slowdown_with_compaction, 0 level0_numfiles, 0 level0_numfiles_with_compaction, 0 stop for pending_compaction_bytes, 0 slowdown for pending_compaction_bytes, 0 memtable_compaction, 0 memtable_slowdown, interval 0 total count
estimated_pending_compaction_bytes: 0 B
I200407 21:29:54.688889 167 server/status/runtime.go:498  [n1] runtime stats: 102 MiB RSS, 171 goroutines, 22 MiB/37 MiB/34 MiB GO alloc/idle/total, 12 MiB/18 MiB CGO alloc/total, 1313.8 CGO/sec, 4.5/1.4 %(u/s)time, 0.0 %gc (1x), 1.6 MiB/1.6 MiB (r/w)net
I200407 21:29:55.062060 1003 kv/kvnemesis/watcher.go:197  watcher reached frontier 1586294985.067760145,0 lagging by 9.994283215s
I200407 21:29:55.084139 1203 kv/kvnemesis/kvnemesis.go:63
  0 OP  db0.Txn(ctx, func(ctx context.Context, txn *client.Txn) error {
  0 OP    b := &Batch{}
  0 OP    b.Put(ctx, /Table/50/"4285d360", v-33) // nil
  0 OP    b.Get(ctx, /Table/50/"53338c85") // ("v-18", nil)
  0 OP    b.Put(ctx, /Table/50/"69106ef0", v-34) // nil
  0 OP    txn.CommitInBatch(ctx, b) // nil
  0 OP    return nil
  0 OP  }) // nil
I200407 21:29:55.084227 1003 kv/kvnemesis/watcher.go:197  watcher reached frontier 1586294991.038052901,0 lagging by 4.046164021s
I200407 21:29:55.084258 1003 kv/kvnemesis/watcher.go:162  rangefeed Put /Table/50/"4285d360" 1586294991.639472783,2 -> /BYTES/v-33 (prev /BYTES/v-24)
I200407 21:29:55.084301 1003 kv/kvnemesis/watcher.go:162  rangefeed Put /Table/50/"69106ef0" 1586294991.639472783,2 -> /BYTES/v-34 (prev /<empty>)
I200407 21:29:55.239658 1003 kv/kvnemesis/watcher.go:197  watcher reached frontier 1586294991.639472783,0 lagging by 3.600168536s
I200407 21:29:55.839758 1003 kv/kvnemesis/watcher.go:197  watcher reached frontier 1586294992.239582491,0 lagging by 3.600160583s
I200407 21:29:56.439864 1003 kv/kvnemesis/watcher.go:197  watcher reached frontier 1586294992.839690043,0 lagging by 3.600160078s
I200407 21:29:57.040275 1003 kv/kvnemesis/watcher.go:197  watcher reached frontier 1586294993.439765571,0 lagging by 3.600491339s
I200407 21:29:57.640434 1003 kv/kvnemesis/watcher.go:197  watcher reached frontier 1586294994.040176959,0 lagging by 3.600237842s
I200407 21:29:58.240507 1003 kv/kvnemesis/watcher.go:197  watcher reached frontier 1586294994.640312267,0 lagging by 3.600178517s
I200407 21:29:58.840644 1003 kv/kvnemesis/watcher.go:197  watcher reached frontier 1586294995.061403412,0 lagging by 3.779205622s
I200407 21:30:04.689112 167 server/status/runtime.go:498  [n1] runtime stats: 104 MiB RSS, 169 goroutines, 16 MiB/41 MiB/36 MiB GO alloc/idle/total, 12 MiB/18 MiB CGO alloc/total, 1549.9 CGO/sec, 5.1/1.1 %(u/s)time, 0.0 %gc (2x), 1.7 MiB/1.7 MiB (r/w)net
I200407 21:30:04.706284 3630 kv/kvserver/replica_consistency.go:255  [n1,consistencyChecker,s1,r4/1:/System{/tsd-tse}] triggering stats recomputation to resolve delta of {ContainsEstimates:4062 LastUpdateNanos:1586294994719624489 IntentAge:0 GCBytesAge:0 LiveBytes:17804 LiveCount:-650 KeyBytes:-31503 KeyCount:-650 ValBytes:49307 ValCount:-650 IntentBytes:0 IntentCount:0 SysBytes:0 SysCount:0}
I200407 21:30:05.313245 1003 kv/kvnemesis/watcher.go:197  watcher reached frontier 1586295001.249115947,0 lagging by 4.064112154s
I200407 21:30:05.449454 1003 kv/kvnemesis/watcher.go:197  watcher reached frontier 1586295001.849219340,0 lagging by 3.600216141s
I200407 21:30:06.049492 1003 kv/kvnemesis/watcher.go:197  watcher reached frontier 1586295002.449323928,0 lagging by 3.600150011s
I200407 21:30:06.649599 1003 kv/kvnemesis/watcher.go:197  watcher reached frontier 1586295003.049392057,0 lagging by 3.600192825s
I200407 21:30:07.249690 1003 kv/kvnemesis/watcher.go:197  watcher reached frontier 1586295003.649510581,0 lagging by 3.600153465s
I200407 21:30:07.851196 1003 kv/kvnemesis/watcher.go:197  watcher reached frontier 1586295004.249598009,0 lagging by 3.601584731s
I200407 21:30:08.451597 1003 kv/kvnemesis/watcher.go:197  watcher reached frontier 1586295004.851102170,0 lagging by 3.600477925s
I200407 21:30:09.051721 1003 kv/kvnemesis/watcher.go:197  watcher reached frontier 1586295005.312338837,0 lagging by 3.739367567s
I200407 21:30:14.689200 167 server/status/runtime.go:498  [n1] runtime stats: 106 MiB RSS, 170 goroutines, 13 MiB/42 MiB/36 MiB GO alloc/idle/total, 13 MiB/20 MiB CGO alloc/total, 1633.1 CGO/sec, 5.8/0.4 %(u/s)time, 0.0 %gc (2x), 1.9 MiB/1.7 MiB (r/w)net
I200407 21:30:15.560854 1003 kv/kvnemesis/watcher.go:197  watcher reached frontier 1586295011.452610227,0 lagging by 4.108229856s
I200407 21:30:15.653011 1003 kv/kvnemesis/watcher.go:197  watcher reached frontier 1586295012.052709463,0 lagging by 3.600286031s
I200407 21:30:16.253496 1003 kv/kvnemesis/watcher.go:197  watcher reached frontier 1586295012.652902667,0 lagging by 3.60057384s
I200407 21:30:16.853592 1003 kv/kvnemesis/watcher.go:197  watcher reached frontier 1586295013.253401627,0 lagging by 3.600173234s
I200407 21:30:17.453719 1003 kv/kvnemesis/watcher.go:197  watcher reached frontier 1586295013.853507488,0 lagging by 3.600194102s
I200407 21:30:18.053837 1003 kv/kvnemesis/watcher.go:197  watcher reached frontier 1586295014.453624040,0 lagging by 3.600196105s
I200407 21:30:18.653925 1003 kv/kvnemesis/watcher.go:197  watcher reached frontier 1586295015.053742722,0 lagging by 3.600165386s
I200407 21:30:19.254055 1003 kv/kvnemesis/watcher.go:197  watcher reached frontier 1586295015.560135714,0 lagging by 3.693901545s
I200407 21:30:24.689214 167 server/status/runtime.go:498  [n1] runtime stats: 107 MiB RSS, 170 goroutines, 20 MiB/38 MiB/35 MiB GO alloc/idle/total, 13 MiB/20 MiB CGO alloc/total, 1488.8 CGO/sec, 4.6/1.3 %(u/s)time, 0.0 %gc (1x), 1.7 MiB/1.7 MiB (r/w)net
I200407 21:30:25.810521 1003 kv/kvnemesis/watcher.go:197  watcher reached frontier 1586295021.656219917,0 lagging by 4.15428572s
I200407 21:30:25.860652 1003 kv/kvnemesis/watcher.go:197  watcher reached frontier 1586295022.256341372,0 lagging by 3.604296537s
I200407 21:30:26.460782 1003 kv/kvnemesis/watcher.go:197  watcher reached frontier 1586295022.860581980,0 lagging by 3.600183317s
I200407 21:30:27.060916 1003 kv/kvnemesis/watcher.go:197  watcher reached frontier 1586295023.460697998,0 lagging by 3.600201741s
I200407 21:30:27.661043 1003 kv/kvnemesis/watcher.go:197  watcher reached frontier 1586295024.060817773,0 lagging by 3.600207258s
I200407 21:30:28.261164 1003 kv/kvnemesis/watcher.go:197  watcher reached frontier 1586295024.660935737,0 lagging by 3.60020997s
I200407 21:30:28.861299 1003 kv/kvnemesis/watcher.go:197  watcher reached frontier 1586295025.261069937,0 lagging by 3.60021141s
I200407 21:30:29.465637 1003 kv/kvnemesis/watcher.go:197  watcher reached frontier 1586295025.809705674,0 lagging by 3.655913981s
panic: test timed out after 1m0s

@nvanbenschoten
Copy link
Member

Do you have a sense for how long this specific failure mode takes to reproduce on SHAs that you are certain do contain the bug? How does that compare to the other issues you see cropping up?

If we're not able to bisect this then we'll have to go back to the basics and debug on the SHA that we can reliably hit the issue on. That would likely require adding logging throughout the transaction evaluation path and trying to determine how the transaction ended up performing a write at a different timestamp than its read. Here's an example of the kind of logging that might be helpful for this kind of investigation: e43a9eb.

@OwenQian
Copy link

OwenQian commented Apr 8, 2020

On SHAs that do contain the bug, it usually reproduces under 10m. The other issues I've been running into seem to appear much more reliably, usually in 2 to 3m. Would it be possible to stitch together ~10m of runtime across multiple runs that terminate with other failures and assume this failure doesn't reproduce if it doesn't appear?

@nvanbenschoten
Copy link
Member

I would recommend bumping up the test timeout to 20 minutes while doing the bisect. If you can stick together 20 minutes of runtime across multiple runs then it would be possible to make such an assumption.

@OwenQian
Copy link

OwenQian commented Apr 16, 2020

Bisect update: the bisect finished and resulted in this result:

The property has changed between dcf8c44 and [22544bf].

The original commit that introduced this bug was ce6b98a, call this C1. This commit was a WIP and later cleaned up and merged into master as dce269d, call this C2.

The first old/good SHA was the parent of C1
old: [22544bf] Merge #45444
let's call this P1.

and the new/bad SHA was the parent of dce269d
new: [7241972] Merge #45668 #46227
git bisect new 7241972
let's call this P2.

The thing that is pretty suspicious is that along the entire bisect process of 15 steps, this bug didn't reproduce a single time, i.e., they were all on the new side of the bisect.

The only SHA I've seen the bug reproduce on is C1 itself, but does not reproduce on C2. Both of these commits were addressing #45542.

Given that the bug doesn't seem to reproduce on any of the bisect steps between the parents of these two commits, the possibilities that come to mind for me are:

  1. C1 introduced a change that caused this bug to surface but the cleanup between C1 and C2 resolved the bug.
  2. C1 and C2 both introduced a change causing this failure, but since there was a long period of time elapsed between when C1 was written and when C2 was merged in, some change between P1 and P2 resolved the issue.
  3. A combination of changes from C1 to C2 and from P1 to P2 resolved the issue.

I'll try to test out hypothesis 1 by applying C2 directly onto P1 and see if the bug reproduces there.

Another thing that came to mind was that if hypothesis 2 is true, then I think I may have made a mistake in my bisecting process. I was running the bisect without applying C1 nor C2 onto to SHA selected by the bisect, so if the changes in C1/C2 are necessary to see this failure surface then it would never surface the way that I've done it.

@nvanbenschoten
Copy link
Member

Nice job getting the end of the bisect.

C1 and C2 both introduced a change causing this failure, but since there was a long period of time elapsed between when C1 was written and when C2 was merged in, some change between P1 and P2 resolved the issue.

This is what I had been thinking we'd find, but I'm not sure the bisect confirmed or denied this. You're right that it is very suspicious that we never saw the failure during the 15 step bisect.

Another alternative is that C1 itself was buggy. We could determine this by cherry-picking C2 on to P1 and trying to repro, essentially what you said here:

I'll try to test out hypothesis 1 by applying C2 directly onto P1 and see if the bug reproduces there.

If we can't reproduce on P1+C2 then I think we can conclude that this was caused by an issue in C1 itself and close this issue.

@irfansharif
Copy link
Contributor

Closing it out based on my reading of the above (may be wrong).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-kv Anything in KV that doesn't belong in a more specific category. C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior.
Projects
None yet
Development

No branches or pull requests

4 participants