-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
kv: txn with inconsistent read & write timestamps #45586
Comments
Here's a log from a run with this issue as of ce6b98a438cfa316527e8b0357c987f6d3308d7e (timestamps are abbreviated to make things obvious): There is one txn that writes
This means the allowable window that a read of Reproduces in minutes with roachprod-stress:
=== RUN TestKVNemesisSingleNode I200228 22:25:36.517183 66 base/addr_validation.go:296 [n?] server certificate addresses: IP=127.0.0.1,::1; DNS=localhost,*.local; CN=node I200228 22:25:36.517270 66 base/addr_validation.go:342 [n?] web UI certificate addresses: IP=127.0.0.1,::1; DNS=localhost,*.local; CN=node W200228 22:25:36.526858 66 server/status/runtime.go:308 [n?] Could not parse build timestamp: parsing time "" as "2006/01/02 15:04:05": cannot parse "" as "2006" I200228 22:25:36.550327 66 server/server.go:1025 [n?] monitoring forward clock jumps based on server.clock.forward_jump_check_enabled I200228 22:25:36.580218 66 server/config.go:570 [n?] 1 storage engine initialized I200228 22:25:36.580242 66 server/config.go:573 [n?] RocksDB cache size: 128 MiB I200228 22:25:36.580250 66 server/config.go:573 [n?] store 0: in-memory, size 0 B I200228 22:25:36.580446 66 server/server.go:1480 [n?] no stores bootstrapped and --join flag specified, awaiting init command or join with an already initialized node. I200228 22:25:36.580490 66 server/server.go:1490 [n?] **** add additional nodes by specifying --join=127.0.0.1:37085 I200228 22:25:36.586490 66 server/node.go:323 [n?] **** cluster 065ddae7-e51d-4e3d-ba95-b595e77630df has been created I200228 22:25:36.586767 66 gossip/gossip.go:395 [n1] NodeDescriptor set to node_id:1 address: attrs:<> locality: tiers: > ServerVersion: build_tag:"v20.1.0-beta.1-471-g669f68f-dirty" started_at:1582928736586580669 cluster_name:"" sql_address: W200228 22:25:36.591143 202 storage/store.go:1538 [n1,s1,r6/1:/Table/{SystemCon…-11}] could not gossip system config: [NotLeaseHolderError] r6: replica (n1,s1):1 not lease holder; lease holder unknown I200228 22:25:36.591245 66 server/node.go:427 [n1] initialized store [n1,s1]: disk (capacity=512 MiB, available=512 MiB, used=0 B, logicalBytes=16 KiB), ranges=32, leases=0, queries=0.00, writes=0.00, bytesPerReplica={p10=0.00 p25=0.00 p50=0.00 p75=0.00 p90=166.00 pMax=13271.00}, writesPerReplica={p10=0.00 p25=0.00 p50=0.00 p75=0.00 p90=0.00 pMax=0.00} I200228 22:25:36.591374 66 storage/stores.go:231 [n1] read 0 node addresses from persistent storage I200228 22:25:36.591477 66 server/node.go:644 [n1] connecting to gossip network to verify cluster ID... I200228 22:25:36.591506 66 server/node.go:664 [n1] node connected via gossip and verified as part of cluster "065ddae7-e51d-4e3d-ba95-b595e77630df" I200228 22:25:36.591561 66 server/node.go:511 [n1] node=1: started with [=] engine(s) and attributes [] I200228 22:25:36.591618 66 server/server.go:2172 [n1] Could not start goroutine dumper worker due to: directory to store dumps could not be determined I200228 22:25:36.591731 66 server/server.go:1610 [n1] starting https server at 127.0.0.1:43561 (use: 127.0.0.1:43561) I200228 22:25:36.591766 66 server/server.go:1615 [n1] starting postgres server at 127.0.0.1:41117 (use: 127.0.0.1:41117) I200228 22:25:36.591787 66 server/server.go:1617 [n1] starting grpc server at 127.0.0.1:37085 I200228 22:25:36.591808 66 server/server.go:1618 [n1] advertising CockroachDB node at 127.0.0.1:37085 W200228 22:25:36.646461 202 storage/store.go:1538 [n1,s1,r6/1:/Table/{SystemCon…-11}] could not gossip system config: [NotLeaseHolderError] r6: replica (n1,s1):1 not lease holder; lease holder unknown I200228 22:25:36.666708 330 sql/event_log.go:132 [n1,intExec=optInToDiagnosticsStatReporting] Event: "set_cluster_setting", target: 0, info: {SettingName:diagnostics.reporting.enabled Value:true User:root} I200228 22:25:36.672887 257 sql/event_log.go:132 [n1,intExec=set-setting] Event: "set_cluster_setting", target: 0, info: {SettingName:version Value:19.2-14 User:root} I200228 22:25:36.689822 733 sql/event_log.go:132 [n1,intExec=initializeClusterSecret] Event: "set_cluster_setting", target: 0, info: {SettingName:cluster.secret Value:92c7c58f-8c5d-4f64-b998-186471886923 User:root} I200228 22:25:36.696688 788 sql/event_log.go:132 [n1,intExec=create-default-db] Event: "create_database", target: 50, info: {DatabaseName:defaultdb Statement:CREATE DATABASE IF NOT EXISTS defaultdb User:root} I200228 22:25:36.700642 753 sql/event_log.go:132 [n1,intExec=create-default-db] Event: "create_database", target: 51, info: {DatabaseName:postgres Statement:CREATE DATABASE IF NOT EXISTS postgres User:root} I200228 22:25:36.725784 66 server/server.go:1696 [n1] done ensuring all necessary migrations have run I200228 22:25:36.725843 66 server/server.go:1977 [n1] serving sql connections I200228 22:25:36.726484 66 testutils/testcluster/testcluster.go:760 WaitForFullReplication took: 3.874µs I200228 22:25:36.730006 1049 sql/event_log.go:132 [n1] Event: "node_join", target: 1, info: {Descriptor:{NodeID:1 Address:127.0.0.1:37085 Attrs: Locality:region=test,dc=dc1 ServerVersion:19.2-14 BuildTag:v20.1.0-beta.1-471-g669f68f-dirty StartedAt:1582928736586580669 LocalityAddress:[] ClusterName: SQLAddress:127.0.0.1:41117} ClusterID:065ddae7-e51d-4e3d-ba95-b595e77630df StartedAt:1582928736586580669 LastUp:1582928736586580669} I200228 22:25:36.737803 1047 server/server_update.go:55 [n1] no need to upgrade, cluster already at the newest version I200228 22:25:36.760304 1028 sql/event_log.go:132 [n1,client=127.0.0.1:45322,user=root] Event: "set_cluster_setting", target: 0, info: {SettingName:kv.rangefeed.enabled Value:true User:root} I200228 22:25:36.761708 66 kv/kvnemesis/watcher.go:134 watcher waiting for 1582928736.761683056,0 I200228 22:25:36.762846 1152 storage/replica_rangefeed.go:608 [n1,s1,r32/1:/{Table/36-Max}] RangeFeed closed timestamp is empty I200228 22:25:36.764947 1028 sql/event_log.go:132 [n1,client=127.0.0.1:45322,user=root] Event: "set_cluster_setting", target: 0, info: {SettingName:kv.closed_timestamp.target_duration Value:00:00:00.001 User:root} I200228 22:25:36.768663 1148 kv/kvnemesis/watcher.go:201 watcher reached frontier 1582928736.767300681,0 lagging by 1.340201ms I200228 22:25:36.768920 1148 kv/kvnemesis/watcher.go:201 watcher reached frontier 1582928736.767595811,0 lagging by 1.314487ms I200228 22:25:36.769346 1148 kv/kvnemesis/watcher.go:201 watcher reached frontier 1582928736.767879049,0 lagging by 1.454447ms I200228 22:25:36.769641 1148 kv/kvnemesis/watcher.go:201 watcher reached frontier 1582928736.768288577,0 lagging by 1.341444ms I200228 22:25:36.770070 1148 kv/kvnemesis/watcher.go:201 watcher reached frontier 1582928736.768589993,0 lagging by 1.470179ms I200228 22:25:36.770412 1148 kv/kvnemesis/watcher.go:201 watcher reached frontier 1582928736.769032434,0 lagging by 1.369582ms I200228 22:25:36.770436 1148 kv/kvnemesis/watcher.go:207 watcher notifying 1582928736.761683056,0 I200228 22:25:36.770680 1148 kv/kvnemesis/watcher.go:201 watcher reached frontier 1582928736.769337201,0 lagging by 1.320038ms I200228 22:25:36.770940 1148 kv/kvnemesis/watcher.go:201 watcher reached frontier 1582928736.769615731,0 lagging by 1.315854ms I200228 22:25:36.771212 1148 kv/kvnemesis/watcher.go:201 watcher reached frontier 1582928736.769900422,0 lagging by 1.303958ms I200228 22:25:36.771510 1148 kv/kvnemesis/watcher.go:201 watcher reached frontier 1582928736.770179178,0 lagging by 1.321315ms I200228 22:25:36.771780 1148 kv/kvnemesis/watcher.go:201 watcher reached frontier 1582928736.770468271,0 lagging by 1.304565ms I200228 22:25:36.772055 1148 kv/kvnemesis/watcher.go:201 watcher reached frontier 1582928736.770747130,0 lagging by 1.300883ms I200228 22:25:36.772360 1148 kv/kvnemesis/watcher.go:201 watcher reached frontier 1582928736.771026206,0 lagging by 1.326647ms I200228 22:25:36.772632 1148 kv/kvnemesis/watcher.go:201 watcher reached frontier 1582928736.771328264,0 lagging by 1.295833ms I200228 22:25:36.772933 1148 kv/kvnemesis/watcher.go:201 watcher reached frontier 1582928736.771588382,0 lagging by 1.332202ms I200228 22:25:36.773490 1028 sql/event_log.go:132 [n1,client=127.0.0.1:45322,user=root] Event: "set_cluster_setting", target: 0, info: {SettingName:kv.closed_timestamp.target_duration Value:DEFAULT User:root} I200228 22:25:36.775754 1300 kv/kvnemesis/kvnemesis.go:65 before: 1582928736.774939169,0 4 OP db0.Txn(ctx, func(ctx context.Context, txn *client.Txn) error { 4 OP txn.Put(ctx, /Table/50/"7b2e13b1", v-0) // nil 4 OP txn.Get(ctx, /Table/50/"34d481fb") // (nil, nil) 4 OP b := &Batch{} 4 OP b.Get(ctx, /Table/50/"0a98d198") // (nil, nil) 4 OP b.Get(ctx, /Table/50/"27db36ad") // (nil, nil) 4 OP txn.CommitInBatch(ctx, b) // nil 4 OP return nil 4 OP }) // nil 1582928736.774939693,0 after: 1582928736.775712344,0 I200228 22:25:36.776643 1300 kv/kvnemesis/kvnemesis.go:65 before: 1582928736.775779571,0 4 OP db0.Txn(ctx, func(ctx context.Context, txn *client.Txn) error { 4 OP txn.Put(ctx, /Table/50/"7b2e13b1", v-1) // nil 4 OP txn.Put(ctx, /Table/50/"7b2e13b1", v-2) // nil 4 OP txn.Get(ctx, /Table/50/"7b2e13b1") // ("v-2", nil) 4 OP return errors.New("rollback") 4 OP }) // rollback after: 1582928736.776616416,0 I200228 22:25:36.776678 1300 kv/kvnemesis/kvnemesis.go:65 before: 1582928736.776660776,0 4 OP db0.Txn(ctx, func(ctx context.Context, txn *client.Txn) error { 4 OP return nil 4 OP }) // nil 1582928736.776661202,0 after: 1582928736.776668834,0 I200228 22:25:36.777273 1300 kv/kvnemesis/kvnemesis.go:65 before: 1582928736.776694928,0 4 OP db0.Txn(ctx, func(ctx context.Context, txn *client.Txn) error { 4 OP txn.Put(ctx, /Table/50/"45258014", v-3) // nil 4 OP txn.Get(ctx, /Table/50/"45258014") // ("v-3", nil) 4 OP return nil 4 OP }) // nil 1582928736.776695227,0 after: 1582928736.777244737,0 I200228 22:25:36.777528 1300 kv/kvnemesis/kvnemesis.go:65 before: 1582928736.777292691,0 4 OP db0.Put(ctx, /Table/50/"7b2e13b1", v-4) // nil after: 1582928736.777508139,0 I200228 22:25:36.777648 1300 kv/kvnemesis/kvnemesis.go:65 before: 1582928736.777546885,0 4 OP db0.AdminMerge(ctx, /Table/50/"9564ef5c") // storage/replica_command.go:776: merge failed: cannot merge final range after: 1582928736.777625582,0 I200228 22:25:36.777680 1300 kv/kvnemesis/kvnemesis.go:65 before: 1582928736.777661271,0 4 OP db0.Txn(ctx, func(ctx context.Context, txn *client.Txn) error { 4 OP b := &Batch{} 4 OP txn.CommitInBatch(ctx, b) // nil 4 OP return nil 4 OP }) // nil 1582928736.777661683,0 after: 1582928736.777670176,0 I200228 22:25:36.777774 1300 kv/kvnemesis/kvnemesis.go:65 before: 1582928736.777689466,0 4 OP db0.Get(ctx, /Table/50/"04831939") // (nil, nil) after: 1582928736.777758808,0 I200228 22:25:36.777806 1300 kv/kvnemesis/kvnemesis.go:65 before: 1582928736.777788114,0 4 OP db0.Txn(ctx, func(ctx context.Context, txn *client.Txn) error { 4 OP b := &Batch{} 4 OP txn.CommitInBatch(ctx, b) // nil 4 OP return nil 4 OP }) // nil 1582928736.777788512,0 after: 1582928736.777796808,0 I200228 22:25:36.777902 1300 kv/kvnemesis/kvnemesis.go:65 before: 1582928736.777819587,0 4 OP db0.Txn(ctx, func(ctx context.Context, txn *client.Txn) error { 4 OP b := &Batch{} 4 OP b.Get(ctx, /Table/50/"90025020") // (nil, nil) 4 OP txn.CommitInBatch(ctx, b) // nil 4 OP return nil 4 OP }) // nil 1582928736.777819887,0 after: 1582928736.777887041,0 I200228 22:25:36.777979 1300 kv/kvnemesis/kvnemesis.go:65 before: 1582928736.777915347,0 4 OP db0.Get(ctx, /Table/50/"96bf0183") // (nil, nil) after: 1582928736.777965718,0 I200228 22:25:36.778154 1300 kv/kvnemesis/kvnemesis.go:65 before: 1582928736.777998726,0 4 OP db0.Txn(ctx, func(ctx context.Context, txn *client.Txn) error { 4 OP txn.Get(ctx, /Table/50/"45258014") // ("v-3", nil) 4 OP b := &Batch{} 4 OP b.Get(ctx, /Table/50/"5f5cb3a8") // (nil, nil) 4 OP b.Get(ctx, /Table/50/"95ccf533") // (nil, nil) 4 OP b.Get(ctx, /Table/50/"7b2e13b1") // ("v-4", nil) 4 OP txn.CommitInBatch(ctx, b) // nil 4 OP return nil 4 OP }) // nil 1582928736.777999088,0 after: 1582928736.778122892,0 I200228 22:25:36.778359 1148 kv/kvnemesis/watcher.go:165 rangefeed Put /Table/50/"7b2e13b1" 1582928736.774939693,0 -> /BYTES/v-0 (prev /) I200228 22:25:36.778407 1148 kv/kvnemesis/watcher.go:165 rangefeed Put /Table/50/"45258014" 1582928736.776695227,0 -> /BYTES/v-3 (prev /) I200228 22:25:36.778432 1148 kv/kvnemesis/watcher.go:165 rangefeed Put /Table/50/"7b2e13b1" 1582928736.777318540,0 -> /BYTES/v-4 (prev /BYTES/v-0) I200228 22:25:36.778832 1300 kv/kvnemesis/kvnemesis.go:65 before: 1582928736.778173756,0 4 OP db0.Txn(ctx, func(ctx context.Context, txn *client.Txn) error { 4 OP txn.Put(ctx, /Table/50/"7b2e13b1", v-5) // nil 4 OP txn.Put(ctx, /Table/50/"872aa197", v-6) // nil 4 OP return errors.New("rollback") 4 OP }) // rollback after: 1582928736.778806599,0 I200228 22:25:36.778936 1300 kv/kvnemesis/kvnemesis.go:65 before: 1582928736.778861181,0 4 OP db0.AdminMerge(ctx, /Table/50/"2dcdefcb") // storage/replica_command.go:776: merge failed: cannot merge final range after: 1582928736.778923402,0 I200228 22:25:36.778973 1300 kv/kvnemesis/kvnemesis.go:65 before: 1582928736.778955717,0 4 OP db0.Txn(ctx, func(ctx context.Context, txn *client.Txn) error { 4 OP b := &Batch{} 4 OP txn.CommitInBatch(ctx, b) // nil 4 OP return nil 4 OP }) // nil 1582928736.778956096,0 after: 1582928736.778963259,0 I200228 22:25:36.779430 1148 kv/kvnemesis/watcher.go:165 rangefeed Put /Table/50/"7b2e13b1" 1582928736.778571270,0 -> /BYTES/v-7 (prev /BYTES/v-4) I200228 22:25:36.779485 1089 kv/kvnemesis/kvnemesis.go:65 before: 1582928736.778507464,0 1 OP db0.Put(ctx, /Table/50/"7b2e13b1", v-7) // nil after: 1582928736.779473364,0 I200228 22:25:36.779440 1088 kv/kvnemesis/kvnemesis.go:65 before: 1582928736.778614415,0 0 OP db0.Txn(ctx, func(ctx context.Context, txn *client.Txn) error { 0 OP txn.Put(ctx, /Table/50/"9db33549", v-8) // nil 0 OP txn.Get(ctx, /Table/50/"45258014") // ("v-3", nil) 0 OP return nil 0 OP }) // nil 1582928736.778614912,0 after: 1582928736.779418598,0 I200228 22:25:36.781035 1300 kv/kvnemesis/kvnemesis.go:65 before: 1582928736.778998790,0 4 OP db0.Txn(ctx, func(ctx context.Context, txn *client.Txn) error { 4 OP txn.Put(ctx, /Table/50/"45258014", v-9) // nil 4 OP txn.Get(ctx, /Table/50/"872aa197") // (nil, nil) 4 OP txn.Put(ctx, /Table/50/"3cb8b431", v-10) // nil 4 OP b := &Batch{} 4 OP txn.CommitInBatch(ctx, b) // nil 4 OP return nil 4 OP }) // nil 1582928736.778999096,0 after: 1582928736.781005095,0 I200228 22:25:36.781240 1148 kv/kvnemesis/watcher.go:165 rangefeed Put /Table/50/"9db33549" 1582928736.778614912,0 -> /BYTES/v-8 (prev /) I200228 22:25:36.782345 1299 kv/kvnemesis/kvnemesis.go:65 before: 1582928736.779542124,0 3 OP db0.Txn(ctx, func(ctx context.Context, txn *client.Txn) error { 3 OP txn.Put(ctx, /Table/50/"e5192bc8", v-17) // nil 3 OP txn.Put(ctx, /Table/50/"872aa197", v-18) // nil 3 OP return nil 3 OP }) // nil 1582928736.779542484,0 after: 1582928736.782318683,0 I200228 22:25:36.782441 1299 kv/kvnemesis/kvnemesis.go:65 before: 1582928736.782370805,0 3 OP db0.AdminMerge(ctx, /Table/50/"69e0c8cf") // storage/replica_command.go:776: merge failed: cannot merge final range after: 1582928736.782428397,0 I200228 22:25:36.782614 1148 kv/kvnemesis/watcher.go:165 rangefeed Put /Table/50/"3cb8b431" 1582928736.778999096,0 -> /BYTES/v-10 (prev /) I200228 22:25:36.782656 1148 kv/kvnemesis/watcher.go:165 rangefeed Put /Table/50/"45258014" 1582928736.778999096,0 -> /BYTES/v-9 (prev /BYTES/v-3) I200228 22:25:36.783234 1089 kv/kvnemesis/kvnemesis.go:65 before: 1582928736.779520628,0 1 OP db0.Txn(ctx, func(ctx context.Context, txn *client.Txn) error { 1 OP txn.Put(ctx, /Table/50/"c9ee039a", v-12) // nil 1 OP txn.Put(ctx, /Table/50/"131be9e4", v-13) // nil 1 OP txn.Put(ctx, /Table/50/"131be9e4", v-14) // nil 1 OP b := &Batch{} 1 OP b.Put(ctx, /Table/50/"28645076", v-15) // nil 1 OP b.Get(ctx, /Table/50/"872aa197") // (nil, nil) 1 OP b.Put(ctx, /Table/50/"3cb8b431", v-16) // nil 1 OP txn.CommitInBatch(ctx, b) // nil 1 OP return nil 1 OP }) // nil 1582928736.779521145,0 after: 1582928736.783191745,0 I200228 22:25:36.783935 1089 kv/kvnemesis/kvnemesis.go:65 before: 1582928736.783260581,0 1 OP db0.Txn(ctx, func(ctx context.Context, txn *client.Txn) error { 1 OP txn.Get(ctx, /Table/50/"2a87e3a8") // (nil, nil) 1 OP txn.Put(ctx, /Table/50/"3cb8b431", v-21) // nil 1 OP b := &Batch{} 1 OP b.Get(ctx, /Table/50/"9db33549") // ("v-8", nil) 1 OP b.Get(ctx, /Table/50/"57d1a821") // (nil, nil) 1 OP txn.CommitInBatch(ctx, b) // nil 1 OP return nil 1 OP }) // nil 1582928736.783261063,0 after: 1582928736.783899491,0 I200228 22:25:36.784030 1148 kv/kvnemesis/watcher.go:165 rangefeed Put /Table/50/"131be9e4" 1582928736.779521145,0 -> /BYTES/v-14 (prev /) I200228 22:25:36.784065 1148 kv/kvnemesis/watcher.go:165 rangefeed Put /Table/50/"28645076" 1582928736.779521145,0 -> /BYTES/v-15 (prev /) I200228 22:25:36.784086 1148 kv/kvnemesis/watcher.go:165 rangefeed Put /Table/50/"3cb8b431" 1582928736.779521145,0 -> /BYTES/v-16 (prev /BYTES/v-10) I200228 22:25:36.784170 1148 kv/kvnemesis/watcher.go:165 rangefeed Put /Table/50/"c9ee039a" 1582928736.779521145,0 -> /BYTES/v-12 (prev /) I200228 22:25:36.784194 1148 kv/kvnemesis/watcher.go:165 rangefeed Put /Table/50/"3cb8b431" 1582928736.783261063,0 -> /BYTES/v-21 (prev /BYTES/v-16) I200228 22:25:36.784503 1148 kv/kvnemesis/watcher.go:165 rangefeed Put /Table/50/"872aa197" 1582928736.779542484,0 -> /BYTES/v-18 (prev /) I200228 22:25:36.784554 1148 kv/kvnemesis/watcher.go:165 rangefeed Put /Table/50/"e5192bc8" 1582928736.779542484,0 -> /BYTES/v-17 (prev /) I200228 22:25:36.790137 1088 kv/kvnemesis/kvnemesis.go:65 before: 1582928736.779870860,0 0 OP db0.Txn(ctx, func(ctx context.Context, txn *client.Txn) error { 0 OP txn.Get(ctx, /Table/50/"e2374831") // (nil, nil) 0 OP txn.Get(ctx, /Table/50/"e5192bc8") // ("v-17", nil) 0 OP return errors.New("rollback") 0 OP }) // rollback after: 1582928736.790110591,0 I200228 22:25:36.790874 1299 kv/kvnemesis/kvnemesis.go:65 before: 1582928736.782487298,0 3 OP db0.Put(ctx, /Table/50/"e5192bc8", v-20) // nil after: 1582928736.790852675,0 I200228 22:25:36.791101 1298 kv/kvnemesis/kvnemesis.go:65 before: 1582928736.779218875,0 2 OP db0.Txn(ctx, func(ctx context.Context, txn *client.Txn) error { 2 OP b := &Batch{} 2 OP b.Put(ctx, /Table/50/"3cb8b431", v-11) // nil 2 OP b.Get(ctx, /Table/50/"872aa197") // ("v-18", nil) 2 OP b.Get(ctx, /Table/50/"81e0cbc0") // (nil, nil) 2 OP txn.CommitInBatch(ctx, b) // nil 2 OP return nil 2 OP }) // nil 1582928736.783261063,1 after: 1582928736.791068621,0 I200228 22:25:36.791212 1298 kv/kvnemesis/kvnemesis.go:65 before: 1582928736.791129624,0 2 OP db0.Get(ctx, /Table/50/"603ced73") // (nil, nil) after: 1582928736.791198076,0 I200228 22:25:36.791263 1148 kv/kvnemesis/watcher.go:165 rangefeed Put /Table/50/"3cb8b431" 1582928736.783261063,1 -> /BYTES/v-11 (prev /BYTES/v-21) I200228 22:25:36.791313 1148 kv/kvnemesis/watcher.go:165 rangefeed Put /Table/50/"e5192bc8" 1582928736.782504859,0 -> /BYTES/v-20 (prev /BYTES/v-17) I200228 22:25:36.791390 1089 kv/kvnemesis/kvnemesis.go:65 before: 1582928736.783952442,0 1 OP db0.Get(ctx, /Table/50/"e5192bc8") // ("v-20", nil) after: 1582928736.791368983,0 I200228 22:25:36.791422 1148 kv/kvnemesis/watcher.go:165 rangefeed Put /Table/50/"a57481b1" 1582928736.790170558,0 -> /BYTES/v-22 (prev /) I200228 22:25:36.791485 1300 kv/kvnemesis/kvnemesis.go:65 before: 1582928736.781062882,0 4 OP db0.Txn(ctx, func(ctx context.Context, txn *client.Txn) error { 4 OP txn.Get(ctx, /Table/50/"25d76e13") // (nil, nil) 4 OP txn.Put(ctx, /Table/50/"872aa197", v-19) // nil 4 OP return nil 4 OP }) // nil 1582928736.781063302,0 after: 1582928736.791449751,0 I200228 22:25:36.791771 1299 kv/kvnemesis/kvnemesis.go:65 before: 1582928736.790909909,0 3 OP db0.Txn(ctx, func(ctx context.Context, txn *client.Txn) error { 3 OP txn.Get(ctx, /Table/50/"a57481b1") // ("v-22", nil) 3 OP txn.Put(ctx, /Table/50/"c61c1493", v-23) // nil 3 OP b := &Batch{} 3 OP b.Put(ctx, /Table/50/"4f40f9fe", v-24) // nil 3 OP b.Put(ctx, /Table/50/"45258014", v-25) // nil 3 OP b.Put(ctx, /Table/50/"9db33549", v-26) // nil 3 OP txn.CommitInBatch(ctx, b) // nil 3 OP return nil 3 OP }) // nil 1582928736.790910461,0 after: 1582928736.791725299,0 I200228 22:25:36.791939 1148 kv/kvnemesis/watcher.go:165 rangefeed Put /Table/50/"872aa197" 1582928736.781063302,0 -> /BYTES/v-19 (prev /BYTES/v-18) I200228 22:25:36.792077 1088 kv/kvnemesis/kvnemesis.go:65 before: 1582928736.790170038,0 0 OP db0.Txn(ctx, func(ctx context.Context, txn *client.Txn) error { 0 OP txn.Get(ctx, /Table/50/"f890cec8") // (nil, nil) 0 OP b := &Batch{} 0 OP b.Get(ctx, /Table/50/"07781c98") // (nil, nil) 0 OP b.Put(ctx, /Table/50/"a57481b1", v-22) // nil 0 OP txn.CommitInBatch(ctx, b) // nil 0 OP return nil 0 OP }) // nil 1582928736.790170558,0 after: 1582928736.790878479,0 I200228 22:25:36.792101 66 kv/kvnemesis/watcher.go:134 watcher waiting for 1582928736.791725299,0 I200228 22:25:36.792372 1148 kv/kvnemesis/watcher.go:165 rangefeed Put /Table/50/"45258014" 1582928736.790910461,0 -> /BYTES/v-25 (prev /BYTES/v-9) I200228 22:25:36.792530 1148 kv/kvnemesis/watcher.go:165 rangefeed Put /Table/50/"4f40f9fe" 1582928736.790910461,0 -> /BYTES/v-24 (prev /) I200228 22:25:36.792558 1148 kv/kvnemesis/watcher.go:165 rangefeed Put /Table/50/"9db33549" 1582928736.790910461,0 -> /BYTES/v-26 (prev /BYTES/v-8) I200228 22:25:36.792673 1148 kv/kvnemesis/watcher.go:165 rangefeed Put /Table/50/"c61c1493" 1582928736.790910461,0 -> /BYTES/v-23 (prev /) I200228 22:25:36.795127 1028 sql/event_log.go:132 [n1,client=127.0.0.1:45322,user=root] Event: "set_cluster_setting", target: 0, info: {SettingName:kv.closed_timestamp.target_duration Value:00:00:00.001 User:root} I200228 22:25:36.796661 1148 kv/kvnemesis/watcher.go:201 watcher reached frontier 1582928736.771880485,0 lagging by 24.769339ms I200228 22:25:36.796943 1148 kv/kvnemesis/watcher.go:201 watcher reached frontier 1582928736.795582419,0 lagging by 1.342785ms I200228 22:25:36.796981 1148 kv/kvnemesis/watcher.go:207 watcher notifying 1582928736.791725299,0 I200228 22:25:36.797201 1148 kv/kvnemesis/watcher.go:201 watcher reached frontier 1582928736.795860079,0 lagging by 1.330671ms I200228 22:25:36.797461 1148 kv/kvnemesis/watcher.go:201 watcher reached frontier 1582928736.796147657,0 lagging by 1.303048ms I200228 22:25:36.797713 1148 kv/kvnemesis/watcher.go:201 watcher reached frontier 1582928736.796401138,0 lagging by 1.301896ms I200228 22:25:36.798029 1148 kv/kvnemesis/watcher.go:201 watcher reached frontier 1582928736.796659853,0 lagging by 1.360986ms I200228 22:25:36.798300 1148 kv/kvnemesis/watcher.go:201 watcher reached frontier 1582928736.796992948,0 lagging by 1.293998ms I200228 22:25:36.799668 1148 kv/kvnemesis/watcher.go:201 watcher reached frontier 1582928736.797256937,0 lagging by 2.398627ms I200228 22:25:36.799905 1148 kv/kvnemesis/watcher.go:201 watcher reached frontier 1582928736.798585909,0 lagging by 1.307073ms I200228 22:25:36.800065 1028 sql/event_log.go:132 [n1,client=127.0.0.1:45322,user=root] Event: "set_cluster_setting", target: 0, info: {SettingName:kv.closed_timestamp.target_duration Value:DEFAULT User:root} I200228 22:25:36.800640 1148 kv/kvnemesis/watcher.go:201 watcher reached frontier 1582928736.798836304,0 lagging by 1.789708ms I200228 22:25:36.801210 66 kv/kvnemesis/kvnemesis.go:89 reproduction steps: g := ctxgroup.WithContext(ctx) g.GoCtx(func(ctx context.Context) error { db0.Txn(ctx, func(ctx context.Context, txn *client.Txn) error { txn.Put(ctx, /Table/50/"9db33549", v-8) // nil txn.Get(ctx, /Table/50/"45258014") // ("v-3", nil) return nil }) // nil 1582928736.778614912,0 db0.Txn(ctx, func(ctx context.Context, txn *client.Txn) error { txn.Get(ctx, /Table/50/"e2374831") // (nil, nil) txn.Get(ctx, /Table/50/"e5192bc8") // ("v-17", nil) return errors.New("rollback") }) // rollback db0.Txn(ctx, func(ctx context.Context, txn *client.Txn) error { txn.Get(ctx, /Table/50/"f890cec8") // (nil, nil) b := &Batch{} b.Get(ctx, /Table/50/"07781c98") // (nil, nil) b.Put(ctx, /Table/50/"a57481b1", v-22) // nil txn.CommitInBatch(ctx, b) // nil return nil }) // nil 1582928736.790170558,0 return nil }) g.GoCtx(func(ctx context.Context) error { db0.Put(ctx, /Table/50/"7b2e13b1", v-7) // nil db0.Txn(ctx, func(ctx context.Context, txn *client.Txn) error { txn.Put(ctx, /Table/50/"c9ee039a", v-12) // nil txn.Put(ctx, /Table/50/"131be9e4", v-13) // nil txn.Put(ctx, /Table/50/"131be9e4", v-14) // nil b := &Batch{} b.Put(ctx, /Table/50/"28645076", v-15) // nil b.Get(ctx, /Table/50/"872aa197") // (nil, nil) b.Put(ctx, /Table/50/"3cb8b431", v-16) // nil txn.CommitInBatch(ctx, b) // nil return nil }) // nil 1582928736.779521145,0 db0.Txn(ctx, func(ctx context.Context, txn *client.Txn) error { txn.Get(ctx, /Table/50/"2a87e3a8") // (nil, nil) txn.Put(ctx, /Table/50/"3cb8b431", v-21) // nil b := &Batch{} b.Get(ctx, /Table/50/"9db33549") // ("v-8", nil) b.Get(ctx, /Table/50/"57d1a821") // (nil, nil) txn.CommitInBatch(ctx, b) // nil return nil }) // nil 1582928736.783261063,0 db0.Get(ctx, /Table/50/"e5192bc8") // ("v-20", nil) return nil }) g.GoCtx(func(ctx context.Context) error { db0.Txn(ctx, func(ctx context.Context, txn *client.Txn) error { b := &Batch{} b.Put(ctx, /Table/50/"3cb8b431", v-11) // nil b.Get(ctx, /Table/50/"872aa197") // ("v-18", nil) b.Get(ctx, /Table/50/"81e0cbc0") // (nil, nil) txn.CommitInBatch(ctx, b) // nil return nil }) // nil 1582928736.783261063,1 db0.Get(ctx, /Table/50/"603ced73") // (nil, nil) return nil }) g.GoCtx(func(ctx context.Context) error { db0.Txn(ctx, func(ctx context.Context, txn *client.Txn) error { txn.Put(ctx, /Table/50/"e5192bc8", v-17) // nil txn.Put(ctx, /Table/50/"872aa197", v-18) // nil return nil }) // nil 1582928736.779542484,0 db0.AdminMerge(ctx, /Table/50/"69e0c8cf") // storage/replica_command.go:776: merge failed: cannot merge final range db0.Put(ctx, /Table/50/"e5192bc8", v-20) // nil db0.Txn(ctx, func(ctx context.Context, txn *client.Txn) error { txn.Get(ctx, /Table/50/"a57481b1") // ("v-22", nil) txn.Put(ctx, /Table/50/"c61c1493", v-23) // nil b := &Batch{} b.Put(ctx, /Table/50/"4f40f9fe", v-24) // nil b.Put(ctx, /Table/50/"45258014", v-25) // nil b.Put(ctx, /Table/50/"9db33549", v-26) // nil txn.CommitInBatch(ctx, b) // nil return nil }) // nil 1582928736.790910461,0 return nil }) g.GoCtx(func(ctx context.Context) error { db0.Txn(ctx, func(ctx context.Context, txn *client.Txn) error { txn.Put(ctx, /Table/50/"7b2e13b1", v-0) // nil txn.Get(ctx, /Table/50/"34d481fb") // (nil, nil) b := &Batch{} b.Get(ctx, /Table/50/"0a98d198") // (nil, nil) b.Get(ctx, /Table/50/"27db36ad") // (nil, nil) txn.CommitInBatch(ctx, b) // nil return nil }) // nil 1582928736.774939693,0 db0.Txn(ctx, func(ctx context.Context, txn *client.Txn) error { txn.Put(ctx, /Table/50/"7b2e13b1", v-1) // nil txn.Put(ctx, /Table/50/"7b2e13b1", v-2) // nil txn.Get(ctx, /Table/50/"7b2e13b1") // ("v-2", nil) return errors.New("rollback") }) // rollback db0.Txn(ctx, func(ctx context.Context, txn *client.Txn) error { return nil }) // nil 1582928736.776661202,0 db0.Txn(ctx, func(ctx context.Context, txn *client.Txn) error { txn.Put(ctx, /Table/50/"45258014", v-3) // nil txn.Get(ctx, /Table/50/"45258014") // ("v-3", nil) return nil }) // nil 1582928736.776695227,0 db0.Put(ctx, /Table/50/"7b2e13b1", v-4) // nil db0.AdminMerge(ctx, /Table/50/"9564ef5c") // storage/replica_command.go:776: merge failed: cannot merge final range db0.Txn(ctx, func(ctx context.Context, txn *client.Txn) error { b := &Batch{} txn.CommitInBatch(ctx, b) // nil return nil }) // nil 1582928736.777661683,0 db0.Get(ctx, /Table/50/"04831939") // (nil, nil) db0.Txn(ctx, func(ctx context.Context, txn *client.Txn) error { b := &Batch{} txn.CommitInBatch(ctx, b) // nil return nil }) // nil 1582928736.777788512,0 db0.Txn(ctx, func(ctx context.Context, txn *client.Txn) error { b := &Batch{} b.Get(ctx, /Table/50/"90025020") // (nil, nil) txn.CommitInBatch(ctx, b) // nil return nil }) // nil 1582928736.777819887,0 db0.Get(ctx, /Table/50/"96bf0183") // (nil, nil) db0.Txn(ctx, func(ctx context.Context, txn *client.Txn) error { txn.Get(ctx, /Table/50/"45258014") // ("v-3", nil) b := &Batch{} b.Get(ctx, /Table/50/"5f5cb3a8") // (nil, nil) b.Get(ctx, /Table/50/"95ccf533") // (nil, nil) b.Get(ctx, /Table/50/"7b2e13b1") // ("v-4", nil) txn.CommitInBatch(ctx, b) // nil return nil }) // nil 1582928736.777999088,0 db0.Txn(ctx, func(ctx context.Context, txn *client.Txn) error { txn.Put(ctx, /Table/50/"7b2e13b1", v-5) // nil txn.Put(ctx, /Table/50/"872aa197", v-6) // nil return errors.New("rollback") }) // rollback db0.AdminMerge(ctx, /Table/50/"2dcdefcb") // storage/replica_command.go:776: merge failed: cannot merge final range db0.Txn(ctx, func(ctx context.Context, txn *client.Txn) error { b := &Batch{} txn.CommitInBatch(ctx, b) // nil return nil }) // nil 1582928736.778956096,0 db0.Txn(ctx, func(ctx context.Context, txn *client.Txn) error { txn.Put(ctx, /Table/50/"45258014", v-9) // nil txn.Get(ctx, /Table/50/"872aa197") // (nil, nil) txn.Put(ctx, /Table/50/"3cb8b431", v-10) // nil b := &Batch{} txn.CommitInBatch(ctx, b) // nil return nil }) // nil 1582928736.778999096,0 db0.Txn(ctx, func(ctx context.Context, txn *client.Txn) error { txn.Get(ctx, /Table/50/"25d76e13") // (nil, nil) txn.Put(ctx, /Table/50/"872aa197", v-19) // nil return nil }) // nil 1582928736.781063302,0 return nil }) g.Wait() I200228 22:25:36.801385 66 kv/kvnemesis/kvnemesis.go:90 kvs (recorded from rangefeed): /Table/50/"131be9e4" 1582928736.779521145,0 -> /BYTES/v-14 /Table/50/"28645076" 1582928736.779521145,0 -> /BYTES/v-15 /Table/50/"3cb8b431" 1582928736.783261063,1 -> /BYTES/v-11 /Table/50/"3cb8b431" 1582928736.783261063,0 -> /BYTES/v-21 /Table/50/"3cb8b431" 1582928736.779521145,0 -> /BYTES/v-16 /Table/50/"3cb8b431" 1582928736.778999096,0 -> /BYTES/v-10 /Table/50/"45258014" 1582928736.790910461,0 -> /BYTES/v-25 /Table/50/"45258014" 1582928736.778999096,0 -> /BYTES/v-9 /Table/50/"45258014" 1582928736.776695227,0 -> /BYTES/v-3 /Table/50/"4f40f9fe" 1582928736.790910461,0 -> /BYTES/v-24 /Table/50/"7b2e13b1" 1582928736.778571270,0 -> /BYTES/v-7 /Table/50/"7b2e13b1" 1582928736.777318540,0 -> /BYTES/v-4 /Table/50/"7b2e13b1" 1582928736.774939693,0 -> /BYTES/v-0 /Table/50/"872aa197" 1582928736.781063302,0 -> /BYTES/v-19 /Table/50/"872aa197" 1582928736.779542484,0 -> /BYTES/v-18 /Table/50/"9db33549" 1582928736.790910461,0 -> /BYTES/v-26 /Table/50/"9db33549" 1582928736.778614912,0 -> /BYTES/v-8 /Table/50/"a57481b1" 1582928736.790170558,0 -> /BYTES/v-22 /Table/50/"c61c1493" 1582928736.790910461,0 -> /BYTES/v-23 /Table/50/"c9ee039a" 1582928736.779521145,0 -> /BYTES/v-12 /Table/50/"e5192bc8" 1582928736.782504859,0 -> /BYTES/v-20 /Table/50/"e5192bc8" 1582928736.779542484,0 -> /BYTES/v-17 I200228 22:25:36.801598 66 kv/kvnemesis/kvnemesis.go:101 kvs (scan of latest values according to crdb): /Table/50/"131be9e4" 1582928736.779521145,0 -> /BYTES/v-14 /Table/50/"28645076" 1582928736.779521145,0 -> /BYTES/v-15 /Table/50/"3cb8b431" 1582928736.783261063,1 -> /BYTES/v-11 /Table/50/"45258014" 1582928736.790910461,0 -> /BYTES/v-25 /Table/50/"4f40f9fe" 1582928736.790910461,0 -> /BYTES/v-24 /Table/50/"7b2e13b1" 1582928736.778571270,0 -> /BYTES/v-7 /Table/50/"872aa197" 1582928736.781063302,0 -> /BYTES/v-19 /Table/50/"9db33549" 1582928736.790910461,0 -> /BYTES/v-26 /Table/50/"a57481b1" 1582928736.790170558,0 -> /BYTES/v-22 /Table/50/"c61c1493" 1582928736.790910461,0 -> /BYTES/v-23 /Table/50/"c9ee039a" 1582928736.779521145,0 -> /BYTES/v-12 /Table/50/"e5192bc8" 1582928736.782504859,0 -> /BYTES/v-20 I200228 22:25:36.801770 66 util/stop/stopper.go:539 quiescing I200228 22:25:36.801798 1451 util/stop/stopper.go:539 quiescing W200228 22:25:36.802041 500 storage/intentresolver/intent_resolver.go:833 failed to gc transaction record: could not GC completed transaction anchored at /Table/SystemConfigSpan/Start: node unavailable; try another peer --- FAIL: TestKVNemesisSingleNode (0.30s) kvnemesis_test.go:48: failure: error with attached stack trace: github.com/cockroachdb/cockroach/pkg/kv/kvnemesis.(*validator).checkCommittedTxn /go/src/github.com/cockroachdb/cockroach/pkg/kv/kvnemesis/validator.go:374 github.com/cockroachdb/cockroach/pkg/kv/kvnemesis.(*validator).checkAtomic /go/src/github.com/cockroachdb/cockroach/pkg/kv/kvnemesis/validator.go:293 github.com/cockroachdb/cockroach/pkg/kv/kvnemesis.(*validator).processOp /go/src/github.com/cockroachdb/cockroach/pkg/kv/kvnemesis/validator.go:279 github.com/cockroachdb/cockroach/pkg/kv/kvnemesis.Validate /go/src/github.com/cockroachdb/cockroach/pkg/kv/kvnemesis/validator.go:72 github.com/cockroachdb/cockroach/pkg/kv/kvnemesis.RunNemesis /go/src/github.com/cockroachdb/cockroach/pkg/kv/kvnemesis/kvnemesis.go:86 github.com/cockroachdb/cockroach/pkg/kv/kvnemesis.TestKVNemesisSingleNode /go/src/github.com/cockroachdb/cockroach/pkg/kv/kvnemesis/kvnemesis_test.go:44 testing.tRunner /usr/local/go/src/testing/testing.go:909 runtime.goexit /usr/local/go/src/runtime/asm_amd64.s:1357 - error with embedded safe details: %s: %s -- arg 1: -- arg 2: - committed txn non-atomic timestamps: [w]/Table/50/"3cb8b431":1582928736.783261063,1->v-11 [r]/Table/50/"872aa197":[1582928736.779542484,0,1582928736.781063302,0)->v-18 [r]/Table/50/"81e0cbc0":[0,0,)-> FAIL |
Previous discussion in #45542 (comment). This certainly looks real. Dan mentioned that this didn't occur when we ran without CommitInBatch operations, so I'm hopeful that like #44315, this is only possible with reads in the same batch as an EndTxn, which don't issue in practice today. It certainly wouldn't surprise me if that was the case. I'm looking forward to digging in deeper, though I won't be able to get to this until at least next week. |
@OwenQian do you mind posting an update about your progress here when you get a chance? Are you able to reproduce this? |
I was able to reproduce the bug two weeks ago when Dan's branch was at I looked through the PR for #45586 and it seems like the configs were changed to disable CommitInBatch by setting the probability of GetExisting and GetMissing to 0 while this issue exists so while trying to reproduce the bug, I commented out those lines in I've been running
on When the above does run into a failure, it seems to always be this error instead:
|
Thanks for the update. The error you're seeing is fixed by #46625. Is the issue that you're hitting this bug too quickly to be able to run for long enough, or do you suspect that it was actually fixed? If you have reason to believe that it was already fixed then it would be good to determine what fixed it. |
I think it's most likely that the concurrency manager bug is preempting this bug; I don't currently have any evidence that suggests that the bug has fixed itself. I'll apply #46625 and see if the bug becomes reproducible. |
After upgrading to Catalina, I've been running into some issues trying to get Running it with #45542 checked out along with cherry picking #46625 (to fix the non-conficting lock concurrency_manager erorr) and #46673 (to fix an XCode versioning issue) is running into a cmake error that seems identical to the CI failures that are being discussed in the dev-inf slack channel, but I'm not sure if they're related.
However, between #45542 and master, something seems to have fixed this CMake error as roachprod-stress does seem to work on the master branch as of commit
|
This does not seem to be reproducing easily on the master branch. After 13731 runs (8m30s) a run of the test timed out. I've seen this happen several times: a run of the stress test times out before the inconsistent timestamp bug appears. I've bumped up the timeout to 90s from 60s to see if that makes a difference. Does it seem like I'm doing anything I'm doing wrong in the reproduction steps? If not, it's seeming more plausible that this bug has been silently resolved. @danhhz when you first discovered this bug, was it reproducing reliably? |
It seems like even with the timeout bumped to 90s, it's still timing out rather than reproducing the bug with 20083 runs completed over 12m27s. |
I suspect that the timeout you're seeing is due to #46639. I was seeing the same thing last week. Until we fix that more generally, you'll probably either want to disable Range merges (if that doesn't prevent the bug from reproducing) or apply the following patch:
How frequently were you seeing this reproduce before? |
Previously it was almost always reproducing under 10m. Adding the timing jitter fixed the Range merge time out, now it seems very likely that some change in the master branch has silently fixed this bug. I ran it on master for 110097 runs in 1h8m25s without any failures. So now I'm trying to track down the change that fixed it. |
I'd probably sanity check by finding the SHA that introduced it, what you're seeing could very well be recent changes masking the fault, not fixing it (though it could be that too, if you're able to piece together how it got fixed from this information). Are you using roachprod-stress workers by any chance? Definitely check that out if you haven't already, it could reduce your search time by a lot. |
I fixed this with the solution mentioned in this issue (in the slack channel): https://github.com/cockroachdb/dev-inf/issues/66, except I did the version change in the reverse direction.
Not really sure what's going on there, but I'm able to run it locally on |
As a sanity check, this bug is still reproducing on ce6b98a438. |
Nice, that's reassuring. Any luck with the bisect strategy we discussed? |
The process seems to be going smoothly enough, but I haven't narrowed it down to the root cause commit yet. |
This bisect process now seems to be going less than smoothly. I've run into a multitude of extraneous issues that are getting in the way of me seeing if this bug exists on the SHA or not. There seems to be another test timeout problem that's distinct from the Range merges; it occurs even with the jitter.
|
Do you have a sense for how long this specific failure mode takes to reproduce on SHAs that you are certain do contain the bug? How does that compare to the other issues you see cropping up? If we're not able to bisect this then we'll have to go back to the basics and debug on the SHA that we can reliably hit the issue on. That would likely require adding logging throughout the transaction evaluation path and trying to determine how the transaction ended up performing a write at a different timestamp than its read. Here's an example of the kind of logging that might be helpful for this kind of investigation: e43a9eb. |
On SHAs that do contain the bug, it usually reproduces under 10m. The other issues I've been running into seem to appear much more reliably, usually in 2 to 3m. Would it be possible to stitch together ~10m of runtime across multiple runs that terminate with other failures and assume this failure doesn't reproduce if it doesn't appear? |
I would recommend bumping up the test timeout to 20 minutes while doing the bisect. If you can stick together 20 minutes of runtime across multiple runs then it would be possible to make such an assumption. |
Bisect update: the bisect finished and resulted in this result: The property has changed between dcf8c44 and [22544bf]. The original commit that introduced this bug was ce6b98a, call this C1. This commit was a WIP and later cleaned up and merged into master as dce269d, call this C2. The first old/good SHA was the parent of C1 and the new/bad SHA was the parent of dce269d The thing that is pretty suspicious is that along the entire bisect process of 15 steps, this bug didn't reproduce a single time, i.e., they were all on the new side of the bisect. The only SHA I've seen the bug reproduce on is C1 itself, but does not reproduce on C2. Both of these commits were addressing #45542. Given that the bug doesn't seem to reproduce on any of the bisect steps between the parents of these two commits, the possibilities that come to mind for me are:
I'll try to test out hypothesis 1 by applying C2 directly onto P1 and see if the bug reproduces there. Another thing that came to mind was that if hypothesis 2 is true, then I think I may have made a mistake in my bisecting process. I was running the bisect without applying C1 nor C2 onto to SHA selected by the bisect, so if the changes in C1/C2 are necessary to see this failure surface then it would never surface the way that I've done it. |
Nice job getting the end of the bisect.
This is what I had been thinking we'd find, but I'm not sure the bisect confirmed or denied this. You're right that it is very suspicious that we never saw the failure during the 15 step bisect. Another alternative is that C1 itself was buggy. We could determine this by cherry-picking C2 on to P1 and trying to repro, essentially what you said here:
If we can't reproduce on P1+C2 then I think we can conclude that this was caused by an issue in C1 itself and close this issue. |
Closing it out based on my reading of the above (may be wrong). |
#45542 adds support to kvnemesis for verifying that a txn's reads are consistent with its writes. It seems to be seeing a read that was only valid for a window entirely before the timestamp it ended up committing at. I haven't fully confirmed this, but the issue seems to go away if I disable the use of CommitInBatch in kvnemesis.
Found by kvnemesis.
The text was updated successfully, but these errors were encountered: