ccl/changefeedccl: TestAvroEncoder failed under stress #34819

cockroach-teamcity · 2019-02-12T07:35:19Z

SHA: https://github.com/cockroachdb/cockroach/commits/965525a5deb3c4fab5ab232e069846ecc08f632d

Parameters:

TAGS=
GOFLAGS=-race

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=TestAvroEncoder PKG=github.com/cockroachdb/cockroach/pkg/ccl/changefeedccl TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1133737&tab=buildLog

I190212 07:34:23.816867 39761 sql/event_log.go:135  [n1,intExec=create-default-db] Event: "create_database", target: 51, info: {DatabaseName:postgres Statement:CREATE DATABASE IF NOT EXISTS postgres User:root}
I190212 07:34:23.843475 39528 server/server.go:1645  [n1] done ensuring all necessary migrations have run
I190212 07:34:23.843592 39528 server/server.go:1648  [n1] serving sql connections
I190212 07:34:23.870680 39789 sql/event_log.go:135  [n1] Event: "node_join", target: 1, info: {Descriptor:{NodeID:1 Address:127.0.0.1:44729 Attrs: Locality: ServerVersion:2.1-6 BuildTag:v2.2.0-alpha.20190211-59-g965525a StartedAt:1549956863221167191 LocalityAddress:[] XXX_NoUnkeyedLiteral:{} XXX_sizecache:0} ClusterID:0767d590-e3a2-4b92-839a-d55b5e214481 StartedAt:1549956863221167191 LastUp:1549956863221167191}
I190212 07:34:23.873853 39786 server/server_update.go:67  [n1] no need to upgrade, cluster already at the newest version
I190212 07:34:23.916926 39497 sql/event_log.go:135  [n1,client=127.0.0.1:52492,user=root] Event: "set_cluster_setting", target: 0, info: {SettingName:kv.rangefeed.enabled Value:true User:root}
I190212 07:34:23.945303 39497 sql/event_log.go:135  [n1,client=127.0.0.1:52492,user=root] Event: "set_cluster_setting", target: 0, info: {SettingName:kv.closed_timestamp.target_duration Value:00:00:01 User:root}
I190212 07:34:23.982669 39497 sql/event_log.go:135  [n1,client=127.0.0.1:52492,user=root] Event: "set_cluster_setting", target: 0, info: {SettingName:changefeed.experimental_poll_interval Value:00:00:00.01 User:root}
I190212 07:34:24.009996 39497 sql/event_log.go:135  [n1,client=127.0.0.1:52492,user=root] Event: "create_database", target: 52, info: {DatabaseName:d Statement:CREATE DATABASE d User:root}
I190212 07:34:24.045299 39497 sql/event_log.go:135  [n1,client=127.0.0.1:52492,user=root] Event: "set_cluster_setting", target: 0, info: {SettingName:changefeed.push.enabled Value:false User:root}
I190212 07:34:24.077778 39497 sql/event_log.go:135  [n1,client=127.0.0.1:52492,user=root] Event: "set_cluster_setting", target: 0, info: {SettingName:changefeed.experimental_poll_interval Value:00:00:00.01 User:root}
I190212 07:34:24.105689 39497 sql/event_log.go:135  [n1,client=127.0.0.1:52492,user=root] Event: "create_table", target: 53, info: {TableName:d.public.foo Statement:CREATE TABLE foo (a INT8 PRIMARY KEY, b STRING) User:root}
I190212 07:34:24.120701 39912 storage/replica_command.go:245  [n1,split,s1,r20/1:/{Table/24-Max}] initiating a split of this range at key /Table/53 [r21] (zone config)
I190212 07:34:24.474823 39528 util/stop/stopper.go:536  quiescing; tasks left:
1      [async] transport racer
1      [async] storage.IntentResolver: cleanup txn records
1      [async] intent_resolver_batcher
1      [async] closedts-rangefeed-subscriber
1      [async] changefeed-poller
1      [async] async-rollback
I190212 07:34:24.485080 39528 util/stop/stopper.go:536  quiescing; tasks left:
1      [async] transport racer
1      [async] storage.IntentResolver: cleanup txn records
1      [async] closedts-rangefeed-subscriber
W190212 07:34:24.485542 39896 storage/intentresolver/intent_resolver.go:721  failed to gc transaction record: could not GC completed transaction anchored at /Local/Range/Table/24/RangeDescriptor: node unavailable; try another peer
I190212 07:34:24.485810 39528 util/stop/stopper.go:536  quiescing; tasks left:
1      [async] transport racer
1      [async] closedts-rangefeed-subscriber
I190212 07:34:24.486039 39528 util/stop/stopper.go:536  quiescing; tasks left:
1      [async] transport racer
I190212 07:34:24.570205 39659 kv/transport_race.go:113  transport race promotion: ran 12 iterations on up to 604 requests
    --- FAIL: TestAvroEncoder/poller (1.49s)
        <autogenerated>:1: expected
              foo: {"a":{"long":3}}->{"after":{"foo":{"a":{"long":3},"b":{"string":"baz"}}},"updated":{"string":"1549956864460181344.0000000000"}}
            got
              foo: {"a":{"long":3}}->{"after":{"foo":{"a":{"long":3},"b":{"string":"baz"}}},"updated":{"string":"1549956864451566598.0000000001"}}

The text was updated successfully, but these errors were encountered:

cockroach-teamcity · 2019-02-13T08:04:42Z

SHA: https://github.com/cockroachdb/cockroach/commits/10f8010fa5778e740c057905e2d7664b5fd5d647

Parameters:

TAGS=
GOFLAGS=-race

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=TestAvroEncoder PKG=github.com/cockroachdb/cockroach/pkg/ccl/changefeedccl TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1135873&tab=buildLog

I190213 08:03:44.087778 39787 server/server.go:1648  [n1] serving sql connections
I190213 08:03:44.134447 40065 server/server_update.go:67  [n1] no need to upgrade, cluster already at the newest version
I190213 08:03:44.136415 40164 sql/event_log.go:135  [n1] Event: "node_join", target: 1, info: {Descriptor:{NodeID:1 Address:127.0.0.1:38655 Attrs: Locality: ServerVersion:2.1-6 BuildTag:v2.2.0-alpha.20190211-100-g10f8010 StartedAt:1550045023446407575 LocalityAddress:[] XXX_NoUnkeyedLiteral:{} XXX_sizecache:0} ClusterID:5fc734b7-1dcb-4b9e-9f10-20aa59bb23b2 StartedAt:1550045023446407575 LastUp:1550045023446407575}
I190213 08:03:44.169978 40137 sql/event_log.go:135  [n1,client=127.0.0.1:59328,user=root] Event: "set_cluster_setting", target: 0, info: {SettingName:kv.rangefeed.enabled Value:true User:root}
I190213 08:03:44.215123 40137 sql/event_log.go:135  [n1,client=127.0.0.1:59328,user=root] Event: "set_cluster_setting", target: 0, info: {SettingName:kv.closed_timestamp.target_duration Value:00:00:01 User:root}
I190213 08:03:44.251681 40137 sql/event_log.go:135  [n1,client=127.0.0.1:59328,user=root] Event: "set_cluster_setting", target: 0, info: {SettingName:changefeed.experimental_poll_interval Value:00:00:00.01 User:root}
I190213 08:03:44.309152 40137 sql/event_log.go:135  [n1,client=127.0.0.1:59328,user=root] Event: "create_database", target: 52, info: {DatabaseName:d Statement:CREATE DATABASE d User:root}
I190213 08:03:44.353301 40137 sql/event_log.go:135  [n1,client=127.0.0.1:59328,user=root] Event: "set_cluster_setting", target: 0, info: {SettingName:changefeed.push.enabled Value:false User:root}
I190213 08:03:44.392545 40137 sql/event_log.go:135  [n1,client=127.0.0.1:59328,user=root] Event: "set_cluster_setting", target: 0, info: {SettingName:changefeed.experimental_poll_interval Value:00:00:00.01 User:root}
I190213 08:03:44.437119 40137 sql/event_log.go:135  [n1,client=127.0.0.1:59328,user=root] Event: "create_table", target: 53, info: {TableName:d.public.foo Statement:CREATE TABLE foo (a INT8 PRIMARY KEY, b STRING) User:root}
I190213 08:03:44.459158 40242 storage/replica_command.go:245  [n1,split,s1,r20/1:/{Table/24-Max}] initiating a split of this range at key /Table/53 [r21] (zone config)
I190213 08:03:44.848236 39787 util/stop/stopper.go:547  quiescing; tasks left:
1      [async] transport racer
1      [async] storage.IntentResolver: cleanup txn records
1      [async] intent_resolver_batcher
1      [async] closedts-rangefeed-subscriber
1      [async] changefeed-poller
W190213 08:03:44.848313 40171 storage/intentresolver/intent_resolver.go:721  failed to gc transaction record: could not GC completed transaction anchored at /Local/Range/Table/24/RangeDescriptor: node unavailable; try another peer
I190213 08:03:44.848589 39787 util/stop/stopper.go:547  quiescing; tasks left:
1      [async] transport racer
1      [async] storage.IntentResolver: cleanup txn records
1      [async] intent_resolver_batcher
1      [async] closedts-rangefeed-subscriber
I190213 08:03:44.848796 39787 util/stop/stopper.go:547  quiescing; tasks left:
1      [async] transport racer
1      [async] intent_resolver_batcher
1      [async] closedts-rangefeed-subscriber
I190213 08:03:44.849114 39787 util/stop/stopper.go:547  quiescing; tasks left:
1      [async] transport racer
1      [async] closedts-rangefeed-subscriber
I190213 08:03:44.849332 39787 util/stop/stopper.go:547  quiescing; tasks left:
1      [async] transport racer
W190213 08:03:44.849458 40213 internal/client/txn.go:520  [n1,client=127.0.0.1:59336,user=root] failure aborting transaction: node unavailable; try another peer; abort caused by: context canceled
I190213 08:03:44.924444 40082 kv/transport_race.go:113  transport race promotion: ran 10 iterations on up to 623 requests
    --- FAIL: TestAvroEncoder/poller (1.65s)
        <autogenerated>:1: expected
              foo: {"a":{"long":3}}->{"after":{"foo":{"a":{"long":3},"b":{"string":"baz"}}},"updated":{"string":"1550045024829413917.0000000000"}}
            got
              foo: {"a":{"long":3}}->{"after":{"foo":{"a":{"long":3},"b":{"string":"baz"}}},"updated":{"string":"1550045024815525369.0000000001"}}

danhhz · 2019-02-13T15:18:32Z

Huh, guess this is real. I wonder what changed

cockroach-teamcity · 2019-02-15T08:23:18Z

SHA: https://github.com/cockroachdb/cockroach/commits/22119c85d2e7aca50f21f61408ae55207679569d

Parameters:

TAGS=
GOFLAGS=-race

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=TestAvroEncoder PKG=github.com/cockroachdb/cockroach/pkg/ccl/changefeedccl TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1140564&tab=buildLog

  github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).RunAsyncTask()
      /go/src/github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:320 +0x14d
  github.com/cockroachdb/cockroach/pkg/jobs.(*Registry).resume()
      /go/src/github.com/cockroachdb/cockroach/pkg/jobs/registry.go:534 +0x307
  github.com/cockroachdb/cockroach/pkg/jobs.(*Registry).StartJob()
      /go/src/github.com/cockroachdb/cockroach/pkg/jobs/registry.go:224 +0x2f7
  github.com/cockroachdb/cockroach/pkg/ccl/changefeedccl.changefeedPlanHook.func2()
      /go/src/github.com/cockroachdb/cockroach/pkg/ccl/changefeedccl/changefeed_stmt.go:248 +0xf33
  github.com/cockroachdb/cockroach/pkg/sql.(*hookFnNode).startExec.func1()
      /go/src/github.com/cockroachdb/cockroach/pkg/sql/planhook.go:136 +0x108
==================
I190215 08:22:21.865895 40350 ccl/changefeedccl/changefeed_processors.go:550  [n1] job 426470739165216769 new resolved timestamp 1550218941.729762618,0 is behind by 136.037081ms
I190215 08:22:21.916404 40350 ccl/changefeedccl/changefeed_processors.go:550  [n1] job 426470739165216769 new resolved timestamp 1550218941.800727678,0 is behind by 115.590636ms
I190215 08:22:21.947317 40445 ccl/changefeedccl/changefeed_processors.go:550  [n1] job 426470740993507329 new resolved timestamp 1550218941.405653573,0 is behind by 541.530005ms
I190215 08:22:21.947618 40445 ccl/changefeedccl/changefeed_processors.go:562  [n1] job 426470740993507329 span /Table/53/{1-2} is behind by 541.530005ms
I190215 08:22:21.991576 40350 ccl/changefeedccl/changefeed_processors.go:550  [n1] job 426470739165216769 new resolved timestamp 1550218941.869488947,0 is behind by 121.938943ms
I190215 08:22:22.048809 40445 ccl/changefeedccl/changefeed_processors.go:550  [n1] job 426470740993507329 new resolved timestamp 1550218941.845130517,0 is behind by 203.58131ms
I190215 08:22:22.080475 40350 ccl/changefeedccl/changefeed_processors.go:550  [n1] job 426470739165216769 new resolved timestamp 1550218941.918866937,0 is behind by 161.517554ms
I190215 08:22:22.152191 40445 ccl/changefeedccl/changefeed_processors.go:550  [n1] job 426470740993507329 new resolved timestamp 1550218941.952422905,0 is behind by 199.685797ms
I190215 08:22:22.162285 40350 ccl/changefeedccl/changefeed_processors.go:550  [n1] job 426470739165216769 new resolved timestamp 1550218941.995299653,0 is behind by 166.888872ms
I190215 08:22:22.210719 40350 ccl/changefeedccl/changefeed_processors.go:550  [n1] job 426470739165216769 new resolved timestamp 1550218942.080909441,0 is behind by 129.686477ms
I190215 08:22:22.231518 40445 ccl/changefeedccl/changefeed_stmt.go:442  [n1] CHANGEFEED job 426470740993507329 returning with error: cannot update progress on canceled job (id 426470740993507329)
I190215 08:22:22.299970 40350 ccl/changefeedccl/changefeed_processors.go:550  [n1] job 426470739165216769 new resolved timestamp 1550218942.162657577,0 is behind by 137.184519ms
I190215 08:22:22.406194 39981 util/stop/stopper.go:547  quiescing; tasks left:
3      [async] storage.IntentResolver: cleanup txn records
1      [async] transport racer
1      [async] job-426470739165216769
1      [async] intent_resolver_batcher
1      [async] closedts-rangefeed-subscriber
1      [async] changefeed-poller
W190215 08:22:22.406794 40462 storage/intentresolver/intent_resolver.go:722  failed to gc transaction record: could not GC completed transaction anchored at /Table/SystemConfigSpan/Start: node unavailable; try another peer
I190215 08:22:22.410154 40350 ccl/changefeedccl/changefeed_stmt.go:442  [n1] CHANGEFEED job 426470739165216769 returning with error: cannot update progress on canceled job (id 426470739165216769)
I190215 08:22:22.411282 39981 util/stop/stopper.go:547  quiescing; tasks left:
1      [async] transport racer
1      [async] closedts-rangefeed-subscriber
I190215 08:22:22.415881 39981 util/stop/stopper.go:547  quiescing; tasks left:
1      [async] transport racer
I190215 08:22:22.564731 40242 kv/transport_race.go:113  transport race promotion: ran 35 iterations on up to 1338 requests
    --- FAIL: TestAvroEncoder/enterprise (2.84s)
        testing.go:771: race detected during execution of test

danhhz · 2019-02-26T19:39:22Z

This reproduces locally under stressrace pretty easily. I turned on verbose changefeed logging and found something odd. I assumed the bug was in the avro encoder stuff reusing some memory that it shouldn't be but I think I'm seeing a kv get written with a different timestamp than the one returned by cluster_logical_timestamp().

The test does

		var ts2 string
		require.NoError(t, crdb.ExecuteTx(ctx, db, nil /* txopts */, func(tx *gosql.Tx) error {
			return tx.QueryRow(
				`INSERT INTO foo VALUES (3, 'baz') RETURNING cluster_logical_timestamp()`,
			).Scan(&ts2)
		}))
		assertPayloadsAvro(t, reg, fooUpdated, []string{
			`foo: {"a":{"long":3}}->{"after":{"foo":{"a":{"long":3},"b":{"string":"baz"}}},` +
				`"updated":{"string":"` + ts2 + `"}}`,
		})

With verbose changefeed logging on (--vmodule=changefeed=3), I'm seeing this

I190226 19:28:51.117848 549 ccl/changefeedccl/changefeed.go:139  [n1,client=127.0.0.1:56320,user=root] changed key /Table/53/1/3/0 1551209331.061806000,1
...
--- FAIL: TestAvroEncoder (2.91s)
    --- FAIL: TestAvroEncoder/poller (2.91s)
        <autogenerated>:1: expected
              foo: {"a":{"long":3}}->{"after":{"foo":{"a":{"long":3},"b":{"string":"baz"}}},"updated":{"string":"1551209331098368000.0000000000"}}
            got
              foo: {"a":{"long":3}}->{"after":{"foo":{"a":{"long":3},"b":{"string":"baz"}}},"updated":{"string":"1551209331061806000.0000000001"}}

The "expected" in the error is ts2, which comes back through sql from RETURNING cluster_logical_timestamp() on the insert. The "got" matches the timestamp we received from the changed kv, which comes right out of (in the case of this "poller" subtest" the ExportRequest).

danhhz · 2019-02-26T19:42:58Z

TESTS=AvroEncoder/sinkless doesn't seem to fail as readily, but if poller is broken, I'd expect to be seeing more tests fail, but so far it's just this one.

danhhz · 2019-02-26T20:33:57Z

Okay, did another repro with --vmodule=export=3 which prints out the kvs that ExportRequest is getting as it iterates. The timestamp agreed with the value the changefeed emitted. It really seems like RETURNING cluster_logical_timestamp() is giving me back a different value than this kv is written at. @tbg @nvanbenschoten is that expected?

I'm still confused why I'm not seeing this with rangefeed, so maybe it has something to do with the ExportRequest implementation (which notably uses a time-bound iterator). I'll try making the test use a normal iterator to double-check what's coming back from ExportRequest.

nvanbenschoten · 2019-02-26T21:16:24Z

I don't know much past what you've already said -- we shouldn't be seeing a kv with any timestamp other than the one returned from cluster_logical_timestamp().

If you can't reproduce this with rangefeed, then I think you're correct to start digging into the ExportRequest implementation. Is it possible that an ExportRequest is pushing an intent and causing the transaction to restart? It would be worth tracing (or just instrumenting) that INSERT to see if anything funky is happening.

danhhz · 2019-02-26T21:23:01Z

The plot thickens:

Export        /Table/53/1/3/0/1551215710.662443000,1 /TUPLE/2:2:Bytes/baz
Export verify /Table/53/1/3/0/1551215710.676995000,0 9d1668540a260362617a
Export        /Table/53/1/3/0/1551215710.676995000,0 /TUPLE/2:2:Bytes/baz
Export verify /Table/53/1/3/0/1551215710.676995000,0 9d1668540a260362617a

Smells like time-bound iterator, bummer : - /

nvanbenschoten · 2019-02-26T23:31:19Z

We'll have to wait for more logging, but the fact that the sanity check iterator doesn't use a max timestamp hint in MVCCIncrementalIterator.sanityCheckMetadataKey

cockroach/pkg/ccl/storageccl/engineccl/mvcc.go

Line 220 in 4a725f9

i.sanityIter = i.e.NewIterator(engine.IterOptions{UpperBound: i.upperBound})

but the time-bound iterator, where the provisional value is pulled from the time-bound iterator, does, seems very suspicious to me. I don't have a full story that explains what you're seeing here, but it certainly feels like that could cause issues. If you feel up to it, I'd try reproducing with a max timestamp hint in sanityCheckMetadataKey.

danhhz · 2019-02-27T00:02:58Z

I just got the following repro with additional logging in 19d76f7ea0bdae258808de3937b82cc3c7d8ad42

engineccl/mvcc.go:150  tbi /Table/53/1/3/0 0a2c0a1042e1403d35e64ddc9a9ef323b9bcaebe1a04bd898b882a0c08e0ecd0c093d0c3c315100130cdf81d3801120c08e0ecd0c093d0c3c31510011800200c280a
engineccl/mvcc.go:161  meta tbi {Txn:<nil> Timestamp:1551224544.182102000,0 Deleted:false KeyBytes:0 ValBytes:0 RawBytes:[] IntentHistory:[] MergeTimestamp:<nil> XXX_NoUnkeyedLiteral:{} XXX_sizecache:0}
engineccl/mvcc.go:256  provisional normal /Table/53/1/3/0/1551224544.832166000,0 9d1668540a260362617a
engineccl/mvcc.go:185  meta normal {Txn:id:42e1403d-35e6-4ddc-9a9e-f323b9bcaebe key:"\275\211\213\210" epoch:1 timestamp:<wall_time:1551224544832166000 > priority:490573 sequence:1  Timestamp:1551224544.832166000,0 Deleted:false KeyBytes:12 ValBytes:10 RawBytes:[] IntentHistory:[] MergeTimestamp:<nil> XXX_NoUnkeyedLiteral:{} XXX_sizecache:0}
engineccl/mvcc.go:199  time bounds=(1551224544.806844000,0,1551224544.818651000,0] meta={Txn:id:42e1403d-35e6-4ddc-9a9e-f323b9bcaebe key:"\275\211\213\210" epoch:1 timestamp:<wall_time:1551224544832166000 > priority:490573 sequence:1  Timestamp:1551224544.832166000,0 Deleted:false KeyBytes:12 ValBytes:10 RawBytes:[] IntentHistory:[] MergeTimestamp:<nil> XXX_NoUnkeyedLiteral:{} XXX_sizecache:0}
engineccl/mvcc.go:150  tbi /Table/53/1/3/0/1551224544.806844000,1 9d1668540a260362617a

Sorry the debug messages I added are so haphazard and oddly ordered, but I'll walk through what happens here.

line 1: TBI hits an intent
line 2: TBI intent metadata has no txn (don't know if this is relevant)
line 3: sanity iter sees the intent and the result of next (it's provisional value) has a timestamp of 1551224544.832166000,0
line 4: sanity iter's intent metadata does have a txn and the timestamp matches the provisional value
line 5: TBI sees a provisional value with a lower timestamp

So at this point, there's a new provisional value (as seen by the sanity iter), which nathan assures me means the old provisional value will have been deleted. But the TBI is still seeing some previous one at an older timestamp. Provisional value deletions do have a timestamp, so they should be reflected in TBI sstable metadata, which means the TBI shouldn't see this intent. I'm not sure yet exactly what is breaking down.

RangeFeed originally intended to use the time-bound iterator performance optimization. However, they've had correctness issues in the past (cockroachdb#28358, cockroachdb#34819) and no-one has the time for the due-diligence necessary to be confidant in their correctness going forward. Not using them causes the total time spent in RangeFeed catchup on changefeed over tpcc-1000 to go from 40s -> 4853s, which is quite large but still workable. Closes cockroachdb#35122 Release note (enterprise change): In exchange for increased correctness confidance, `CHANGEFEED`s using `changefeed.push.enabled` (the default) now take slightly more resources on startup and range rebalancing/splits.

35470: rangefeed: stop using time-bound iterator for catchup scan r=tbg a=danhhz RangeFeed originally intended to use the time-bound iterator performance optimization. However, they've had correctness issues in the past (#28358, #34819) and no-one has the time for the due-diligence necessary to be confidant in their correctness going forward. Not using them causes the total time spent in RangeFeed catchup on changefeed over tpcc-1000 to go from 40s -> 4853s, which is quite large but still workable. Closes #35122 Release note (enterprise change): In exchange for increased correctness confidance, `CHANGEFEED`s using `changefeed.push.enabled` (the default) now take slightly more resources on startup and range rebalancing/splits. Co-authored-by: Daniel Harrison <[email protected]>

nvanbenschoten · 2019-03-12T19:27:59Z

Ok, I think I see what's going on. I'm running to a meeting so I'll write the explanation up later, but here's a hint - this fixes it:

diff --git a/pkg/storage/spanlatch/manager.go b/pkg/storage/spanlatch/manager.go
index 8ee9df8c67..e8369c8d58 100644
--- a/pkg/storage/spanlatch/manager.go
+++ b/pkg/storage/spanlatch/manager.go
@@ -350,7 +350,7 @@ func (m *Manager) wait(ctx context.Context, lg *Guard, snap snapshot) error {
                                case spanset.SpanReadOnly:
                                        // Wait for writes at equal or lower timestamps.
                                        it := tr[spanset.SpanReadWrite].MakeIter()
-                                       if err := m.iterAndWait(ctx, timer, &it, latch, ignoreLater); err != nil {
+                                       if err := m.iterAndWait(ctx, timer, &it, latch, ignoreNothing); err != nil {
                                                return err
                                        }
                                case spanset.SpanReadWrite:
@@ -366,7 +366,7 @@ func (m *Manager) wait(ctx context.Context, lg *Guard, snap snapshot) error {
                                        }
                                        // Wait for reads at equal or higher timestamps.
                                        it = tr[spanset.SpanReadOnly].MakeIter()
-                                       if err := m.iterAndWait(ctx, timer, &it, latch, ignoreEarlier); err != nil {
+                                       if err := m.iterAndWait(ctx, timer, &it, latch, ignoreNothing); err != nil {
                                                return err
                                        }
                                default:

nvanbenschoten · 2019-03-12T21:06:48Z

The issue is that the Export request acquires read latches at timestamp 1 and concurrently an intent is moved forward while holding write latches at timestamp 2. Writes in the future should not interact with reads in the past, so these requests are allowed to execute concurrently. However, changes to uncommitted transactional state make the rules here a little complicated, and its better to think of "interact" in the logical sense instead of the physical sense (e.g. a write at a higher timestamp could remove an intent and race with a reader, but the effect of observing the intent vs. not observing the intent as a concurrent reading will be the same after concurrency control).

The hack in MVCCIncrementalIterator to re-check the intent key and ignore it breaks this independence. It allows the Export request to see a partially applied update at a higher timestamp that affects uncommitted state in its timestamp range in a way that doesn't preserve logical equivalence. Concretely, we're seeing a case where MVCCIncrementalIterator.iter observes a state before a intent push is applied (old intent key, old provisional key) and MVCCIncrementalIterator.sanityIter observes a state after an intent push is applied (new intent key, new provisional key). On its own, this isn't necessarily an issue, but the problem occurs when we switch back and forth between them. We ignore the old intent key in favor of the new intent key, but then we switch back to MVCCIncrementalIterator.iter and return the old provisional key. This causes us to consider an uncommitted and already removed intent as committed.

So what do we do about this? I see a few options:

pull both iterators from the same snapshot
grab the sanityIter before grabing the iter
check the status of old provisional key using the sanityIter to determine whether it is committed or aborted

This will affect any user of MVCCIncrementalIterator, which is at least backup, right @danhhz?

nvanbenschoten · 2019-03-12T22:56:23Z

grab the sanityIter before grabing the iter

diff --git a/pkg/ccl/storageccl/engineccl/mvcc.go b/pkg/ccl/storageccl/engineccl/mvcc.go
index 48fa07b..61e01d1 100644
--- a/pkg/ccl/storageccl/engineccl/mvcc.go
+++ b/pkg/ccl/storageccl/engineccl/mvcc.go
@@ -85,6 +85,9 @@ func NewMVCCIncrementalIterator(e engine.Reader, opts IterOptions) *MVCCIncremen
        return &MVCCIncrementalIterator{
                e:          e,
                upperBound: opts.UpperBound,
+               sanityIter: e.NewIterator(engine.IterOptions{
+                       UpperBound: opts.UpperBound,
+               }),
                iter: e.NewIterator(engine.IterOptions{
                        // The call to startTime.Next() converts our exclusive start bound into
                        // the inclusive start bound that MinTimestampHint expects. This is
@@ -231,11 +234,6 @@ func (i *MVCCIncrementalIterator) advance() {
 // sees that exact key. Otherwise, it returns false. It's used in the workaround
 // in `advance` for a time-bound iterator bug.
 func (i *MVCCIncrementalIterator) sanityCheckMetadataKey() ([]byte, bool, error) {
-       if i.sanityIter == nil {
-               // The common case is that we'll won't need the sanityIter for a given
-               // MVCCIncrementalIterator, so create it lazily.
-               i.sanityIter = i.e.NewIterator(engine.IterOptions{UpperBound: i.upperBound})
-       }
        unsafeKey := i.iter.UnsafeKey()
        i.sanityIter.Seek(unsafeKey)
        if ok, err := i.sanityIter.Valid(); err != nil {

Yeah this fixes it, but the reason why is pretty subtle. By creating the sanityIter (non-tbi) before iter (tbi), we ensure that the only way for the iterators to both have different intents on a key is if the iter intent has a timestamp outside of the endTime (the timestamp it acquired its read latch), which it will ignore when it iterates to the next key (the provisional kv). Doing this aligns the ambiguity of the phantom intent problem of tbis with the ambiguity of the multiple iterator race. We then rely on the sanityIter to distinguish between these cases and collapse the ambiguity.

danhhz · 2019-03-12T23:31:29Z

I'm still wrapping my head around this, but great find! I think I've been assuming that all iterators created from the engine.ReadWriter that is passed to evalExport would see the exact same thing (like they would if they were attached to a rocksdb snapshot), but now that it's pointed out to me I realize of course that's not true.

Yes, this will affect all users of MVCCIncrementalIterator, which is currently both full and incremental backup as well as poller based changefeeds and the initial scan and tabledesc history of any changefeed. There is a bug that makes full backup and changefeed initial scan use tbis, but it's an easy fix. I've also just proposed that incremental backup also stop using tbis (see #35671). tabledesc history polling also shouldn't need tbis. Which leaves only poller based chagnefeeds.

…ents Fixes cockroachdb#34819. 349ff61 introduced a workaround for cockroachdb#28358 into MVCCIncrementalIterator. This workaround created a second (non-time-bound) iterator to verify possibly-phantom MVCCMetadata keys during iteration. We found in cockroachdb#34819 that it is necessary for correctness that sanityIter be created before the original iter. This is because the provided Reader that both iterators are created from may not be a consistent snapshot, so the two iterators could end up observing different information. The hack around sanityCheckMetadataKey only works properly if all possible discrepancies between the two iterators lead to intents and values falling outside of the timestamp range **from the time-bound iterators perspective**. This allows us to simply ignore discrepancies that we notice in advance(). This commit makes this change. It also adds a test that failed regularly before the fix under stress and no longer fails after the fix. Release note: None

35300: sql: add checks after all referenced columns have been backfilled r=lucy-zhang a=lucy-zhang Previously, if a column was added with a check constraint that also referenced another column that was public, writes to that public column would erroneously fail (and, in the worst case, result in a panic) while the column being added was not yet public. With this change, the schema changer will now wait to add the check constraint to the table descriptor until after all the columns that were added in the same transaction have been backfilled. A new optional field has been added to `ConstraintToValidate` for the check constraint itself, so that it can be added to the table descriptor at the correct step in the schema change process. I ended up adding this field to the existing mutation instead of creating a new type of mutation to add the constraint to the descriptor, since it ultimately seemed to me that a mutation that simply adds a schema element in its backfill step would be too inconsistent with what mutations are, especially since all the mutations for a single transaction are basically executed at the same time. To support NOT VALID in the future, we could add more flags to the protobuf to indicate that either the addition of the constraint or the validation should be skipped, so that they can be executed separately. Fixes #35258, fixes #35193 Release note: None 35682: engineccl/mvcc: fix time-bound iterator's interaction with moving intents r=nvanbenschoten a=nvanbenschoten Fixes #34819. 349ff61 introduced a workaround for #28358 into MVCCIncrementalIterator. This workaround created a second (non-time-bound) iterator to verify possibly-phantom MVCCMetadata keys during iteration. We found in #34819 that it is necessary for correctness that sanityIter be created before the original iter. This is because the provided Reader that both iterators are created from may not be a consistent snapshot, so the two iterators could end up observing different information. The hack around sanityCheckMetadataKey only works properly if all possible discrepancies between the two iterators lead to intents and values falling outside of the timestamp range **from the time-bound iterator's perspective**. This allows us to simply ignore discrepancies that we notice in advance(). This commit makes this change. It also adds a test that failed regularly before the fix under stress and no longer fails after the fix. Release note: None 35689: roachtest: add large node kv tests and batching kv tests r=nvanbenschoten a=nvanbenschoten This commit adds support for running `kv` with a `--batch` parameter. It then adds the following new roachtest configurations: - kv0/enc=false/nodes=3/batch=16 - kv95/enc=false/nodes=3/batch=16 - kv0/enc=false/nodes=3/cpu=96 - kv95/enc=false/nodes=3/cpu=96 - kv50/enc=false/nodes=4/cpu=96/batch=64 The last test is currently skipped because of #34241. I confirmed that it triggers the corresponding assertion on both AWS and GCE. My request for more m5d.24xlarge quota just succeeded, but I may need to request more quota for n1-highcpu-96 VMs for these to run nightly. Release note: None Co-authored-by: Lucy Zhang <[email protected]> Co-authored-by: Nathan VanBenschoten <[email protected]>

…ents Fixes cockroachdb#34819. 349ff61 introduced a workaround for cockroachdb#28358 into MVCCIncrementalIterator. This workaround created a second (non-time-bound) iterator to verify possibly-phantom MVCCMetadata keys during iteration. We found in cockroachdb#34819 that it is necessary for correctness that sanityIter be created before the original iter. This is because the provided Reader that both iterators are created from may not be a consistent snapshot, so the two iterators could end up observing different information. The hack around sanityCheckMetadataKey only works properly if all possible discrepancies between the two iterators lead to intents and values falling outside of the timestamp range **from the time-bound iterators perspective**. This allows us to simply ignore discrepancies that we notice in advance(). This commit makes this change. It also adds a test that failed regularly before the fix under stress and no longer fails after the fix. Release note: None

Time-bound iterators are a performance optimization that allows us to entirely skip over sstables in RocksDB that don't have data relevant to the time bounds in a request. This can have a dramatic impact on performance, but we've seen a number of extremely subtle and hard to detect correctness issues with this (see issues cockroachdb#28358 cockroachdb#34819). As a result, we've decided to skip the optimization everywhere that it isn't absolutely necessary for the feature to work (leaving one place: poller-based changefeeds, which are being phased out anyway). This will both give increased confidance in correctness as well as eliminate any need to consider and investigate time-bound iterators when/if someone hits a correctness bug. This commit introduces the plumbing necessary for an individual ExportRequest to control whether time-bound iterators are allowed or disallowed. A bool is introduced to the ExportRequest proto that explictly allows time-bound iterators. This means that in a rolling upgrade, it's possible for an old changefeed-poller node to send a request without the field set to a new node, which sees the unset field as false, disabling the optimization. An alternative is to invert the semantics of the bool (zero-value = enable, true = disable the optimization), but in case any new uses of ExportRequest are introduced, I'd much prefer the zero-value of this field be the safer default of disabled. As part of the investigation for whether we could turn them off for incremental BACKUP (cockroachdb#35671), I reran some of the initial measurements of time-bound iterator impact on incremental backup. I installed tpcc-1000 on a 3 node n1-standard-16 cluster, then ran a full backup, then ran load for 1 hour, noting the time T. With load running, I ran 6 incremental backups from T alternating between tbi and no-tbi: tbi incremental backup runtimes: 3m45s,3m56s,4m6s no-tbi incremental backup runtimes: 6m45s,6m27s,6m48s Any impact on normal traffic (latencies etc) seemed to be in the noise. Closes cockroachdb#35671. Release note (enterprise change): In exchange for increased correctness confidance, `BACKUP`s using `INCREMENTAL FROM` now take slightly longer.

36191: storageccl: disable time-bound iteration optimization in BACKUP r=dt,tbg a=danhhz Time-bound iterators are a performance optimization that allows us to entirely skip over sstables in RocksDB that don't have data relevant to the time bounds in a request. This can have a dramatic impact on performance, but we've seen a number of extremely subtle and hard to detect correctness issues with this (see issues #28358 #34819). As a result, we've decided to skip the optimization everywhere that it isn't absolutely necessary for the feature to work (leaving one place: poller-based changefeeds, which are being phased out anyway). This will both give increased confidance in correctness as well as eliminate any need to consider and investigate time-bound iterators when/if someone hits a correctness bug. This commit introduces the plumbing necessary for an individual ExportRequest to control whether time-bound iterators are allowed or disallowed. A bool is introduced to the ExportRequest proto that explictly allows time-bound iterators. This means that in a rolling upgrade, it's possible for an old changefeed-poller node to send a request without the field set to a new node, which sees the unset field as false, disabling the optimization. An alternative is to invert the semantics of the bool (zero-value = enable, true = disable the optimization), but in case any new uses of ExportRequest are introduced, I'd much prefer the zero-value of this field be the safer default of disabled. As part of the investigation for whether we could turn them off for incremental BACKUP (#35671), I reran some of the initial measurements of time-bound iterator impact on incremental backup. I installed tpcc-1000 on a 3 node n1-standard-16 cluster, then ran a full backup, then ran load for 1 hour, noting the time T. With load running, I ran 6 incremental backups from T alternating between tbi and no-tbi: tbi incremental backup runtimes: 3m45s,3m56s,4m6s no-tbi incremental backup runtimes: 6m45s,6m27s,6m48s Any impact on normal traffic (latencies etc) seemed to be in the noise. Closes #35671. Release note (enterprise change): In exchange for increased correctness confidance, `BACKUP`s using `INCREMENTAL FROM` now take slightly longer. Co-authored-by: Daniel Harrison <[email protected]>

Time-bound iterators are a performance optimization that allows us to entirely skip over sstables in RocksDB that don't have data relevant to the time bounds in a request. This can have a dramatic impact on performance, but we've seen a number of extremely subtle and hard to detect correctness issues with this (see issues cockroachdb#28358 cockroachdb#34819). As a result, we've decided to skip the optimization everywhere that it isn't absolutely necessary for the feature to work (leaving one place: poller-based changefeeds, which are being phased out anyway). This will both give increased confidance in correctness as well as eliminate any need to consider and investigate time-bound iterators when/if someone hits a correctness bug. This commit introduces the plumbing necessary for an individual ExportRequest to control whether time-bound iterators are allowed or disallowed. A bool is introduced to the ExportRequest proto that explictly allows time-bound iterators. This means that in a rolling upgrade, it's possible for an old changefeed-poller node to send a request without the field set to a new node, which sees the unset field as false, disabling the optimization. An alternative is to invert the semantics of the bool (zero-value = enable, true = disable the optimization), but in case any new uses of ExportRequest are introduced, I'd much prefer the zero-value of this field be the safer default of disabled. As part of the investigation for whether we could turn them off for incremental BACKUP (cockroachdb#35671), I reran some of the initial measurements of time-bound iterator impact on incremental backup. I installed tpcc-1000 on a 3 node n1-standard-16 cluster, then ran a full backup, then ran load for 1 hour, noting the time T. With load running, I ran 6 incremental backups from T alternating between tbi and no-tbi: tbi incremental backup runtimes: 3m45s,3m56s,4m6s no-tbi incremental backup runtimes: 6m45s,6m27s,6m48s Any impact on normal traffic (latencies etc) seemed to be in the noise. Closes cockroachdb#35671. Release note (enterprise change): In exchange for increased correctness confidance, `BACKUP`s using `INCREMENTAL FROM` now take slightly longer.

cockroach-teamcity added this to the 2.2 milestone Feb 12, 2019

cockroach-teamcity assigned danhhz Feb 12, 2019

cockroach-teamcity added C-test-failure Broken test (automatically or manually discovered). O-robot Originated from a bot. labels Feb 12, 2019

danhhz mentioned this issue Mar 6, 2019

rangefeed: stop using time-bound iterator for catchup scan #35470

Merged

danhhz mentioned this issue Mar 12, 2019

backupccl: investigate turning off time-bound iterator #35671

Closed

nvanbenschoten mentioned this issue Mar 13, 2019

engineccl/mvcc: fix time-bound iterator's interaction with moving intents #35682

Merged

craig bot closed this as completed in #35682 Mar 13, 2019

nvanbenschoten mentioned this issue Mar 20, 2019

release-2.1: engineccl/mvcc: fix time-bound iterator's interaction with moving intents #35971

Merged

danhhz mentioned this issue Mar 26, 2019

storageccl: disable time-bound iteration optimization in BACKUP #36191

Merged

danhhz mentioned this issue Mar 28, 2019

release-19.1: storageccl: disable time-bound iteration optimization in BACKUP #36278

Merged

sumeerbhola mentioned this issue Jan 21, 2020

backupccl, engine: re-enable Time-Bound Iterator in ExportToSST #43799

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ccl/changefeedccl: TestAvroEncoder failed under stress #34819

ccl/changefeedccl: TestAvroEncoder failed under stress #34819

cockroach-teamcity commented Feb 12, 2019

cockroach-teamcity commented Feb 13, 2019

danhhz commented Feb 13, 2019

cockroach-teamcity commented Feb 15, 2019

danhhz commented Feb 26, 2019

danhhz commented Feb 26, 2019

danhhz commented Feb 26, 2019

nvanbenschoten commented Feb 26, 2019

danhhz commented Feb 26, 2019

nvanbenschoten commented Feb 26, 2019

danhhz commented Feb 27, 2019 •

edited

Loading

nvanbenschoten commented Mar 12, 2019

nvanbenschoten commented Mar 12, 2019

nvanbenschoten commented Mar 12, 2019 •

edited

Loading

danhhz commented Mar 12, 2019

ccl/changefeedccl: TestAvroEncoder failed under stress #34819

ccl/changefeedccl: TestAvroEncoder failed under stress #34819

Comments

cockroach-teamcity commented Feb 12, 2019

cockroach-teamcity commented Feb 13, 2019

danhhz commented Feb 13, 2019

cockroach-teamcity commented Feb 15, 2019

danhhz commented Feb 26, 2019

danhhz commented Feb 26, 2019

danhhz commented Feb 26, 2019

nvanbenschoten commented Feb 26, 2019

danhhz commented Feb 26, 2019

nvanbenschoten commented Feb 26, 2019

danhhz commented Feb 27, 2019 • edited Loading

nvanbenschoten commented Mar 12, 2019

nvanbenschoten commented Mar 12, 2019

nvanbenschoten commented Mar 12, 2019 • edited Loading

danhhz commented Mar 12, 2019

danhhz commented Feb 27, 2019 •

edited

Loading

nvanbenschoten commented Mar 12, 2019 •

edited

Loading