Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ccl/changefeedccl: TestAvroEncoder failed under stress #34819

Closed
cockroach-teamcity opened this issue Feb 12, 2019 · 14 comments · Fixed by #35682
Closed

ccl/changefeedccl: TestAvroEncoder failed under stress #34819

cockroach-teamcity opened this issue Feb 12, 2019 · 14 comments · Fixed by #35682
Assignees
Labels
C-test-failure Broken test (automatically or manually discovered). O-robot Originated from a bot.
Milestone

Comments

@cockroach-teamcity
Copy link
Member

SHA: https://github.com/cockroachdb/cockroach/commits/965525a5deb3c4fab5ab232e069846ecc08f632d

Parameters:

TAGS=
GOFLAGS=-race

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=TestAvroEncoder PKG=github.com/cockroachdb/cockroach/pkg/ccl/changefeedccl TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1133737&tab=buildLog

I190212 07:34:23.816867 39761 sql/event_log.go:135  [n1,intExec=create-default-db] Event: "create_database", target: 51, info: {DatabaseName:postgres Statement:CREATE DATABASE IF NOT EXISTS postgres User:root}
I190212 07:34:23.843475 39528 server/server.go:1645  [n1] done ensuring all necessary migrations have run
I190212 07:34:23.843592 39528 server/server.go:1648  [n1] serving sql connections
I190212 07:34:23.870680 39789 sql/event_log.go:135  [n1] Event: "node_join", target: 1, info: {Descriptor:{NodeID:1 Address:127.0.0.1:44729 Attrs: Locality: ServerVersion:2.1-6 BuildTag:v2.2.0-alpha.20190211-59-g965525a StartedAt:1549956863221167191 LocalityAddress:[] XXX_NoUnkeyedLiteral:{} XXX_sizecache:0} ClusterID:0767d590-e3a2-4b92-839a-d55b5e214481 StartedAt:1549956863221167191 LastUp:1549956863221167191}
I190212 07:34:23.873853 39786 server/server_update.go:67  [n1] no need to upgrade, cluster already at the newest version
I190212 07:34:23.916926 39497 sql/event_log.go:135  [n1,client=127.0.0.1:52492,user=root] Event: "set_cluster_setting", target: 0, info: {SettingName:kv.rangefeed.enabled Value:true User:root}
I190212 07:34:23.945303 39497 sql/event_log.go:135  [n1,client=127.0.0.1:52492,user=root] Event: "set_cluster_setting", target: 0, info: {SettingName:kv.closed_timestamp.target_duration Value:00:00:01 User:root}
I190212 07:34:23.982669 39497 sql/event_log.go:135  [n1,client=127.0.0.1:52492,user=root] Event: "set_cluster_setting", target: 0, info: {SettingName:changefeed.experimental_poll_interval Value:00:00:00.01 User:root}
I190212 07:34:24.009996 39497 sql/event_log.go:135  [n1,client=127.0.0.1:52492,user=root] Event: "create_database", target: 52, info: {DatabaseName:d Statement:CREATE DATABASE d User:root}
I190212 07:34:24.045299 39497 sql/event_log.go:135  [n1,client=127.0.0.1:52492,user=root] Event: "set_cluster_setting", target: 0, info: {SettingName:changefeed.push.enabled Value:false User:root}
I190212 07:34:24.077778 39497 sql/event_log.go:135  [n1,client=127.0.0.1:52492,user=root] Event: "set_cluster_setting", target: 0, info: {SettingName:changefeed.experimental_poll_interval Value:00:00:00.01 User:root}
I190212 07:34:24.105689 39497 sql/event_log.go:135  [n1,client=127.0.0.1:52492,user=root] Event: "create_table", target: 53, info: {TableName:d.public.foo Statement:CREATE TABLE foo (a INT8 PRIMARY KEY, b STRING) User:root}
I190212 07:34:24.120701 39912 storage/replica_command.go:245  [n1,split,s1,r20/1:/{Table/24-Max}] initiating a split of this range at key /Table/53 [r21] (zone config)
I190212 07:34:24.474823 39528 util/stop/stopper.go:536  quiescing; tasks left:
1      [async] transport racer
1      [async] storage.IntentResolver: cleanup txn records
1      [async] intent_resolver_batcher
1      [async] closedts-rangefeed-subscriber
1      [async] changefeed-poller
1      [async] async-rollback
I190212 07:34:24.485080 39528 util/stop/stopper.go:536  quiescing; tasks left:
1      [async] transport racer
1      [async] storage.IntentResolver: cleanup txn records
1      [async] closedts-rangefeed-subscriber
W190212 07:34:24.485542 39896 storage/intentresolver/intent_resolver.go:721  failed to gc transaction record: could not GC completed transaction anchored at /Local/Range/Table/24/RangeDescriptor: node unavailable; try another peer
I190212 07:34:24.485810 39528 util/stop/stopper.go:536  quiescing; tasks left:
1      [async] transport racer
1      [async] closedts-rangefeed-subscriber
I190212 07:34:24.486039 39528 util/stop/stopper.go:536  quiescing; tasks left:
1      [async] transport racer
I190212 07:34:24.570205 39659 kv/transport_race.go:113  transport race promotion: ran 12 iterations on up to 604 requests
    --- FAIL: TestAvroEncoder/poller (1.49s)
        <autogenerated>:1: expected
              foo: {"a":{"long":3}}->{"after":{"foo":{"a":{"long":3},"b":{"string":"baz"}}},"updated":{"string":"1549956864460181344.0000000000"}}
            got
              foo: {"a":{"long":3}}->{"after":{"foo":{"a":{"long":3},"b":{"string":"baz"}}},"updated":{"string":"1549956864451566598.0000000001"}}

@cockroach-teamcity cockroach-teamcity added this to the 2.2 milestone Feb 12, 2019
@cockroach-teamcity cockroach-teamcity added C-test-failure Broken test (automatically or manually discovered). O-robot Originated from a bot. labels Feb 12, 2019
@cockroach-teamcity
Copy link
Member Author

SHA: https://github.com/cockroachdb/cockroach/commits/10f8010fa5778e740c057905e2d7664b5fd5d647

Parameters:

TAGS=
GOFLAGS=-race

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=TestAvroEncoder PKG=github.com/cockroachdb/cockroach/pkg/ccl/changefeedccl TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1135873&tab=buildLog

I190213 08:03:44.087778 39787 server/server.go:1648  [n1] serving sql connections
I190213 08:03:44.134447 40065 server/server_update.go:67  [n1] no need to upgrade, cluster already at the newest version
I190213 08:03:44.136415 40164 sql/event_log.go:135  [n1] Event: "node_join", target: 1, info: {Descriptor:{NodeID:1 Address:127.0.0.1:38655 Attrs: Locality: ServerVersion:2.1-6 BuildTag:v2.2.0-alpha.20190211-100-g10f8010 StartedAt:1550045023446407575 LocalityAddress:[] XXX_NoUnkeyedLiteral:{} XXX_sizecache:0} ClusterID:5fc734b7-1dcb-4b9e-9f10-20aa59bb23b2 StartedAt:1550045023446407575 LastUp:1550045023446407575}
I190213 08:03:44.169978 40137 sql/event_log.go:135  [n1,client=127.0.0.1:59328,user=root] Event: "set_cluster_setting", target: 0, info: {SettingName:kv.rangefeed.enabled Value:true User:root}
I190213 08:03:44.215123 40137 sql/event_log.go:135  [n1,client=127.0.0.1:59328,user=root] Event: "set_cluster_setting", target: 0, info: {SettingName:kv.closed_timestamp.target_duration Value:00:00:01 User:root}
I190213 08:03:44.251681 40137 sql/event_log.go:135  [n1,client=127.0.0.1:59328,user=root] Event: "set_cluster_setting", target: 0, info: {SettingName:changefeed.experimental_poll_interval Value:00:00:00.01 User:root}
I190213 08:03:44.309152 40137 sql/event_log.go:135  [n1,client=127.0.0.1:59328,user=root] Event: "create_database", target: 52, info: {DatabaseName:d Statement:CREATE DATABASE d User:root}
I190213 08:03:44.353301 40137 sql/event_log.go:135  [n1,client=127.0.0.1:59328,user=root] Event: "set_cluster_setting", target: 0, info: {SettingName:changefeed.push.enabled Value:false User:root}
I190213 08:03:44.392545 40137 sql/event_log.go:135  [n1,client=127.0.0.1:59328,user=root] Event: "set_cluster_setting", target: 0, info: {SettingName:changefeed.experimental_poll_interval Value:00:00:00.01 User:root}
I190213 08:03:44.437119 40137 sql/event_log.go:135  [n1,client=127.0.0.1:59328,user=root] Event: "create_table", target: 53, info: {TableName:d.public.foo Statement:CREATE TABLE foo (a INT8 PRIMARY KEY, b STRING) User:root}
I190213 08:03:44.459158 40242 storage/replica_command.go:245  [n1,split,s1,r20/1:/{Table/24-Max}] initiating a split of this range at key /Table/53 [r21] (zone config)
I190213 08:03:44.848236 39787 util/stop/stopper.go:547  quiescing; tasks left:
1      [async] transport racer
1      [async] storage.IntentResolver: cleanup txn records
1      [async] intent_resolver_batcher
1      [async] closedts-rangefeed-subscriber
1      [async] changefeed-poller
W190213 08:03:44.848313 40171 storage/intentresolver/intent_resolver.go:721  failed to gc transaction record: could not GC completed transaction anchored at /Local/Range/Table/24/RangeDescriptor: node unavailable; try another peer
I190213 08:03:44.848589 39787 util/stop/stopper.go:547  quiescing; tasks left:
1      [async] transport racer
1      [async] storage.IntentResolver: cleanup txn records
1      [async] intent_resolver_batcher
1      [async] closedts-rangefeed-subscriber
I190213 08:03:44.848796 39787 util/stop/stopper.go:547  quiescing; tasks left:
1      [async] transport racer
1      [async] intent_resolver_batcher
1      [async] closedts-rangefeed-subscriber
I190213 08:03:44.849114 39787 util/stop/stopper.go:547  quiescing; tasks left:
1      [async] transport racer
1      [async] closedts-rangefeed-subscriber
I190213 08:03:44.849332 39787 util/stop/stopper.go:547  quiescing; tasks left:
1      [async] transport racer
W190213 08:03:44.849458 40213 internal/client/txn.go:520  [n1,client=127.0.0.1:59336,user=root] failure aborting transaction: node unavailable; try another peer; abort caused by: context canceled
I190213 08:03:44.924444 40082 kv/transport_race.go:113  transport race promotion: ran 10 iterations on up to 623 requests
    --- FAIL: TestAvroEncoder/poller (1.65s)
        <autogenerated>:1: expected
              foo: {"a":{"long":3}}->{"after":{"foo":{"a":{"long":3},"b":{"string":"baz"}}},"updated":{"string":"1550045024829413917.0000000000"}}
            got
              foo: {"a":{"long":3}}->{"after":{"foo":{"a":{"long":3},"b":{"string":"baz"}}},"updated":{"string":"1550045024815525369.0000000001"}}

@danhhz
Copy link
Contributor

danhhz commented Feb 13, 2019

Huh, guess this is real. I wonder what changed

@cockroach-teamcity
Copy link
Member Author

SHA: https://github.com/cockroachdb/cockroach/commits/22119c85d2e7aca50f21f61408ae55207679569d

Parameters:

TAGS=
GOFLAGS=-race

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=TestAvroEncoder PKG=github.com/cockroachdb/cockroach/pkg/ccl/changefeedccl TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1140564&tab=buildLog

  github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).RunAsyncTask()
      /go/src/github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:320 +0x14d
  github.com/cockroachdb/cockroach/pkg/jobs.(*Registry).resume()
      /go/src/github.com/cockroachdb/cockroach/pkg/jobs/registry.go:534 +0x307
  github.com/cockroachdb/cockroach/pkg/jobs.(*Registry).StartJob()
      /go/src/github.com/cockroachdb/cockroach/pkg/jobs/registry.go:224 +0x2f7
  github.com/cockroachdb/cockroach/pkg/ccl/changefeedccl.changefeedPlanHook.func2()
      /go/src/github.com/cockroachdb/cockroach/pkg/ccl/changefeedccl/changefeed_stmt.go:248 +0xf33
  github.com/cockroachdb/cockroach/pkg/sql.(*hookFnNode).startExec.func1()
      /go/src/github.com/cockroachdb/cockroach/pkg/sql/planhook.go:136 +0x108
==================
I190215 08:22:21.865895 40350 ccl/changefeedccl/changefeed_processors.go:550  [n1] job 426470739165216769 new resolved timestamp 1550218941.729762618,0 is behind by 136.037081ms
I190215 08:22:21.916404 40350 ccl/changefeedccl/changefeed_processors.go:550  [n1] job 426470739165216769 new resolved timestamp 1550218941.800727678,0 is behind by 115.590636ms
I190215 08:22:21.947317 40445 ccl/changefeedccl/changefeed_processors.go:550  [n1] job 426470740993507329 new resolved timestamp 1550218941.405653573,0 is behind by 541.530005ms
I190215 08:22:21.947618 40445 ccl/changefeedccl/changefeed_processors.go:562  [n1] job 426470740993507329 span /Table/53/{1-2} is behind by 541.530005ms
I190215 08:22:21.991576 40350 ccl/changefeedccl/changefeed_processors.go:550  [n1] job 426470739165216769 new resolved timestamp 1550218941.869488947,0 is behind by 121.938943ms
I190215 08:22:22.048809 40445 ccl/changefeedccl/changefeed_processors.go:550  [n1] job 426470740993507329 new resolved timestamp 1550218941.845130517,0 is behind by 203.58131ms
I190215 08:22:22.080475 40350 ccl/changefeedccl/changefeed_processors.go:550  [n1] job 426470739165216769 new resolved timestamp 1550218941.918866937,0 is behind by 161.517554ms
I190215 08:22:22.152191 40445 ccl/changefeedccl/changefeed_processors.go:550  [n1] job 426470740993507329 new resolved timestamp 1550218941.952422905,0 is behind by 199.685797ms
I190215 08:22:22.162285 40350 ccl/changefeedccl/changefeed_processors.go:550  [n1] job 426470739165216769 new resolved timestamp 1550218941.995299653,0 is behind by 166.888872ms
I190215 08:22:22.210719 40350 ccl/changefeedccl/changefeed_processors.go:550  [n1] job 426470739165216769 new resolved timestamp 1550218942.080909441,0 is behind by 129.686477ms
I190215 08:22:22.231518 40445 ccl/changefeedccl/changefeed_stmt.go:442  [n1] CHANGEFEED job 426470740993507329 returning with error: cannot update progress on canceled job (id 426470740993507329)
I190215 08:22:22.299970 40350 ccl/changefeedccl/changefeed_processors.go:550  [n1] job 426470739165216769 new resolved timestamp 1550218942.162657577,0 is behind by 137.184519ms
I190215 08:22:22.406194 39981 util/stop/stopper.go:547  quiescing; tasks left:
3      [async] storage.IntentResolver: cleanup txn records
1      [async] transport racer
1      [async] job-426470739165216769
1      [async] intent_resolver_batcher
1      [async] closedts-rangefeed-subscriber
1      [async] changefeed-poller
W190215 08:22:22.406794 40462 storage/intentresolver/intent_resolver.go:722  failed to gc transaction record: could not GC completed transaction anchored at /Table/SystemConfigSpan/Start: node unavailable; try another peer
I190215 08:22:22.410154 40350 ccl/changefeedccl/changefeed_stmt.go:442  [n1] CHANGEFEED job 426470739165216769 returning with error: cannot update progress on canceled job (id 426470739165216769)
I190215 08:22:22.411282 39981 util/stop/stopper.go:547  quiescing; tasks left:
1      [async] transport racer
1      [async] closedts-rangefeed-subscriber
I190215 08:22:22.415881 39981 util/stop/stopper.go:547  quiescing; tasks left:
1      [async] transport racer
I190215 08:22:22.564731 40242 kv/transport_race.go:113  transport race promotion: ran 35 iterations on up to 1338 requests
    --- FAIL: TestAvroEncoder/enterprise (2.84s)
        testing.go:771: race detected during execution of test

@danhhz
Copy link
Contributor

danhhz commented Feb 26, 2019

This reproduces locally under stressrace pretty easily. I turned on verbose changefeed logging and found something odd. I assumed the bug was in the avro encoder stuff reusing some memory that it shouldn't be but I think I'm seeing a kv get written with a different timestamp than the one returned by cluster_logical_timestamp().

The test does

		var ts2 string
		require.NoError(t, crdb.ExecuteTx(ctx, db, nil /* txopts */, func(tx *gosql.Tx) error {
			return tx.QueryRow(
				`INSERT INTO foo VALUES (3, 'baz') RETURNING cluster_logical_timestamp()`,
			).Scan(&ts2)
		}))
		assertPayloadsAvro(t, reg, fooUpdated, []string{
			`foo: {"a":{"long":3}}->{"after":{"foo":{"a":{"long":3},"b":{"string":"baz"}}},` +
				`"updated":{"string":"` + ts2 + `"}}`,
		})

With verbose changefeed logging on (--vmodule=changefeed=3), I'm seeing this

I190226 19:28:51.117848 549 ccl/changefeedccl/changefeed.go:139  [n1,client=127.0.0.1:56320,user=root] changed key /Table/53/1/3/0 1551209331.061806000,1
...
--- FAIL: TestAvroEncoder (2.91s)
    --- FAIL: TestAvroEncoder/poller (2.91s)
        <autogenerated>:1: expected
              foo: {"a":{"long":3}}->{"after":{"foo":{"a":{"long":3},"b":{"string":"baz"}}},"updated":{"string":"1551209331098368000.0000000000"}}
            got
              foo: {"a":{"long":3}}->{"after":{"foo":{"a":{"long":3},"b":{"string":"baz"}}},"updated":{"string":"1551209331061806000.0000000001"}}

The "expected" in the error is ts2, which comes back through sql from RETURNING cluster_logical_timestamp() on the insert. The "got" matches the timestamp we received from the changed kv, which comes right out of (in the case of this "poller" subtest" the ExportRequest).

@danhhz
Copy link
Contributor

danhhz commented Feb 26, 2019

TESTS=AvroEncoder/sinkless doesn't seem to fail as readily, but if poller is broken, I'd expect to be seeing more tests fail, but so far it's just this one.

@danhhz
Copy link
Contributor

danhhz commented Feb 26, 2019

Okay, did another repro with --vmodule=export=3 which prints out the kvs that ExportRequest is getting as it iterates. The timestamp agreed with the value the changefeed emitted. It really seems like RETURNING cluster_logical_timestamp() is giving me back a different value than this kv is written at. @tbg @nvanbenschoten is that expected?

I'm still confused why I'm not seeing this with rangefeed, so maybe it has something to do with the ExportRequest implementation (which notably uses a time-bound iterator). I'll try making the test use a normal iterator to double-check what's coming back from ExportRequest.

@nvanbenschoten
Copy link
Member

I don't know much past what you've already said -- we shouldn't be seeing a kv with any timestamp other than the one returned from cluster_logical_timestamp().

If you can't reproduce this with rangefeed, then I think you're correct to start digging into the ExportRequest implementation. Is it possible that an ExportRequest is pushing an intent and causing the transaction to restart? It would be worth tracing (or just instrumenting) that INSERT to see if anything funky is happening.

@danhhz
Copy link
Contributor

danhhz commented Feb 26, 2019

The plot thickens:

Export        /Table/53/1/3/0/1551215710.662443000,1 /TUPLE/2:2:Bytes/baz
Export verify /Table/53/1/3/0/1551215710.676995000,0 9d1668540a260362617a
Export        /Table/53/1/3/0/1551215710.676995000,0 /TUPLE/2:2:Bytes/baz
Export verify /Table/53/1/3/0/1551215710.676995000,0 9d1668540a260362617a

Smells like time-bound iterator, bummer : - /

@nvanbenschoten
Copy link
Member

We'll have to wait for more logging, but the fact that the sanity check iterator doesn't use a max timestamp hint in MVCCIncrementalIterator.sanityCheckMetadataKey

i.sanityIter = i.e.NewIterator(engine.IterOptions{UpperBound: i.upperBound})

but the time-bound iterator, where the provisional value is pulled from the time-bound iterator, does, seems very suspicious to me. I don't have a full story that explains what you're seeing here, but it certainly feels like that could cause issues. If you feel up to it, I'd try reproducing with a max timestamp hint in sanityCheckMetadataKey.

@danhhz
Copy link
Contributor

danhhz commented Feb 27, 2019

I just got the following repro with additional logging in 19d76f7ea0bdae258808de3937b82cc3c7d8ad42

engineccl/mvcc.go:150  tbi /Table/53/1/3/0 0a2c0a1042e1403d35e64ddc9a9ef323b9bcaebe1a04bd898b882a0c08e0ecd0c093d0c3c315100130cdf81d3801120c08e0ecd0c093d0c3c31510011800200c280a
engineccl/mvcc.go:161  meta tbi {Txn:<nil> Timestamp:1551224544.182102000,0 Deleted:false KeyBytes:0 ValBytes:0 RawBytes:[] IntentHistory:[] MergeTimestamp:<nil> XXX_NoUnkeyedLiteral:{} XXX_sizecache:0}
engineccl/mvcc.go:256  provisional normal /Table/53/1/3/0/1551224544.832166000,0 9d1668540a260362617a
engineccl/mvcc.go:185  meta normal {Txn:id:42e1403d-35e6-4ddc-9a9e-f323b9bcaebe key:"\275\211\213\210" epoch:1 timestamp:<wall_time:1551224544832166000 > priority:490573 sequence:1  Timestamp:1551224544.832166000,0 Deleted:false KeyBytes:12 ValBytes:10 RawBytes:[] IntentHistory:[] MergeTimestamp:<nil> XXX_NoUnkeyedLiteral:{} XXX_sizecache:0}
engineccl/mvcc.go:199  time bounds=(1551224544.806844000,0,1551224544.818651000,0] meta={Txn:id:42e1403d-35e6-4ddc-9a9e-f323b9bcaebe key:"\275\211\213\210" epoch:1 timestamp:<wall_time:1551224544832166000 > priority:490573 sequence:1  Timestamp:1551224544.832166000,0 Deleted:false KeyBytes:12 ValBytes:10 RawBytes:[] IntentHistory:[] MergeTimestamp:<nil> XXX_NoUnkeyedLiteral:{} XXX_sizecache:0}
engineccl/mvcc.go:150  tbi /Table/53/1/3/0/1551224544.806844000,1 9d1668540a260362617a

Sorry the debug messages I added are so haphazard and oddly ordered, but I'll walk through what happens here.

line 1: TBI hits an intent
line 2: TBI intent metadata has no txn (don't know if this is relevant)
line 3: sanity iter sees the intent and the result of next (it's provisional value) has a timestamp of 1551224544.832166000,0
line 4: sanity iter's intent metadata does have a txn and the timestamp matches the provisional value
line 5: TBI sees a provisional value with a lower timestamp

So at this point, there's a new provisional value (as seen by the sanity iter), which nathan assures me means the old provisional value will have been deleted. But the TBI is still seeing some previous one at an older timestamp. Provisional value deletions do have a timestamp, so they should be reflected in TBI sstable metadata, which means the TBI shouldn't see this intent. I'm not sure yet exactly what is breaking down.

danhhz added a commit to danhhz/cockroach that referenced this issue Mar 11, 2019
RangeFeed originally intended to use the time-bound iterator
performance optimization. However, they've had correctness issues in
the past (cockroachdb#28358, cockroachdb#34819) and no-one has the time for the due-diligence
necessary to be confidant in their correctness going forward. Not using
them causes the total time spent in RangeFeed catchup on changefeed
over tpcc-1000 to go from 40s -> 4853s, which is quite large but still
workable.

Closes cockroachdb#35122

Release note (enterprise change): In exchange for increased correctness
confidance, `CHANGEFEED`s using `changefeed.push.enabled` (the default)
now take slightly more resources on startup and range
rebalancing/splits.
craig bot pushed a commit that referenced this issue Mar 12, 2019
35470: rangefeed: stop using time-bound iterator for catchup scan r=tbg a=danhhz

RangeFeed originally intended to use the time-bound iterator
performance optimization. However, they've had correctness issues in
the past (#28358, #34819) and no-one has the time for the due-diligence
necessary to be confidant in their correctness going forward. Not using
them causes the total time spent in RangeFeed catchup on changefeed
over tpcc-1000 to go from 40s -> 4853s, which is quite large but still
workable.

Closes #35122

Release note (enterprise change): In exchange for increased correctness
confidance, `CHANGEFEED`s using `changefeed.push.enabled` (the default)
now take slightly more resources on startup and range
rebalancing/splits.

Co-authored-by: Daniel Harrison <[email protected]>
@nvanbenschoten
Copy link
Member

Ok, I think I see what's going on. I'm running to a meeting so I'll write the explanation up later, but here's a hint - this fixes it:

diff --git a/pkg/storage/spanlatch/manager.go b/pkg/storage/spanlatch/manager.go
index 8ee9df8c67..e8369c8d58 100644
--- a/pkg/storage/spanlatch/manager.go
+++ b/pkg/storage/spanlatch/manager.go
@@ -350,7 +350,7 @@ func (m *Manager) wait(ctx context.Context, lg *Guard, snap snapshot) error {
                                case spanset.SpanReadOnly:
                                        // Wait for writes at equal or lower timestamps.
                                        it := tr[spanset.SpanReadWrite].MakeIter()
-                                       if err := m.iterAndWait(ctx, timer, &it, latch, ignoreLater); err != nil {
+                                       if err := m.iterAndWait(ctx, timer, &it, latch, ignoreNothing); err != nil {
                                                return err
                                        }
                                case spanset.SpanReadWrite:
@@ -366,7 +366,7 @@ func (m *Manager) wait(ctx context.Context, lg *Guard, snap snapshot) error {
                                        }
                                        // Wait for reads at equal or higher timestamps.
                                        it = tr[spanset.SpanReadOnly].MakeIter()
-                                       if err := m.iterAndWait(ctx, timer, &it, latch, ignoreEarlier); err != nil {
+                                       if err := m.iterAndWait(ctx, timer, &it, latch, ignoreNothing); err != nil {
                                                return err
                                        }
                                default:

@nvanbenschoten
Copy link
Member

The issue is that the Export request acquires read latches at timestamp 1 and concurrently an intent is moved forward while holding write latches at timestamp 2. Writes in the future should not interact with reads in the past, so these requests are allowed to execute concurrently. However, changes to uncommitted transactional state make the rules here a little complicated, and its better to think of "interact" in the logical sense instead of the physical sense (e.g. a write at a higher timestamp could remove an intent and race with a reader, but the effect of observing the intent vs. not observing the intent as a concurrent reading will be the same after concurrency control).

The hack in MVCCIncrementalIterator to re-check the intent key and ignore it breaks this independence. It allows the Export request to see a partially applied update at a higher timestamp that affects uncommitted state in its timestamp range in a way that doesn't preserve logical equivalence. Concretely, we're seeing a case where MVCCIncrementalIterator.iter observes a state before a intent push is applied (old intent key, old provisional key) and MVCCIncrementalIterator.sanityIter observes a state after an intent push is applied (new intent key, new provisional key). On its own, this isn't necessarily an issue, but the problem occurs when we switch back and forth between them. We ignore the old intent key in favor of the new intent key, but then we switch back to MVCCIncrementalIterator.iter and return the old provisional key. This causes us to consider an uncommitted and already removed intent as committed.

So what do we do about this? I see a few options:

  • pull both iterators from the same snapshot
  • grab the sanityIter before grabing the iter
  • check the status of old provisional key using the sanityIter to determine whether it is committed or aborted

This will affect any user of MVCCIncrementalIterator, which is at least backup, right @danhhz?

@nvanbenschoten
Copy link
Member

nvanbenschoten commented Mar 12, 2019

grab the sanityIter before grabing the iter

diff --git a/pkg/ccl/storageccl/engineccl/mvcc.go b/pkg/ccl/storageccl/engineccl/mvcc.go
index 48fa07b..61e01d1 100644
--- a/pkg/ccl/storageccl/engineccl/mvcc.go
+++ b/pkg/ccl/storageccl/engineccl/mvcc.go
@@ -85,6 +85,9 @@ func NewMVCCIncrementalIterator(e engine.Reader, opts IterOptions) *MVCCIncremen
        return &MVCCIncrementalIterator{
                e:          e,
                upperBound: opts.UpperBound,
+               sanityIter: e.NewIterator(engine.IterOptions{
+                       UpperBound: opts.UpperBound,
+               }),
                iter: e.NewIterator(engine.IterOptions{
                        // The call to startTime.Next() converts our exclusive start bound into
                        // the inclusive start bound that MinTimestampHint expects. This is
@@ -231,11 +234,6 @@ func (i *MVCCIncrementalIterator) advance() {
 // sees that exact key. Otherwise, it returns false. It's used in the workaround
 // in `advance` for a time-bound iterator bug.
 func (i *MVCCIncrementalIterator) sanityCheckMetadataKey() ([]byte, bool, error) {
-       if i.sanityIter == nil {
-               // The common case is that we'll won't need the sanityIter for a given
-               // MVCCIncrementalIterator, so create it lazily.
-               i.sanityIter = i.e.NewIterator(engine.IterOptions{UpperBound: i.upperBound})
-       }
        unsafeKey := i.iter.UnsafeKey()
        i.sanityIter.Seek(unsafeKey)
        if ok, err := i.sanityIter.Valid(); err != nil {

Yeah this fixes it, but the reason why is pretty subtle. By creating the sanityIter (non-tbi) before iter (tbi), we ensure that the only way for the iterators to both have different intents on a key is if the iter intent has a timestamp outside of the endTime (the timestamp it acquired its read latch), which it will ignore when it iterates to the next key (the provisional kv). Doing this aligns the ambiguity of the phantom intent problem of tbis with the ambiguity of the multiple iterator race. We then rely on the sanityIter to distinguish between these cases and collapse the ambiguity.

@danhhz
Copy link
Contributor

danhhz commented Mar 12, 2019

I'm still wrapping my head around this, but great find! I think I've been assuming that all iterators created from the engine.ReadWriter that is passed to evalExport would see the exact same thing (like they would if they were attached to a rocksdb snapshot), but now that it's pointed out to me I realize of course that's not true.

Yes, this will affect all users of MVCCIncrementalIterator, which is currently both full and incremental backup as well as poller based changefeeds and the initial scan and tabledesc history of any changefeed. There is a bug that makes full backup and changefeed initial scan use tbis, but it's an easy fix. I've also just proposed that incremental backup also stop using tbis (see #35671). tabledesc history polling also shouldn't need tbis. Which leaves only poller based chagnefeeds.

nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this issue Mar 13, 2019
…ents

Fixes cockroachdb#34819.

349ff61 introduced a workaround for cockroachdb#28358 into MVCCIncrementalIterator.
This workaround created a second (non-time-bound) iterator to verify
possibly-phantom MVCCMetadata keys during iteration.

We found in cockroachdb#34819 that it is necessary for correctness that sanityIter
be created before the original iter. This is because the provided Reader
that both iterators are created from may not be a consistent snapshot, so
the two iterators could end up observing different information. The hack
around sanityCheckMetadataKey only works properly if all possible
discrepancies between the two iterators lead to intents and values falling
outside of the timestamp range **from the time-bound iterators perspective**.
This allows us to simply ignore discrepancies that we notice in advance().

This commit makes this change. It also adds a test that failed regularly
before the fix under stress and no longer fails after the fix.

Release note: None
craig bot pushed a commit that referenced this issue Mar 13, 2019
35300: sql: add checks after all referenced columns have been backfilled r=lucy-zhang a=lucy-zhang

Previously, if a column was added with a check constraint that also referenced
another column that was public, writes to that public column would erroneously
fail (and, in the worst case, result in a panic) while the column being added
was not yet public. With this change, the schema changer will now wait to add
the check constraint to the table descriptor until after all the columns that
were added in the same transaction have been backfilled.

A new optional field has been added to `ConstraintToValidate` for the check
constraint itself, so that it can be added to the table descriptor at the
correct step in the schema change process.

I ended up adding this field to the existing mutation instead of creating a new
type of mutation to add the constraint to the descriptor, since it ultimately
seemed to me that a mutation that simply adds a schema element in its backfill
step would be too inconsistent with what mutations are, especially since all
the mutations for a single transaction are basically executed at the same time.
To support NOT VALID in the future, we could add more flags to the protobuf to
indicate that either the addition of the constraint or the validation should be
skipped, so that they can be executed separately.

Fixes #35258, fixes #35193

Release note: None

35682: engineccl/mvcc: fix time-bound iterator's interaction with moving intents r=nvanbenschoten a=nvanbenschoten

Fixes #34819.

349ff61 introduced a workaround for #28358 into MVCCIncrementalIterator.
This workaround created a second (non-time-bound) iterator to verify
possibly-phantom MVCCMetadata keys during iteration.

We found in #34819 that it is necessary for correctness that sanityIter
be created before the original iter. This is because the provided Reader
that both iterators are created from may not be a consistent snapshot, so
the two iterators could end up observing different information. The hack
around sanityCheckMetadataKey only works properly if all possible
discrepancies between the two iterators lead to intents and values falling
outside of the timestamp range **from the time-bound iterator's perspective**.
This allows us to simply ignore discrepancies that we notice in advance().

This commit makes this change. It also adds a test that failed regularly
before the fix under stress and no longer fails after the fix.

Release note: None

35689: roachtest: add large node kv tests and batching kv tests r=nvanbenschoten a=nvanbenschoten

This commit adds support for running `kv` with a `--batch` parameter. It
then adds the following new roachtest configurations:
- kv0/enc=false/nodes=3/batch=16
- kv95/enc=false/nodes=3/batch=16
- kv0/enc=false/nodes=3/cpu=96
- kv95/enc=false/nodes=3/cpu=96
- kv50/enc=false/nodes=4/cpu=96/batch=64

The last test is currently skipped because of #34241. I confirmed that
it triggers the corresponding assertion on both AWS and GCE.

My request for more m5d.24xlarge quota just succeeded, but I may need to
request more quota for n1-highcpu-96 VMs for these to run nightly.

Release note: None

Co-authored-by: Lucy Zhang <[email protected]>
Co-authored-by: Nathan VanBenschoten <[email protected]>
@craig craig bot closed this as completed in #35682 Mar 13, 2019
nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this issue Mar 20, 2019
…ents

Fixes cockroachdb#34819.

349ff61 introduced a workaround for cockroachdb#28358 into MVCCIncrementalIterator.
This workaround created a second (non-time-bound) iterator to verify
possibly-phantom MVCCMetadata keys during iteration.

We found in cockroachdb#34819 that it is necessary for correctness that sanityIter
be created before the original iter. This is because the provided Reader
that both iterators are created from may not be a consistent snapshot, so
the two iterators could end up observing different information. The hack
around sanityCheckMetadataKey only works properly if all possible
discrepancies between the two iterators lead to intents and values falling
outside of the timestamp range **from the time-bound iterators perspective**.
This allows us to simply ignore discrepancies that we notice in advance().

This commit makes this change. It also adds a test that failed regularly
before the fix under stress and no longer fails after the fix.

Release note: None
danhhz added a commit to danhhz/cockroach that referenced this issue Mar 26, 2019
Time-bound iterators are a performance optimization that allows us to
entirely skip over sstables in RocksDB that don't have data relevant to
the time bounds in a request.

This can have a dramatic impact on performance, but we've seen a number
of extremely subtle and hard to detect correctness issues with this (see
issues cockroachdb#28358 cockroachdb#34819). As a result, we've decided to skip the
optimization everywhere that it isn't absolutely necessary for the
feature to work (leaving one place: poller-based changefeeds, which are
being phased out anyway). This will both give increased confidance in
correctness as well as eliminate any need to consider and investigate
time-bound iterators when/if someone hits a correctness bug.

This commit introduces the plumbing necessary for an individual
ExportRequest to control whether time-bound iterators are allowed or
disallowed.

A bool is introduced to the ExportRequest proto that explictly allows
time-bound iterators. This means that in a rolling upgrade, it's
possible for an old changefeed-poller node to send a request without the
field set to a new node, which sees the unset field as false, disabling
the optimization. An alternative is to invert the semantics of the bool
(zero-value = enable, true = disable the optimization), but in case any
new uses of ExportRequest are introduced, I'd much prefer the zero-value
of this field be the safer default of disabled.

As part of the investigation for whether we could turn them off for
incremental BACKUP (cockroachdb#35671), I reran some of the initial measurements of
time-bound iterator impact on incremental backup. I installed tpcc-1000
on a 3 node n1-standard-16 cluster, then ran a full backup, then ran
load for 1 hour, noting the time T. With load running, I ran 6
incremental backups from T alternating between tbi and no-tbi:

    tbi incremental backup runtimes: 3m45s,3m56s,4m6s
    no-tbi incremental backup runtimes: 6m45s,6m27s,6m48s

Any impact on normal traffic (latencies etc) seemed to be in the noise.

Closes cockroachdb#35671.

Release note (enterprise change): In exchange for increased correctness
confidance, `BACKUP`s using `INCREMENTAL FROM` now take slightly longer.
danhhz added a commit to danhhz/cockroach that referenced this issue Mar 28, 2019
Time-bound iterators are a performance optimization that allows us to
entirely skip over sstables in RocksDB that don't have data relevant to
the time bounds in a request.

This can have a dramatic impact on performance, but we've seen a number
of extremely subtle and hard to detect correctness issues with this (see
issues cockroachdb#28358 cockroachdb#34819). As a result, we've decided to skip the
optimization everywhere that it isn't absolutely necessary for the
feature to work (leaving one place: poller-based changefeeds, which are
being phased out anyway). This will both give increased confidance in
correctness as well as eliminate any need to consider and investigate
time-bound iterators when/if someone hits a correctness bug.

This commit introduces the plumbing necessary for an individual
ExportRequest to control whether time-bound iterators are allowed or
disallowed.

A bool is introduced to the ExportRequest proto that explictly allows
time-bound iterators. This means that in a rolling upgrade, it's
possible for an old changefeed-poller node to send a request without the
field set to a new node, which sees the unset field as false, disabling
the optimization. An alternative is to invert the semantics of the bool
(zero-value = enable, true = disable the optimization), but in case any
new uses of ExportRequest are introduced, I'd much prefer the zero-value
of this field be the safer default of disabled.

As part of the investigation for whether we could turn them off for
incremental BACKUP (cockroachdb#35671), I reran some of the initial measurements of
time-bound iterator impact on incremental backup. I installed tpcc-1000
on a 3 node n1-standard-16 cluster, then ran a full backup, then ran
load for 1 hour, noting the time T. With load running, I ran 6
incremental backups from T alternating between tbi and no-tbi:

    tbi incremental backup runtimes: 3m45s,3m56s,4m6s
    no-tbi incremental backup runtimes: 6m45s,6m27s,6m48s

Any impact on normal traffic (latencies etc) seemed to be in the noise.

Closes cockroachdb#35671.

Release note (enterprise change): In exchange for increased correctness
confidance, `BACKUP`s using `INCREMENTAL FROM` now take slightly longer.
craig bot pushed a commit that referenced this issue Mar 28, 2019
36191: storageccl: disable time-bound iteration optimization in BACKUP r=dt,tbg a=danhhz

Time-bound iterators are a performance optimization that allows us to
entirely skip over sstables in RocksDB that don't have data relevant to
the time bounds in a request.

This can have a dramatic impact on performance, but we've seen a number
of extremely subtle and hard to detect correctness issues with this (see
issues #28358 #34819). As a result, we've decided to skip the
optimization everywhere that it isn't absolutely necessary for the
feature to work (leaving one place: poller-based changefeeds, which are
being phased out anyway). This will both give increased confidance in
correctness as well as eliminate any need to consider and investigate
time-bound iterators when/if someone hits a correctness bug.

This commit introduces the plumbing necessary for an individual
ExportRequest to control whether time-bound iterators are allowed or
disallowed.

A bool is introduced to the ExportRequest proto that explictly allows
time-bound iterators. This means that in a rolling upgrade, it's
possible for an old changefeed-poller node to send a request without the
field set to a new node, which sees the unset field as false, disabling
the optimization. An alternative is to invert the semantics of the bool
(zero-value = enable, true = disable the optimization), but in case any
new uses of ExportRequest are introduced, I'd much prefer the zero-value
of this field be the safer default of disabled.

As part of the investigation for whether we could turn them off for
incremental BACKUP (#35671), I reran some of the initial measurements of
time-bound iterator impact on incremental backup. I installed tpcc-1000
on a 3 node n1-standard-16 cluster, then ran a full backup, then ran
load for 1 hour, noting the time T. With load running, I ran 6
incremental backups from T alternating between tbi and no-tbi:

    tbi incremental backup runtimes: 3m45s,3m56s,4m6s
    no-tbi incremental backup runtimes: 6m45s,6m27s,6m48s

Any impact on normal traffic (latencies etc) seemed to be in the noise.

Closes #35671.

Release note (enterprise change): In exchange for increased correctness
confidance, `BACKUP`s using `INCREMENTAL FROM` now take slightly longer.

Co-authored-by: Daniel Harrison <[email protected]>
danhhz added a commit to danhhz/cockroach that referenced this issue Mar 28, 2019
Time-bound iterators are a performance optimization that allows us to
entirely skip over sstables in RocksDB that don't have data relevant to
the time bounds in a request.

This can have a dramatic impact on performance, but we've seen a number
of extremely subtle and hard to detect correctness issues with this (see
issues cockroachdb#28358 cockroachdb#34819). As a result, we've decided to skip the
optimization everywhere that it isn't absolutely necessary for the
feature to work (leaving one place: poller-based changefeeds, which are
being phased out anyway). This will both give increased confidance in
correctness as well as eliminate any need to consider and investigate
time-bound iterators when/if someone hits a correctness bug.

This commit introduces the plumbing necessary for an individual
ExportRequest to control whether time-bound iterators are allowed or
disallowed.

A bool is introduced to the ExportRequest proto that explictly allows
time-bound iterators. This means that in a rolling upgrade, it's
possible for an old changefeed-poller node to send a request without the
field set to a new node, which sees the unset field as false, disabling
the optimization. An alternative is to invert the semantics of the bool
(zero-value = enable, true = disable the optimization), but in case any
new uses of ExportRequest are introduced, I'd much prefer the zero-value
of this field be the safer default of disabled.

As part of the investigation for whether we could turn them off for
incremental BACKUP (cockroachdb#35671), I reran some of the initial measurements of
time-bound iterator impact on incremental backup. I installed tpcc-1000
on a 3 node n1-standard-16 cluster, then ran a full backup, then ran
load for 1 hour, noting the time T. With load running, I ran 6
incremental backups from T alternating between tbi and no-tbi:

    tbi incremental backup runtimes: 3m45s,3m56s,4m6s
    no-tbi incremental backup runtimes: 6m45s,6m27s,6m48s

Any impact on normal traffic (latencies etc) seemed to be in the noise.

Closes cockroachdb#35671.

Release note (enterprise change): In exchange for increased correctness
confidance, `BACKUP`s using `INCREMENTAL FROM` now take slightly longer.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-test-failure Broken test (automatically or manually discovered). O-robot Originated from a bot.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants