Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sql: TestLeaseAtLatestVersion failed under stress #45696

Closed
cockroach-teamcity opened this issue Mar 4, 2020 · 4 comments · Fixed by #52489
Closed

sql: TestLeaseAtLatestVersion failed under stress #45696

cockroach-teamcity opened this issue Mar 4, 2020 · 4 comments · Fixed by #52489
Assignees
Labels
C-test-failure Broken test (automatically or manually discovered). O-robot Originated from a bot.

Comments

@cockroach-teamcity
Copy link
Member

SHA: https://github.com/cockroachdb/cockroach/commits/d651c17d58e4ce97477e2fa66d04487e9039c1fd

Parameters:

TAGS=
GOFLAGS=-race

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=TestLeaseAtLatestVersion PKG=github.com/cockroachdb/cockroach/pkg/sql TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1784444&tab=buildLog

I200304 11:07:39.431689 73833 server/server.go:1660  [n1] done ensuring all necessary migrations have run
I200304 11:07:39.431921 73833 server/server.go:1663  [n1] serving sql connections
I200304 11:07:39.462973 74255 server/server_update.go:67  [n1] no need to upgrade, cluster already at the newest version
I200304 11:07:39.467460 74258 sql/event_log.go:135  [n1] Event: "node_join", target: 1, info: {Descriptor:{NodeID:1 Address:127.0.0.1:35269 Attrs: Locality: ServerVersion:19.1 BuildTag:v19.1.8-10-gd651c17 StartedAt:1583320058201315627 LocalityAddress:[] XXX_NoUnkeyedLiteral:{} XXX_sizecache:0} ClusterID:dac366f4-53ce-4978-88a1-245f1d2cf79e StartedAt:1583320058201315627 LastUp:1583320058201315627}
I200304 11:07:39.596351 74295 sql/event_log.go:135  [n1,client=127.0.0.1:35146,user=root] Event: "create_database", target: 52, info: {DatabaseName:t Statement:CREATE DATABASE t User:root}
I200304 11:07:39.596656 74295 sql/event_log.go:135  [n1,client=127.0.0.1:35146,user=root] Event: "create_table", target: 53, info: {TableName:t.public.kv Statement:CREATE TABLE t.public.kv (k CHAR PRIMARY KEY, v CHAR) User:root}
I200304 11:07:39.596736 74295 sql/event_log.go:135  [n1,client=127.0.0.1:35146,user=root] Event: "create_table", target: 54, info: {TableName:t.public."timestamp" Statement:CREATE TABLE t.public."timestamp" (k CHAR PRIMARY KEY, v CHAR) User:root}
I200304 11:07:39.614337 74221 storage/replica_command.go:247  [n1,split,s1,r20/1:/{Table/24-Max}] initiating a split of this range at key /Table/53 [r21] (zone config)
I200304 11:07:39.699771 73833 sql/lease.go:353  publish: descID=53 (kv) version=2 mtime=2020-03-04 11:07:39.69787101 +0000 UTC
I200304 11:07:39.744357 74209 storage/replica_command.go:247  [n1,split,s1,r21/1:/{Table/53-Max}] initiating a split of this range at key /Table/54 [r22] (zone config)
I200304 11:07:39.808205 73833 util/stop/stopper.go:546  quiescing; tasks left:
2      node.Node: batch
1      [async] txnHeartbeater: aborting txn
1      [async] transport racer
1      [async] storage.split: processing replica
1      [async] storage.IntentResolver: cleanup txn records
1      [async] kv.TxnCoordSender: heartbeat loop
1      [async] intent_resolver_ir_batcher
1      [async] intent_resolver_gc_batcher
1      [async] closedts-rangefeed-subscriber
W200304 11:07:39.808229 74271 storage/intentresolver/intent_resolver.go:834  failed to gc transaction record: could not GC completed transaction anchored at /Local/Range/Table/24/RangeDescriptor: node unavailable; try another peer
W200304 11:07:39.809305 74209 internal/client/txn.go:509  [n1,split,s1,r21/1:/{Table/53-Max}] failure aborting transaction: node unavailable; try another peer; abort caused by: node unavailable; try another peer
E200304 11:07:39.809919 74209 storage/queue.go:822  [n1,split,s1,r21/1:/{Table/53-Max}] unable to split [n1,s1,r21/1:/{Table/53-Max}] at key "/Table/54": split at key /Table/54 failed: node unavailable; try another peer
I200304 11:07:39.810775 73833 util/stop/stopper.go:546  quiescing; tasks left:
1      node.Node: batch
1      [async] txnHeartbeater: aborting txn
1      [async] transport racer
1      [async] storage.split: processing replica
1      [async] closedts-rangefeed-subscriber
I200304 11:07:39.812009 73833 util/stop/stopper.go:546  quiescing; tasks left:
1      [async] transport racer
1      [async] storage.split: processing replica
1      [async] closedts-rangefeed-subscriber
I200304 11:07:39.812521 73833 util/stop/stopper.go:546  quiescing; tasks left:
1      [async] transport racer
I200304 11:07:39.855507 73930 kv/transport_race.go:113  transport race promotion: ran 62 iterations on up to 272 requests
--- FAIL: TestLeaseAtLatestVersion (1.92s)
    lease_test.go:1181: pq: restart transaction: TransactionRetryWithProtoRefreshError: TransactionAbortedError(ABORT_REASON_TIMESTAMP_CACHE_REJECTED_POSSIBLE_REPLAY): "sql txn" id=689d15d9 key=/Table/54/1/"a"/0 rw=true pri=0.00146949 stat=PENDING epo=0 ts=1583320059.608940278,0 orig=1583320059.608940278,0 max=1583320059.703541547,0 wto=false seq=2


Failed to find issue assignee: 
old friend vivekmenezes is friendless
@cockroach-teamcity cockroach-teamcity added C-test-failure Broken test (automatically or manually discovered). O-robot Originated from a bot. labels Mar 4, 2020
@cockroach-teamcity
Copy link
Member Author

SHA: https://github.com/cockroachdb/cockroach/commits/39d0f63e3f57fa3cca92fdeda4815130776b0830

Parameters:

TAGS=
GOFLAGS=

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=TestLeaseAtLatestVersion PKG=github.com/cockroachdb/cockroach/pkg/sql TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1963100&tab=buildLog

I200523 06:16:05.110500 69935 sql/event_log.go:135  [n1,client=127.0.0.1:45680,user=root] Event: "create_database", target: 52, info: {DatabaseName:t Statement:CREATE DATABASE t User:root}
I200523 06:16:05.110526 69935 sql/event_log.go:135  [n1,client=127.0.0.1:45680,user=root] Event: "create_table", target: 53, info: {TableName:t.public.kv Statement:CREATE TABLE t.public.kv (k CHAR PRIMARY KEY, v CHAR) User:root}
I200523 06:16:05.110537 69935 sql/event_log.go:135  [n1,client=127.0.0.1:45680,user=root] Event: "create_table", target: 54, info: {TableName:t.public."timestamp" Statement:CREATE TABLE t.public."timestamp" (k CHAR PRIMARY KEY, v CHAR) User:root}
I200523 06:16:05.111203 69957 storage/replica_command.go:247  [n1,split,s1,r20/1:/{Table/24-Max}] initiating a split of this range at key /Table/53 [r21] (zone config)
I200523 06:16:05.116918 69567 sql/lease.go:353  publish: descID=53 (kv) version=2 mtime=2020-05-23 06:16:05.116823454 +0000 UTC
I200523 06:16:05.119412 69979 storage/replica_command.go:247  [n1,split,s1,r21/1:/{Table/53-Max}] initiating a split of this range at key /Table/54 [r22] (zone config)
I200523 06:16:05.120324 69567 util/stop/stopper.go:546  quiescing; tasks left:
7      [async] storage.IntentResolver: cleanup txn records
1      storage.IntentResolver: processing intents
1      node.Node: batch
1      [async] txnHeartbeater: aborting txn
1      [async] storage.split: processing replica
1      [async] kv.TxnCoordSender: heartbeat loop
1      [async] intent_resolver_ir_batcher
1      [async] intent_resolver_gc_batcher
1      [async] closedts-rangefeed-subscriber
I200523 06:16:05.120379 69567 util/stop/stopper.go:546  quiescing; tasks left:
6      [async] storage.IntentResolver: cleanup txn records
1      storage.IntentResolver: processing intents
1      node.Node: batch
1      [async] txnHeartbeater: aborting txn
1      [async] storage.split: processing replica
1      [async] kv.TxnCoordSender: heartbeat loop
1      [async] intent_resolver_ir_batcher
1      [async] intent_resolver_gc_batcher
1      [async] closedts-rangefeed-subscriber
W200523 06:16:05.120333 69633 storage/intentresolver/intent_resolver.go:834  failed to gc transaction record: could not GC completed transaction anchored at /Table/SystemConfigSpan/Start: node unavailable; try another peer
I200523 06:16:05.120581 69567 util/stop/stopper.go:546  quiescing; tasks left:
2      [async] storage.IntentResolver: cleanup txn records
1      [async] storage.split: processing replica
1      [async] closedts-rangefeed-subscriber
I200523 06:16:05.120608 69567 util/stop/stopper.go:546  quiescing; tasks left:
1      [async] storage.split: processing replica
1      [async] storage.IntentResolver: cleanup txn records
1      [async] closedts-rangefeed-subscriber
W200523 06:16:05.120610 69979 internal/client/txn.go:509  [n1,split,s1,r21/1:/{Table/53-Max}] failure aborting transaction: node unavailable; try another peer; abort caused by: log-range-event: kv/txn_interceptor_heartbeater.go:345: node already quiescing
I200523 06:16:05.120630 69567 util/stop/stopper.go:546  quiescing; tasks left:
1      [async] storage.split: processing replica
1      [async] closedts-rangefeed-subscriber
E200523 06:16:05.120655 69979 storage/queue.go:822  [n1,split,s1,r21/1:/{Table/53-Max}] unable to split [n1,s1,r21/1:/{Table/53-Max}] at key "/Table/54": split at key /Table/54 failed: log-range-event: kv/txn_interceptor_heartbeater.go:345: node already quiescing
I200523 06:16:05.120682 69567 util/stop/stopper.go:546  quiescing; tasks left:
1      [async] closedts-rangefeed-subscriber
--- FAIL: TestLeaseAtLatestVersion (0.16s)
    lease_test.go:1181: pq: restart transaction: TransactionRetryWithProtoRefreshError: TransactionAbortedError(ABORT_REASON_TIMESTAMP_CACHE_REJECTED_POSSIBLE_REPLAY): "sql txn" id=35722395 key=/Table/54/1/"a"/0 rw=true pri=0.02037478 stat=PENDING epo=0 ts=1590214565.111134349,0 orig=1590214565.111134349,0 max=1590214565.114950993,0 wto=false seq=2


Failed to find issue assignee: 
old friend vivekmenezes is friendless

@knz
Copy link
Contributor

knz commented Jun 4, 2020

This looks like a legitimate error, the test is unable to deal with a retry error.

cc @ajwerner for triage

@ajwerner ajwerner self-assigned this Jun 4, 2020
@ajwerner
Copy link
Contributor

ajwerner commented Jun 4, 2020

It's a legitimate failure but not a particularly concerning one, will need to handle some retries. Adding to my backlog.

@cockroach-teamcity
Copy link
Member Author

SHA: https://github.com/cockroachdb/cockroach/commits/0c04a92ba19eedd4762ca7feb8361433682f3ded

Parameters:

TAGS=
GOFLAGS=

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=TestLeaseAtLatestVersion PKG=github.com/cockroachdb/cockroach/pkg/sql TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=2054457&tab=buildLog

I200701 06:06:06.181795 70343 sql/event_log.go:135  [n1,client=127.0.0.1:49388,user=root] Event: "create_table", target: 53, info: {TableName:t.public.kv Statement:CREATE TABLE t.public.kv (k CHAR PRIMARY KEY, v CHAR) User:root}
I200701 06:06:06.181808 70343 sql/event_log.go:135  [n1,client=127.0.0.1:49388,user=root] Event: "create_table", target: 54, info: {TableName:t.public."timestamp" Statement:CREATE TABLE t.public."timestamp" (k CHAR PRIMARY KEY, v CHAR) User:root}
I200701 06:06:06.183013 70282 storage/replica_command.go:247  [n1,split,s1,r20/1:/{Table/24-Max}] initiating a split of this range at key /Table/53 [r21] (zone config)
I200701 06:06:06.186603 70005 sql/lease.go:353  publish: descID=53 (kv) version=2 mtime=2020-07-01 06:06:06.186525523 +0000 UTC
I200701 06:06:06.189593 69814 storage/replica_command.go:247  [n1,split,s1,r21/1:/{Table/53-Max}] initiating a split of this range at key /Table/54 [r22] (zone config)
I200701 06:06:06.195624 70005 util/stop/stopper.go:546  quiescing; tasks left:
7      [async] storage.IntentResolver: cleanup txn records
2      node.Node: batch
1      storage.IntentResolver: processing intents
1      kv.DistSender: sending partial batch
1      [async] txnHeartbeater: aborting txn
1      [async] storage.split: processing replica
1      [async] kv.TxnCoordSender: heartbeat loop
1      [async] intent_resolver_ir_batcher
1      [async] intent_resolver_gc_batcher
1      [async] closedts-rangefeed-subscriber
I200701 06:06:06.195725 70005 util/stop/stopper.go:546  quiescing; tasks left:
7      [async] storage.IntentResolver: cleanup txn records
2      node.Node: batch
1      storage.IntentResolver: processing intents
1      kv.DistSender: sending partial batch
1      [async] txnHeartbeater: aborting txn
1      [async] storage.split: processing replica
1      [async] kv.TxnCoordSender: heartbeat loop
1      [async] intent_resolver_gc_batcher
1      [async] closedts-rangefeed-subscriber
W200701 06:06:06.195732 70373 storage/intentresolver/intent_resolver.go:834  failed to gc transaction record: could not GC completed transaction anchored at /Table/SystemConfigSpan/Start: node unavailable; try another peer
I200701 06:06:06.195774 70005 util/stop/stopper.go:546  quiescing; tasks left:
6      [async] storage.IntentResolver: cleanup txn records
2      node.Node: batch
1      kv.DistSender: sending partial batch
1      [async] txnHeartbeater: aborting txn
1      [async] storage.split: processing replica
1      [async] closedts-rangefeed-subscriber
W200701 06:06:06.195937 69814 internal/client/txn.go:509  [n1,split,s1,r21/1:/{Table/53-Max}] failure aborting transaction: node unavailable; try another peer; abort caused by: failed to send RPC: sending to all 1 replicas failed; last error: (err: node unavailable; try another peer) <nil>
I200701 06:06:06.195950 70005 util/stop/stopper.go:546  quiescing; tasks left:
1      [async] storage.split: processing replica
1      [async] storage.IntentResolver: cleanup txn records
1      [async] closedts-rangefeed-subscriber
E200701 06:06:06.195982 69814 storage/queue.go:822  [n1,split,s1,r21/1:/{Table/53-Max}] unable to split [n1,s1,r21/1:/{Table/53-Max}] at key "/Table/54": split at key /Table/54 failed: failed to send RPC: sending to all 1 replicas failed; last error: (err: node unavailable; try another peer) <nil>
I200701 06:06:06.196018 70005 util/stop/stopper.go:546  quiescing; tasks left:
1      [async] storage.IntentResolver: cleanup txn records
1      [async] closedts-rangefeed-subscriber
I200701 06:06:06.196050 70005 util/stop/stopper.go:546  quiescing; tasks left:
1      [async] storage.IntentResolver: cleanup txn records
--- FAIL: TestLeaseAtLatestVersion (0.23s)
    lease_test.go:1181: pq: restart transaction: TransactionRetryWithProtoRefreshError: TransactionAbortedError(ABORT_REASON_TIMESTAMP_CACHE_REJECTED_POSSIBLE_REPLAY): "sql txn" id=cb65e435 key=/Table/54/1/"a"/0 rw=true pri=0.00953600 stat=PENDING epo=0 ts=1593583566.182203166,0 orig=1593583566.182203166,0 max=1593583566.186730453,0 wto=false seq=2


Failed to find issue assignee: 
old friend vivekmenezes is friendless

craig bot pushed a commit that referenced this issue Sep 16, 2020
52489: sql/catalog/lease: attempt to fix flakey TestLeaseAtLatestVersion r=lucy-zhang a=ajwerner

The error is that the transaction sees a retry. Handle this explicitly.
I haven't actually been able to repro the failure but this will fix
what was in the issue.

Fixes #45696.

Release note: None

Co-authored-by: Andrew Werner <[email protected]>
@craig craig bot closed this as completed in a4f108a Sep 16, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-test-failure Broken test (automatically or manually discovered). O-robot Originated from a bot.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants