roachtest: always use active node in backup schedule injection #97581

msbutler · 2023-02-23T17:53:24Z

Previously, the schedule backup cmd injection on Start() would always run on the first node on the cluster, but if that node were not available, the cmd would fail. This patch ensures the injection runs on a node that was just started.

Fixes #97558, #97582, #97548, #97562 #97565 #97561

Release note: None

cockroach-teamcity · 2023-02-23T17:53:34Z

This change is

msbutler · 2023-02-23T17:54:33Z

@renatolabs I've verified that cdc/mixedversions passes with this patch!

renatolabs · 2023-02-23T18:57:58Z

Once we are confident this fixes all the issues that we know of, I think we should kick off a nightly build with this patch to double check we don't see on-start backup related failures anymore. A number of tests in current nightly run (which is still running) have failed (examples: #97582, #97562, #97548).

I also noticed there are at least two error messages: file not found (e.g., #97558), and failed to connect (e.g., #97562).

renatolabs · 2023-02-23T21:17:43Z

~~If it helps, I was testing some changes on multitenant-upgrade, and that test seems to fail pretty consistently with the message below, even after I applied this patch:~~

Never mind, that test actually fails on master as well (see #97016).

msbutler · 2023-02-23T22:16:03Z

for reasons that i do not completely understand, I've gotten multitenant-upgrade to pass using the current implementation in this pr. I'll run it a few more times and then spin nightly job.

msbutler · 2023-02-23T22:25:49Z

rats. it's failed on my branch. To prevent other tests from flaking, I will disable scheduled backups on the restart in this test for now.

renatolabs · 2023-02-23T22:32:24Z

it's failed on my branch

Did if fail with a Is the server running? message? It might be that crdb itself crashed (and therefore not related to the backup logic). At least that's what I was seeing on my gceworker and also over on #97016.

msbutler · 2023-02-23T22:40:33Z

Yup, it is failing with Is the server running?.

msbutler · 2023-02-23T22:45:55Z

@renatolabs is it worth merging this tonight to further limit the blast radius of failures caused by #97495 ?

msbutler · 2023-02-24T12:41:33Z

it also seems the cdc/mixed-versions test has a different failure mode. the timeout here occured after the scheduled backup cmd.

00:00:25 cockroach.go:847: SCHEDULED BACKUP 0
NOTICE: schedule %!q(*string=0xc003eb2340) already exists, skipping
00:00:29 versionupgrade.go:164: test status: versionUpgradeTest: starting step 12
00:00:29 mixed_version_cdc.go:162: test status: waiting for 5 resolved timestamps
00:01:42 mixed_version_cdc.go:273: 11 resolved timestamps validated, latest is 10m59.223895475s behind realtime
00:01:42 mixed_version_cdc.go:176: 1 of 5 timestamps resolved
00:03:42 mixed_version_cdc.go:273: 12 resolved timestamps validated, latest is 12m15.741057494s behind realtime
00:03:42 mixed_version_cdc.go:176: 2 of 5 timestamps resolved
I230224 00:05:17.895588 1592 github.com/Shopify/sarama/sarama.go:122  [T1,kafka-producer] 45  client/metadata fetching metadata for all topics from broker 35.245.39.176:9092
00:05:26 test_impl.go:344: test failure #1: full stack retained in failure_1.log: (test_runner.go:989).runTest: test timed out (30m0s)
I230224 00:05:41.138285 1514 github.com/Shopify/sarama/sarama.go:122  [T1,kafka-producer] 46  client/metadata fetching metadata for all topics from broker 35.245.39.176:9092
I230224 00:05:55.405181 1784 github.com/Shopify/sarama/consumer.go:911  [T1,kafka-producer] 47  consumer/broker/0 accumulated 1 new subscriptions
I230224 00:05:55.405331 1785 github.com/Shopify/sarama/consumer.go:951  [T1,kafka-producer] 48  consumer/broker/0 added subscription to bank/0
I230224 00:05:55.405367 1785 github.com/Shopify/sarama/consumer.go:957  [T1,kafka-producer] 49  consumer/broker/0 closed dead subscription to bank/0
I230224 00:05:55.405411 1691 github.com/Shopify/sarama/sarama.go:125  [T1,kafka-producer] 50  Closing Client
I230224 00:05:55.405497 27629 github.com/Shopify/sarama/sarama.go:122  [T1,kafka-producer] 51  Closed connection to broker 35.245.39.176:9092
I230224 00:05:55.405559 27628 github.com/Shopify/sarama/sarama.go:122  [T1,kafka-producer] 52  Closed connection to broker 35.245.39.176:9092
--- FAIL: cdc/mixed-versions (1829.35s)
test artifacts and logs in: artifacts/cdc/mixed-versions/run_1
(test_runner.go:989).runTest: test timed out (30m0s)
(mixed_version_cdc.go:270).1: read tcp 10.142.0.85:33362 -> 34.86.66.147:26257: read: connection reset by peer
00:05:56 test_runner.go:726: [w0] test failed: cdc/mixed-versions (run 1)

Previously, the schedule backup cmd injection on Start() would always run on the first node on the cluster, but if that node were not available, the cmd would fail. This patch ensures the injection runs on a node that was just started. Fixes cockroachdb#97558 Release note: None

renatolabs · 2023-02-24T14:40:21Z

it also seems the cdc/mixed-versions test has a different failure mode

This is probably because of the location of the machine you ran this test on (see [1]).

is it worth merging this tonight to further limit the blast radius

Yes, let's merge. Could you go over the test failures on yesterday's build and double check we're closing all issues related to this, please?

[1]

cockroach/pkg/cmd/roachtest/tests/mixed_version_cdc.go

Lines 53 to 59 in f3796ef

    
           // teamcityAgentZone is the zone used in this test. Since this test 
        
           // runs a lot of queries from the TeamCity agent to CRDB nodes, we 
        
           // make sure to create roachprod nodes that are in the same region 
        
           // as the TeamCity agents. Note that a change in the TeamCity agent 
        
           // region could lead to this test timing out (see #92649 for an 
        
           // occurrence of this). 
        
           teamcityAgentZone = "us-east4-b"

msbutler · 2023-02-24T15:02:40Z

thanks for the help here!

bors r=renatolabs

craig · 2023-02-24T16:16:25Z

Build succeeded:

Bazel Essential CI (Cockroach)

msbutler added the T-testeng TestEng Team label Feb 23, 2023

msbutler requested a review from renatolabs February 23, 2023 17:53

msbutler requested a review from a team as a code owner February 23, 2023 17:53

msbutler self-assigned this Feb 23, 2023

msbutler requested review from herkolategan and removed request for a team February 23, 2023 17:53

msbutler force-pushed the butler-restart-sched branch 2 times, most recently from 21db718 to 6e1ddc4 Compare February 23, 2023 22:14

msbutler force-pushed the butler-restart-sched branch from 6e1ddc4 to 94d4876 Compare February 23, 2023 22:42

msbutler force-pushed the butler-restart-sched branch from 94d4876 to c5ef384 Compare February 24, 2023 12:43

pav-kv mentioned this pull request Feb 24, 2023

roachtest: replicagc-changed-peers/restart=true failed #97562

Closed

renatolabs approved these changes Feb 24, 2023

View reviewed changes

craig bot merged commit 973181a into cockroachdb:master Feb 24, 2023

msbutler deleted the butler-restart-sched branch February 24, 2023 16:20

This was referenced Feb 24, 2023

roachtest: replicagc-changed-peers/restart=false failed #97561

Closed

roachtest: follower-reads/mixed-version/single-region failed [missing cockroach binary] #97565

Closed

This was referenced Feb 24, 2023

roachtest: acceptance/gossip/restart-node-one failed #97548

Closed

roachtest: tpcc/mixed-headroom/multiple-upgrades/n5cpu16 failed #97582

Closed

renatolabs mentioned this pull request Feb 24, 2023

roachtest: tpcc/mixed-headroom/multiple-upgrades/n5cpu16 failed #97637

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

roachtest: always use active node in backup schedule injection #97581

roachtest: always use active node in backup schedule injection #97581

msbutler commented Feb 23, 2023 •

edited

Loading

cockroach-teamcity commented Feb 23, 2023

msbutler commented Feb 23, 2023

renatolabs commented Feb 23, 2023

renatolabs commented Feb 23, 2023 •

edited

Loading

msbutler commented Feb 23, 2023

msbutler commented Feb 23, 2023

renatolabs commented Feb 23, 2023

msbutler commented Feb 23, 2023

msbutler commented Feb 23, 2023

msbutler commented Feb 24, 2023

renatolabs commented Feb 24, 2023 •

edited

Loading

msbutler commented Feb 24, 2023

craig bot commented Feb 24, 2023

roachtest: always use active node in backup schedule injection #97581

roachtest: always use active node in backup schedule injection #97581

Conversation

msbutler commented Feb 23, 2023 • edited Loading

cockroach-teamcity commented Feb 23, 2023

msbutler commented Feb 23, 2023

renatolabs commented Feb 23, 2023

renatolabs commented Feb 23, 2023 • edited Loading

msbutler commented Feb 23, 2023

msbutler commented Feb 23, 2023

renatolabs commented Feb 23, 2023

msbutler commented Feb 23, 2023

msbutler commented Feb 23, 2023

msbutler commented Feb 24, 2023

renatolabs commented Feb 24, 2023 • edited Loading

msbutler commented Feb 24, 2023

craig bot commented Feb 24, 2023

msbutler commented Feb 23, 2023 •

edited

Loading

renatolabs commented Feb 23, 2023 •

edited

Loading

renatolabs commented Feb 24, 2023 •

edited

Loading