Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

roachtest: always use active node in backup schedule injection #97581

Merged
merged 1 commit into from
Feb 24, 2023

Conversation

msbutler
Copy link
Collaborator

@msbutler msbutler commented Feb 23, 2023

Previously, the schedule backup cmd injection on Start() would always run on the first node on the cluster, but if that node were not available, the cmd would fail. This patch ensures the injection runs on a node that was just started.

Fixes #97558, #97582, #97548, #97562 #97565 #97561

Release note: None

@msbutler msbutler added the T-testeng TestEng Team label Feb 23, 2023
@msbutler msbutler requested a review from renatolabs February 23, 2023 17:53
@msbutler msbutler requested a review from a team as a code owner February 23, 2023 17:53
@msbutler msbutler self-assigned this Feb 23, 2023
@msbutler msbutler requested review from herkolategan and removed request for a team February 23, 2023 17:53
@cockroach-teamcity
Copy link
Member

This change is Reviewable

@msbutler
Copy link
Collaborator Author

@renatolabs I've verified that cdc/mixedversions passes with this patch!

@renatolabs
Copy link
Contributor

Once we are confident this fixes all the issues that we know of, I think we should kick off a nightly build with this patch to double check we don't see on-start backup related failures anymore. A number of tests in current nightly run (which is still running) have failed (examples: #97582, #97562, #97548).

I also noticed there are at least two error messages: file not found (e.g., #97558), and failed to connect (e.g., #97562).

@renatolabs
Copy link
Contributor

renatolabs commented Feb 23, 2023

If it helps, I was testing some changes on multitenant-upgrade, and that test seems to fail pretty consistently with the message below, even after I applied this patch:

Never mind, that test actually fails on master as well (see #97016).

@msbutler msbutler force-pushed the butler-restart-sched branch 2 times, most recently from 21db718 to 6e1ddc4 Compare February 23, 2023 22:14
@msbutler
Copy link
Collaborator Author

for reasons that i do not completely understand, I've gotten multitenant-upgrade to pass using the current implementation in this pr. I'll run it a few more times and then spin nightly job.

@msbutler
Copy link
Collaborator Author

rats. it's failed on my branch. To prevent other tests from flaking, I will disable scheduled backups on the restart in this test for now.

@renatolabs
Copy link
Contributor

it's failed on my branch

Did if fail with a Is the server running? message? It might be that crdb itself crashed (and therefore not related to the backup logic). At least that's what I was seeing on my gceworker and also over on #97016.

@msbutler
Copy link
Collaborator Author

Yup, it is failing with Is the server running?.

@msbutler msbutler force-pushed the butler-restart-sched branch from 6e1ddc4 to 94d4876 Compare February 23, 2023 22:42
@msbutler
Copy link
Collaborator Author

@renatolabs is it worth merging this tonight to further limit the blast radius of failures caused by #97495 ?

@msbutler
Copy link
Collaborator Author

it also seems the cdc/mixed-versions test has a different failure mode. the timeout here occured after the scheduled backup cmd.

00:00:25 cockroach.go:847: SCHEDULED BACKUP 0
NOTICE: schedule %!q(*string=0xc003eb2340) already exists, skipping
00:00:29 versionupgrade.go:164: test status: versionUpgradeTest: starting step 12
00:00:29 mixed_version_cdc.go:162: test status: waiting for 5 resolved timestamps
00:01:42 mixed_version_cdc.go:273: 11 resolved timestamps validated, latest is 10m59.223895475s behind realtime
00:01:42 mixed_version_cdc.go:176: 1 of 5 timestamps resolved
00:03:42 mixed_version_cdc.go:273: 12 resolved timestamps validated, latest is 12m15.741057494s behind realtime
00:03:42 mixed_version_cdc.go:176: 2 of 5 timestamps resolved
I230224 00:05:17.895588 1592 github.com/Shopify/sarama/sarama.go:122  [T1,kafka-producer] 45  client/metadata fetching metadata for all topics from broker 35.245.39.176:9092
00:05:26 test_impl.go:344: test failure #1: full stack retained in failure_1.log: (test_runner.go:989).runTest: test timed out (30m0s)
I230224 00:05:41.138285 1514 github.com/Shopify/sarama/sarama.go:122  [T1,kafka-producer] 46  client/metadata fetching metadata for all topics from broker 35.245.39.176:9092
I230224 00:05:55.405181 1784 github.com/Shopify/sarama/consumer.go:911  [T1,kafka-producer] 47  consumer/broker/0 accumulated 1 new subscriptions
I230224 00:05:55.405331 1785 github.com/Shopify/sarama/consumer.go:951  [T1,kafka-producer] 48  consumer/broker/0 added subscription to bank/0
I230224 00:05:55.405367 1785 github.com/Shopify/sarama/consumer.go:957  [T1,kafka-producer] 49  consumer/broker/0 closed dead subscription to bank/0
I230224 00:05:55.405411 1691 github.com/Shopify/sarama/sarama.go:125  [T1,kafka-producer] 50  Closing Client
I230224 00:05:55.405497 27629 github.com/Shopify/sarama/sarama.go:122  [T1,kafka-producer] 51  Closed connection to broker 35.245.39.176:9092
I230224 00:05:55.405559 27628 github.com/Shopify/sarama/sarama.go:122  [T1,kafka-producer] 52  Closed connection to broker 35.245.39.176:9092
--- FAIL: cdc/mixed-versions (1829.35s)
test artifacts and logs in: artifacts/cdc/mixed-versions/run_1
(test_runner.go:989).runTest: test timed out (30m0s)
(mixed_version_cdc.go:270).1: read tcp 10.142.0.85:33362 -> 34.86.66.147:26257: read: connection reset by peer
00:05:56 test_runner.go:726: [w0] test failed: cdc/mixed-versions (run 1)

Previously, the schedule backup cmd injection on Start() would always run on
the first node on the cluster, but if that node were not available, the cmd
would fail. This patch ensures the injection runs on a node that was just
started.

Fixes cockroachdb#97558

Release note: None
@renatolabs
Copy link
Contributor

renatolabs commented Feb 24, 2023

it also seems the cdc/mixed-versions test has a different failure mode

This is probably because of the location of the machine you ran this test on (see [1]).

is it worth merging this tonight to further limit the blast radius

Yes, let's merge. Could you go over the test failures on yesterday's build and double check we're closing all issues related to this, please?

[1]

// teamcityAgentZone is the zone used in this test. Since this test
// runs a lot of queries from the TeamCity agent to CRDB nodes, we
// make sure to create roachprod nodes that are in the same region
// as the TeamCity agents. Note that a change in the TeamCity agent
// region could lead to this test timing out (see #92649 for an
// occurrence of this).
teamcityAgentZone = "us-east4-b"

@msbutler
Copy link
Collaborator Author

thanks for the help here!

bors r=renatolabs

@craig
Copy link
Contributor

craig bot commented Feb 24, 2023

Build succeeded:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
T-testeng TestEng Team
Projects
None yet
Development

Successfully merging this pull request may close these issues.

roachtest: cdc/mixed-versions failed
3 participants