Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

roachtest: Add sstable-corruption roachtest #70845

Merged
merged 1 commit into from
Oct 18, 2021

Conversation

itsbilal
Copy link
Member

Adds a new roachtest, sstable-corruption/table, that
imports TPCC, then finds an sstable that contains
table keys (which are replicated) on one node, then
calls dd to write random bytes into the middle
of them. It then checks that the corrupt node
either crashes on restart, or crashes after the
tpcc workload is run on it.

Informs #67568.

Release note: None.

@itsbilal itsbilal requested a review from a team September 28, 2021 21:10
@itsbilal itsbilal self-assigned this Sep 28, 2021
@itsbilal itsbilal requested a review from a team as a code owner September 28, 2021 21:10
@itsbilal itsbilal requested review from stevendanna and removed request for a team September 28, 2021 21:10
@cockroach-teamcity
Copy link
Member

This change is Reviewable

pkg/cmd/roachtest/tests/sstable_corruption.go Outdated Show resolved Hide resolved
pkg/cmd/roachtest/tests/sstable_corruption.go Outdated Show resolved Hide resolved
pkg/cmd/roachtest/tests/sstable_corruption.go Outdated Show resolved Hide resolved
pkg/cmd/roachtest/tests/sstable_corruption.go Outdated Show resolved Hide resolved
@itsbilal itsbilal force-pushed the sstable-corruption-roachtest branch from d2bfa7a to 2fd4810 Compare October 14, 2021 19:19
@itsbilal
Copy link
Member Author

TFTR! I did the monitor refactor and got rid of the channel. The test also passes more reliably now, as we're no longer tied to the more finicky output of NumExpectedDeaths.

Copy link
Member

@tbg tbg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm: after comments

Reviewed 4 of 4 files at r2, all commit messages.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @itsbilal and @stevendanna)


pkg/cmd/roachtest/tests/sstable_corruption.go, line 46 at r2 (raw file):

			return nil
		})
		m.Wait()

Could you indent this whole block so that m is in a nested scope?

{
  m := ...
  ...
  m.Wait()
}

This prevents reuse of monitors, which has lots of snags (actually using just a single monitor already unfortunately has snags :sadface:)


pkg/cmd/roachtest/tests/sstable_corruption.go, line 47 at r2 (raw file):

		})
		m.Wait()
		m.ExpectDeaths(3)

This line isn't doing anything, since you're not using this m any more and it also isn't watching the nodes any more (since Wait() returned).


pkg/cmd/roachtest/tests/sstable_corruption.go, line 88 at r2 (raw file):

		m.Go(func(ctx context.Context) error {
			_ = c.RunE(ctx, workloadNode, "./cockroach workload run tpcc --warehouses=100 "+
				fmt.Sprintf("--tolerate-errors --duration=%s", 2*time.Minute))

nit: unusual use of Sprintf here, why don't you Sprintf the entire args if you're going to do it to part of it?


pkg/cmd/roachtest/tests/sstable_corruption.go, line 89 at r2 (raw file):

			_ = c.RunE(ctx, workloadNode, "./cockroach workload run tpcc --warehouses=100 "+
				fmt.Sprintf("--tolerate-errors --duration=%s", 2*time.Minute))
			return nil

// Don't return an error from the workload. We want outcome of WaitE to be determined by the monitor noticing that a node died. The workload may also fail, despite --tolerate-errors, if a node crashes too early.


pkg/cmd/roachtest/tests/sstable_corruption.go, line 92 at r2 (raw file):

		})
		require.Error(t, m.WaitE())
		_ = c.WipeE(ctx, t.L(), c.Node(corruptNode))

// Exempt corrupted node from roachtest harness' post-test liveness checks.

Copy link
Member Author

@itsbilal itsbilal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TFTR!

Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @stevendanna)


pkg/cmd/roachtest/tests/sstable_corruption.go, line 46 at r2 (raw file):

Previously, tbg (Tobias Grieger) wrote…

Could you indent this whole block so that m is in a nested scope?

{
  m := ...
  ...
  m.Wait()
}

This prevents reuse of monitors, which has lots of snags (actually using just a single monitor already unfortunately has snags :sadface:)

I was already putting all uses of m in this one nested scope, but I outdented the lines below that aren't using m to make it clear. Hope that works!

Copy link
Member

@tbg tbg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @stevendanna)


pkg/cmd/roachtest/tests/sstable_corruption.go, line 46 at r2 (raw file):

Previously, itsbilal (Bilal Akhtar) wrote…

I was already putting all uses of m in this one nested scope, but I outdented the lines below that aren't using m to make it clear. Hope that works!

The goal is that m goes out of scope once m.Wait() returns. Otherwise I'm just nervously scanning everything after that that has m visible for errant uses.

@itsbilal itsbilal force-pushed the sstable-corruption-roachtest branch from 2fd4810 to 607db33 Compare October 14, 2021 20:23
@itsbilal
Copy link
Member Author

Oops, forgot a push. Here we go.

Copy link
Member

@tbg tbg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 1 of 1 files at r3, all commit messages.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @stevendanna)

@itsbilal itsbilal force-pushed the sstable-corruption-roachtest branch from 607db33 to 5e0f68e Compare October 15, 2021 20:25
@itsbilal
Copy link
Member Author

TFTR! Latest push now corrupts 6 table SSTs on every node, not just on one node. Should eliminate the flakiness.

@itsbilal itsbilal force-pushed the sstable-corruption-roachtest branch from 5e0f68e to 022a0c4 Compare October 18, 2021 14:46
Adds a new roachtest, sstable-corruption/table, that
imports TPCC, then finds an sstable that contains
table keys (which are replicated) on one node, then
calls `dd` to write random bytes into the middle
of them. It then checks that the corrupt node
either crashes on restart, or crashes after the
tpcc workload is run on it.

Informs cockroachdb#67568.

Release note: None.
@itsbilal itsbilal force-pushed the sstable-corruption-roachtest branch from 022a0c4 to 8f7f7bf Compare October 18, 2021 15:51
@itsbilal
Copy link
Member Author

bors r=tbg

@craig
Copy link
Contributor

craig bot commented Oct 18, 2021

Build succeeded:

@craig craig bot merged commit 7669013 into cockroachdb:master Oct 18, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants