Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cdc: Unskip TestChangefeedNodeShutdown #32232

Closed
mrtracy opened this issue Nov 12, 2018 · 9 comments · Fixed by #99934
Closed

cdc: Unskip TestChangefeedNodeShutdown #32232

mrtracy opened this issue Nov 12, 2018 · 9 comments · Fixed by #99934
Labels
A-cdc Change Data Capture branch-master Failures and bugs on the master branch. branch-release-23.1 Used to mark GA and release blockers, technical advisories, and bugs for 23.1 C-test-failure Broken test (automatically or manually discovered). GA-blocker skipped-test T-cdc
Milestone

Comments

@mrtracy
Copy link
Contributor

mrtracy commented Nov 12, 2018

TestChangefeedNodeShutdown is a new test which is partially complete, but has been checked in Skip()ed due to unfortunate flakiness issues, of which there are several:

  • Primarily, while the job is usually picked up by another node, occasionally the test takes several minutes to complete, and may actually never complete - this it the primary flake causing me to pause this work.
  • SQL Statements after the node shutdown sometimes return with an ambiguous result. This could possibly be fixed with a succeedsSoon, but might further exacerbate the indeterministic nature of the test.
  • The output of the changefeed is not deterministic with the shutdown, there may be duplicates, meaning that it is not appropriate to use the assertPayloads() infrastructure and will instead require a verifier setup.
  • Finally, it is my intuition that these three issues may be combining to mask other existing flakes.

This unit test was intended as a faster-running replacement for the cdc/crdbChaos roachtest, so this functionality is not going untested. We are logging this follow-up issue and skipping the test so that we get this needed (and verified) functionality checked in.

Jira issue: CRDB-4756

@knz knz added C-test-failure Broken test (automatically or manually discovered). A-cdc Change Data Capture labels Nov 26, 2018
@knz knz added this to the 2.2 milestone Nov 26, 2018
@mrtracy
Copy link
Contributor Author

mrtracy commented Dec 20, 2018

I've been heads-down on this for a few days, here's what I've been working towards:

  • The first thing that is verified (and needed for additional verification) is the "availability history" of the cluster, which is synthesized from cluster events and the current node liveness values. This gives a point-in-time picture of how many nodes were available at any timestamp in the cluster. The history itself can be verified; for instance, it may be required that there is never more than one node down at a time. However, this history can also be used to adjust expectations elsewhere in the verification.

  • The primary verification is to query different time series values across the whole history of the cluster.

    • We break down time into fixed periods, and verify that each period meets expectations; we adjust those expectations according to the values in the availability history; for example, how many nodes there are, how many nodes are down, whether we are near the beginning of the test, etc.
    • The time series being verified can be configured; or rather, we will modularize the verification so that we only select modules that are appropriate. For example, if we are measuring an expected txn/s target, we might adjust that based on the workload that is being run against the cluster.
    • Initial time series being verified will be per-machine latencies and manually configurable txn/s based on workload.

A third angle of verification would be to verify the replication history; that is, make sure replica counts make sense, replicas are spread evenly, and that there isn't excess replication. This might be possible through time series, though, if we are unconcerned with specifics. However, I am leaving this out of my V1 product.

@knz knz unassigned mrtracy Jan 13, 2019
@tbg
Copy link
Member

tbg commented Feb 14, 2019

@danhhz can you clarify if this test provides anything in addition to the cdc-chaos roachtests? This sounds like we're mostly shaving yaks.

@danhhz
Copy link
Contributor

danhhz commented Feb 14, 2019

My intention was to replace the cdc/crdb-chaos roachtest. Roachtests are necessary for testing large and long-running things, but when something can be a TestCluster-based unit test, I always prefer that. Unit tests are certainly easier while developing something, but in this case it's mostly because Roachtests flake at a dramatically higher rate, and when that ongoing load can be avoided, I'd like to.

@danhhz
Copy link
Contributor

danhhz commented Feb 14, 2019

Also just remembered that I'd like to solve some of these anyway. I'm about to start working on a jepsen style test for changefeeds. The overall correctness of changefeeds is dependent on them maintaining our user-guaranteed invariants across anything we throw at it: crdb chaos, sink chaos, schema changes, job pause/resume, etc. There are a huge number of moving pieces that all have to work together exactly right and this is currently under-tested. Being able to do this in a go test with TestCluster would be huge.

@tbg
Copy link
Member

tbg commented Feb 14, 2019

I agree that it's worthwhile but my main concern is that getting these tests to run fast enough to run on every CI is not worth the hacks. They seem to live at a stage between roachtests and unit tests, where we want something that's really easy to run but we'd rather not run it in CI (but, say, in a nightly stress test).

I think if we skipped the test if !testutils.NightlyStress() we'd get the desired behavior (or you can run it at any cadence you like, just have to set up the jobs the right way). This could also be the pattern used for this jepsen-style suite you're working on. Thoughts?

var stress = envutil.EnvOrDefaultBool("COCKROACH_NIGHTLY_STRESS", false)
// NightlyStress returns true iff the process is running as part of CockroachDB's
// nightly stress tests.
func NightlyStress() bool {
return stress
}

@danhhz
Copy link
Contributor

danhhz commented Feb 14, 2019

I'm certainly amenable to the argument, but I don't see how it applies to this case? What about this test makes you worried that it will take too long to run on every CI?

@tbg tbg self-assigned this Feb 19, 2019
@tbg
Copy link
Member

tbg commented May 7, 2019

Looking at this test now. Right now it fails pretty much on every run. Will look a bit as to why that is.

@tbg
Copy link
Member

tbg commented May 7, 2019

Doesn't fail for any CDC related reasons. Basically after shutting down the first node, things fall apart. I can reduce this to

func TestChangefeedNodeShutdown(t *testing.T) {
	defer leaktest.AfterTest(t)()

	args := base.TestServerArgs{
		UseDatabase: "d",
	}

	tc := serverutils.StartTestCluster(t, 3, base.TestClusterArgs{
		ServerArgs: args,
	})
	defer tc.Stopper().Stop(context.Background())

	db := tc.ServerConn(1)
	sqlDB := sqlutils.MakeSQLRunner(db)

	sqlDB.Exec(t, `CREATE DATABASE d`)
	sqlDB.Exec(t, `CREATE TABLE foo (a INT PRIMARY KEY, b STRING)`)
	sqlDB.Exec(t, `INSERT INTO foo VALUES (0, 'initial')`)

	time.Sleep(10 * time.Second)
	tc.StopServer(0)

	sqlDB.Exec(t, `UPSERT INTO foo VALUES(0, 'updated')`)
	sqlDB.Exec(t, `INSERT INTO foo VALUES (3, 'third')`)
}
--- FAIL: TestChangefeedNodeShutdown (24.95s)
    changefeed_test.go:1741: error executing 'UPSERT INTO foo VALUES(0, 'updated')': pq: result is ambiguous (error=failed to connect to n1 at 127.0.0.1:59716: initial connection heartbeat failed: rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: connection error: desc = "transport: Error while dialing dial tcp 127.0.0.1:59716: connect: connection refused" [exhausted])
FAIL

Plenty of failed heartbeats before. I'm not quite sure why, the ranges should be upreplicated at this point. We'll see.

@tbg tbg added the branch-master Failures and bugs on the master branch. label Jan 22, 2020
@tbg tbg removed their assignment May 31, 2022
@shermanCRL shermanCRL added GA-blocker branch-release-23.1 Used to mark GA and release blockers, technical advisories, and bugs for 23.1 labels Mar 8, 2023
@miretskiy
Copy link
Contributor

@shermanCRL why GA-blocker on an issue that's 5 years old?

craig bot pushed a commit that referenced this issue Apr 3, 2023
99934: changefeedccl: Remove skipped tests that decayed over time r=miretskiy a=miretskiy

Remove
Fixes #32232
Remove TestChangefeedNodeShutdown.  This test has been disabled since 2018; Other tests exist (e.g. `TestChangefeedHandlesDrainingNodes`) that verify restart behavior.

Fixes #51842
Remove BenchmarkChangefeedTicks benchmark.  This benchmark has been skipped since 2019.  Attempts could be made to revive it; however, this benchmark had a lot of code, which accomplished questionable goals. The benchmark itself was unrepresentative (by using dependency injection), too small to be meaningful (1000 rows), and most likely would be too noise and inconclusive.  We have added other micro benchmarks over time; and we conduct large scale testing, including with roachtests.

Release note: None

100342: upgrades: remove migration that waits for schema changes r=rafiss a=rafiss

We can also remove some skipped tests, since they no longer apply.

informs: #96751
Release note: None

100345: upgrades: unskip TestIsAtLeastVersionBuiltin r=rafiss a=rafiss

informs: #96751
Release note: None

100484: server,testutils: add some extra logging for TestStatusEngineStatsJson r=abarganier a=knz

Informs #99261

Release note: None
Epic: None

Co-authored-by: Yevgeniy Miretskiy <[email protected]>
Co-authored-by: Rafi Shamim <[email protected]>
Co-authored-by: Raphael 'kena' Poss <[email protected]>
@craig craig bot closed this as completed in f42c0f1 Apr 3, 2023
miretskiy pushed a commit to miretskiy/cockroach that referenced this issue Apr 12, 2023
Remove
Fixes cockroachdb#32232
Remove TestChangefeedNodeShutdown.  This test has been disabled since
2018; Other tests exist (e.g. `TestChangefeedHandlesDrainingNodes`) that verify
restart behavior.

Fixes cockroachdb#51842
Remove BenchmarkChangefeedTicks benchmark.  This benchmark has been
skipped since 2019.  Attempts could be made to revive it; however, this
benchmark had a lot of code, which accomplished questionable goals.
The benchmark itself was unrepresentative (by using dependency
injection), too small to be meaningful (1000 rows), and most likely
would be too noise and inconclusive.  We have added other micro
benchmarks over time; and we conduct large scale testing, including
with roachtests.

Release note: None
miretskiy pushed a commit to miretskiy/cockroach that referenced this issue Apr 12, 2023
Remove
Fixes cockroachdb#32232
Remove TestChangefeedNodeShutdown.  This test has been disabled since
2018; Other tests exist (e.g. `TestChangefeedHandlesDrainingNodes`) that verify
restart behavior.

Fixes cockroachdb#51842
Remove BenchmarkChangefeedTicks benchmark.  This benchmark has been
skipped since 2019.  Attempts could be made to revive it; however, this
benchmark had a lot of code, which accomplished questionable goals.
The benchmark itself was unrepresentative (by using dependency
injection), too small to be meaningful (1000 rows), and most likely
would be too noise and inconclusive.  We have added other micro
benchmarks over time; and we conduct large scale testing, including
with roachtests.

Release note: None
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-cdc Change Data Capture branch-master Failures and bugs on the master branch. branch-release-23.1 Used to mark GA and release blockers, technical advisories, and bugs for 23.1 C-test-failure Broken test (automatically or manually discovered). GA-blocker skipped-test T-cdc
Projects
None yet
Development

Successfully merging a pull request may close this issue.

8 participants