cdc: Unskip TestChangefeedNodeShutdown #32232

mrtracy · 2018-11-12T22:55:17Z

TestChangefeedNodeShutdown is a new test which is partially complete, but has been checked in Skip()ed due to unfortunate flakiness issues, of which there are several:

Primarily, while the job is usually picked up by another node, occasionally the test takes several minutes to complete, and may actually never complete - this it the primary flake causing me to pause this work.
SQL Statements after the node shutdown sometimes return with an ambiguous result. This could possibly be fixed with a succeedsSoon, but might further exacerbate the indeterministic nature of the test.
The output of the changefeed is not deterministic with the shutdown, there may be duplicates, meaning that it is not appropriate to use the assertPayloads() infrastructure and will instead require a verifier setup.
Finally, it is my intuition that these three issues may be combining to mask other existing flakes.

This unit test was intended as a faster-running replacement for the cdc/crdbChaos roachtest, so this functionality is not going untested. We are logging this follow-up issue and skipping the test so that we get this needed (and verified) functionality checked in.

Jira issue: CRDB-4756

mrtracy · 2018-12-20T07:54:16Z

I've been heads-down on this for a few days, here's what I've been working towards:

The first thing that is verified (and needed for additional verification) is the "availability history" of the cluster, which is synthesized from cluster events and the current node liveness values. This gives a point-in-time picture of how many nodes were available at any timestamp in the cluster. The history itself can be verified; for instance, it may be required that there is never more than one node down at a time. However, this history can also be used to adjust expectations elsewhere in the verification.
The primary verification is to query different time series values across the whole history of the cluster.
- We break down time into fixed periods, and verify that each period meets expectations; we adjust those expectations according to the values in the availability history; for example, how many nodes there are, how many nodes are down, whether we are near the beginning of the test, etc.
- The time series being verified can be configured; or rather, we will modularize the verification so that we only select modules that are appropriate. For example, if we are measuring an expected txn/s target, we might adjust that based on the workload that is being run against the cluster.
- Initial time series being verified will be per-machine latencies and manually configurable txn/s based on workload.

A third angle of verification would be to verify the replication history; that is, make sure replica counts make sense, replicas are spread evenly, and that there isn't excess replication. This might be possible through time series, though, if we are unconcerned with specifics. However, I am leaving this out of my V1 product.

tbg · 2019-02-14T10:54:35Z

@danhhz can you clarify if this test provides anything in addition to the cdc-chaos roachtests? This sounds like we're mostly shaving yaks.

danhhz · 2019-02-14T15:27:22Z

My intention was to replace the cdc/crdb-chaos roachtest. Roachtests are necessary for testing large and long-running things, but when something can be a TestCluster-based unit test, I always prefer that. Unit tests are certainly easier while developing something, but in this case it's mostly because Roachtests flake at a dramatically higher rate, and when that ongoing load can be avoided, I'd like to.

danhhz · 2019-02-14T16:20:38Z

Also just remembered that I'd like to solve some of these anyway. I'm about to start working on a jepsen style test for changefeeds. The overall correctness of changefeeds is dependent on them maintaining our user-guaranteed invariants across anything we throw at it: crdb chaos, sink chaos, schema changes, job pause/resume, etc. There are a huge number of moving pieces that all have to work together exactly right and this is currently under-tested. Being able to do this in a go test with TestCluster would be huge.

tbg · 2019-02-14T16:40:45Z

I agree that it's worthwhile but my main concern is that getting these tests to run fast enough to run on every CI is not worth the hacks. They seem to live at a stage between roachtests and unit tests, where we want something that's really easy to run but we'd rather not run it in CI (but, say, in a nightly stress test).

I think if we skipped the test if !testutils.NightlyStress() we'd get the desired behavior (or you can run it at any cadence you like, just have to set up the jobs the right way). This could also be the pattern used for this jepsen-style suite you're working on. Thoughts?

cockroach/pkg/testutils/stress.go

Lines 19 to 25 in 2558dcc

    
           var stress = envutil.EnvOrDefaultBool("COCKROACH_NIGHTLY_STRESS", false) 
        
           // NightlyStress returns true iff the process is running as part of CockroachDB's 
        
           // nightly stress tests. 
        
           func NightlyStress() bool { 
        
           	return stress 
        
           }

danhhz · 2019-02-14T17:09:08Z

I'm certainly amenable to the argument, but I don't see how it applies to this case? What about this test makes you worried that it will take too long to run on every CI?

tbg · 2019-05-07T16:35:24Z

Looking at this test now. Right now it fails pretty much on every run. Will look a bit as to why that is.

tbg · 2019-05-07T16:57:24Z

Doesn't fail for any CDC related reasons. Basically after shutting down the first node, things fall apart. I can reduce this to

func TestChangefeedNodeShutdown(t *testing.T) {
	defer leaktest.AfterTest(t)()

	args := base.TestServerArgs{
		UseDatabase: "d",
	}

	tc := serverutils.StartTestCluster(t, 3, base.TestClusterArgs{
		ServerArgs: args,
	})
	defer tc.Stopper().Stop(context.Background())

	db := tc.ServerConn(1)
	sqlDB := sqlutils.MakeSQLRunner(db)

	sqlDB.Exec(t, `CREATE DATABASE d`)
	sqlDB.Exec(t, `CREATE TABLE foo (a INT PRIMARY KEY, b STRING)`)
	sqlDB.Exec(t, `INSERT INTO foo VALUES (0, 'initial')`)

	time.Sleep(10 * time.Second)
	tc.StopServer(0)

	sqlDB.Exec(t, `UPSERT INTO foo VALUES(0, 'updated')`)
	sqlDB.Exec(t, `INSERT INTO foo VALUES (3, 'third')`)
}

--- FAIL: TestChangefeedNodeShutdown (24.95s)
    changefeed_test.go:1741: error executing 'UPSERT INTO foo VALUES(0, 'updated')': pq: result is ambiguous (error=failed to connect to n1 at 127.0.0.1:59716: initial connection heartbeat failed: rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: connection error: desc = "transport: Error while dialing dial tcp 127.0.0.1:59716: connect: connection refused" [exhausted])
FAIL

Plenty of failed heartbeats before. I'm not quite sure why, the ranges should be upreplicated at this point. We'll see.

miretskiy · 2023-03-13T11:39:58Z

@shermanCRL why GA-blocker on an issue that's 5 years old?

99934: changefeedccl: Remove skipped tests that decayed over time r=miretskiy a=miretskiy Remove Fixes #32232 Remove TestChangefeedNodeShutdown. This test has been disabled since 2018; Other tests exist (e.g. `TestChangefeedHandlesDrainingNodes`) that verify restart behavior. Fixes #51842 Remove BenchmarkChangefeedTicks benchmark. This benchmark has been skipped since 2019. Attempts could be made to revive it; however, this benchmark had a lot of code, which accomplished questionable goals. The benchmark itself was unrepresentative (by using dependency injection), too small to be meaningful (1000 rows), and most likely would be too noise and inconclusive. We have added other micro benchmarks over time; and we conduct large scale testing, including with roachtests. Release note: None 100342: upgrades: remove migration that waits for schema changes r=rafiss a=rafiss We can also remove some skipped tests, since they no longer apply. informs: #96751 Release note: None 100345: upgrades: unskip TestIsAtLeastVersionBuiltin r=rafiss a=rafiss informs: #96751 Release note: None 100484: server,testutils: add some extra logging for TestStatusEngineStatsJson r=abarganier a=knz Informs #99261 Release note: None Epic: None Co-authored-by: Yevgeniy Miretskiy <[email protected]> Co-authored-by: Rafi Shamim <[email protected]> Co-authored-by: Raphael 'kena' Poss <[email protected]>

Remove Fixes cockroachdb#32232 Remove TestChangefeedNodeShutdown. This test has been disabled since 2018; Other tests exist (e.g. `TestChangefeedHandlesDrainingNodes`) that verify restart behavior. Fixes cockroachdb#51842 Remove BenchmarkChangefeedTicks benchmark. This benchmark has been skipped since 2019. Attempts could be made to revive it; however, this benchmark had a lot of code, which accomplished questionable goals. The benchmark itself was unrepresentative (by using dependency injection), too small to be meaningful (1000 rows), and most likely would be too noise and inconclusive. We have added other micro benchmarks over time; and we conduct large scale testing, including with roachtests. Release note: None

mrtracy mentioned this issue Nov 12, 2018

cdc: Fix/Roachtest for CRDB Chaos #32069

Merged

knz added C-test-failure Broken test (automatically or manually discovered). A-cdc Change Data Capture labels Nov 26, 2018

knz added this to the 2.2 milestone Nov 26, 2018

petermattis assigned mrtracy Dec 18, 2018

knz unassigned mrtracy Jan 13, 2019

tbg self-assigned this Feb 19, 2019

tbg added the branch-master Failures and bugs on the master branch. label Jan 22, 2020

elinorgarcia added the T-cdc label Dec 7, 2020

tbg removed their assignment May 31, 2022

jordanlewis added the skipped-test label Jan 5, 2023

shermanCRL added GA-blocker branch-release-23.1 Used to mark GA and release blockers, technical advisories, and bugs for 23.1 labels Mar 8, 2023

miretskiy mentioned this issue Mar 29, 2023

changefeedccl: Remove skipped tests that decayed over time #99934

Merged

craig bot closed this as completed in f42c0f1 Apr 3, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cdc: Unskip TestChangefeedNodeShutdown #32232

cdc: Unskip TestChangefeedNodeShutdown #32232

mrtracy commented Nov 12, 2018 •

edited by cockroach-jira-scripts

Loading

mrtracy commented Dec 20, 2018

tbg commented Feb 14, 2019

danhhz commented Feb 14, 2019

danhhz commented Feb 14, 2019

tbg commented Feb 14, 2019

danhhz commented Feb 14, 2019

tbg commented May 7, 2019

tbg commented May 7, 2019

miretskiy commented Mar 13, 2023

cdc: Unskip TestChangefeedNodeShutdown #32232

cdc: Unskip TestChangefeedNodeShutdown #32232

Comments

mrtracy commented Nov 12, 2018 • edited by cockroach-jira-scripts Loading

mrtracy commented Dec 20, 2018

tbg commented Feb 14, 2019

danhhz commented Feb 14, 2019

danhhz commented Feb 14, 2019

tbg commented Feb 14, 2019

danhhz commented Feb 14, 2019

tbg commented May 7, 2019

tbg commented May 7, 2019

miretskiy commented Mar 13, 2023

mrtracy commented Nov 12, 2018 •

edited by cockroach-jira-scripts

Loading