testutils/testcluster: quiesce servers in parallel #9338

petermattis · 2016-09-13T20:10:47Z

Similar to how multiTestContext shuts down, we need to quiesce the
per-server stoppers in parallel to avoid deadlock. Helps with
TestBasicManualReplication flakiness, but does not fully deflake.

This change is

andreimatei · 2016-09-13T20:16:14Z

Review status: 0 of 1 files reviewed at latest revision, 1 unresolved discussion, some commit checks pending.

testutils/testcluster/testcluster.go, line 84 at r1 (raw file):

  defer tc.mu.Unlock()

  // Quiesce the servers in parallel to avoid deadlocks.

mind going into more details here? I'm looking at multiTestContest too and I see something about tasks stuck inaddWriteCommand... But I'm not quite sure what that's about.
Is this problem related to the fact that the servers are sharing a stopper? I.e. is it also a problem in production?
I kinda feel like there should be something else to be done here...

Comments from Reviewable

petermattis · 2016-09-13T20:19:48Z

Review status: 0 of 1 files reviewed at latest revision, 1 unresolved discussion, some commit checks pending.

testutils/testcluster/testcluster.go, line 84 at r1 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

mind going into more details here? I'm looking at multiTestContest too and I see something about tasks stuck inaddWriteCommand... But I'm not quite sure what that's about.
Is this problem related to the fact that the servers are sharing a stopper? I.e. is it also a problem in production?
I kinda feel like there should be something else to be done here...

This is mirroring the code in `multiTestContext.Stop`. If you `Stop` the servers serially, the last server might deadlock because the previous servers have been fully stopped, not just quiesced. I saw this failure during `TestBasicManualReplication` stress testing.

There is another problem with that test which sounds like what you're seeing: tasks stuck in addWriteCommand. I think that has arisen due to removing the DistSender send timeout. We're probably not handling some shutdown situation correctly.

Comments from Reviewable

andreimatei · 2016-09-13T20:23:48Z

Review status: 0 of 1 files reviewed at latest revision, 1 unresolved discussion, some commit checks pending.

testutils/testcluster/testcluster.go, line 84 at r1 (raw file):

Previously, petermattis (Peter Mattis) wrote…

This is mirroring the code in multiTestContext.Stop. If you Stop the servers serially, the last server might deadlock because the previous servers have been fully stopped, not just quiesced. I saw this failure during TestBasicManualReplication stress testing.

There is another problem with that test which sounds like what you're seeing: tasks stuck in addWriteCommand. I think that has arisen due to removing the DistSender send timeout. We're probably not handling some shutdown situation correctly.

ok, so all the servers but one have been fully stopped. So why does the last one deadlock exactly?

Comments from Reviewable

petermattis · 2016-09-13T20:53:18Z

Review status: 0 of 1 files reviewed at latest revision, 1 unresolved discussion, all commit checks successful.

testutils/testcluster/testcluster.go, line 84 at r1 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

ok, so all the servers but one have been fully stopped. So why does the last one deadlock exactly?

A believe the answer is "various reasons". @tschottdorf?

Comments from Reviewable

tbg · 2016-09-14T01:07:52Z

The one real reason is waiting on a raft command which can never commit
because a majority is unavailable.

On Tue, Sep 13, 2016, 16:53 Peter Mattis [email protected] wrote:

Review status: 0 of 1 files reviewed at latest revision, 1 unresolved

discussion, all commit checks successful.

testutils/testcluster/testcluster.go, line 84 at r1
https://reviewable.io:443/reviews/cockroachdb/cockroach/9338#-KR_AcA-AoDHeLwQaDZ6:-KR_JjQRmvVQ96DlnsUR:brc850s
(raw file

cockroach/testutils/testcluster/testcluster.go

Line 84 in 8555135

// Quiesce the servers in parallel to avoid deadlocks.

):
Previously, andreimatei (Andrei Matei) wrote…

ok, so all the servers but one have been fully stopped. So why does the
last one deadlock exactly?

A believe the answer is "various reasons". @tschottdorf
https://github.com/tschottdorf?

Comments from Reviewable
https://reviewable.io:443/reviews/cockroachdb/cockroach/9338

—
You are receiving this because you were mentioned.

Reply to this email directly, view it on GitHub
#9338 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AE135J7_N887tIOiLXob5mebI2OrGXXBks5qpw1MgaJpZM4J8F0-
.

-- Tobias

bdarnell · 2016-09-14T06:07:17Z

Review status: 0 of 1 files reviewed at latest revision, 1 unresolved discussion, all commit checks successful.

testutils/testcluster/testcluster.go, line 84 at r1 (raw file):

Previously, petermattis (Peter Mattis) wrote…

A believe the answer is "various reasons". @tschottdorf?

A quiesced node returns NodeUnavailableError, which is generally considered retryable (in production, the node should be given a chance to come back). In tests, we know that's not going to happen, but the DistSender doesn't. With this change to quiesce in parallel, we should in theory be able to fix all remaining issues in theory with proper use of Tasks, but it's tricky. We need to make sure that all "first mover" operations are run in Tasks, and in the absence of timeouts in DistSender they must always plumb their stopper's ShouldQuiesce channel through to the DistSender for cancellation purposes.

Comments from Reviewable

Similar to how multiTestContext shuts down, we need to quiesce the per-server stoppers in parallel to avoid deadlock. Helps with TestBasicManualReplication flakiness, but does not fully deflake.

petermattis · 2016-09-14T13:59:21Z

Review status: 0 of 1 files reviewed at latest revision, 1 unresolved discussion, some commit checks pending.

testutils/testcluster/testcluster.go, line 84 at r1 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

A quiesced node returns NodeUnavailableError, which is generally considered retryable (in production, the node should be given a chance to come back). In tests, we know that's not going to happen, but the DistSender doesn't. With this change to quiesce in parallel, we should in theory be able to fix all remaining issues in theory with proper use of Tasks, but it's tricky. We need to make sure that all "first mover" operations are run in Tasks, and in the absence of timeouts in DistSender they must always plumb their stopper's ShouldQuiesce channel through to the DistSender for cancellation purposes.

Expanded the comment here. Even though this change doesn't eliminate the flakiness, it is a necessary step.

Comments from Reviewable

andreimatei · 2016-09-14T19:16:10Z

Review status: 0 of 1 files reviewed at latest revision, 1 unresolved discussion, all commit checks successful.

Comments from Reviewable

59490: add licensing info to README r=bdarnell a=gemma-shay Added a section on licensing info. Links to the new Licensing doc (#9338) will not work yet, so will wait to merge until they are up. Closes cockroachdb/docs#7441 60513: sql: ensure REGIONAL BY ROW statements roundtrip r=ajstorm a=otan Resolves #59362 See individual commits for details. 60552: sql: add descriptor validation on write r=postamar a=postamar Previously, we didn't systematically validate descriptors when they were written. Furthermore, there existed no common method to validate descriptors across all descriptor subtypes This commit adds three methods to the catalog.Descriptor interface: 1. ValidateSelf ( context.Context ) error 2. Validate ( context.Context, catalog.DescGetter ) error 3. ValidateTxnCommit ( context.Context, catalog.DescGetter) error Each performs a subset the checks performed by the next. ValidateSelf contains all checks which can be performed in isolation, Validate also performs all those involving DescGetters (i.e. cross-reference checks) and ValidateTxnCommit also performs checks which should only be done at commit-time. An example of the latter is checking that a table has a primary key: dropping the PK is allowed within a transaction as long as a new PK is subsequently provided before committing. This commit adds new validation calls when writing descriptors: 1. ValidateSelf is called prior to Collection adding a descriptor Put to a kv.Batch. At this point, we want descriptors to be at least internally-consistent, both to catch validation errors early and because it's not possible to do cross-reference checking at this point (changes on FKs for instance involve multiple descriptors). 2. ValidateTxnCommit is called on the descs.Collection's uncommitted descriptors when the corresponding txn is about to commit, just prior to the two-version-invariant check. These validations may be disabled using a new cluster setting: sql.catalog.descs.validate_on_write.enabled Setting this to false makes it possible to corrupt the descriptor state using the crdb_internal.unsafe_* functions. Release note: None 60616: builtins: add builtin to retrieve the payload(s) for a span. r=knz,irfansharif a=angelapwen Part of addressing #55733 The `payloads_for_span` builtin retrieves all payloads for a given span ID, given that the span is part of an active trace. The payloads are returned in JSONB format. If the span is not found, or if the span does not have any payloads, the builtin returns an empty JSON object. With the appropriate usage of this builtin and the `crdb_internal.trace_id` builtin as shown in the `contention_event` logic test, all payloads for the current trace may be surfaced. Release note (sql change): add `payloads_for_span` builtin that takes in a span ID and returns its paylods in JSONB format. If the span is not found, or if the span does not have any payloads, the builtin returns an empty JSON object. 60692: sql: add tests for inverted indexes on virtual columns r=mgartner a=mgartner No code changes were needed to support inverted indexes on virtual columns. Release note: None Co-authored-by: Gemma Shay <[email protected]> Co-authored-by: Oliver Tan <[email protected]> Co-authored-by: Marius Posta <[email protected]> Co-authored-by: angelapwen <[email protected]> Co-authored-by: Marcus Gartner <[email protected]>

testutils/testcluster: quiesce servers in parallel

7eaff2a

Similar to how multiTestContext shuts down, we need to quiesce the per-server stoppers in parallel to avoid deadlock. Helps with TestBasicManualReplication flakiness, but does not fully deflake.

petermattis force-pushed the pmattis/stop-testcluster branch from 8555135 to 7eaff2a Compare September 14, 2016 13:58

petermattis merged commit 7133d01 into cockroachdb:master Sep 14, 2016

petermattis deleted the pmattis/stop-testcluster branch September 14, 2016 19:17

gemma-shay mentioned this pull request Jan 27, 2021

add licensing info to README #59490

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

testutils/testcluster: quiesce servers in parallel #9338

testutils/testcluster: quiesce servers in parallel #9338

petermattis commented Sep 13, 2016 •

edited by rjnn

Loading

andreimatei commented Sep 13, 2016

petermattis commented Sep 13, 2016

andreimatei commented Sep 13, 2016

petermattis commented Sep 13, 2016

tbg commented Sep 14, 2016

discussion, all commit checks successful.

bdarnell commented Sep 14, 2016

petermattis commented Sep 14, 2016

andreimatei commented Sep 14, 2016

testutils/testcluster: quiesce servers in parallel #9338

testutils/testcluster: quiesce servers in parallel #9338

Conversation

petermattis commented Sep 13, 2016 • edited by rjnn Loading

andreimatei commented Sep 13, 2016

petermattis commented Sep 13, 2016

andreimatei commented Sep 13, 2016

petermattis commented Sep 13, 2016

tbg commented Sep 14, 2016

discussion, all commit checks successful.

bdarnell commented Sep 14, 2016

petermattis commented Sep 14, 2016

andreimatei commented Sep 14, 2016

petermattis commented Sep 13, 2016 •

edited by rjnn

Loading