Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

testutils/testcluster: quiesce servers in parallel #9338

Merged

Conversation

petermattis
Copy link
Collaborator

@petermattis petermattis commented Sep 13, 2016

Similar to how multiTestContext shuts down, we need to quiesce the
per-server stoppers in parallel to avoid deadlock. Helps with
TestBasicManualReplication flakiness, but does not fully deflake.


This change is Reviewable

@andreimatei
Copy link
Contributor

Review status: 0 of 1 files reviewed at latest revision, 1 unresolved discussion, some commit checks pending.


testutils/testcluster/testcluster.go, line 84 at r1 (raw file):

  defer tc.mu.Unlock()

  // Quiesce the servers in parallel to avoid deadlocks.

mind going into more details here? I'm looking at multiTestContest too and I see something about tasks stuck inaddWriteCommand... But I'm not quite sure what that's about.
Is this problem related to the fact that the servers are sharing a stopper? I.e. is it also a problem in production?
I kinda feel like there should be something else to be done here...


Comments from Reviewable

@petermattis
Copy link
Collaborator Author

Review status: 0 of 1 files reviewed at latest revision, 1 unresolved discussion, some commit checks pending.


testutils/testcluster/testcluster.go, line 84 at r1 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

mind going into more details here? I'm looking at multiTestContest too and I see something about tasks stuck inaddWriteCommand... But I'm not quite sure what that's about.
Is this problem related to the fact that the servers are sharing a stopper? I.e. is it also a problem in production?
I kinda feel like there should be something else to be done here...

This is mirroring the code in `multiTestContext.Stop`. If you `Stop` the servers serially, the last server might deadlock because the previous servers have been fully stopped, not just quiesced. I saw this failure during `TestBasicManualReplication` stress testing.

There is another problem with that test which sounds like what you're seeing: tasks stuck in addWriteCommand. I think that has arisen due to removing the DistSender send timeout. We're probably not handling some shutdown situation correctly.


Comments from Reviewable

@andreimatei
Copy link
Contributor

Review status: 0 of 1 files reviewed at latest revision, 1 unresolved discussion, some commit checks pending.


testutils/testcluster/testcluster.go, line 84 at r1 (raw file):

Previously, petermattis (Peter Mattis) wrote…

This is mirroring the code in multiTestContext.Stop. If you Stop the servers serially, the last server might deadlock because the previous servers have been fully stopped, not just quiesced. I saw this failure during TestBasicManualReplication stress testing.

There is another problem with that test which sounds like what you're seeing: tasks stuck in addWriteCommand. I think that has arisen due to removing the DistSender send timeout. We're probably not handling some shutdown situation correctly.

ok, so all the servers but one have been fully stopped. So why does the last one deadlock exactly?

Comments from Reviewable

@petermattis
Copy link
Collaborator Author

Review status: 0 of 1 files reviewed at latest revision, 1 unresolved discussion, all commit checks successful.


testutils/testcluster/testcluster.go, line 84 at r1 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

ok, so all the servers but one have been fully stopped. So why does the last one deadlock exactly?

A believe the answer is "various reasons". @tschottdorf?

Comments from Reviewable

@tbg
Copy link
Member

tbg commented Sep 14, 2016

The one real reason is waiting on a raft command which can never commit
because a majority is unavailable.

On Tue, Sep 13, 2016, 16:53 Peter Mattis [email protected] wrote:

Review status: 0 of 1 files reviewed at latest revision, 1 unresolved

discussion, all commit checks successful.

testutils/testcluster/testcluster.go, line 84 at r1
https://reviewable.io:443/reviews/cockroachdb/cockroach/9338#-KR_AcA-AoDHeLwQaDZ6:-KR_JjQRmvVQ96DlnsUR:brc850s
(raw file

// Quiesce the servers in parallel to avoid deadlocks.
):
Previously, andreimatei (Andrei Matei) wrote…

ok, so all the servers but one have been fully stopped. So why does the
last one deadlock exactly?

A believe the answer is "various reasons". @tschottdorf
https://github.com/tschottdorf?


Comments from Reviewable
https://reviewable.io:443/reviews/cockroachdb/cockroach/9338


You are receiving this because you were mentioned.

Reply to this email directly, view it on GitHub
#9338 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AE135J7_N887tIOiLXob5mebI2OrGXXBks5qpw1MgaJpZM4J8F0-
.

-- Tobias

@bdarnell
Copy link
Contributor

Review status: 0 of 1 files reviewed at latest revision, 1 unresolved discussion, all commit checks successful.


testutils/testcluster/testcluster.go, line 84 at r1 (raw file):

Previously, petermattis (Peter Mattis) wrote…

A believe the answer is "various reasons". @tschottdorf?

A quiesced node returns NodeUnavailableError, which is generally considered retryable (in production, the node should be given a chance to come back). In tests, we know that's not going to happen, but the DistSender doesn't. With this change to quiesce in parallel, we should in theory be able to fix all remaining issues in theory with proper use of Tasks, but it's tricky. We need to make sure that all "first mover" operations are run in Tasks, and in the absence of timeouts in DistSender they must always plumb their stopper's ShouldQuiesce channel through to the DistSender for cancellation purposes.

Comments from Reviewable

Similar to how multiTestContext shuts down, we need to quiesce the
per-server stoppers in parallel to avoid deadlock. Helps with
TestBasicManualReplication flakiness, but does not fully deflake.
@petermattis petermattis force-pushed the pmattis/stop-testcluster branch from 8555135 to 7eaff2a Compare September 14, 2016 13:58
@petermattis
Copy link
Collaborator Author

Review status: 0 of 1 files reviewed at latest revision, 1 unresolved discussion, some commit checks pending.


testutils/testcluster/testcluster.go, line 84 at r1 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

A quiesced node returns NodeUnavailableError, which is generally considered retryable (in production, the node should be given a chance to come back). In tests, we know that's not going to happen, but the DistSender doesn't. With this change to quiesce in parallel, we should in theory be able to fix all remaining issues in theory with proper use of Tasks, but it's tricky. We need to make sure that all "first mover" operations are run in Tasks, and in the absence of timeouts in DistSender they must always plumb their stopper's ShouldQuiesce channel through to the DistSender for cancellation purposes.

Expanded the comment here. Even though this change doesn't eliminate the flakiness, it is a necessary step.

Comments from Reviewable

@andreimatei
Copy link
Contributor

:lgtm:


Review status: 0 of 1 files reviewed at latest revision, 1 unresolved discussion, all commit checks successful.


Comments from Reviewable

@petermattis petermattis merged commit 7133d01 into cockroachdb:master Sep 14, 2016
@petermattis petermattis deleted the pmattis/stop-testcluster branch September 14, 2016 19:17
craig bot pushed a commit that referenced this pull request Feb 18, 2021
59490: add licensing info to README r=bdarnell a=gemma-shay

Added a section on licensing info.
Links to the new Licensing doc (#9338) will not work yet, so will wait to merge until they are up.
Closes cockroachdb/docs#7441

60513: sql: ensure REGIONAL BY ROW statements roundtrip r=ajstorm a=otan

Resolves #59362

See individual commits for details.

60552: sql: add descriptor validation on write r=postamar a=postamar

Previously, we didn't systematically validate descriptors when they were
written. Furthermore, there existed no common method to validate
descriptors across all descriptor subtypes

This commit adds three methods to the catalog.Descriptor interface:
  1. ValidateSelf ( context.Context ) error
  2. Validate ( context.Context, catalog.DescGetter ) error
  3. ValidateTxnCommit ( context.Context, catalog.DescGetter) error

Each performs a subset the checks performed by the next. ValidateSelf
contains all checks which can be performed in isolation, Validate also
performs all those involving DescGetters (i.e. cross-reference checks)
and ValidateTxnCommit also performs checks which should only be done at
commit-time. An example of the latter is checking that a table has
a primary key: dropping the PK is allowed within a transaction as long
as a new PK is subsequently provided before committing.

This commit adds new validation calls when writing descriptors:
  1. ValidateSelf is called prior to Collection adding a descriptor Put
     to a kv.Batch. At this point, we want descriptors to be at least
     internally-consistent, both to catch validation errors early and
     because it's not possible to do cross-reference checking at this
     point (changes on FKs for instance involve multiple descriptors).
  2. ValidateTxnCommit is called on the descs.Collection's uncommitted
     descriptors when the corresponding txn is about to commit, just
     prior to the two-version-invariant check.

These validations may be disabled using a new cluster setting:
    sql.catalog.descs.validate_on_write.enabled
Setting this to false makes it possible to corrupt the descriptor state
using the crdb_internal.unsafe_* functions.

Release note: None

60616: builtins: add builtin to retrieve the payload(s) for a span. r=knz,irfansharif a=angelapwen

Part of addressing #55733 

The `payloads_for_span` builtin retrieves all payloads for
a given span ID, given that the span is part of an active trace.
The payloads are returned in JSONB format. If the span is not
found, or if the span does not have any payloads, the builtin
returns an empty JSON object.

With the appropriate usage of this builtin and the
`crdb_internal.trace_id` builtin as shown in the `contention_event`
logic test, all payloads for the current trace may be surfaced.

Release note (sql change): add `payloads_for_span` builtin that
takes in a span ID and returns its paylods in JSONB format. If
the span is not found, or if the span does not have any payloads,
the builtin returns an empty JSON object.

60692: sql: add tests for inverted indexes on virtual columns r=mgartner a=mgartner

No code changes were needed to support inverted indexes on virtual
columns.

Release note: None

Co-authored-by: Gemma Shay <[email protected]>
Co-authored-by: Oliver Tan <[email protected]>
Co-authored-by: Marius Posta <[email protected]>
Co-authored-by: angelapwen <[email protected]>
Co-authored-by: Marcus Gartner <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants