Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

storage: improve message on slow Raft proposal #34296

Merged
merged 1 commit into from
Feb 11, 2019

Conversation

tbg
Copy link
Member

@tbg tbg commented Jan 28, 2019

Touches #33007.

Release note: None

@tbg tbg requested a review from a team January 28, 2019 16:17
@cockroach-teamcity
Copy link
Member

This change is Reviewable

@tbg tbg force-pushed the fix/timeout-on-propose branch from 603dc97 to 3ff95c4 Compare January 28, 2019 16:17
@tbg tbg requested a review from petermattis January 28, 2019 16:19
@knz
Copy link
Contributor

knz commented Jan 29, 2019

Doesn't this change deserve a release note that says more than "none"?

@tbg
Copy link
Member Author

tbg commented Jan 29, 2019 via email

@spencerkimball
Copy link
Member

This is a fundamental change to the way CRDB handles problems. Because errors are returned after 60s of waiting, the system will still effectively seize up, but applications will begin to receive errors. Perhaps an alternative mechanism makes sense where this is a cluster setting (on by default), where the range itself tracks last forward progress and after 60s (cluster setting could control this), clears all waiting proposals, and begins to refuse new proposals immediately. That feels like a more understandable error condition than each new request hitting the range taking an additional 60s to report an error.

@tbg tbg force-pushed the fix/timeout-on-propose branch from 3ff95c4 to 95a811c Compare February 11, 2019 13:29
@tbg tbg changed the title storage: return errors from Raft proposals instead of hanging storage: improve message on slow Raft proposal Feb 11, 2019
@tbg
Copy link
Member Author

tbg commented Feb 11, 2019

PTAL, I've removed the proactive return path but kept the improved logging.

Copy link
Collaborator

@petermattis petermattis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @petermattis and @tbg)


pkg/storage/replica_write.go, line 227 at r1 (raw file):

			log.Warningf(ctx, `have been waiting %.2fs for proposing command %s.
This range is likely unavailable.
Please submit this message along with

Submit an issue? Do you want to include the github new-issue link?


pkg/storage/replica_write.go, line 242 at r1 (raw file):

				contextStr := ""
				if err := ctx.Err(); err != nil {
					contextStr = " with context cancellation"

Shouldn't this be using err.Error()? What if there is a different error than cancellation here?

@tbg tbg force-pushed the fix/timeout-on-propose branch 2 times, most recently from e706574 to 2e5512e Compare February 11, 2019 19:26
Copy link
Member Author

@tbg tbg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TFTR!

bors r=petermattis

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @petermattis)


pkg/storage/replica_write.go, line 227 at r1 (raw file):

Previously, petermattis (Peter Mattis) wrote…

Submit an issue? Do you want to include the github new-issue link?

Done.


pkg/storage/replica_write.go, line 242 at r1 (raw file):

Previously, petermattis (Peter Mattis) wrote…

Shouldn't this be using err.Error()? What if there is a different error than cancellation here?

ctx.Err() is always context.{DeadlineExceeded,Timeout} so its .Error() isn't relevant. However, I realized that we already have a naked return for pErr which is much better! I didn't want to introduce it because it's ugly, but now that it's there we might as well use it.

@craig
Copy link
Contributor

craig bot commented Feb 11, 2019

Build failed

@tbg tbg force-pushed the fix/timeout-on-propose branch from 2e5512e to 0c36412 Compare February 11, 2019 21:39
@tbg
Copy link
Member Author

tbg commented Feb 11, 2019

Linter.

bors r=petermattis

craig bot pushed a commit that referenced this pull request Feb 11, 2019
34296: storage: improve message on slow Raft proposal r=petermattis a=tbg

Touches #33007.

Release note: None

34589: importccl: fix flaky test TestImportCSVStmt r=rytaft a=rytaft

`TestImportCSVStmt` tests that `IMPORT` jobs appear in a certain order
in the `system.jobs` table. Automatic statistics were causing this
test to be flaky since `CreateStats` jobs were present in the jobs
table as well, in an unpredictable order. This commit fixes the problem
by only selecting `IMPORT` jobs from the jobs table.

Fixes #34568

Release note: None

34660: storage: make RaftTruncatedState unreplicated r=bdarnell a=tbg

This isn't 100% polished yet, but generally ready for review.

----

See #34287.

Today, Raft (or preemptive) snapshots include the past Raft log, that is,
log entries which are already reflected in the state of the snapshot.
Fundamentally, this is because we have historically used a replicated
TruncatedState.

TruncatedState essentially tells us what the first index in the log is
(though it also includes a Term). If the TruncatedState cannot diverge
across replicas, we *must* send the whole log in snapshots, as the first
log index must match what the TruncatedState claims it is.

The Raft log is typically, but not necessarily small. Log truncations are
driven by a queue and use a complex decision process. That decision process
can be faulty and even if it isn't, the queue could be held up. Besides,
even when the Raft log contains only very few entries, these entries may be
quite large (see SSTable ingestion during RESTORE).

All this motivates that we don't want to (be forced to) send the Raft log
as part of snapshots, and in turn we need the TruncatedState to be
unreplicated.

This change migrates the TruncatedState into unreplicated keyspace. It does
not yet allow snapshots to avoid sending the past Raft log, but that is a
relatively straightforward follow-up change.

VersionUnreplicatedRaftTruncatedState, when active, moves the truncated
state into unreplicated keyspace on log truncations.

The migration works as follows:

1. at any log position, the replicas of a Range either use the new
(unreplicated) key or the old one, and exactly one of them exists.

2. When a log truncation evaluates under the new cluster version, it
initiates the migration by deleting the old key. Under the old cluster
version, it behaves like today, updating the replicated truncated state.

3. The deletion signals new code downstream of Raft and triggers a write to
the new, unreplicated, key (atomic with the deletion of the old key).

4. Future log truncations don't write any replicated data any more, but
(like before) send along the TruncatedState which is written downstream of
Raft atomically with the deletion of the log entries. This actually uses
the same code as 3. What's new is that the truncated state needs to be
verified before replacing a previous one. If replicas disagree about their
truncated state, it's possible for replica X at FirstIndex=100 to apply a
truncated state update that sets FirstIndex to, say, 50 (proposed by a
replica with a "longer" historical log). In that case, the truncated state
update must be ignored (this is straightforward downstream-of-Raft code).

5. When a split trigger evaluates, it seeds the RHS with the legacy key iff
the LHS uses the legacy key, and the unreplicated key otherwise. This makes
sure that the invariant that all replicas agree on the state of the
migration is upheld.

6. When a snapshot is applied, the receiver is told whether the snapshot
contains a legacy key. If not, it writes the truncated state (which is part
of the snapshot metadata) in its unreplicated version. Otherwise it doesn't
have to do anything (the range will migrate later).

The following diagram visualizes the above. Note that it abuses sequence
diagrams to get a nice layout; the vertical lines belonging to NewState and
OldState don't imply any particular ordering of operations.

```
┌────────┐                            ┌────────┐
│OldState│                            │NewState│
└───┬────┘                            └───┬────┘
   │                        Bootstrap under old version
   │ <─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─
─ ─
   │                                     │
   │                                     │     Bootstrap under new version
   │                                     │ <─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─
─ ─
   │                                     │
   │─ ─ ┐
   │    | Log truncation under old version
   │< ─ ┘
   │                                     │
   │─ ─ ┐                                │
   │    | Snapshot                       │
   │< ─ ┘                                │
   │                                     │
   │                                     │─ ─ ┐
   │                                     │    | Snapshot
   │                                     │< ─ ┘
   │                                     │
   │   Log truncation under new version  │
   │ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─>│
   │                                     │
   │                                     │─ ─ ┐
   │                                     │    | Log truncation under new version
   │                                     │< ─ ┘
   │                                     │
   │                                     │─ ─ ┐
   │                                     │    | Log truncation under old version
   │                                     │< ─ ┘ (necessarily running new binary)
```

Release note: None

34762: distsqlplan: fix error in union planning r=jordanlewis a=jordanlewis

Previously, if 2 inputs to a UNION ALL had identical post processing
except for renders, further post processing on top of that union all
could invalidate the plan and cause errors or crashes.

Fixes #34437.

Release note (bug fix): fix a planning crash during UNION ALL operations
that had projections, filters or renders directly on top of the UNION
ALL in some cases.

34767: sql: reuse already allocated memory for the cache in a row container r=yuzefovich a=yuzefovich

Previously, we would always allocate new memory for every row that
we put in the cache of DiskBackedIndexedRowContainer and simply
discard the memory underlying the row that we remove from the cache.
Now, we're reusing that memory.

Release note: None

34779: opt: add stats to tpch xform test r=justinj a=justinj

Since we have stats by default now, this should be the default testing
mechanism. I left in tpch-no-stats since we also have that for tpcc,
just for safety.

Release note: None

Co-authored-by: Tobias Schottdorf <[email protected]>
Co-authored-by: Rebecca Taft <[email protected]>
Co-authored-by: Jordan Lewis <[email protected]>
Co-authored-by: Yahor Yuzefovich <[email protected]>
Co-authored-by: Justin Jaffray <[email protected]>
@craig
Copy link
Contributor

craig bot commented Feb 11, 2019

Build succeeded

@craig craig bot merged commit 0c36412 into cockroachdb:master Feb 11, 2019
@tbg tbg deleted the fix/timeout-on-propose branch March 13, 2019 11:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants