Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

backport-2.1: storage: improve logging and log Raft truncation #32972

Merged
merged 9 commits into from
Dec 10, 2018

Conversation

tbg
Copy link
Member

@tbg tbg commented Dec 10, 2018

Backport:

Note that this does not backport the other commit from #31875 which solved the splittrigger-snapshot-race in an unfortunate way which is currently reworked in #32782 (but itself may not get backported).

Please see individual PRs for details.

/cc @cockroachdb/release

@tbg tbg requested a review from a team December 10, 2018 10:10
@cockroach-teamcity
Copy link
Member

This change is Reviewable

tbg added 9 commits December 10, 2018 12:05
Queues can run into a variety of errors which are "harmless" in nature
(assuming they are not permanent).

Introduce a mechanism to detect such errors and avoid having them
increment the error metrics (and pollute the logs).

Touches cockroachdb#31373.

Release note: None
I didn't notice a concrete problem here, but it's more robust to walk
the error chain.

Release note: None
If either of the two special-cased errors ever implemented Causer, the
old code would jump right past them.

Release note: None
This makes it a lot easier to log descriptive debug messages indicating
how a truncation decision was arrived at, and in particular allows
pointing the finger at truncations that lead to Raft snapshots, which
is relevant in the context of cockroachdb#32046.

Release note: None
Whenever the "max raft log size" is exceeded, log truncations become
more aggressive in that they aim at the quorum commit index, potentially
cutting off followers (which then need Raft snapshots).

The effective threshold log size is 4mb for replicas larger than 4mb and
the replica size otherwise. This latter case can be problematic since
replicas can be persistently small despite having steady log progress
(for example, range 4 receives node status updates which are large
inline puts). If in such a range a follower falls behind just slightly,
it'll need a snapshot. This isn't in itself the biggest deal since the
snapshot is fairly rare (the required log entries are usually already on
in transit to the follower) and would be small, but it's not ideal.

Always use a 4mb threshold instead. Note that we also truncate the log
to the minimum replicated index if the log size is above 64kb. This is
similarly aggressive but respects followers (until they fall behind by
4mb or more).

My expectation is that this will not functionally change anything. It
might leave behind a little bit more Raft log on quiescent ranges, but I
think the solution here is performing "one last truncation" for ranges
that are quiescent to make sure they shed the remainder of their Raft
log.

Touches cockroachdb#32046.

Release note: None
This is a more natural place to put them a) because there's no reason to
have them sit on Replica and b) because this will allow me to write
better code when I query this map in the log truncation code.

Release note: None
When a (preemptive or Raft) snapshot is inflight, a range should not
truncate its log index to one ahead of the pending snapshot as doing so
would require yet another snapshot.

The code prior to this commit attempted to achieve that by treating the
pending snapshot as an additional follower in the computation of the
quorum index. This works as intended as long as the Raft log is below
its configured maximum size but usually leads to abandoning the snapshot
index when truncating based on size (to the quorum commit index). This
situation occurs frequently during the split+scatter phase of data
imports, where (mostly empty) ranges are rapidly upreplicated and split.

This isn't limited to small replicas, however. Assume a range is ~32mb
in size and a (say, preemptive) snapshot is underway. Preemptive
snapshots are (at the time of writing) throttled to 2mb/s, so it will
take approximately 16s to go through. At the same time, the Raft log may
easily grow in size by the size truncation threshold (4mb at time of
writing), allowing a log truncation that abandons the snapshot. A
similar phenomenon applies to Raft snapshots, though now the quota pool
will place restrictions on forward Raft progress, however not if a
single command exceeds the size restriction (as is common during RESTORE
operations where individual Raft commands are large). I haven't
conclusively observed this in practice (though there has been enough
badness to suspect that it happened), but in principle this could lead
to a potentially infinite number of snapshots being sent out, a very
expensive way of keeping a follower up to date.

After this change, the pending snapshot index is managed more carefully
(and is managed for Raft snapshots as well, not only for preemptive
ones) and is never truncated away. As a result, in regular operation
snapshots should now be able to be followed by regular Raft log catch-up
in all cases. More precisely, we keep a map of pending snapshot index to
deadline. The deadline is zero while the snapshot is ongoing and is set
when the snapshot is completed. This avoids races between the snapshot
completing and the replication change completing (which would occur
even if the snapshot is only registering as completed after the
replication change completes).

There's an opportunity for a bigger refactor here by using learner
replicas instead of preemptive snapshots. The idea is to add the replica
as a learner first (ignoring it for the quota pool until it has received
the first snapshot) first, and to use the regular Raft snapshot
mechanism to catch it up. Once achieved, another configuration change
would convert the learner into a regular follower. This change in itself
will likely make our code more principled, but it is a more invasive
change that is left for the future.

Similarly, there is knowledge in the quota pool that it seems we should
be using for log truncation decisions. The quota pool essentially knows
about the size of each log entry whereas the Raft truncations only know
the accumulated approximate size of the full log. For instance, instead
of blindly truncating to the quorum commit index when the Raft log is
too large, we could truncate it to an index that reduces the log size to
about 50% of the maximum, in effect reducing the number of snapshots
that are necessary due to quorum truncation. It's unclear whether this
improvement will matter in practice.

The following script reproduces redundant Raft snapshots (needs a
somewhat beefy machine such as a gceworker). Note that this script will
also incur Raft snapshots due to the splits, which are fixed in a
follow-up commit.

```
set -euxo pipefail

killall -9 cockroach || true
killall -9 workload || true
sleep 1
rm -rf cockroach-data || true
mkdir -p cockroach-data

./cockroach start --insecure --host=localhost --port=26257 --http-port=26258 --store=cockroach-data/1 --cache=256MiB --background
./cockroach start --insecure --host=localhost --port=26259 --http-port=26260 --store=cockroach-data/2 --cache=256MiB --join=localhost:26257 --background
./cockroach start --insecure --host=localhost --port=26261 --http-port=26262 --store=cockroach-data/3 --cache=256MiB --join=localhost:26257 --background
./cockroach start --insecure --host=localhost --port=26263 --http-port=26264 --store=cockroach-data/4 --cache=256MiB --join=localhost:26257 --background

sleep 5

./cockroach sql --insecure -e 'set cluster setting kv.range_merge.queue_enabled = false;'

./cockroach sql --insecure <<EOF
CREATE DATABASE csv;
IMPORT TABLE csv.lineitem CREATE USING 'gs://cockroach-fixtures/tpch-csv/schema/lineitem.sql'
CSV DATA (
 'gs://cockroach-fixtures/tpch-csv/sf-1/lineitem.tbl.1',
 'gs://cockroach-fixtures/tpch-csv/sf-1/lineitem.tbl.2',
 'gs://cockroach-fixtures/tpch-csv/sf-1/lineitem.tbl.3',
 'gs://cockroach-fixtures/tpch-csv/sf-1/lineitem.tbl.4',
 'gs://cockroach-fixtures/tpch-csv/sf-1/lineitem.tbl.5',
 'gs://cockroach-fixtures/tpch-csv/sf-1/lineitem.tbl.6',
 'gs://cockroach-fixtures/tpch-csv/sf-1/lineitem.tbl.7',
 'gs://cockroach-fixtures/tpch-csv/sf-1/lineitem.tbl.8'
) WITH delimiter='|' ;
EOF

sleep 5

for port in 26257 26259 26261 26263; do
 ./cockroach sql --insecure -e "select name, value from crdb_internal.node_metrics where name like '%raftsn%' order by name desc" --port "${port}"
+done
```

Release note: None
This is yet another root cause of Raft snapshots during import/restore.
We create a number of new ranges which may experience Raft leadership
changes early on and ingest SSTs at the same time. This leads to a
situation in which we have followers whose acknowledged log index isn't
known, which in turn leads to bad truncation decisions. Respect such
followers (which are usually only in this state for very brief amounts
of time).

Release note: None
@tbg tbg force-pushed the backport2.1-31728-32137-32437-32439 branch from bac6de3 to 6d4bf7a Compare December 10, 2018 11:06
@tbg tbg requested a review from petermattis December 10, 2018 12:35
Copy link
Collaborator

@petermattis petermattis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

I assume all these cherry-picks were clean, right?

Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained

@tbg
Copy link
Member Author

tbg commented Dec 10, 2018

They weren't clean, but the conflicts were minor. As far as I could tell, they were all due to 2692aba and were straightforward to fix.

@tbg tbg merged commit 64f2725 into cockroachdb:release-2.1 Dec 10, 2018
@tbg tbg deleted the backport2.1-31728-32137-32437-32439 branch December 10, 2018 22:39
@tbg
Copy link
Member Author

tbg commented Dec 10, 2018

TFTR!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants