backport-2.1: storage: improve logging and log Raft truncation #32972

tbg · 2018-12-10T10:10:06Z

Backport:

4/4 commits from "storage: don't log benign queue errors" (storage: don't log benign queue errors #31728)
1/1 commits from "storage: refactor log truncation index computation" (storage: refactor log truncation index computation #32137)
1/1 commits from "storage: truncate aggressively only after 4mb of logs" (storage: truncate aggressively only after 4mb of logs #32437)
2/2 commits from "storage: don't truncate when probing active followers are present" (storage: don't truncate when probing active followers are present #32439)
1/2 commits from "storage: avoid errant Raft snapshots during splits" (storage: avoid errant Raft snapshots during splits #31875)

Note that this does not backport the other commit from #31875 which solved the splittrigger-snapshot-race in an unfortunate way which is currently reworked in #32782 (but itself may not get backported).

Please see individual PRs for details.

/cc @cockroachdb/release

cockroach-teamcity · 2018-12-10T10:10:19Z

This change is

Release note: None

Queues can run into a variety of errors which are "harmless" in nature (assuming they are not permanent). Introduce a mechanism to detect such errors and avoid having them increment the error metrics (and pollute the logs). Touches cockroachdb#31373. Release note: None

I didn't notice a concrete problem here, but it's more robust to walk the error chain. Release note: None

If either of the two special-cased errors ever implemented Causer, the old code would jump right past them. Release note: None

This makes it a lot easier to log descriptive debug messages indicating how a truncation decision was arrived at, and in particular allows pointing the finger at truncations that lead to Raft snapshots, which is relevant in the context of cockroachdb#32046. Release note: None

Whenever the "max raft log size" is exceeded, log truncations become more aggressive in that they aim at the quorum commit index, potentially cutting off followers (which then need Raft snapshots). The effective threshold log size is 4mb for replicas larger than 4mb and the replica size otherwise. This latter case can be problematic since replicas can be persistently small despite having steady log progress (for example, range 4 receives node status updates which are large inline puts). If in such a range a follower falls behind just slightly, it'll need a snapshot. This isn't in itself the biggest deal since the snapshot is fairly rare (the required log entries are usually already on in transit to the follower) and would be small, but it's not ideal. Always use a 4mb threshold instead. Note that we also truncate the log to the minimum replicated index if the log size is above 64kb. This is similarly aggressive but respects followers (until they fall behind by 4mb or more). My expectation is that this will not functionally change anything. It might leave behind a little bit more Raft log on quiescent ranges, but I think the solution here is performing "one last truncation" for ranges that are quiescent to make sure they shed the remainder of their Raft log. Touches cockroachdb#32046. Release note: None

This is a more natural place to put them a) because there's no reason to have them sit on Replica and b) because this will allow me to write better code when I query this map in the log truncation code. Release note: None

When a (preemptive or Raft) snapshot is inflight, a range should not truncate its log index to one ahead of the pending snapshot as doing so would require yet another snapshot. The code prior to this commit attempted to achieve that by treating the pending snapshot as an additional follower in the computation of the quorum index. This works as intended as long as the Raft log is below its configured maximum size but usually leads to abandoning the snapshot index when truncating based on size (to the quorum commit index). This situation occurs frequently during the split+scatter phase of data imports, where (mostly empty) ranges are rapidly upreplicated and split. This isn't limited to small replicas, however. Assume a range is ~32mb in size and a (say, preemptive) snapshot is underway. Preemptive snapshots are (at the time of writing) throttled to 2mb/s, so it will take approximately 16s to go through. At the same time, the Raft log may easily grow in size by the size truncation threshold (4mb at time of writing), allowing a log truncation that abandons the snapshot. A similar phenomenon applies to Raft snapshots, though now the quota pool will place restrictions on forward Raft progress, however not if a single command exceeds the size restriction (as is common during RESTORE operations where individual Raft commands are large). I haven't conclusively observed this in practice (though there has been enough badness to suspect that it happened), but in principle this could lead to a potentially infinite number of snapshots being sent out, a very expensive way of keeping a follower up to date. After this change, the pending snapshot index is managed more carefully (and is managed for Raft snapshots as well, not only for preemptive ones) and is never truncated away. As a result, in regular operation snapshots should now be able to be followed by regular Raft log catch-up in all cases. More precisely, we keep a map of pending snapshot index to deadline. The deadline is zero while the snapshot is ongoing and is set when the snapshot is completed. This avoids races between the snapshot completing and the replication change completing (which would occur even if the snapshot is only registering as completed after the replication change completes). There's an opportunity for a bigger refactor here by using learner replicas instead of preemptive snapshots. The idea is to add the replica as a learner first (ignoring it for the quota pool until it has received the first snapshot) first, and to use the regular Raft snapshot mechanism to catch it up. Once achieved, another configuration change would convert the learner into a regular follower. This change in itself will likely make our code more principled, but it is a more invasive change that is left for the future. Similarly, there is knowledge in the quota pool that it seems we should be using for log truncation decisions. The quota pool essentially knows about the size of each log entry whereas the Raft truncations only know the accumulated approximate size of the full log. For instance, instead of blindly truncating to the quorum commit index when the Raft log is too large, we could truncate it to an index that reduces the log size to about 50% of the maximum, in effect reducing the number of snapshots that are necessary due to quorum truncation. It's unclear whether this improvement will matter in practice. The following script reproduces redundant Raft snapshots (needs a somewhat beefy machine such as a gceworker). Note that this script will also incur Raft snapshots due to the splits, which are fixed in a follow-up commit. ``` set -euxo pipefail killall -9 cockroach || true killall -9 workload || true sleep 1 rm -rf cockroach-data || true mkdir -p cockroach-data ./cockroach start --insecure --host=localhost --port=26257 --http-port=26258 --store=cockroach-data/1 --cache=256MiB --background ./cockroach start --insecure --host=localhost --port=26259 --http-port=26260 --store=cockroach-data/2 --cache=256MiB --join=localhost:26257 --background ./cockroach start --insecure --host=localhost --port=26261 --http-port=26262 --store=cockroach-data/3 --cache=256MiB --join=localhost:26257 --background ./cockroach start --insecure --host=localhost --port=26263 --http-port=26264 --store=cockroach-data/4 --cache=256MiB --join=localhost:26257 --background sleep 5 ./cockroach sql --insecure -e 'set cluster setting kv.range_merge.queue_enabled = false;' ./cockroach sql --insecure <<EOF CREATE DATABASE csv; IMPORT TABLE csv.lineitem CREATE USING 'gs://cockroach-fixtures/tpch-csv/schema/lineitem.sql' CSV DATA ( 'gs://cockroach-fixtures/tpch-csv/sf-1/lineitem.tbl.1', 'gs://cockroach-fixtures/tpch-csv/sf-1/lineitem.tbl.2', 'gs://cockroach-fixtures/tpch-csv/sf-1/lineitem.tbl.3', 'gs://cockroach-fixtures/tpch-csv/sf-1/lineitem.tbl.4', 'gs://cockroach-fixtures/tpch-csv/sf-1/lineitem.tbl.5', 'gs://cockroach-fixtures/tpch-csv/sf-1/lineitem.tbl.6', 'gs://cockroach-fixtures/tpch-csv/sf-1/lineitem.tbl.7', 'gs://cockroach-fixtures/tpch-csv/sf-1/lineitem.tbl.8' ) WITH delimiter='|' ; EOF sleep 5 for port in 26257 26259 26261 26263; do ./cockroach sql --insecure -e "select name, value from crdb_internal.node_metrics where name like '%raftsn%' order by name desc" --port "${port}" +done ``` Release note: None

This is yet another root cause of Raft snapshots during import/restore. We create a number of new ranges which may experience Raft leadership changes early on and ingest SSTs at the same time. This leads to a situation in which we have followers whose acknowledged log index isn't known, which in turn leads to bad truncation decisions. Respect such followers (which are usually only in this state for very brief amounts of time). Release note: None

petermattis

I assume all these cherry-picks were clean, right?

Reviewable status: complete! 1 of 0 LGTMs obtained

tbg · 2018-12-10T22:39:15Z

They weren't clean, but the conflicts were minor. As far as I could tell, they were all due to 2692aba and were straightforward to fix.

tbg · 2018-12-10T22:39:29Z

TFTR!

tbg requested a review from a team December 10, 2018 10:10

tbg added 9 commits December 10, 2018 12:05

causer: provide method for walking an error chain

8b78bb6

Release note: None

storage: harden purgatory error check

726183d

I didn't notice a concrete problem here, but it's more robust to walk the error chain. Release note: None

storage: harden admin split retryable errors

b1872c9

If either of the two special-cased errors ever implemented Causer, the old code would jump right past them. Release note: None

storage: factor out lastUpdateMap's accessors

2abf382

This is a more natural place to put them a) because there's no reason to have them sit on Replica and b) because this will allow me to write better code when I query this map in the log truncation code. Release note: None

tbg force-pushed the backport2.1-31728-32137-32437-32439 branch from bac6de3 to 6d4bf7a Compare December 10, 2018 11:06

tbg requested a review from petermattis December 10, 2018 12:35

petermattis approved these changes Dec 10, 2018

View reviewed changes

tbg merged commit 64f2725 into cockroachdb:release-2.1 Dec 10, 2018

tbg deleted the backport2.1-31728-32137-32437-32439 branch December 10, 2018 22:39

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

backport-2.1: storage: improve logging and log Raft truncation #32972

backport-2.1: storage: improve logging and log Raft truncation #32972

tbg commented Dec 10, 2018 •

edited

Loading

cockroach-teamcity commented Dec 10, 2018

petermattis left a comment

tbg commented Dec 10, 2018

tbg commented Dec 10, 2018

backport-2.1: storage: improve logging and log Raft truncation #32972

backport-2.1: storage: improve logging and log Raft truncation #32972

Conversation

tbg commented Dec 10, 2018 • edited Loading

cockroach-teamcity commented Dec 10, 2018

petermattis left a comment

Choose a reason for hiding this comment

tbg commented Dec 10, 2018

tbg commented Dec 10, 2018

tbg commented Dec 10, 2018 •

edited

Loading