storage: splits don't seem to take into account range size properly #21689

jordanlewis · 2018-01-23T06:09:19Z

I noticed this while playing with the tpcc 1000 warehouse dataset, but it's easy to reproduce with just 10 warehouses.

Under the tpcc data generation load, ranges get queued for splitting due to their size despite not actually being over the range size limit.

To observe this, simply run a 1-node cockroach cluster locally and run ./tpcc -load -warehouses=10. After a short while, you should notice that the cluster is performing a ton of splits on many different tables.

For example, after loading 10 warehouses, the stock table has 232 ranges. The first one is pretty large, containing more than a single 100,000 row warehouse. The second one is of similar size. The third is fairly small, containing just below 18,000 rows, and the rest are very small, containing only a few thousand rows each. Here's a snippet of the ranges:

root@:26257/tpcc> show testing_ranges from table stock;
+-----------+-----------+----------+----------+--------------+
| Start Key |  End Key  | Range ID | Replicas | Lease Holder |
+-----------+-----------+----------+----------+--------------+
| NULL      | /1/699    |       59 | {1}      |            1 |
| /1/699    | /2/1384   |       61 | {1}      |            1 |
| /2/1384   | /2/18000  |       62 | {1}      |            1 |
| /2/18000  | /2/20000  |       63 | {1}      |            1 |
| /2/20000  | /2/23000  |       64 | {1}      |            1 |
| /2/23000  | /2/27000  |       65 | {1}      |            1 |
| /2/27000  | /2/30000  |       66 | {1}      |            1 |
| /2/30000  | /2/34000  |       67 | {1}      |            1 |
| /2/34000  | /2/36000  |       68 | {1}      |            1 |
| /2/36000  | /2/41000  |       69 | {1}      |            1 |
| /2/41000  | /2/43000  |       70 | {1}      |            1 |
<snip>

As a total split queue novice, I poked around and added some debug output:

diff --git a/pkg/storage/split_queue.go b/pkg/storage/split_queue.go
index 8d9e5f9be..e26eea045 100644
--- a/pkg/storage/split_queue.go
+++ b/pkg/storage/split_queue.go
@@ -20,6 +20,8 @@ import (

        "github.com/pkg/errors"

+       "fmt"
+
        "github.com/cockroachdb/cockroach/pkg/config"
        "github.com/cockroachdb/cockroach/pkg/gossip"
        "github.com/cockroachdb/cockroach/pkg/internal/client"
@@ -83,6 +85,7 @@ func (sq *splitQueue) shouldQueue(
        // Add priority based on the size of range compared to the max
        // size for the zone it's in.
        if ratio := float64(repl.GetMVCCStats().Total()) / float64(repl.GetMaxBytes()); ratio > 1 {
+               fmt.Println("Ratio: ", ratio, desc.RangeID, desc.StartKey, desc.EndKey, repl.GetMVCCStats().Total(), repl.GetMaxBytes())
                priority += ratio
                shouldQ = true
        }

This log line gets fired many times for a particular range when a split happens, claiming that the result of Total() is in fact greater than 64 megabytes. This is empirically false - the rows in these small ranges are no larger bytes-wise than those in the large ranges.

So, my hypothesis is that something's overcounting range size. Since this behavior takes a while to kick in (a couple warehouses), I would guess that there's an issue during range splits themselves that causes the size of the new range to be overcounted.

cc @petermattis @tschottdorf as likely candidates for people who know about MVCC stats.

The text was updated successfully, but these errors were encountered:

nvanbenschoten · 2018-01-23T14:46:51Z

I just confirmed for my own sanity that this wasn't caused by #21562.

petermattis · 2018-01-23T14:47:08Z

@tschottdorf You've had your head the MVCC stats muck recently.

tbg · 2018-01-23T19:38:16Z

This was introduced by #21078 but I don't understand what's going wrong yet -- it doesn't have to do anything with with the new functionality introduced in that change. I'm still figuring out the details, but whatever it is, I think I'll get to this soon since it's so easy to repro. In the meantime, take a look at the logging I added below. stats LHS is what the left side of the split (post split) thinks it has. computed LHS is what it actually has. Ditto for RHS.

Reading these numbers, the LHS thinks it's empty, but it's actually about half of the data. The RHS thinks it's everything, but it's really only the other half (to the point where the numbers add up exactly in most columns). I'm very curious how this came to be, and even more so, how it's caused by the PR linked above.

stats LHS    enginepb.MVCCStats{ContainsEstimates:false, LastUpdateNanos:1516735821067912000, IntentAge:0, GCBytesAge:0, LiveBytes:0, LiveCount:0, KeyBytes:0, KeyCount:0, ValBytes:0, ValCount:0, IntentBytes:0, IntentCount:0, SysBytes:637, SysCount:6}
computed LHS enginepb.MVCCStats{ContainsEstimates:false, LastUpdateNanos:1516735821067912000, IntentAge:0, GCBytesAge:0, LiveBytes:33554320, LiveCount:100694, KeyBytes:2047617, KeyCount:100694, ValBytes:31506703, ValCount:100694, IntentBytes:0, IntentCount:0, SysBytes:1017, SysCount:12}
stats RHS    enginepb.MVCCStats{ContainsEstimates:false, LastUpdateNanos:1516735821067912000, IntentAge:0, GCBytesAge:0, LiveBytes:69638828, LiveCount:209000, KeyBytes:4247838, KeyCount:209000, ValBytes:65390990, ValCount:209000, IntentBytes:0, IntentCount:0, SysBytes:915, SysCount:16}
computed RHS enginepb.MVCCStats{ContainsEstimates:false, LastUpdateNanos:1516735821067912000, IntentAge:0, GCBytesAge:0, LiveBytes:36084508, LiveCount:108306, KeyBytes:2200221, KeyCount:108306, ValBytes:33884287, ValCount:108306, IntentBytes:0, IntentCount:0, SysBytes:535, SysCount:10}

tbg · 2018-01-23T20:29:04Z

with the code in the split trigger (I added engineLeftMS) I see that this is the origin of the problem: leftMS has most fields zero (engineLeftMS does not).

	leftMS, err := rditer.ComputeStatsForRange(&split.LeftDesc, batch, ts.WallTime)
	if err != nil {
		return enginepb.MVCCStats{}, result.Result{}, errors.Wrap(err, "unable to compute stats for LHS range after split")
	}
	log.Event(ctx, "computed stats for left hand side range")

	log.Warningf(ctx, "orig ms %s\nleft ms %s", pretty.Sprint(origBothMS), pretty.Sprint(leftMS))

	engineLeftMS, err := rditer.ComputeStatsForRange(&split.LeftDesc, rec.Engine(), ts.WallTime)
	if err != nil {
		return enginepb.MVCCStats{}, result.Result{}, err
	}
	log.Warningf(ctx, "recomp left ms %s", pretty.Sprint(engineLeftMS))

The problem is in (*rocksDBBatch).NewTimeBoundIterator which makes all future calls to NewIterator() return the time bounded iterator. Sending a fix.

Release note: None Without cockroachdb#21721, this fails with > saw 0 values in regular iterator, but expected 4. Touches cockroachdb#21721. Touches cockroachdb#21689.

petermattis added this to the 2.0 milestone Jan 23, 2018

petermattis assigned tbg Jan 23, 2018

jordanlewis mentioned this issue Jan 23, 2018

storage: unmarshaling descriptors makes 35% of garbage on cluster with many ranges #21702

Closed

tbg closed this as completed in 61dd154 Jan 23, 2018

tbg mentioned this issue Jan 23, 2018

storage: regression test time bounded iterator reuse #21723

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

storage: splits don't seem to take into account range size properly #21689

storage: splits don't seem to take into account range size properly #21689

jordanlewis commented Jan 23, 2018

nvanbenschoten commented Jan 23, 2018

petermattis commented Jan 23, 2018

tbg commented Jan 23, 2018

tbg commented Jan 23, 2018

storage: splits don't seem to take into account range size properly #21689

storage: splits don't seem to take into account range size properly #21689

Comments

jordanlewis commented Jan 23, 2018

nvanbenschoten commented Jan 23, 2018

petermattis commented Jan 23, 2018

tbg commented Jan 23, 2018

tbg commented Jan 23, 2018