stability: disconnect + crash under load #4925

mberhault · 2016-03-06T22:14:52Z

There is not clear and obvious culprit here, so I'll attempt to summarize as best I can.

build sha: e6793e1

4-node cluster with the photos example running against it.
Things are fine for a few hours, until the following happens:

Crash timelines and full logs. All crashes are OOM

node4: 17:40:41
node4.crash.txt

node3: 18:04:32
node3.crash.txt

node2: 18:09:45
node2.crash.txt

node1: 21:15:49
node1.crash.txt

Now, looking for originating events is where it gets a bit tedious.
One of the first signs of badness if around 17:26 when we get:

17:26:54.224163 gossip/gossip.go:923  partition in gossip network; attempting new connection

This happens on nodes 1, 2, and 4 at that precise time, and 20 seconds lated on node 3.
Lease holders were the following, with 3 being the last one:

Node1:
I160306 17:24:47.189891 storage/replica_command.go:1255  range 1: new leader lease replica {1 1 10} 2016-03-06 17:24:43 +0000 UTC +1.820s
I160306 17:24:55.236673 storage/replica_command.go:1255  range 1: new leader lease replica {1 1 10} 2016-03-06 17:24:54 +0000 UTC +1.152s

Node2:
I160306 17:10:28.212065 storage/replica_command.go:1255  range 1: new leader lease replica {2 2 9} 2016-03-06 17:10:24 +0000 UTC +4.435s
I160306 17:10:38.194705 storage/replica_command.go:1255  range 1: new leader lease replica {2 2 9} 2016-03-06 17:10:34 +0000 UTC +4.438s

Node3:
I160306 17:24:47.324180 storage/replica_command.go:1255  range 1: new leader lease replica {3 3 11} 2016-03-06 17:24:46 +0000 UTC +1.391s
I160306 17:25:12.291490 storage/replica_command.go:1255  range 1: new leader lease replica {3 3 11} 2016-03-06 17:25:08 +0000 UTC +1.474s

Node4:
I160306 17:24:20.530929 storage/replica_command.go:1255  range 1: new leader lease replica {4 4 4} 2016-03-06 17:24:17 +0000 UTC +1.732s
I160306 17:24:24.417120 storage/replica_command.go:1255  range 1: new leader lease replica {4 4 4} 2016-03-06 17:24:23 +0000 UTC +1.003s

The text was updated successfully, but these errors were encountered:

mberhault · 2016-03-06T23:24:07Z

I've also been grabbing heap profiles every minute. Here's the last non-empty profile for each node using go tool pprof --png --runtime <binary> <heap profile>.

Node1:

Node2:

Node3:

Node4:

bdarnell · 2016-03-06T23:32:07Z

The immediate cause is tons of async intent resolutions. Queueing those up (as I'm planning to do for #4881) will help, although it's unclear how things got into a state where nothing can make progress.

petermattis · 2016-03-08T21:11:19Z

Plan is to add a pool of intent resolution workers which pull intent resolution tasks from a shared channel. Back pressure is applied if the channel fills up causing the calling goroutine to block.

petermattis · 2016-03-08T21:12:08Z

Need to make sure this doesn't cause deadlock. We could drop the intent resolution on the floor if the channel is full.

petermattis · 2016-03-08T21:59:15Z

Other brainstorming: intents are associated with a transaction. We could maintain a map of transaction to in-flight intent resolution so that we don't pile up a bunch of RPCs for the same transaction.

petermattis · 2016-03-09T18:42:52Z

Cc @nvanbenschoten.

An explosion of these calls for the same transaction (or a small number of transactions) was responsible for cockroachdb#4925 Fixes cockroachdb#4925

This is a last line of defense against issues like cockroachdb#4925.

tbg · 2016-03-14T15:06:05Z

unrelated-ish, but

running single-node single-photo after 25min:

I160314 10:59:00.309653 user.go:116  89479 ops, 11 no-user, 0 errs (38.20/s)
I160314 10:59:10.312414 user.go:116  89912 ops, 11 no-user, 0 errs (43.30/s)
I160314 10:59:20.309660 user.go:116  90456 ops, 11 no-user, 0 errs (54.40/s)
I160314 10:59:30.309139 user.go:116  91004 ops, 11 no-user, 0 errs (54.80/s)
I160314 10:59:40.309233 user.go:116  91403 ops, 11 no-user, 0 errs (39.90/s)

That's really slow for single-node single-photo. This was running on #5219, but not much faster on master. Has anybody looked at this already?

tbg · 2016-03-14T19:18:13Z

2016/03/14 15:16:42.270775  0.055116    node 1
15:16:42.270776  .     1    ... node 1
15:16:42.270777  .     1    ... node
15:16:42.270784  .     8    ... read has no clock uncertainty
15:16:42.270790  .     6    ... executing 1 requests
15:16:42.270793  .     3    ... replica
15:16:42.270794  .     1    ... read-only path
15:16:42.270802  .     8    ... left command queue
15:16:42.270803  .     1    ... acquired read lock
15:16:42.325824  . 55020    ... executed batch: Scan [/Table/53/1/""/"s\x84\x14\xa8(\x18O۠@Gb\xf0 \x883"/5/1/NULL,/Table/53/1/"\x00")

Something's pretty slow down at RocksDB again. I'll make a conscious effort to plumb the tracing further down that path.

tbg · 2016-03-14T20:00:23Z

hey @RaduBerinde, with some more tracing I see this:

2016/03/14 15:58:43.763655  0.052631    node 1
15:58:43.763659  .     4    ... node 1
15:58:43.763673  .    14    ... read has no clock uncertainty
15:58:43.763685  .    12    ... executing 1 requests
15:58:43.763689  .     4    ... read-only path
15:58:43.816206  .     3    ... (10001 events discarded)
15:58:43.816210  .     4    ... collect result
15:58:43.816217  .     7    ... scan completed
15:58:43.816225  .     8    ... begin mvcc scan
15:58:43.816257  .    32    ... collect result
15:58:43.816262  .     5    ... scan completed

basically there seem to be a lot of 10k element scans going on. Iirc that's your chunking size limit? Can you from what the photo app is selecting tell whether that's going to be a query which you're about to optimize away using the batch-wide max results? Doubt that the photos app really wants the 10k rows back (haven't checked).

tbg · 2016-03-14T20:01:45Z

probably one of the selects here: https://github.com/cockroachdb/examples-go/blob/master/photos/db.go

RaduBerinde · 2016-03-14T20:41:25Z

Yeah 10k is the chunk size limit. The batch-wide limit is not going to optimize anything away, it will just allow us to use the chunk limit in all scenarios - currently, we can only scan in chunks when we have a single span to scan. But this won't be a factor here, I looked at the SELECTS in photos/db.go and none of them should result in multiple spans.

I looked at the SELECTs locally using EXPLAIN and there is one that looks problematic:

EXPLAIN SELECT commentID, userID, message, timestamp FROM comments WHERE photoID = 10 ORDER BY timestamp DESC LIMIT 100;
+-------+-------+--------------------------+
| Level | Type  |       Description        |
+-------+-------+--------------------------+
|     0 | limit | count: 100, offset: 0    |
|     1 | sort  | -timestamp               |
|     2 | scan  | comments@primary /10-/11 |
+-------+-------+--------------------------+

This means we will scan all comments with that photoID and sort them. We aren't using the comments (photoID, timestamp) index, this is because it's not a covering index which requires a more expensive index join. But the fact that we have a limit should factor into the index selection.

We could force it to do what we want by doing something like:

SELECT commentID, userID, message, timestamp FROM comments
   WHERE photoID = $1 AND commentID in
      (SELECT commentID FROM comments WHERE photoID = $1 ORDER BY timestamp DESC LIMIT 100)
   ORDER BY timestamp DESC

tbg · 2016-03-14T21:41:35Z

Thanks for looking into this - I'm going to update the photos app with your query, at least for the time being.

Just curious, shouldn't the where clause show up in EXPLAIN above the scan?

See cockroachdb/cockroach#4925 (comment)

petermattis · 2016-03-14T22:24:56Z

Rather than changing the query, I think the photos schema could be adjusted. We could add a STORING clause to the comments (photoID, timestamp) index. Additionally, I don't think that index should be UNIQUE.

tbg · 2016-03-14T22:26:37Z

sgtm, but I'll still go ahead and do the "stupid" update first to see whether that opens up another performance issue hidden behind it at the moment.

See cockroachdb/cockroach#4925 (comment)

tbg · 2016-03-14T22:51:16Z

Just curious, shouldn't the where clause show up in EXPLAIN above the scan?

another thing: It'd be helpful if (at least EXPLAIN (DEBUG) showed for each filter the "success" ratio -
the above select would have a pretty bad select ratio, reading 10k entries for 100 results).

cc @RaduBerinde

RaduBerinde · 2016-03-14T23:17:28Z

Just curious, shouldn't the where clause show up in EXPLAIN above the scan?

EXPLAIN shows the high-level plan, it doesn't show expressions. You can see however that the scan range is restricted to /10-/11 which is really the entire filter.

It'd be helpful if (at least EXPLAIN (DEBUG) showed for each filter the "success" ratio -
the above select would have a pretty bad select ratio, reading 10k entries for 100 results).

It's a bit more complicated than that here. It's not really a filter that has a low-pass ratio, it's the fact that we retrieve all results for the given photoID to sort them by their timestamps (before we can return the top 100). As of a PR I have out, EXPLAIN (DEBUG) will show all the rows that are retrieved and buffered internally.

tbg · 2016-03-14T23:22:29Z

does that mean EXPLAIN(DEBUG) would show 10k rows? Not sure that's helpful (ok, that's probably a signal that the query is bad - maybe that's what we want).

It's not really a filter that has a low-pass ratio, it's the fact that we retrieve all results for the given photoID to sort them by their timestamps (before we can return the top 100). As of a PR I have out, EXPLAIN (DEBUG) will show all the rows that are retrieved and buffered internally.

but wouldn't "oh we got these 10k results and did something but at the end we threw away all but 100" generally indicate the right thing, namely that whichever source the data is coming from is fetching the data in a way that doesn't use the right boundary condition?

RaduBerinde · 2016-03-14T23:27:54Z

does that mean EXPLAIN(DEBUG) would show 10k rows?

Yes, DEBUG spits out all the keys that are queried plus per-row information as it travels through the system.

but wouldn't "oh we got these 10k results and did something but at the end we threw away all but 100" generally indicate the right thing, namely that whichever source the data is coming from is fetching the data in a way that doesn't use the right boundary condition?

Right, I was just saying it's not a property of a filter in this case.

Maybe we want an EXPLAIN (STATS) that runs the query and shows statistics about how many keyes were processed, how many rows each layer received and generated etc.

tbg · 2016-03-15T00:21:31Z

+1

On Mon, Mar 14, 2016 at 7:28 PM RaduBerinde [email protected]
wrote:

does that mean EXPLAIN(DEBUG) would show 10k rows?

Yes, DEBUG spits out all the keys that are queried plus per-row
information as it travels through the system.

but wouldn't "oh we got these 10k results and did something but at the end
we threw away all but 100" generally indicate the right thing, namely that
whichever source the data is coming from is fetching the data in a way that
doesn't use the right boundary condition?

Right, I was just saying it's not a property of a filter in this case.

Maybe we want an EXPLAIN (STATS) that runs the query and shows statistics
about how many keyes were processed, how many rows each layer received and
generated etc.

—
Reply to this email directly or view it on GitHub
#4925 (comment)
.

In cockroachdb#4925, we observed ineffective planning for a query in the photos app. We prefer to use the primary index and sort rather than use a non-covering index which makes sense in general (non-covering indices require and expensive indexJoin) but in this case we also had a limit. In such a case using the index would require looking only at the first rows instead of getting all matching rows and sorting. In this change we tweak the index selection: if we have a reasonable limit, we give a "boost" to all indices that match the ordering exactly. The boost exactly offsets the non-covering index penalty. In addition to the new tests, I also verified the photo app query in cockroachdb#4925 now uses the index. Fixes cockroachdb#5246.

In cockroachdb#4925, we observed ineffective planning for a query in the photos app. We prefer to use the primary index and sort rather than use a non-covering index which makes sense in general (non-covering indices require an expensive indexJoin) but in this case we also had a limit. In such a case using the index would require looking only at the first rows instead of getting all matching rows and sorting. In this change we tweak the index selection: if we have a reasonable limit, we give a "boost" to all indices that match the ordering exactly. The boost exactly offsets the non-covering index penalty. In addition to the new tests, I also verified the photo app query in cockroachdb#4925 now uses the index. Fixes cockroachdb#5246.

petermattis modified the milestone: Beta Mar 7, 2016

mberhault mentioned this issue Mar 8, 2016

stability: local storage fills up on connection problems #4977

Closed

petermattis self-assigned this Mar 8, 2016

petermattis added the C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. label Mar 8, 2016

petermattis assigned bdarnell and unassigned petermattis Mar 10, 2016

bdarnell added a commit to bdarnell/cockroach that referenced this issue Mar 14, 2016

storage: Limit the number of tasks started by intentResolver

8e44cdc

This is a last line of defense against issues like cockroachdb#4925.

bdarnell mentioned this issue Mar 14, 2016

storage: limit goroutine creation in intent resolution #5219

Merged

RaduBerinde mentioned this issue Mar 14, 2016

sql: prefer order-matching index when there is a limit #5246

Closed

tbg added a commit to cockroachdb/examples-go that referenced this issue Mar 14, 2016

Use more efficient query

f123170

See cockroachdb/cockroach#4925 (comment)

tbg mentioned this issue Mar 14, 2016

Use more efficient query cockroachdb/examples-go#45

Merged

bdarnell closed this as completed in #5219 Mar 14, 2016

tbg added a commit to cockroachdb/examples-go that referenced this issue Mar 14, 2016

Use more efficient query

c595ef6

See cockroachdb/cockroach#4925 (comment)

RaduBerinde mentioned this issue Mar 21, 2016

sql: add EXPLAIN (STATS) that runs query and shows statistics #5445

Closed

RaduBerinde mentioned this issue Mar 21, 2016

sql: prefer order-matching index if there is a limit #5446

Merged

tbg mentioned this issue Dec 3, 2017

storage: move txn intent cleanup out of GC critical path, & GC keys early #20396

Merged

bdarnell mentioned this issue Feb 17, 2021

kvserver: avoid or at least efficiently handle intent buildup #60585

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

stability: disconnect + crash under load #4925

stability: disconnect + crash under load #4925

mberhault commented Mar 6, 2016

mberhault commented Mar 6, 2016

bdarnell commented Mar 6, 2016

petermattis commented Mar 8, 2016

petermattis commented Mar 8, 2016

petermattis commented Mar 8, 2016

petermattis commented Mar 9, 2016

tbg commented Mar 14, 2016

tbg commented Mar 14, 2016

tbg commented Mar 14, 2016

tbg commented Mar 14, 2016

RaduBerinde commented Mar 14, 2016

tbg commented Mar 14, 2016

petermattis commented Mar 14, 2016

tbg commented Mar 14, 2016

tbg commented Mar 14, 2016

RaduBerinde commented Mar 14, 2016

tbg commented Mar 14, 2016

RaduBerinde commented Mar 14, 2016

tbg commented Mar 15, 2016

stability: disconnect + crash under load #4925

stability: disconnect + crash under load #4925

Comments

mberhault commented Mar 6, 2016

mberhault commented Mar 6, 2016

bdarnell commented Mar 6, 2016

petermattis commented Mar 8, 2016

petermattis commented Mar 8, 2016

petermattis commented Mar 8, 2016

petermattis commented Mar 9, 2016

tbg commented Mar 14, 2016

tbg commented Mar 14, 2016

tbg commented Mar 14, 2016

tbg commented Mar 14, 2016

RaduBerinde commented Mar 14, 2016

tbg commented Mar 14, 2016

petermattis commented Mar 14, 2016

tbg commented Mar 14, 2016

tbg commented Mar 14, 2016

RaduBerinde commented Mar 14, 2016

tbg commented Mar 14, 2016

RaduBerinde commented Mar 14, 2016

tbg commented Mar 15, 2016