Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

stability: disconnect + crash under load #4925

Closed
mberhault opened this issue Mar 6, 2016 · 19 comments
Closed

stability: disconnect + crash under load #4925

mberhault opened this issue Mar 6, 2016 · 19 comments
Assignees
Labels
C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior.
Milestone

Comments

@mberhault
Copy link
Contributor

There is not clear and obvious culprit here, so I'll attempt to summarize as best I can.

build sha: e6793e1

4-node cluster with the photos example running against it.
Things are fine for a few hours, until the following happens:

Crash timelines and full logs. All crashes are OOM

node4: 17:40:41
node4.crash.txt

node3: 18:04:32
node3.crash.txt

node2: 18:09:45
node2.crash.txt

node1: 21:15:49
node1.crash.txt

Now, looking for originating events is where it gets a bit tedious.
One of the first signs of badness if around 17:26 when we get:

17:26:54.224163 gossip/gossip.go:923  partition in gossip network; attempting new connection

This happens on nodes 1, 2, and 4 at that precise time, and 20 seconds lated on node 3.
Lease holders were the following, with 3 being the last one:

Node1:
I160306 17:24:47.189891 storage/replica_command.go:1255  range 1: new leader lease replica {1 1 10} 2016-03-06 17:24:43 +0000 UTC +1.820s
I160306 17:24:55.236673 storage/replica_command.go:1255  range 1: new leader lease replica {1 1 10} 2016-03-06 17:24:54 +0000 UTC +1.152s

Node2:
I160306 17:10:28.212065 storage/replica_command.go:1255  range 1: new leader lease replica {2 2 9} 2016-03-06 17:10:24 +0000 UTC +4.435s
I160306 17:10:38.194705 storage/replica_command.go:1255  range 1: new leader lease replica {2 2 9} 2016-03-06 17:10:34 +0000 UTC +4.438s

Node3:
I160306 17:24:47.324180 storage/replica_command.go:1255  range 1: new leader lease replica {3 3 11} 2016-03-06 17:24:46 +0000 UTC +1.391s
I160306 17:25:12.291490 storage/replica_command.go:1255  range 1: new leader lease replica {3 3 11} 2016-03-06 17:25:08 +0000 UTC +1.474s

Node4:
I160306 17:24:20.530929 storage/replica_command.go:1255  range 1: new leader lease replica {4 4 4} 2016-03-06 17:24:17 +0000 UTC +1.732s
I160306 17:24:24.417120 storage/replica_command.go:1255  range 1: new leader lease replica {4 4 4} 2016-03-06 17:24:23 +0000 UTC +1.003s
@mberhault
Copy link
Contributor Author

I've also been grabbing heap profiles every minute. Here's the last non-empty profile for each node using go tool pprof --png --runtime <binary> <heap profile>.

Node1:
profile1

Node2:
profile2

Node3:
profile3

Node4:
profile4

@bdarnell
Copy link
Contributor

bdarnell commented Mar 6, 2016

The immediate cause is tons of async intent resolutions. Queueing those up (as I'm planning to do for #4881) will help, although it's unclear how things got into a state where nothing can make progress.

@petermattis
Copy link
Collaborator

Plan is to add a pool of intent resolution workers which pull intent resolution tasks from a shared channel. Back pressure is applied if the channel fills up causing the calling goroutine to block.

@petermattis petermattis added the C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. label Mar 8, 2016
@petermattis
Copy link
Collaborator

Need to make sure this doesn't cause deadlock. We could drop the intent resolution on the floor if the channel is full.

@petermattis
Copy link
Collaborator

Other brainstorming: intents are associated with a transaction. We could maintain a map of transaction to in-flight intent resolution so that we don't pile up a bunch of RPCs for the same transaction.

@petermattis
Copy link
Collaborator

Cc @nvanbenschoten.

@petermattis petermattis assigned bdarnell and unassigned petermattis Mar 10, 2016
bdarnell added a commit to bdarnell/cockroach that referenced this issue Mar 14, 2016
An explosion of these calls for the same transaction (or a small number
of transactions) was responsible for cockroachdb#4925

Fixes cockroachdb#4925
bdarnell added a commit to bdarnell/cockroach that referenced this issue Mar 14, 2016
This is a last line of defense against issues like cockroachdb#4925.
@tbg
Copy link
Member

tbg commented Mar 14, 2016

unrelated-ish, but

running single-node single-photo after 25min:

I160314 10:59:00.309653 user.go:116  89479 ops, 11 no-user, 0 errs (38.20/s)
I160314 10:59:10.312414 user.go:116  89912 ops, 11 no-user, 0 errs (43.30/s)
I160314 10:59:20.309660 user.go:116  90456 ops, 11 no-user, 0 errs (54.40/s)
I160314 10:59:30.309139 user.go:116  91004 ops, 11 no-user, 0 errs (54.80/s)
I160314 10:59:40.309233 user.go:116  91403 ops, 11 no-user, 0 errs (39.90/s)

That's really slow for single-node single-photo. This was running on #5219, but not much faster on master. Has anybody looked at this already?

@tbg
Copy link
Member

tbg commented Mar 14, 2016

2016/03/14 15:16:42.270775  0.055116    node 1
15:16:42.270776  .     1    ... node 1
15:16:42.270777  .     1    ... node
15:16:42.270784  .     8    ... read has no clock uncertainty
15:16:42.270790  .     6    ... executing 1 requests
15:16:42.270793  .     3    ... replica
15:16:42.270794  .     1    ... read-only path
15:16:42.270802  .     8    ... left command queue
15:16:42.270803  .     1    ... acquired read lock
15:16:42.325824  . 55020    ... executed batch: Scan [/Table/53/1/""/"s\x84\x14\xa8(\x18O۠@Gb\xf0 \x883"/5/1/NULL,/Table/53/1/"\x00")

Something's pretty slow down at RocksDB again. I'll make a conscious effort to plumb the tracing further down that path.

@tbg
Copy link
Member

tbg commented Mar 14, 2016

hey @RaduBerinde, with some more tracing I see this:

2016/03/14 15:58:43.763655  0.052631    node 1
15:58:43.763659  .     4    ... node 1
15:58:43.763673  .    14    ... read has no clock uncertainty
15:58:43.763685  .    12    ... executing 1 requests
15:58:43.763689  .     4    ... read-only path
15:58:43.816206  .     3    ... (10001 events discarded)
15:58:43.816210  .     4    ... collect result
15:58:43.816217  .     7    ... scan completed
15:58:43.816225  .     8    ... begin mvcc scan
15:58:43.816257  .    32    ... collect result
15:58:43.816262  .     5    ... scan completed

basically there seem to be a lot of 10k element scans going on. Iirc that's your chunking size limit? Can you from what the photo app is selecting tell whether that's going to be a query which you're about to optimize away using the batch-wide max results? Doubt that the photos app really wants the 10k rows back (haven't checked).

@tbg
Copy link
Member

tbg commented Mar 14, 2016

@RaduBerinde
Copy link
Member

Yeah 10k is the chunk size limit. The batch-wide limit is not going to optimize anything away, it will just allow us to use the chunk limit in all scenarios - currently, we can only scan in chunks when we have a single span to scan. But this won't be a factor here, I looked at the SELECTS in photos/db.go and none of them should result in multiple spans.

I looked at the SELECTs locally using EXPLAIN and there is one that looks problematic:

EXPLAIN SELECT commentID, userID, message, timestamp FROM comments WHERE photoID = 10 ORDER BY timestamp DESC LIMIT 100;
+-------+-------+--------------------------+
| Level | Type  |       Description        |
+-------+-------+--------------------------+
|     0 | limit | count: 100, offset: 0    |
|     1 | sort  | -timestamp               |
|     2 | scan  | comments@primary /10-/11 |
+-------+-------+--------------------------+

This means we will scan all comments with that photoID and sort them. We aren't using the comments (photoID, timestamp) index, this is because it's not a covering index which requires a more expensive index join. But the fact that we have a limit should factor into the index selection.

We could force it to do what we want by doing something like:

SELECT commentID, userID, message, timestamp FROM comments
   WHERE photoID = $1 AND commentID in
      (SELECT commentID FROM comments WHERE photoID = $1 ORDER BY timestamp DESC LIMIT 100)
   ORDER BY timestamp DESC

@tbg
Copy link
Member

tbg commented Mar 14, 2016

Thanks for looking into this - I'm going to update the photos app with your query, at least for the time being.

Just curious, shouldn't the where clause show up in EXPLAIN above the scan?

@petermattis
Copy link
Collaborator

Rather than changing the query, I think the photos schema could be adjusted. We could add a STORING clause to the comments (photoID, timestamp) index. Additionally, I don't think that index should be UNIQUE.

@tbg
Copy link
Member

tbg commented Mar 14, 2016

sgtm, but I'll still go ahead and do the "stupid" update first to see whether that opens up another performance issue hidden behind it at the moment.

tbg added a commit to cockroachdb/examples-go that referenced this issue Mar 14, 2016
@tbg
Copy link
Member

tbg commented Mar 14, 2016

Just curious, shouldn't the where clause show up in EXPLAIN above the scan?

another thing: It'd be helpful if (at least EXPLAIN (DEBUG) showed for each filter the "success" ratio -
the above select would have a pretty bad select ratio, reading 10k entries for 100 results).

cc @RaduBerinde

@RaduBerinde
Copy link
Member

Just curious, shouldn't the where clause show up in EXPLAIN above the scan?

EXPLAIN shows the high-level plan, it doesn't show expressions. You can see however that the scan range is restricted to /10-/11 which is really the entire filter.

It'd be helpful if (at least EXPLAIN (DEBUG) showed for each filter the "success" ratio -
the above select would have a pretty bad select ratio, reading 10k entries for 100 results).

It's a bit more complicated than that here. It's not really a filter that has a low-pass ratio, it's the fact that we retrieve all results for the given photoID to sort them by their timestamps (before we can return the top 100). As of a PR I have out, EXPLAIN (DEBUG) will show all the rows that are retrieved and buffered internally.

@tbg
Copy link
Member

tbg commented Mar 14, 2016

does that mean EXPLAIN(DEBUG) would show 10k rows? Not sure that's helpful (ok, that's probably a signal that the query is bad - maybe that's what we want).

It's not really a filter that has a low-pass ratio, it's the fact that we retrieve all results for the given photoID to sort them by their timestamps (before we can return the top 100). As of a PR I have out, EXPLAIN (DEBUG) will show all the rows that are retrieved and buffered internally.

but wouldn't "oh we got these 10k results and did something but at the end we threw away all but 100" generally indicate the right thing, namely that whichever source the data is coming from is fetching the data in a way that doesn't use the right boundary condition?

@RaduBerinde
Copy link
Member

does that mean EXPLAIN(DEBUG) would show 10k rows?

Yes, DEBUG spits out all the keys that are queried plus per-row information as it travels through the system.

but wouldn't "oh we got these 10k results and did something but at the end we threw away all but 100" generally indicate the right thing, namely that whichever source the data is coming from is fetching the data in a way that doesn't use the right boundary condition?

Right, I was just saying it's not a property of a filter in this case.

Maybe we want an EXPLAIN (STATS) that runs the query and shows statistics about how many keyes were processed, how many rows each layer received and generated etc.

@tbg
Copy link
Member

tbg commented Mar 15, 2016

+1

On Mon, Mar 14, 2016 at 7:28 PM RaduBerinde [email protected]
wrote:

does that mean EXPLAIN(DEBUG) would show 10k rows?

Yes, DEBUG spits out all the keys that are queried plus per-row
information as it travels through the system.

but wouldn't "oh we got these 10k results and did something but at the end
we threw away all but 100" generally indicate the right thing, namely that
whichever source the data is coming from is fetching the data in a way that
doesn't use the right boundary condition?

Right, I was just saying it's not a property of a filter in this case.

Maybe we want an EXPLAIN (STATS) that runs the query and shows statistics
about how many keyes were processed, how many rows each layer received and
generated etc.


Reply to this email directly or view it on GitHub
#4925 (comment)
.

RaduBerinde added a commit to RaduBerinde/cockroach that referenced this issue Mar 21, 2016
In cockroachdb#4925, we observed ineffective planning for a query in the photos app. We
prefer to use the primary index and sort rather than use a non-covering index
which makes sense in general (non-covering indices require and expensive
indexJoin) but in this case we also had a limit. In such a case using the index
would require looking only at the first rows instead of getting all matching
rows and sorting.

In this change we tweak the index selection: if we have a reasonable limit, we
give a "boost" to all indices that match the ordering exactly. The boost exactly
offsets the non-covering index penalty.

In addition to the new tests, I also verified the photo app query in cockroachdb#4925 now
uses the index.

Fixes cockroachdb#5246.
RaduBerinde added a commit to RaduBerinde/cockroach that referenced this issue Mar 22, 2016
In cockroachdb#4925, we observed ineffective planning for a query in the photos app. We
prefer to use the primary index and sort rather than use a non-covering index
which makes sense in general (non-covering indices require an expensive
indexJoin) but in this case we also had a limit. In such a case using the index
would require looking only at the first rows instead of getting all matching
rows and sorting.

In this change we tweak the index selection: if we have a reasonable limit, we
give a "boost" to all indices that match the ordering exactly. The boost exactly
offsets the non-covering index penalty.

In addition to the new tests, I also verified the photo app query in cockroachdb#4925 now
uses the index.

Fixes cockroachdb#5246.
RaduBerinde added a commit to RaduBerinde/cockroach that referenced this issue Mar 22, 2016
In cockroachdb#4925, we observed ineffective planning for a query in the photos app. We
prefer to use the primary index and sort rather than use a non-covering index
which makes sense in general (non-covering indices require an expensive
indexJoin) but in this case we also had a limit. In such a case using the index
would require looking only at the first rows instead of getting all matching
rows and sorting.

In this change we tweak the index selection: if we have a reasonable limit, we
give a "boost" to all indices that match the ordering exactly. The boost exactly
offsets the non-covering index penalty.

In addition to the new tests, I also verified the photo app query in cockroachdb#4925 now
uses the index.

Fixes cockroachdb#5246.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior.
Projects
None yet
Development

No branches or pull requests

5 participants