sql: Spring Boot generated query potentially led to internal error #41077

roncrdb · 2019-09-25T15:26:27Z

Describe the problem
An internal error was encountered by a user. From their Spring Boot logs it shows the following: (sql has been anonymized)

org.springframework.jdbc.UncategorizedSQLException: PreparedStatementCallback; uncategorized SQLException for SQL [SELECT t1.resource_key, t1.total, t2.guid_id as t2_id FROM t1 t1_alias LEFT OUTER JOIN t2 t2_alias ON (t2.favourite = t1.id AND t2.customerid = ?) AS OF SYSTEM TIME '-1s' WHERE t1.jurisdiction = ? AND t1.brand = ? AND t1.resource_type = ? AND t1.device_group = ? AND t1.locale = ? AND t1.resource_key IN (?, ?, ?, ?, ?) ORDER BY total DESC, resource_key]; SQL state [XX000]; error code [0]; ERROR: internal error: uncaught error: no inbound stream connection
github.com/cockroachdb/cockroach/pkg/sql/distsqlrun.init
/go/src/github.com/cockroachdb/cockroach/pkg/sql/distsqlrun/flow_registry.go:33
github.com/cockroachdb/cockroach/pkg/sql.init
<autogenerated>:1
github.com/cockroachdb/cockroach/pkg/ccl/backupccl.init
<autogenerated>:1
github.com/cockroachdb/cockroach/pkg/ccl.init
<autogenerated>:1
main.init
<autogenerated>:1
runtime.main
/usr/local/go/src/runtime/proc.go:189
runtime.goexit
/usr/local/go/src/runtime/asm_amd64.s:1333
Detail: stack trace:
github.com/cockroachdb/cockroach/pkg/sql/distsqlpb/data.go:160: in NewError()
github.com/cockroachdb/cockroach/pkg/sql/distsqlrun/stream_encoder.go:109: in AddMetadata()
github.com/cockroachdb/cockroach/pkg/sql/distsqlrun/outbox.go:125: in addRow()
github.com/cockroachdb/cockroach/pkg/sql/distsqlrun/outbox.go:293: in mainLoop()
github.com/cockroachdb/cockroach/pkg/sql/distsqlrun/outbox.go:413: in run()

It's not clear if the internal error was due to an issue on our end, or if it was caused by the query having placeholders instead of actual values and we did not know how to handle this.
We do see that that Node 6 was fairly unhappy:

W190925 07:20:49.827664 417 storage/node_liveness.go:523 [n6,hb] slow heartbeat took 38.3s

Expected behavior
Queries should execute correctly without any negative impact on cluster.

Additional data / screenshots
Debug zip can be found here

If a node in your cluster encountered a fatal error, supply the contents of the
log directories (at minimum of the affected node(s), but preferably all nodes).

Note that log files can contain confidential information. Please continue
creating this issue, but contact [email protected] to submit the log
files in private.

If applicable, add screenshots to help explain your problem.

Environment:

CockroachDB version 19.1.4
Server OS: [e.g. Linux/Distrib]
Client app [e.g. cockroach sql, JDBC, ...]

Additional context
User was unable to access admin UI from node 3 and Node 6 had periodic issues of unavailability and under-replicated ranges. Node 6 was restarted, and cluster and application were back to a healthy state.

The text was updated successfully, but these errors were encountered:

nvanbenschoten · 2019-09-26T16:16:58Z

This is related to #31361, which we have seen in the past when a cluster is overloaded. The slow heartbeat took 38.3s demonstrates a degree of extreme overload. Is the cluster having trouble writing to its disk?

tim-o · 2019-09-26T16:40:14Z

@nvanbenschoten this is the same cluster with spikes in log commit latency:

Hard to tell without a mouseover but it could easily be ~07:30. So might end up being the same root cause.

andreimatei · 2019-10-31T16:28:24Z

It's not clear if the internal error was due to an issue on our end, or if it was caused by the query having placeholders instead of actual values and we did not know how to handle this.

This doesn't seem to have anything to do with placeholders. no inbound stream connection means that execution on node A of a query that was distributed timed out while waiting for the part of the query running on node B to connect to A. The respective timeout is 10s.

This happens on node failure, or on cluster overload. Looking at the logs, you see frequent pairs of:

E190925 08:00:07.446320 162344448 sql/distsqlrun/flow_registry.go:224  [n3] flow id:1aa63f90-fe44-4891-9b08-70f34303f459 : 1 inbound streams timed out after 10s; propagated error throughout flow
...
E190925 08:00:20.783833 162344722 sql/distsqlrun/server.go:614  [n3] flow 1aa63f90-fe44-4891-9b08-70f34303f459 not found

The first line indicates the timeout that generates the error that's ultimately returned to the client.
The second line (notice that it's many seconds later) indicates that the expected connection did eventually come, but it was too late. The logs are full of these over days.

One likely reason why this might happen is by running into the sql.distsql.max_running_flows limit - 500 by default. Once this many flows (~ distributed queries) are running concurrently on a node, further flows queue. This means that they won't connect to their counterparts on other nodes; those guys will hold up "resources" counting towards that limit for 10s uselessly, potentially causing cascading failures. You can determine if queuing is going on by using logspy to look for a message reading "flow scheduler enqueuing flow". If that's the case, they can try raising sql.distsql.max_running_flows (cluster setting). This theoretically opens the door to even worse overload effects, but might also just solve the problem. You can make a judgement about whether the cluster is actually overloaded by looking at CPU consumption, etc. If it really is overloaded, then that's beyond the database's control.

@asubiotto @ajwerner do we all agree to get rid of this flow queuing? Let's just do it; I think it does more harm than good, particularly now that we've isolated node liveness heartbeats somewhat.

As far as this customer issue is concerned, I think we should close it unless we continue seeing these errors without also seeing the "flow scheduler enqueuing flow" messages in logspy.

asubiotto · 2019-10-31T17:44:57Z

I'm personally for removing this queuing behavior but I think it's up to @ajwerner to determine whether it's useful at all or not wrt admission control.

In this particular case I don't think the queuing is having an effect. To be sure, the number of running flows can be inspected through the admin UI (Metrics -> SQL -> Active flows [...]). If it's a queuing problem the graph should be flatlined at sql.distsql.max_running_flows. Based on the heartbeat took 38.3s though, the cluster just seems to be overloaded.19.1.4 doesn't have any admission control at all as far as I know.

awoods187 · 2020-03-25T14:42:14Z

Should we close this issue now that #31361 is closed?

ajwerner · 2020-03-25T14:45:09Z

This can still happen. It should be rarer but the root cause still exists, especially if the query is run in parallel many times. We can close it and associate it with #34229.

roncrdb added the A-sql-execution Relating to SQL execution. label Sep 25, 2019

roncrdb assigned awoods187 Sep 25, 2019

jordanlewis added the C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. label Sep 25, 2019

awoods187 removed their assignment Mar 25, 2020

ajwerner closed this as completed Mar 25, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

sql: Spring Boot generated query potentially led to internal error #41077

sql: Spring Boot generated query potentially led to internal error #41077

roncrdb commented Sep 25, 2019

nvanbenschoten commented Sep 26, 2019

tim-o commented Sep 26, 2019

andreimatei commented Oct 31, 2019

asubiotto commented Oct 31, 2019

awoods187 commented Mar 25, 2020

ajwerner commented Mar 25, 2020

sql: Spring Boot generated query potentially led to internal error #41077

sql: Spring Boot generated query potentially led to internal error #41077

Comments

roncrdb commented Sep 25, 2019

nvanbenschoten commented Sep 26, 2019

tim-o commented Sep 26, 2019

andreimatei commented Oct 31, 2019

asubiotto commented Oct 31, 2019

awoods187 commented Mar 25, 2020

ajwerner commented Mar 25, 2020