sql: OOM risk of EXPLAIN ANALYZE (DEBUG) of statement with large lookup or index join #103358

michae2 · 2023-05-15T23:32:11Z

At the end of statement diagnostics bundle collection of a query with a large lookup or index join, there can be a large spike in memory usage. Sometimes this is enough to OOM a node. Here's a demonstration using a real cluster:

-- create a table with a secondary index and at least one leaseholder range per node
CREATE TABLE abc (a STRING, b STRING, c STRING, INDEX (a, b));
ALTER TABLE abc CONFIGURE ZONE USING range_max_bytes = 33554432, range_min_bytes = 1048576;

-- populate
INSERT INTO abc SELECT i::string, i::string, i::string FROM generate_series(0, 99999) s(i);
INSERT INTO abc SELECT i::string, i::string, i::string FROM generate_series(0, 99999) s(i);
INSERT INTO abc SELECT i::string, i::string, i::string FROM generate_series(0, 99999) s(i);
INSERT INTO abc SELECT i::string, i::string, i::string FROM generate_series(0, 99999) s(i);
INSERT INTO abc SELECT i::string, i::string, i::string FROM generate_series(0, 99999) s(i);
INSERT INTO abc SELECT i::string, i::string, i::string FROM generate_series(0, 99999) s(i);
INSERT INTO abc SELECT i::string, i::string, i::string FROM generate_series(0, 99999) s(i);
INSERT INTO abc SELECT i::string, i::string, i::string FROM generate_series(0, 99999) s(i);
INSERT INTO abc SELECT i::string, i::string, i::string FROM generate_series(0, 99999) s(i);
INSERT INTO abc SELECT i::string, i::string, i::string FROM generate_series(0, 99999) s(i);
INSERT INTO abc SELECT i::string, i::string, i::string FROM generate_series(0, 99999) s(i);
INSERT INTO abc SELECT i::string, i::string, i::string FROM generate_series(0, 99999) s(i);
INSERT INTO abc SELECT i::string, i::string, i::string FROM generate_series(0, 99999) s(i);
INSERT INTO abc SELECT i::string, i::string, i::string FROM generate_series(0, 99999) s(i);
INSERT INTO abc SELECT i::string, i::string, i::string FROM generate_series(0, 99999) s(i);
INSERT INTO abc SELECT i::string, i::string, i::string FROM generate_series(0, 99999) s(i);
INSERT INTO abc SELECT i::string, i::string, i::string FROM generate_series(0, 99999) s(i);
INSERT INTO abc SELECT i::string, i::string, i::string FROM generate_series(0, 99999) s(i);
INSERT INTO abc SELECT i::string, i::string, i::string FROM generate_series(0, 99999) s(i);
INSERT INTO abc SELECT i::string, i::string, i::string FROM generate_series(0, 99999) s(i);
INSERT INTO abc SELECT i::string, i::string, i::string FROM generate_series(0, 99999) s(i);
INSERT INTO abc SELECT i::string, i::string, i::string FROM generate_series(0, 99999) s(i);
INSERT INTO abc SELECT i::string, i::string, i::string FROM generate_series(0, 99999) s(i);
INSERT INTO abc SELECT i::string, i::string, i::string FROM generate_series(0, 99999) s(i);
INSERT INTO abc SELECT i::string, i::string, i::string FROM generate_series(0, 99999) s(i);
INSERT INTO abc SELECT i::string, i::string, i::string FROM generate_series(0, 99999) s(i);
INSERT INTO abc SELECT i::string, i::string, i::string FROM generate_series(0, 99999) s(i);
INSERT INTO abc SELECT i::string, i::string, i::string FROM generate_series(0, 99999) s(i);
INSERT INTO abc SELECT i::string, i::string, i::string FROM generate_series(0, 99999) s(i);
INSERT INTO abc SELECT i::string, i::string, i::string FROM generate_series(0, 99999) s(i);

ANALYZE abc;
SHOW RANGES FROM TABLE abc;

-- use a vectorized, distributed query with 2.6m rows passing through an index join
EXPLAIN SELECT COUNT(DISTINCT c) FROM abc@abc_a_b_idx WHERE a > '2' AND b > '2' AND c > '2';

-- normal execution takes a few seconds and only has a small impact on total node memory usage
SELECT now(); SELECT COUNT(DISTINCT c) FROM abc@abc_a_b_idx WHERE a > '2' AND b > '2' AND c > '2';
SELECT now();

-- execution with analyze (statistics collection) takes 10 second and again has a small impact on total node memory usage
SELECT now(); EXPLAIN ANALYZE SELECT COUNT(DISTINCT c) FROM abc@abc_a_b_idx WHERE a > '2' AND b > '2' AND c > '2';
SELECT now();

-- execution with verbose tracing takes over 3 minutes and causes a large spike in CPU and memory usage
SELECT now(); EXPLAIN ANALYZE (DEBUG) SELECT COUNT(DISTINCT c) FROM abc@abc_a_b_idx WHERE a > '2' AND b > '2' AND c > '2';
SELECT now();

Here's how that looked in the metrics:

Even though the bundle said max memory usage was 67 MiB, we actually saw a spike > 1 GiB.

This is very similar to #90739 and may even be the same root cause (verbose tracing) but I wanted to document the spike in memory usage. We should try to fix this spike even if statement bundle collection takes longer than normal execution.

I believe the spike in memory on the gateway node is due to unmarshaling of this log message in traces:

cockroach/pkg/kv/kvserver/replica_evaluate.go

Lines 550 to 551 in 190aa54

    
           log.VEventf(ctx, 2, "evaluated %s command %s, txn=%v : resp=%s, err=%v", 
        
           	args.Method(), trunc(args.String()), h.Txn, resp, err)

I have reproduced this on both v23.1.0 and v21.2.17.

Jira issue: CRDB-27960

yuzefovich · 2023-05-16T09:20:06Z

Nice find!

Even though the bundle said max memory usage was 67 MiB, we actually saw a spike > 1 GiB.

Max memory usage reported by the bundle only includes the execution statistics collected during the query run about the memory usage of operators in the plan, but we never include the auxiliary stuff like trace data unmarshalling in that number, so it's expected for this memory to not be included into the diagram. We did add accounting for LeafTxnFinalState metadata in #85285, and perhaps we should extend that metadata accounting for all metadata types.

yuzefovich · 2023-05-16T09:27:33Z

If I'm reading the comments here

cockroach/pkg/util/tracing/tracer.go

Lines 53 to 57 in 2b66e5e

    
           // maxRecordedSpansPerTrace limits the number of spans per recording, keeping 
        
           // recordings from getting too large. 
        
           maxRecordedSpansPerTrace = 1000 
        
           // maxRecordedBytesPerSpan limits the size of unstructured logs in a span. 
        
           maxLogBytesPerSpan = 256 * (1 << 10) // 256 KiB

right, then it's expected that the log messages for a single trace could take up to 256MB in size, so it appears that we're not respecting these two limits (or perhaps there is additional overhead due to serialization / deserialization logs into protobuf). We might need some guidance from Obs Infra to know whether this behavior is expected or not, cc @dhartunian @abarganier

rytaft · 2023-05-16T18:59:01Z

From triage meeting:

The real fix should be done by observability infra
SQL Queries might be able to add some memory accounting
A quick fix: temporary accounting when we receive the trace data on the inbox side, and release it when inbox exits

yuzefovich · 2023-05-16T21:30:11Z

I spent some time playing around with Michael's reproduction on a single node cluster and collecting heap profiles. Some interesting finds:

when looking at inuse_space right after bundle collection, the conversion of the trace to JaegerJSON is non-trivial

when looking at alloc_space (since the node started, after having collected the bundle a few times) txnSpanRefresher logging is the largest memory allocator

DrewKimball · 2023-05-22T17:28:55Z

Could we benefit here by adding a pool for execinfrapb.ProducerMessage structs so that execinfrapb.distSQLFlowStreamServer doesn't have to allocate new ones in RecV()? That way we could reuse memory for fields like tracingpb.RecordedSpan.Logs that are contributing to the heap profiles we've seen.

yuzefovich · 2023-05-30T07:51:17Z

Could we benefit here by adding a pool for execinfrapb.ProducerMessage structs so that execinfrapb.distSQLFlowStreamServer doesn't have to allocate new ones in RecV()? That way we could reuse memory for fields like tracingpb.RecordedSpan.Logs that are contributing to the heap profiles we've seen.

I'm not sure if we have a precedent for this (we'd need to modify how the protobuf generated code is generated, or perhaps disable generation of Recv method to write it manually, maybe also ServerStream.RecvMsg).

However, I'm a bit worried about sync-pooling some of these large tracingpb.RecordedSpan.Logs objects - presumably, we'd use them very rarely (when an expensive query with lots of KV requests is being traced), yet we could keep the sync-pooled objects in memory for non-trivial amount of time, significantly increasing RSS. I think we should be tackling this problem from a different angle - we should not be creating these large log messages / large traces in the first place (by either truncating or dropping stuff altogether).

tbg · 2023-06-06T12:41:05Z

I believe the spike in memory on the gateway node is due to unmarshaling of this log message in traces:

I'm just driving by, but are we talking about the right problem? I doubt anyone on KV would object to significantly clamping down on this log statement. We should never be logging unbounded amounts of data.

It's difficult to prevent creation of large log messages, but perhaps via logging telemetry (sentry?) we could highlight locations for which a message of >256kb (arbitrary threshold) was ever seen, and at least take measures after the fact.

yuzefovich · 2023-06-06T21:30:41Z

It looks like that log message should already be truncated to roughly 1KiB, and IIUC the problem is that we have too many log messages which add up.

michae2 added the A-tracing Relating to tracing in CockroachDB. label May 15, 2023

rytaft added T-observability-inf A-observability-inf labels May 16, 2023

rytaft added the E-quick-win Likely to be a quick win for someone experienced. label May 16, 2023

mgartner added this to SQL Queries Jul 24, 2023

mgartner moved this to 23.2 Release in SQL Queries Jul 24, 2023

michae2 moved this from 23.2 Release to 24.1 Release in SQL Queries Sep 12, 2023

mgartner moved this from 24.1 Release to 24.2 Release in SQL Queries Nov 28, 2023

nkodali removed T-observability-inf A-observability-inf labels Mar 22, 2024

mgartner moved this from 24.2 Release to New Backlog in SQL Queries Mar 28, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

sql: OOM risk of EXPLAIN ANALYZE (DEBUG) of statement with large lookup or index join #103358

sql: OOM risk of EXPLAIN ANALYZE (DEBUG) of statement with large lookup or index join #103358

michae2 commented May 15, 2023 •

edited by cockroach-jira-scripts

Loading

yuzefovich commented May 16, 2023

yuzefovich commented May 16, 2023

rytaft commented May 16, 2023

yuzefovich commented May 16, 2023

DrewKimball commented May 22, 2023

yuzefovich commented May 30, 2023

tbg commented Jun 6, 2023

yuzefovich commented Jun 6, 2023

sql: OOM risk of EXPLAIN ANALYZE (DEBUG) of statement with large lookup or index join #103358

sql: OOM risk of EXPLAIN ANALYZE (DEBUG) of statement with large lookup or index join #103358

Comments

michae2 commented May 15, 2023 • edited by cockroach-jira-scripts Loading

yuzefovich commented May 16, 2023

yuzefovich commented May 16, 2023

rytaft commented May 16, 2023

yuzefovich commented May 16, 2023

DrewKimball commented May 22, 2023

yuzefovich commented May 30, 2023

tbg commented Jun 6, 2023

yuzefovich commented Jun 6, 2023

michae2 commented May 15, 2023 •

edited by cockroach-jira-scripts

Loading