Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sql: OOM risk of EXPLAIN ANALYZE (DEBUG) of statement with large lookup or index join #103358

Open
michae2 opened this issue May 15, 2023 · 8 comments
Labels
A-sql-execution Relating to SQL execution. A-sql-memmon SQL memory monitoring A-tracing Relating to tracing in CockroachDB. C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. C-performance Perf of queries or internals. Solution not expected to change functional behavior. E-quick-win Likely to be a quick win for someone experienced. S-2-temp-unavailability Temp crashes or other availability problems. Can be worked around or resolved by restarting. T-sql-queries SQL Queries Team

Comments

@michae2
Copy link
Collaborator

michae2 commented May 15, 2023

At the end of statement diagnostics bundle collection of a query with a large lookup or index join, there can be a large spike in memory usage. Sometimes this is enough to OOM a node. Here's a demonstration using a real cluster:

-- create a table with a secondary index and at least one leaseholder range per node
CREATE TABLE abc (a STRING, b STRING, c STRING, INDEX (a, b));
ALTER TABLE abc CONFIGURE ZONE USING range_max_bytes = 33554432, range_min_bytes = 1048576;

-- populate
INSERT INTO abc SELECT i::string, i::string, i::string FROM generate_series(0, 99999) s(i);
INSERT INTO abc SELECT i::string, i::string, i::string FROM generate_series(0, 99999) s(i);
INSERT INTO abc SELECT i::string, i::string, i::string FROM generate_series(0, 99999) s(i);
INSERT INTO abc SELECT i::string, i::string, i::string FROM generate_series(0, 99999) s(i);
INSERT INTO abc SELECT i::string, i::string, i::string FROM generate_series(0, 99999) s(i);
INSERT INTO abc SELECT i::string, i::string, i::string FROM generate_series(0, 99999) s(i);
INSERT INTO abc SELECT i::string, i::string, i::string FROM generate_series(0, 99999) s(i);
INSERT INTO abc SELECT i::string, i::string, i::string FROM generate_series(0, 99999) s(i);
INSERT INTO abc SELECT i::string, i::string, i::string FROM generate_series(0, 99999) s(i);
INSERT INTO abc SELECT i::string, i::string, i::string FROM generate_series(0, 99999) s(i);
INSERT INTO abc SELECT i::string, i::string, i::string FROM generate_series(0, 99999) s(i);
INSERT INTO abc SELECT i::string, i::string, i::string FROM generate_series(0, 99999) s(i);
INSERT INTO abc SELECT i::string, i::string, i::string FROM generate_series(0, 99999) s(i);
INSERT INTO abc SELECT i::string, i::string, i::string FROM generate_series(0, 99999) s(i);
INSERT INTO abc SELECT i::string, i::string, i::string FROM generate_series(0, 99999) s(i);
INSERT INTO abc SELECT i::string, i::string, i::string FROM generate_series(0, 99999) s(i);
INSERT INTO abc SELECT i::string, i::string, i::string FROM generate_series(0, 99999) s(i);
INSERT INTO abc SELECT i::string, i::string, i::string FROM generate_series(0, 99999) s(i);
INSERT INTO abc SELECT i::string, i::string, i::string FROM generate_series(0, 99999) s(i);
INSERT INTO abc SELECT i::string, i::string, i::string FROM generate_series(0, 99999) s(i);
INSERT INTO abc SELECT i::string, i::string, i::string FROM generate_series(0, 99999) s(i);
INSERT INTO abc SELECT i::string, i::string, i::string FROM generate_series(0, 99999) s(i);
INSERT INTO abc SELECT i::string, i::string, i::string FROM generate_series(0, 99999) s(i);
INSERT INTO abc SELECT i::string, i::string, i::string FROM generate_series(0, 99999) s(i);
INSERT INTO abc SELECT i::string, i::string, i::string FROM generate_series(0, 99999) s(i);
INSERT INTO abc SELECT i::string, i::string, i::string FROM generate_series(0, 99999) s(i);
INSERT INTO abc SELECT i::string, i::string, i::string FROM generate_series(0, 99999) s(i);
INSERT INTO abc SELECT i::string, i::string, i::string FROM generate_series(0, 99999) s(i);
INSERT INTO abc SELECT i::string, i::string, i::string FROM generate_series(0, 99999) s(i);
INSERT INTO abc SELECT i::string, i::string, i::string FROM generate_series(0, 99999) s(i);

ANALYZE abc;
SHOW RANGES FROM TABLE abc;

-- use a vectorized, distributed query with 2.6m rows passing through an index join
EXPLAIN SELECT COUNT(DISTINCT c) FROM abc@abc_a_b_idx WHERE a > '2' AND b > '2' AND c > '2';

-- normal execution takes a few seconds and only has a small impact on total node memory usage
SELECT now(); SELECT COUNT(DISTINCT c) FROM abc@abc_a_b_idx WHERE a > '2' AND b > '2' AND c > '2';
SELECT now();

-- execution with analyze (statistics collection) takes 10 second and again has a small impact on total node memory usage
SELECT now(); EXPLAIN ANALYZE SELECT COUNT(DISTINCT c) FROM abc@abc_a_b_idx WHERE a > '2' AND b > '2' AND c > '2';
SELECT now();

-- execution with verbose tracing takes over 3 minutes and causes a large spike in CPU and memory usage
SELECT now(); EXPLAIN ANALYZE (DEBUG) SELECT COUNT(DISTINCT c) FROM abc@abc_a_b_idx WHERE a > '2' AND b > '2' AND c > '2';
SELECT now();

Here's how that looked in the metrics:
Screenshot 2023-05-15 at 16 03 11

Even though the bundle said max memory usage was 67 MiB, we actually saw a spike > 1 GiB.

This is very similar to #90739 and may even be the same root cause (verbose tracing) but I wanted to document the spike in memory usage. We should try to fix this spike even if statement bundle collection takes longer than normal execution.

I believe the spike in memory on the gateway node is due to unmarshaling of this log message in traces:

log.VEventf(ctx, 2, "evaluated %s command %s, txn=%v : resp=%s, err=%v",
args.Method(), trunc(args.String()), h.Txn, resp, err)

I have reproduced this on both v23.1.0 and v21.2.17.

Jira issue: CRDB-27960

@michae2 michae2 added C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. C-performance Perf of queries or internals. Solution not expected to change functional behavior. S-2-temp-unavailability Temp crashes or other availability problems. Can be worked around or resolved by restarting. A-sql-execution Relating to SQL execution. A-sql-memmon SQL memory monitoring T-sql-queries SQL Queries Team labels May 15, 2023
@michae2 michae2 added the A-tracing Relating to tracing in CockroachDB. label May 15, 2023
@yuzefovich
Copy link
Member

Nice find!

Even though the bundle said max memory usage was 67 MiB, we actually saw a spike > 1 GiB.

Max memory usage reported by the bundle only includes the execution statistics collected during the query run about the memory usage of operators in the plan, but we never include the auxiliary stuff like trace data unmarshalling in that number, so it's expected for this memory to not be included into the diagram. We did add accounting for LeafTxnFinalState metadata in #85285, and perhaps we should extend that metadata accounting for all metadata types.

@yuzefovich
Copy link
Member

If I'm reading the comments here

// maxRecordedSpansPerTrace limits the number of spans per recording, keeping
// recordings from getting too large.
maxRecordedSpansPerTrace = 1000
// maxRecordedBytesPerSpan limits the size of unstructured logs in a span.
maxLogBytesPerSpan = 256 * (1 << 10) // 256 KiB

right, then it's expected that the log messages for a single trace could take up to 256MB in size, so it appears that we're not respecting these two limits (or perhaps there is additional overhead due to serialization / deserialization logs into protobuf). We might need some guidance from Obs Infra to know whether this behavior is expected or not, cc @dhartunian @abarganier

@rytaft
Copy link
Collaborator

rytaft commented May 16, 2023

From triage meeting:

  • The real fix should be done by observability infra
  • SQL Queries might be able to add some memory accounting
  • A quick fix: temporary accounting when we receive the trace data on the inbox side, and release it when inbox exits

@rytaft rytaft added the E-quick-win Likely to be a quick win for someone experienced. label May 16, 2023
@yuzefovich
Copy link
Member

I spent some time playing around with Michael's reproduction on a single node cluster and collecting heap profiles. Some interesting finds:

  • when looking at inuse_space right after bundle collection, the conversion of the trace to JaegerJSON is non-trivial
Screenshot 2023-05-16 at 2 28 23 PM
  • when looking at alloc_space (since the node started, after having collected the bundle a few times) txnSpanRefresher logging is the largest memory allocator
Screenshot 2023-05-16 at 2 26 18 PM

@DrewKimball
Copy link
Collaborator

Could we benefit here by adding a pool for execinfrapb.ProducerMessage structs so that execinfrapb.distSQLFlowStreamServer doesn't have to allocate new ones in RecV()? That way we could reuse memory for fields like tracingpb.RecordedSpan.Logs that are contributing to the heap profiles we've seen.

@yuzefovich
Copy link
Member

Could we benefit here by adding a pool for execinfrapb.ProducerMessage structs so that execinfrapb.distSQLFlowStreamServer doesn't have to allocate new ones in RecV()? That way we could reuse memory for fields like tracingpb.RecordedSpan.Logs that are contributing to the heap profiles we've seen.

I'm not sure if we have a precedent for this (we'd need to modify how the protobuf generated code is generated, or perhaps disable generation of Recv method to write it manually, maybe also ServerStream.RecvMsg).

However, I'm a bit worried about sync-pooling some of these large tracingpb.RecordedSpan.Logs objects - presumably, we'd use them very rarely (when an expensive query with lots of KV requests is being traced), yet we could keep the sync-pooled objects in memory for non-trivial amount of time, significantly increasing RSS. I think we should be tackling this problem from a different angle - we should not be creating these large log messages / large traces in the first place (by either truncating or dropping stuff altogether).

@tbg
Copy link
Member

tbg commented Jun 6, 2023

I believe the spike in memory on the gateway node is due to unmarshaling of this log message in traces:

I'm just driving by, but are we talking about the right problem? I doubt anyone on KV would object to significantly clamping down on this log statement. We should never be logging unbounded amounts of data.

It's difficult to prevent creation of large log messages, but perhaps via logging telemetry (sentry?) we could highlight locations for which a message of >256kb (arbitrary threshold) was ever seen, and at least take measures after the fact.

@yuzefovich
Copy link
Member

It looks like that log message should already be truncated to roughly 1KiB, and IIUC the problem is that we have too many log messages which add up.

@mgartner mgartner moved this to 23.2 Release in SQL Queries Jul 24, 2023
@michae2 michae2 moved this from 23.2 Release to 24.1 Release in SQL Queries Sep 12, 2023
@mgartner mgartner moved this from 24.1 Release to 24.2 Release in SQL Queries Nov 28, 2023
@mgartner mgartner moved this from 24.2 Release to New Backlog in SQL Queries Mar 28, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-sql-execution Relating to SQL execution. A-sql-memmon SQL memory monitoring A-tracing Relating to tracing in CockroachDB. C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. C-performance Perf of queries or internals. Solution not expected to change functional behavior. E-quick-win Likely to be a quick win for someone experienced. S-2-temp-unavailability Temp crashes or other availability problems. Can be worked around or resolved by restarting. T-sql-queries SQL Queries Team
Projects
Status: Backlog
Development

No branches or pull requests

6 participants