sqlstats: "get regions for node" logic is extremely inefficient #102170

knz · 2023-04-24T18:44:14Z

Split from #101299

Describe the problem

This PR #95449 has implemented "storing the regions in which a statement was executed" using some logic that is abusively expensive:

the computation in getNodesFromPlanner is way too expensive
- it extracts the full trace recording for the execution, which incurs a memory copy of the entire trace + a sort of the trace evens
- it calls ExtractNodesFromSpan which calls protoutil.Unmarshal, which incurs a heap allocation and a relatively expensive decode
- all of which just to store an unordered set of "node IDs" (SQL instance IDs) that the query was using.
- Observation: this work should be reduced to store the node IDs as separate structured metadata on the traces, and use a dedicated trace accessor function that only collects the SQL instance IDs (into an unordered set), so as to avoid the sort+unmarshaling steps.
- Additionally: ExtractNodesFromSpan returns an unordered set already (intsets.Fast). Then ForEach converts it into a sequence. Then the caller uses util.CombineUnique to transform the sequence back into a set. This is just silly and these steps should be combined.
the computation in getRegionsForNode is way too expensive
- it retrieves all the SQL instances (resolver.GetAllInstances) even though the specific query under consideration might use just 1 of them
  - This should use a different API that only retrieves region details for the instance IDs under consideration
- it uses an API that performs an internal data transformation (from cache to list, and then sorts the list), incurring CPU overhead, even though this logic does not need all the instance details nor does it need it to be sorted
  - This should use a different API that can retrieve just the region information without extra data transformation.
- it builds the map from instance to region every time, even though the mapping should change rarely (only when instances are added)
  - The mapping of instances to regions should be cached per node (SQL server), and not re-computed on every query.
- it sorts the list of regions on every call, even though the ordering is only relevant when the results are observed
  - The computation should not sort at all; instead the sorting should happen in the front-end during presentation

Expected behavior

The code should be refactored overall to remove the excess CPU and memory overheads.

Jira issue: CRDB-27322

The text was updated successfully, but these errors were encountered:

102192: sqlstats: only include local region in statement_statistics r=j82w a=matthewtodd Part of #89949. Addresses #98020. Addresses #99563. Related to cockroachdb/roachperf#129. Related to #102170. Previously, we attempted to record all the regions hit in a single statement execution in the sqlstats tables, leaning on the sqlAddressResolver to map traced nodeIDs to localities at execution time. While the sqlAddressResolver is generally non-blocking, the introduction of this code did cause some of the multiregion "this query shouldn't span regions" tests to start [flaking][] and it's more recently been [implicated][] in a 2.5% performance regression. Given that the probabilistic nature of the tracing meant that we generally weren't capturing all the relevant nodeIDs anyway, it seems like the most prudent thing to do here is take a step back and regroup. In the short term, let's stop even trying to gather all these regions. In the medium/long term, let's see if we can find a better approach. [flaking]: #98020 [implicated]: https://github.com/cockroachdb/roachperf/pull/129 Release note: None Co-authored-by: Matthew Todd <[email protected]>

106587: sql: fix StatementStatistics.Nodes list r=j82w a=j82w 1. Fixes the nodes list to show all the nodes in StatementStatistics.Nodes. 2. The node list is now empty if tracing is disabled. Previously it would always include the current gateway node id, but it would be missing all the other nodes. This causes confusion because it's uncertain whether the node list is complete or not. 3. Fixes regions on `EXPLAIN ANALYSE (DISTSQL)` to show regions information on secondary tenenants. It was not shown before because only system tenants have acces to gossip which is used under the covers to get the node descriptors. 4. Fixes the performance issues previously listed in #102170. 5. Fixes the test to actually validate the nodes list. The fix was done by adding the region name to the Flow ComponentID. This means the region name is now part of the traces for the Flow ComponentID, so it no longer needs figure out the region. It gets the region information from the same trace the SQL Instance ID is obtained. Moving the collection to the QueryLevelStats avoids iterating the traces multiple times. Fixes: #102170, fixes: #96647, fixes: #91219; Epic: none Release note (sql change): Fixes the StatementStatistics.Nodes to contain all the nodes involved in the query. Adds region info to `EXPLAIN ANALYSE (DISTSQL)` for seconary tenants. Co-authored-by: j82w <[email protected]>

blathers-crl bot added the T-cluster-observability label Apr 24, 2023

knz assigned maryliag Apr 24, 2023

knz mentioned this issue Apr 24, 2023

roachperf: gradual kv95 decline of 10% #101299

Closed

maryliag assigned j82w and unassigned maryliag Apr 24, 2023

matthewtodd mentioned this issue Apr 25, 2023

sqlstats: only include local region in statement_statistics #102192

Merged

This was referenced Apr 26, 2023

release-23.1.0: sqlstats: only include local region in statement_statistics #102330

Merged

release-23.1: sqlstats: only include local region in statement_statistics #102331

Merged

maryliag removed the GA-blocker label Apr 26, 2023

j82w mentioned this issue May 4, 2023

sql: add node ids and region to statement stats #102774

Closed

maryliag removed the todo-deprecate.branch-release-23.1.0 label Jul 7, 2023

j82w mentioned this issue Jul 11, 2023

sql: fix StatementStatistics.Nodes list #106587

Merged

craig bot closed this as completed in 44e6054 Jul 12, 2023

j82w mentioned this issue Jul 13, 2023

release-23.1: sql: fix StatementStatistics.Nodes list #106776

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

sqlstats: "get regions for node" logic is extremely inefficient #102170

sqlstats: "get regions for node" logic is extremely inefficient #102170

knz commented Apr 24, 2023 •

edited

Loading

sqlstats: "get regions for node" logic is extremely inefficient #102170

sqlstats: "get regions for node" logic is extremely inefficient #102170

Comments

knz commented Apr 24, 2023 • edited Loading

knz commented Apr 24, 2023 •

edited

Loading