sql: always on EXPLAIN ANALYZE #54556
Labels
A-sql-execution
Relating to SQL execution.
A-sql-observability
Related to observability of the SQL layer
C-enhancement
Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception)
meta-issue
Contains a list of several other issues.
General
EXPLAIN ANALYZE is a SQL tool that runs a query with execution statistics collection enabled. A user can run EXPLAIN ANALYZE to figure out why a query is performing poorly. This tool is only useful if the user already knows that a query is not performing well, leaving the initial investigation and decision up to the user. EXPLAIN ANALYZE is also only available through the SQL shell or after an explicit user request for a statement diagnostics bundle, making it a lot less likely to be used.
Always on EXPLAIN ANALYZE is about surfacing more statistics to the statements page by unconditionally returning execution statistics without affecting query performance. The goal of this work is to reduce the friction a user experiences when investigating a slow query by making EXPLAIN ANALYZE information more readily available in the admin UI and eliminating the need to explicitly request this information. An example of a high-value use case is displaying maximum query memory usage in the statements page and allowing users to sort by this value to figure out which query is using excessive memory.
This is a WIP meta issue. Related issues will be added as concrete work items are fleshed out.
Goal
The goal is to surface these collected statistics unconditionally and display them in a useful way. To start with, we will surface the latest EXPLAIN ANALYZE diagram to the statements execution stats page. The UI of the EXPLAIN ANALYZE diagram will most likely need to be reworked to fit with the admin UI aesthetic.
Additionally, we will look at what EXPLAIN ANALYZE stats are useful in a non-diagram format. Aggregate per-node query memory usage, for example, should be surfaced to the top-level statements page. These kinds of stats can be collected and returned in aggregate form, not only for the latest query execution.
Note that aggregating multiple EXPLAIN ANALYZE plans together on a per-operator or per-node basis could be done in the future but is out of scope for this initial work.
Another thing to keep in mind is that although these stats are in principle always-on, it should be possible for the user to specify a sampling frequency if desired.
Work Items
Overview
On the backend, the KV team will work on a lightweight form of tracing that will focus on minimizing the performance of returning these statistics unconditionally. The cost of collecting them is most likely negligible compared to other types of overhead, but we will have to verify this.
After this work item is complete, the conn executor will now have unconditional access to these traces, where it will be able to transfer the collected SpanStats to the StatementStatistics protobuf when the current statement stats are recorded. The Admin UI will be able to access this structured data and decide how to display it.
There are three “players” in this project: KV, SQL Execution backend work, and the frontend work done by design/contractors/SQL Execution. SQL Execution will depend on the KV team for always-on tracing to be able to return execution statistics unconditionally (although the SQL Execution team will probably also be involved for feedback/reviews). The frontend will depend on the structured execution statistics to work on how to display them. The KV team has no dependencies for this project.
Phases
In order to increase parallelism, this project will be divided into separate phases
Phase 1
While KV is working on the tracing infrastructure, SQL Execution will not be able to return statistics unconditionally. However, in order to unblock the frontend work, SQL Execution will implement returning EXPLAIN ANALYZE statistics in the StatementStatistics protobuf when a statement diagnostics bundle is requested. The Admin UI work can proceed showing/hiding UI elements based on whether this field is set in StatementStatistics. This work will probably be hidden behind a cluster setting while the work is in progress and most likely until the design team has completed the UI aspect.
The goal of this phase is to show an EXPLAIN ANALYZE diagram in the statement statistics page when a statement diagnostics bundle is requested.
Collecting/surfacing EXPLAIN ANALYZE stats when a statement diagnostics bundle is requested
Specific stats to surface
Phase 2
Phase 2 will most likely be focused on the UI aspect. Once the admin UI has this structured data, it can decide how best to display this information. A nicer-looking EXPLAIN ANALYZE has always been something we’ve wanted, and this is the time to do it. We can use snowflake’s UI for inspiration:
Some other things for the design team to think about: what aggregated EXPLAIN ANALYZE information is useful and how should it be displayed (e.g. memory usage per node/execution time of operators)?
While this is going on, the SQL Execution team can also think about surfacing useful basic stats to other pages. The example used throughout this document is query memory usage, but another interesting datum is network latency.
Phase 3
More for completeness, since it doesn’t need to strictly happen after Phase 2. This phase happens once KV finishes the lightweight tracing infrastructure. The SQL Execution team will evaluate what work needs to be done to use the new API. This allows us to flip the switch from collecting execution statistics when requested by the user through a request for a statement diagnostics bundle to unconditionally doing so (while still respecting a sample rate)
The text was updated successfully, but these errors were encountered: