Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sql: always on EXPLAIN ANALYZE #54556

Closed
2 of 3 tasks
asubiotto opened this issue Sep 18, 2020 · 1 comment · Fixed by #59132
Closed
2 of 3 tasks

sql: always on EXPLAIN ANALYZE #54556

asubiotto opened this issue Sep 18, 2020 · 1 comment · Fixed by #59132
Assignees
Labels
A-sql-execution Relating to SQL execution. A-sql-observability Related to observability of the SQL layer C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) meta-issue Contains a list of several other issues.

Comments

@asubiotto
Copy link
Contributor

asubiotto commented Sep 18, 2020

General

EXPLAIN ANALYZE is a SQL tool that runs a query with execution statistics collection enabled. A user can run EXPLAIN ANALYZE to figure out why a query is performing poorly. This tool is only useful if the user already knows that a query is not performing well, leaving the initial investigation and decision up to the user. EXPLAIN ANALYZE is also only available through the SQL shell or after an explicit user request for a statement diagnostics bundle, making it a lot less likely to be used.

Always on EXPLAIN ANALYZE is about surfacing more statistics to the statements page by unconditionally returning execution statistics without affecting query performance. The goal of this work is to reduce the friction a user experiences when investigating a slow query by making EXPLAIN ANALYZE information more readily available in the admin UI and eliminating the need to explicitly request this information. An example of a high-value use case is displaying maximum query memory usage in the statements page and allowing users to sort by this value to figure out which query is using excessive memory.

This is a WIP meta issue. Related issues will be added as concrete work items are fleshed out.

Goal

The goal is to surface these collected statistics unconditionally and display them in a useful way. To start with, we will surface the latest EXPLAIN ANALYZE diagram to the statements execution stats page. The UI of the EXPLAIN ANALYZE diagram will most likely need to be reworked to fit with the admin UI aesthetic.

Additionally, we will look at what EXPLAIN ANALYZE stats are useful in a non-diagram format. Aggregate per-node query memory usage, for example, should be surfaced to the top-level statements page. These kinds of stats can be collected and returned in aggregate form, not only for the latest query execution.

Note that aggregating multiple EXPLAIN ANALYZE plans together on a per-operator or per-node basis could be done in the future but is out of scope for this initial work.

Another thing to keep in mind is that although these stats are in principle always-on, it should be possible for the user to specify a sampling frequency if desired.

Work Items

Overview

On the backend, the KV team will work on a lightweight form of tracing that will focus on minimizing the performance of returning these statistics unconditionally. The cost of collecting them is most likely negligible compared to other types of overhead, but we will have to verify this.

After this work item is complete, the conn executor will now have unconditional access to these traces, where it will be able to transfer the collected SpanStats to the StatementStatistics protobuf when the current statement stats are recorded. The Admin UI will be able to access this structured data and decide how to display it.

There are three “players” in this project: KV, SQL Execution backend work, and the frontend work done by design/contractors/SQL Execution. SQL Execution will depend on the KV team for always-on tracing to be able to return execution statistics unconditionally (although the SQL Execution team will probably also be involved for feedback/reviews). The frontend will depend on the structured execution statistics to work on how to display them. The KV team has no dependencies for this project.

Phases

In order to increase parallelism, this project will be divided into separate phases

Phase 1

While KV is working on the tracing infrastructure, SQL Execution will not be able to return statistics unconditionally. However, in order to unblock the frontend work, SQL Execution will implement returning EXPLAIN ANALYZE statistics in the StatementStatistics protobuf when a statement diagnostics bundle is requested. The Admin UI work can proceed showing/hiding UI elements based on whether this field is set in StatementStatistics. This work will probably be hidden behind a cluster setting while the work is in progress and most likely until the design team has completed the UI aspect.

The goal of this phase is to show an EXPLAIN ANALYZE diagram in the statement statistics page when a statement diagnostics bundle is requested.

Collecting/surfacing EXPLAIN ANALYZE stats when a statement diagnostics bundle is requested

Specific stats to surface

Phase 2

Phase 2 will most likely be focused on the UI aspect. Once the admin UI has this structured data, it can decide how best to display this information. A nicer-looking EXPLAIN ANALYZE has always been something we’ve wanted, and this is the time to do it. We can use snowflake’s UI for inspiration:

Some other things for the design team to think about: what aggregated EXPLAIN ANALYZE information is useful and how should it be displayed (e.g. memory usage per node/execution time of operators)?

While this is going on, the SQL Execution team can also think about surfacing useful basic stats to other pages. The example used throughout this document is query memory usage, but another interesting datum is network latency.

Phase 3

More for completeness, since it doesn’t need to strictly happen after Phase 2. This phase happens once KV finishes the lightweight tracing infrastructure. The SQL Execution team will evaluate what work needs to be done to use the new API. This allows us to flip the switch from collecting execution statistics when requested by the user through a request for a statement diagnostics bundle to unconditionally doing so (while still respecting a sample rate)

@asubiotto asubiotto added C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) meta-issue Contains a list of several other issues. A-sql-execution Relating to SQL execution. A-sql-observability Related to observability of the SQL layer labels Sep 18, 2020
craig bot pushed a commit that referenced this issue Nov 2, 2020
55969: sql/ui: surface network bytes sent in statements page r=RaduBerinde,dhartunian a=asubiotto

Closes #55331 

This PR implements the conditional statistics collection and display described in phase 1 of the always-on EXPLAIN ANALYZE issue (#54556). The goal of this work is to provide a framework to surface more top-level statistics while always-on tracing is being worked on. The next steps here are to surface more top-level statistics we find interesting to the admin UI (e.g. query memory usage by @cathymw in #54340).

Once always-on tracing is completed, the code added to the conn executor to analyze trace data will be moved to `recordStatement`.

cc @jordanlewis 

Co-authored-by: Alfonso Subiotto Marques <[email protected]>
@asubiotto asubiotto self-assigned this Jan 19, 2021
@asubiotto
Copy link
Contributor Author

With #59132 we'll have sampled execution stats which completes Phase 3 of this work. Phase 2 is still in progress, but I think it'll be more productive to track the UI work separately.

@craig craig bot closed this as completed in 7d27d9a Jan 28, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-sql-execution Relating to SQL execution. A-sql-observability Related to observability of the SQL layer C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) meta-issue Contains a list of several other issues.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant