-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
sql: conditional statement diagnostics #57634
Comments
Yeah, eg a checkbox that says "only collect diagnostics if execution time is above: XXX" |
cc @Annebirzin |
@awoods187 @RaduBerinde I put together some designs for how this could work. There are 3 places from where a user can activate diagnostics:
For 1, we show a modal with a description about how diagnostics work. Like you suggested we can introduce the radio button selection below the description text. For 2 and 3, we do not show a modal so we would need to introduce a modal with the radio button selection. Figma designs are here |
Thanks Anne, these are very good! The only thing we should change is the phrasing. The result will always be a "full" diagnostics bundle, we are choosing whether to collect it on any instance of the query, or wait until we see an instance that slower than X. So maybe "Would you like to set a condition for the diagnostics collection? a) No, collect on the next instance of the statement; b) Yes, collect on the next instance of the statement that takes longer than ms" |
Thanks Radu, that makes sense. I've updated the figma designs and the screenshots in my previous comment to reflect some copy changes. One question, should the use be able to define the timeframe (eg. 100ms) or would we set a predefined time? |
Definitely needs to be controllable by the user. |
Ok I see, in that case I've updated the screens/figma link above to include a time picker so the user can control the time length for collecting diagnostics. |
We have seen multiple customer issues where this functionality would have been very useful. One wart is that we really need to be able to cancel such a diagnostics request. If we set a time bound that is too high, the request will be open forever, and all instances of that query will run slower. Probably the safest way to do this would be for the user to also have to choose an "expiration time" from the request. We should make this high priority for 21.2. I can work on the lower-level implementation but we will need UI changes. |
Another thing that came up on a customer call yesterday was the hesitation to enable diagnostics collection for a high-frequency query, because of the fear of the contention that will be caused by all nodes trying to write a bundle at the same time. One thing that was suggested was to optionally ask a single node to collect a bundle. |
Yeah I think that makes sense. We could also add a random delay on each node so the request doesn't become effective on all nodes at exactly the same time. That said, I'm not sure the hypothesized problem is real - there should be no contention when writing the actual bundle data. |
CC @vy-ton for help with prioritization. |
FYI - We've heard this quite a few times and the SQL Observability team is closely tracking this on our backlog. |
In recent customer support cases #1011 we find it challenging to identify the query and predicates used that had performance issues. Since the query is executed many times, and it doesn't use the same plan for each predicates. Having this feature would help us identify the root cause much faster |
On the PR #72491 I can see this is also adding an |
I would say that if a non-zero latency is selected, the |
72491: stmtdiagnostics: add min exec latency condition r=yuzefovich a=yuzefovich **stmtdiagnostics: don't hold the lock when using gossip** Release note: None **stmtdiagnostics: add min exec latency condition** This commit extends the statement diagnostics to support conditional collection: namely, it introduces `minExecutionLatency` parameter so that we could collect bundles on the slow outliers if necessary. Using the zero value for that parameter will give us the previous behavior (collecting the bundle on the next query fingerprint execution). When `minExecutionLatency` argument is non-zero, then all queries that match the specified fingerprint are traced until a slow enough query comes along. This approach was chosen over tracing at most one query at a time so that we don't miss the slow outlier when fast queries can "starve" it (imagine that a node is bombarded with many fast queries and a slow outlier occurs very rarely - it would be unfortunate to not trace the outlier because when it arrives, a fast query is already being traced). In order for some diagnostics requests not to stay "alive" forever, this commit also adds an `expiresAfter` interval. It'll be provided by the user when the conditional collection is desired. The zero value is treated as "never". Expired and not completed requests are not explicitly deleted from the system table and are being simply ignored once they expire. This commit then adds a migration to the `statement_diagnostics_requests` system table to add these two additional columns as well as to update the secondary index to store these columns as well. All requests created prior to the migration will never expire and won't have the min execution latency conditional. If a diagnostics request already exists for a fingerprint in the system table, a new one (even with different conditionals / expiration interval) cannot be created by the API - an error is returned. The follow-up work will extend an API to make it possible to cancel the existing request so that a new one could be created. This API is not hooked up to the DB Console yet, thus there is no release note. Supports: #57634. Release note: None Co-authored-by: Yahor Yuzefovich <[email protected]>
#72491 has been merged, so we are now able to support conditional statement diagnostics on the DB side. It'd be great if @cockroachdb/sql-observability could prioritize resolving this issue before 22.1. An additional flow that we need to work out as part of this issue is how to cancel an already created request for statement diagnostics. Imagine a scenario where a user creates a request with too large value for We probably want to introduce some flow to cancel an already existing request. This will allow the user to then create a new request with the updated |
I've updated the Conditional statement diagnostic designs to include the Expiration After input, and also added the cancellation flows a couple of open questions:
@stbof wanted to get this on your radar for a copy review at some point:
Let me know any other thoughts |
73882: stmtdiagnostics: add CancelRequest API r=yuzefovich a=yuzefovich This commit adds an API for a stmt diagnostics request to be canceled based on the query fingerprint. Internally, we choose to achieve that by marking the request (if found) as expired because this allows any ongoing query traces to write their bundles. The API is only available once 22.1 migrations have been completed, otherwise it'll error. This commit uses gossip with a separate key to tell other nodes about the cancellation of a request. Fixes: #73863. Informs: #57634. Paired with @lindseyjin to implement this. Release note: None 74140: sql: ignore shard column during unique constraint searching r=chengxiong-ruan a=chengxiong-ruan Fixes #69192 Referencing table does not know about the shard column. So need to ignore it when looking for a valid unique constraint. Release note (bug fix): Foreign key referencing hash sharded key won't fail anymore. Co-authored-by: Yahor Yuzefovich <[email protected]> Co-authored-by: Chengxiong Ruan <[email protected]>
Resolves cockroachdb#57634 Previously, we did not have the ability to create conditional statement diagnostics in the frontend. This commit adds in the ability to specify a minimum execution latency and an expiry time when creating a statement diagnostics request. These changes apply to both DB and CC console. Since expired requests are not surfaced at all in the frontend, we have also modified the statement diagnostics API response to not return already expired and incomplete requests. Lastly, this commit also deletes some unused code related to statement diagnostics modals. Release note (ui change): Add ability to create conditional statement diagnostics by adding two new fields 1) minimum execution latency, which specifies the limit for when a statement should be tracked and 2) expiry time, which specifies when a diagnostics request should expire.
Resolves cockroachdb#57634 Previously, we did not have the ability to create conditional statement diagnostics in the frontend. This commit adds in the ability to specify a minimum execution latency and an expiry time when creating a statement diagnostics request. These changes apply to both DB and CC console. Since expired requests are not surfaced at all in the frontend, we have also modified the statement diagnostics API response to not return already expired and incomplete requests. Lastly, this commit also deletes some unused code related to statement diagnostics modals. Release note (ui change): Add ability to create conditional statement diagnostics by adding two new fields 1) minimum execution latency, which specifies the limit for when a statement should be tracked and 2) expiry time, which specifies when a diagnostics request should expire.
Resolves cockroachdb#57634 Previously, we did not have the ability to create conditional statement diagnostics in the frontend. This commit adds in the ability to specify a minimum execution latency and an expiry time when creating a statement diagnostics request. These changes apply to both DB and CC console. Since expired requests are not surfaced at all in the frontend, we have also modified the statement diagnostics API response to not return already expired and incomplete requests. Lastly, this commit also deletes some unused code related to statement diagnostics modals. Release note (ui change): Add ability to create conditional statement diagnostics by adding two new fields 1) minimum execution latency, which specifies the limit for when a statement should be tracked and 2) expiry time, which specifies when a diagnostics request should expire.
74112: ui/server: add ability to create conditional statement diagnostics r=lindseyjin a=lindseyjin Resolves #57634 Previously, we did not have the ability to create conditional statement diagnostics in the frontend. This commit adds in the ability to specify a minimum execution latency and an expiry time when creating a statement diagnostics request. These changes apply to both DB and CC console. Since expired requests are not surfaced at all in the frontend, we have also modified the statement diagnostics API response to not return already expired and incomplete requests. Lastly, this commit also deletes some unused code related to statement diagnostics modals. Release note (ui change): Add ability to create conditional statement diagnostics by adding two new fields 1) minimum execution latency, which specifies the limit for when a statement should be tracked and 2) expiry time, which specifies when a diagnostics request should expire. 74145: streamingccl: Ensure appropriate privileges when replicating. r=miretskiy a=miretskiy Add checks to replication stream manager to ensure appropriate privileges and require enterprise license when executing streaming replication. Release Notes: None Co-authored-by: Lindsey Jin <[email protected]> Co-authored-by: Yevgeniy Miretskiy <[email protected]>
Resolves cockroachdb#57634 Previously, we did not have the ability to create conditional statement diagnostics in the frontend. This commit adds in the ability to specify a minimum execution latency and an expiry time when creating a statement diagnostics request. These changes apply to both DB and CC console. Since expired requests are not surfaced at all in the frontend, we have also modified the statement diagnostics API response to not return already expired and incomplete requests. Lastly, this commit also deletes some unused code related to statement diagnostics modals. Release note (ui change): Add ability to create conditional statement diagnostics by adding two new fields 1) minimum execution latency, which specifies the limit for when a statement should be tracked and 2) expiry time, which specifies when a diagnostics request should expire.
We should extend the statement diagnostics to allow enabling on a specific statement with additional conditions - eg only generate diagnostics if exec time > 100ms.
We would enable tracing for all instances of the statement until one execution satisfies the condition. This would be very useful to debug occasional slow instances of a query (high p99 time).
Epic: CRDB-8649
CC @awoods187 @andreimatei
The text was updated successfully, but these errors were encountered: