Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sql: conditional statement diagnostics #57634

Closed
RaduBerinde opened this issue Dec 7, 2020 · 18 comments · Fixed by #74112
Closed

sql: conditional statement diagnostics #57634

RaduBerinde opened this issue Dec 7, 2020 · 18 comments · Fixed by #74112
Labels
A-sql-debug-bundle Issues related to statement bundle improvements C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) N-followup Needs followup. O-postmortem Originated from a Postmortem action item.

Comments

@RaduBerinde
Copy link
Member

RaduBerinde commented Dec 7, 2020

We should extend the statement diagnostics to allow enabling on a specific statement with additional conditions - eg only generate diagnostics if exec time > 100ms.

We would enable tracing for all instances of the statement until one execution satisfies the condition. This would be very useful to debug occasional slow instances of a query (high p99 time).

Epic: CRDB-8649

CC @awoods187 @andreimatei

@RaduBerinde RaduBerinde added the C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) label Dec 7, 2020
@awoods187
Copy link
Contributor

How would you imagine this working? Some kind of filter setting inside of the activate?
image

@RaduBerinde
Copy link
Member Author

RaduBerinde commented Dec 7, 2020

Yeah, eg a checkbox that says "only collect diagnostics if execution time is above: XXX"

@awoods187
Copy link
Contributor

cc @Annebirzin

@Annebirzin
Copy link

Annebirzin commented Dec 17, 2020

@awoods187 @RaduBerinde I put together some designs for how this could work. There are 3 places from where a user can activate diagnostics:

  1. the statement list page
  2. the statement detail page when no diagnostics have been activated
  3. the statement details page when there have been diagnostics activated

For 1, we show a modal with a description about how diagnostics work. Like you suggested we can introduce the radio button selection below the description text.

For 2 and 3, we do not show a modal so we would need to introduce a modal with the radio button selection.

Figma designs are here

@RaduBerinde
Copy link
Member Author

Thanks Anne, these are very good! The only thing we should change is the phrasing. The result will always be a "full" diagnostics bundle, we are choosing whether to collect it on any instance of the query, or wait until we see an instance that slower than X. So maybe "Would you like to set a condition for the diagnostics collection? a) No, collect on the next instance of the statement; b) Yes, collect on the next instance of the statement that takes longer than ms"

@Annebirzin
Copy link

Thanks Radu, that makes sense. I've updated the figma designs and the screenshots in my previous comment to reflect some copy changes. One question, should the use be able to define the timeframe (eg. 100ms) or would we set a predefined time?

@RaduBerinde
Copy link
Member Author

Definitely needs to be controllable by the user.

@Annebirzin
Copy link

Ok I see, in that case I've updated the screens/figma link above to include a time picker so the user can control the time length for collecting diagnostics.

@RaduBerinde
Copy link
Member Author

We have seen multiple customer issues where this functionality would have been very useful.

One wart is that we really need to be able to cancel such a diagnostics request. If we set a time bound that is too high, the request will be open forever, and all instances of that query will run slower. Probably the safest way to do this would be for the user to also have to choose an "expiration time" from the request.

We should make this high priority for 21.2. I can work on the lower-level implementation but we will need UI changes.

@andreimatei
Copy link
Contributor

Another thing that came up on a customer call yesterday was the hesitation to enable diagnostics collection for a high-frequency query, because of the fear of the contention that will be caused by all nodes trying to write a bundle at the same time. One thing that was suggested was to optionally ask a single node to collect a bundle.

@RaduBerinde
Copy link
Member Author

One thing that was suggested was to optionally ask a single node to collect a bundle.

Yeah I think that makes sense. We could also add a random delay on each node so the request doesn't become effective on all nodes at exactly the same time. That said, I'm not sure the hypothesized problem is real - there should be no contention when writing the actual bundle data.

@RaduBerinde
Copy link
Member Author

CC @vy-ton for help with prioritization.

@kevin-v-ngo
Copy link

FYI - We've heard this quite a few times and the SQL Observability team is closely tracking this on our backlog.

@lunevalex lunevalex added the N-followup Needs followup. label Jun 15, 2021
@lin-crl
Copy link
Contributor

lin-crl commented Jul 24, 2021

In recent customer support cases #1011 we find it challenging to identify the query and predicates used that had performance issues. Since the query is executed many times, and it doesn't use the same plan for each predicates. Having this feature would help us identify the root cause much faster

@maryliag
Copy link
Contributor

On the PR #72491 I can see this is also adding an expires after. The ui should have the input for that value too (with a default value of never expires)

@RaduBerinde
Copy link
Member Author

On the PR #72491 I can see this is also adding an expires after. The ui should have the input for that value too (with a default value of never expires)

I would say that if a non-zero latency is selected, the expires after should default to something like 15 minutes; and there should be a hint or tooltip that explains that outstanding requests with latency cause all "fast" instances of that query to run slower.

craig bot pushed a commit that referenced this issue Nov 23, 2021
72491: stmtdiagnostics: add min exec latency condition r=yuzefovich a=yuzefovich

**stmtdiagnostics: don't hold the lock when using gossip**

Release note: None

**stmtdiagnostics: add min exec latency condition**

This commit extends the statement diagnostics to support conditional
collection: namely, it introduces `minExecutionLatency` parameter so
that we could collect bundles on the slow outliers if necessary. Using
the zero value for that parameter will give us the previous behavior
(collecting the bundle on the next query fingerprint execution).

When `minExecutionLatency` argument is non-zero, then all queries that
match the specified fingerprint are traced until a slow enough query
comes along. This approach was chosen over tracing at most one query at
a time so that we don't miss the slow outlier when fast queries can
"starve" it (imagine that a node is bombarded with many fast queries and
a slow outlier occurs very rarely - it would be unfortunate to not trace
the outlier because when it arrives, a fast query is already being
traced).

In order for some diagnostics requests not to stay "alive" forever, this
commit also adds an `expiresAfter` interval. It'll be provided by the
user when the conditional collection is desired. The zero value is
treated as "never". Expired and not completed requests are not explicitly
deleted from the system table and are being simply ignored once they
expire.

This commit then adds a migration to the `statement_diagnostics_requests`
system table to add these two additional columns as well as to update
the secondary index to store these columns as well. All requests created
prior to the migration will never expire and won't have the min execution
latency conditional.

If a diagnostics request already exists for a fingerprint in the system
table, a new one (even with different conditionals / expiration
interval) cannot be created by the API - an error is returned. The
follow-up work will extend an API to make it possible to cancel the
existing request so that a new one could be created.

This API is not hooked up to the DB Console yet, thus there is no release
note.

Supports: #57634.

Release note: None

Co-authored-by: Yahor Yuzefovich <[email protected]>
@yuzefovich
Copy link
Member

#72491 has been merged, so we are now able to support conditional statement diagnostics on the DB side. It'd be great if @cockroachdb/sql-observability could prioritize resolving this issue before 22.1.

An additional flow that we need to work out as part of this issue is how to cancel an already created request for statement diagnostics. Imagine a scenario where a user creates a request with too large value for MinExecutionLatency that never expires - if no query instance runs slow enough, this will mean that every query instance matching the fingerprint will be traced every time, which will have non-trivial impact on the latency for that query fingerprint. (The only workaround right now is modifying the system table manually.)

We probably want to introduce some flow to cancel an already existing request. This will allow the user to then create a new request with the updated MinExecutionLatency or ExpiresAfter value (or make the request non-conditional).

@Annebirzin
Copy link

Annebirzin commented Dec 7, 2021

I've updated the Conditional statement diagnostic designs to include the Expiration After input, and also added the cancellation flows

a couple of open questions:

@stbof wanted to get this on your radar for a copy review at some point:

  • The description copy in the modal is pulled from the current copy on the Statement details > Diagnostics tab (empty state) with a minor tweak to reference the new conditional collection option.
  • There also may be a need for a tooltip on the Expiration field to let the user know that activating a bundle could cause the query to slow down

Let me know any other thoughts

cc @kevin-v-ngo @maryliag

craig bot pushed a commit that referenced this issue Dec 21, 2021
73882: stmtdiagnostics: add CancelRequest API r=yuzefovich a=yuzefovich

This commit adds an API for a stmt diagnostics request to be canceled
based on the query fingerprint. Internally, we choose to achieve that by
marking the request (if found) as expired because this allows any
ongoing query traces to write their bundles. The API is only available
once 22.1 migrations have been completed, otherwise it'll error. This
commit uses gossip with a separate key to tell other nodes about the
cancellation of a request.

Fixes: #73863.
Informs: #57634.

Paired with @lindseyjin to implement this.

Release note: None

74140: sql: ignore shard column during unique constraint searching r=chengxiong-ruan a=chengxiong-ruan

Fixes #69192

Referencing table does not know about the shard column. So
need to ignore it when looking for a valid unique constraint.

Release note (bug fix): Foreign key referencing hash sharded
key won't fail anymore.

Co-authored-by: Yahor Yuzefovich <[email protected]>
Co-authored-by: Chengxiong Ruan <[email protected]>
lindseyjin added a commit to lindseyjin/cockroach that referenced this issue Dec 22, 2021
Resolves cockroachdb#57634

Previously, we did not have the ability to create conditional statement
diagnostics in the frontend. This commit adds in the ability to specify
a minimum execution latency and an expiry time when creating a statement
diagnostics request. These changes apply to both DB and CC console.

Since expired requests are not surfaced at all in the frontend, we have
also modified the statement diagnostics API response to not return
already expired and incomplete requests.

Lastly, this commit also deletes some unused code related to statement
diagnostics modals.

Release note (ui change): Add ability to create conditional statement
diagnostics by adding two new fields 1) minimum execution latency, which
specifies the limit for when a statement should be tracked and 2) expiry
time, which specifies when a diagnostics request should expire.
lindseyjin added a commit to lindseyjin/cockroach that referenced this issue Dec 23, 2021
Resolves cockroachdb#57634

Previously, we did not have the ability to create conditional statement
diagnostics in the frontend. This commit adds in the ability to specify
a minimum execution latency and an expiry time when creating a statement
diagnostics request. These changes apply to both DB and CC console.

Since expired requests are not surfaced at all in the frontend, we have
also modified the statement diagnostics API response to not return
already expired and incomplete requests.

Lastly, this commit also deletes some unused code related to statement
diagnostics modals.

Release note (ui change): Add ability to create conditional statement
diagnostics by adding two new fields 1) minimum execution latency, which
specifies the limit for when a statement should be tracked and 2) expiry
time, which specifies when a diagnostics request should expire.
lindseyjin added a commit to lindseyjin/cockroach that referenced this issue Dec 23, 2021
Resolves cockroachdb#57634

Previously, we did not have the ability to create conditional statement
diagnostics in the frontend. This commit adds in the ability to specify
a minimum execution latency and an expiry time when creating a statement
diagnostics request. These changes apply to both DB and CC console.

Since expired requests are not surfaced at all in the frontend, we have
also modified the statement diagnostics API response to not return
already expired and incomplete requests.

Lastly, this commit also deletes some unused code related to statement
diagnostics modals.

Release note (ui change): Add ability to create conditional statement
diagnostics by adding two new fields 1) minimum execution latency, which
specifies the limit for when a statement should be tracked and 2) expiry
time, which specifies when a diagnostics request should expire.
craig bot pushed a commit that referenced this issue Dec 23, 2021
74112: ui/server: add ability to create conditional statement diagnostics r=lindseyjin a=lindseyjin

Resolves #57634

Previously, we did not have the ability to create conditional statement
diagnostics in the frontend. This commit adds in the ability to specify
a minimum execution latency and an expiry time when creating a statement
diagnostics request. These changes apply to both DB and CC console.

Since expired requests are not surfaced at all in the frontend, we have
also modified the statement diagnostics API response to not return
already expired and incomplete requests.

Lastly, this commit also deletes some unused code related to statement
diagnostics modals.

Release note (ui change): Add ability to create conditional statement
diagnostics by adding two new fields 1) minimum execution latency, which
specifies the limit for when a statement should be tracked and 2) expiry
time, which specifies when a diagnostics request should expire.

74145: streamingccl: Ensure appropriate privileges when replicating. r=miretskiy a=miretskiy

Add checks to replication stream manager to ensure appropriate privileges
and require enterprise license when executing streaming replication.

Release Notes: None

Co-authored-by: Lindsey Jin <[email protected]>
Co-authored-by: Yevgeniy Miretskiy <[email protected]>
@craig craig bot closed this as completed in 6c1b611 Dec 23, 2021
gustasva pushed a commit to gustasva/cockroach that referenced this issue Jan 4, 2022
Resolves cockroachdb#57634

Previously, we did not have the ability to create conditional statement
diagnostics in the frontend. This commit adds in the ability to specify
a minimum execution latency and an expiry time when creating a statement
diagnostics request. These changes apply to both DB and CC console.

Since expired requests are not surfaced at all in the frontend, we have
also modified the statement diagnostics API response to not return
already expired and incomplete requests.

Lastly, this commit also deletes some unused code related to statement
diagnostics modals.

Release note (ui change): Add ability to create conditional statement
diagnostics by adding two new fields 1) minimum execution latency, which
specifies the limit for when a statement should be tracked and 2) expiry
time, which specifies when a diagnostics request should expire.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-sql-debug-bundle Issues related to statement bundle improvements C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) N-followup Needs followup. O-postmortem Originated from a Postmortem action item.
Projects
None yet
Development

Successfully merging a pull request may close this issue.