Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SQL activity pages (sessions, transactions and statements) are empty during an active workload #68602

Closed
kevin-v-ngo opened this issue Aug 9, 2021 · 11 comments · Fixed by #71846
Assignees
Labels
A-sql-console-general SQL Observability issues on the DB console spanning multiple areas. Includes Cockroach Cloud Console C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior.

Comments

@kevin-v-ngo
Copy link

CC dedicated clusters show no activity in the Statements, Sessions, or Transactions pages within the CC console. There’s only a red info icon with no text. This was during an active workload (TPCC).

Note these pages were working after initially creating the cluster. Screenshot:

image

@kevin-v-ngo kevin-v-ngo added C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. T-sql-observability A-sql-console-general SQL Observability issues on the DB console spanning multiple areas. Includes Cockroach Cloud Console labels Aug 9, 2021
Azhng added a commit to Azhng/cockroach that referenced this issue Sep 30, 2021
requests when the API requests fails

Previously, Statement Page and Transaction Page issues API requests
in conmponentDidUpdate() React lifecycle hook. However, when the
backend API returns an error, this lifecycle hook is constantly
triggered which resulted in infinite loop. This crashes the entire
frontend application.
This commit removes API calls from within the componentDidUpdate()
hooks.

Resolves cockroachdb#68602

Release note: None
@Azhng
Copy link
Contributor

Azhng commented Sep 30, 2021

Spent some times investigation this. This is the result of backend not properly surfacing the error message to frontend.

Specifically in this case, the intrusion was using an expired kubenet cookie, which resulted in a error in the backend DB Console proxy. (This can be reproduced using a hacked version of intrusion). As the result, it simply returns a HTTP response with 502 error code without any error message. This causes the frontend to surface an empty error string.

However, our error handling is severely flawed here due to the API calls within componentDidUpdate() method. This triggers an infinite loop that essentially causes the entire webpage to crash.

PR here addressed that issue.

@kevin-v-ngo
Copy link
Author

Thanks for the update Archer! Should there even be an error message surfaced? What's the expected behavior?

I would expect Statements/Transactions to still be populated in this view.

Also did you forget to link the PR?

@Azhng
Copy link
Contributor

Azhng commented Sep 30, 2021

Should there even be an error message surfaced?

This should definitely be an error. Though I am expecting the error should be coming from Console API backend.

What's the expected behavior?

Well, ideally the expected behavior is there's never going to be an error :P . But in this case, it's hard to say. If either the Console -> Intrusion connection or Intrusion -> Cluster breaks, this would means something really wrong is happening in the cloud. Last few times this happened we both declared incidents.

I would expect Statements/Transactions to still be populated in this view.

If the connection borked, we won't be able to get data from the backend. I suppose we can still load the page with our cached stale data until the cache TTL expires. But once the cache TTL expires there's not much we can do. Also we won't be able to change selected time ranges or go into statement detail page, since they would need additional API call.

@Azhng
Copy link
Contributor

Azhng commented Sep 30, 2021

Chatted with @maryliag offline. To summarize:

  1. In normal case, we would like to poll the backend API at a fixed interval (5 minutes).
  2. If we encounter an error, we would stop polling and display the error.
  • when we display the error, we also want to display a "Retry" button along with the error message. One user click on the Retry button, we will retry the calling the backend API:
    • if the retry API call succeeds, we would resume the polling
    • else, we do nothing

@Annebirzin , do you mind create a design for the error box with a Retry button?

@kevin-v-ngo
Copy link
Author

Thanks Archer. I was more wondering why we're consistently getting this (unexpected) error related cookie expiration and how we can make this more resilient.

So it seems like we'll be polling at a fixed interval which should reduce the likelihood of running into this error (expired cookie) but for edge cases, we need an error message and retry button. Correct?

@Azhng
Copy link
Contributor

Azhng commented Sep 30, 2021

why we're consistently getting this (unexpected) error related cookie expiration

This was due to erroneous logic within the Intrusion, which was recently patched.

So it seems like we'll be polling at a fixed interval which should reduce the likelihood of running into this error (expired cookie) but for edge cases, we need an error message and retry button. Correct?

Hmm running at a fixed interval is not going to reduce the likelihood of running into this error, since it's something completely out of our control.

This is more of a technical issue of how we are managing out API calls. Currently, we would attempt to issue an API call as soon as anything on the page has changed (e.g. filter, search bar, sorting order). This attempt is aborted if we have already requested for data from the API within last cache TTL (5 minutes).

This means we would constantly attempt to issue API calls, keep getting aborted in a normal case. But in unhappy case, er would keep attempt to issue API calls, keep getting errors and retry them. This is extremely wasteful.

The new approach would be instead of running this loop of:

  • attempt to call API
  • gets aborted

We simply poll at a fixed interval, and pause the polling if an error is encountered.

@kevin-v-ngo
Copy link
Author

Got it! Thanks for providing the context Archer.

@Annebirzin
Copy link

Annebirzin commented Oct 4, 2021

Here are error states for the statements and transactions pages, including a link to reload the page: https://www.figma.com/file/MoSPFkEyJfGipCmQvvLg1a/21.2_SQL-obsrv_query-performance?node-id=3524%3A74970

Screen Shot 2021-10-05 at 10 40 33 AM

Screen Shot 2021-10-05 at 10 39 53 AM

Not sure if we want to be more specific around the messaging cc @ianjevans

Also, does the same thing happen on the transaction page?

@Azhng
Copy link
Contributor

Azhng commented Oct 4, 2021

Yes, same thing would also happen on the Transaction Page

@ianjevans
Copy link
Contributor

I have a slight preference for "error" over "issue" on wording, and we can be more clear that the error isn't in the cluster but the console.
"This page had an unexpected error while loading transactions/statements. Reload this page."

@Annebirzin
Copy link

Sounds good, updated the designs to reflect those changes

Azhng added a commit to Azhng/cockroach that referenced this issue Oct 21, 2021
Resolves cockroachdb#68602

Release note (ui change): when an error is encoutered in the statement/
transaction/session page, user can now click on reload button to reload
the page.
Azhng added a commit to Azhng/cockroach that referenced this issue Oct 21, 2021
Resolves cockroachdb#68602

Release note (ui change): when an error is encoutered in the statement/
transaction/session page, user can now click on reload button to reload
the page.
Azhng added a commit to Azhng/cockroach that referenced this issue Oct 21, 2021
Resolves cockroachdb#68602

Release note (ui change): when an error is encoutered in the statement/
transaction/session page, user can now click on reload button to reload
the page.
Azhng added a commit to Azhng/cockroach that referenced this issue Oct 22, 2021
Resolves cockroachdb#68602

Release note (ui change): when an error is encoutered in the statement/
transaction/session page, user can now click on reload button to reload
the page.
craig bot pushed a commit that referenced this issue Oct 23, 2021
71846: ui: add reload button for Loading component r=maryliag a=Azhng

Resolves #68602



https://user-images.githubusercontent.com/9267198/138362614-96f15559-9761-4c4b-8a47-2843cad7978d.mov



Release note (ui change): when an error is encoutered in the statement/
transaction/session page, user can now click on reload button to reload
the page.

Co-authored-by: Azhng <[email protected]>
@craig craig bot closed this as completed in a5d1ed8 Oct 23, 2021
blathers-crl bot pushed a commit that referenced this issue Nov 24, 2021
Resolves #68602

Release note (ui change): when an error is encoutered in the statement/
transaction/session page, user can now click on reload button to reload
the page.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-sql-console-general SQL Observability issues on the DB console spanning multiple areas. Includes Cockroach Cloud Console C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior.
Projects
None yet
4 participants