-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
stats referencing dead nodes create unrecoverable failure in db console network page #59322
Comments
This PR may be helpful here: #56529 |
Hi @dhartunian, do you have an estimate when this will be fixed? thanks! |
@florence-crl not at the moment. let us know if this is critical for next release or for a backport and we can re-prioritize cc @thtruo |
Hi @dhartunian, for clarification, does PR: #56529 prevent this issue from happening? thanks! |
It appears that #56529 may fix the issue, I will reproduce by EOW and provide an update in this ticket. |
Previously, nodes could reference nodes we didn't know about in their stats protos. Our frontend wouldn't know those nodeIDs because they weren't in the list of known nodes that were returned from the DB and would fail to render the network stats table. Now we filter out any nodeIDs that are unknown to us so that we don't render a broken table. This is meant to be a targeted fix for a specific problem. Future work will be done that addresses properly accounting for the various node liveness states that have been changed on the backend and have not been fully accounted for in the frontend. Resolves cockroachdb#59322 Release note (ui change): Fixes a bug where nodes would reference dead cluster nodes in their network stats and cause the network DB Console page to crash.
I was able to reproduce this on 21.2.5 using roachprod
|
Hi @thtruo I've been trying to reproduce this on the latest master build - |
Alex and I synced offline. So it looks like this issue is reproducible for v22.1 but it still affects 21.2.5+ |
So it seemed a bit fishy to me that it wasn't occurring on latest but was on 21.2.x since the page flow is basically the same between the versions. I spent awhile testing and I actually can reproduce this on latest as well as 21.2.x. It seems to only occur when a nodes liveness status is |
76753: ui: create ui pages for active queries and transactions r=xinhaoz a=xinhaoz This commit creates the pages for active transactins and queries in db-console. Each page calls the /sessions API at an interval of 10 seconds to refresh the data. Release note (ui change): Users can now see actively running queries and transactions in the SQL Activity tab. The transactions and statements tabs in SQL activity now have a menu to show either active or historial transactions and statements data. ------------------------- https://www.loom.com/share/964ee2624f4e41d0b6a88d5a280ce6a3 78292: [CRDB-3303] ui: handle latency not defined on network page r=Santamaura a=Santamaura Previously when a cluster with multiple nodes had a node stopped and then another node quickly started, the network page would crash. This was due to the node object holding the latency value being undefined and the ui trying to read this key when that was undefined. This occurs when a node is in an `UNAVAILABLE` state. This patch resolves the issue by being more defensive on the front end by safely attempting to access latency and if it is undefined, set the value to 0. The existing code is able to handle this case afterward and will eventually set the user friendly latency status to `--`. Resolves: #59322 Release note (ui change): Fixes a bug where a node in the `UNAVAILABLE` state will not have latency defined and cause the network page to crash. Overview list page & network page when node status is `UNAVAILABLE`: ![Screen Shot 2022-03-22 at 5 14 47 PM](https://user-images.githubusercontent.com/17861665/159579337-a48fafb1-e8fd-4c47-a07b-df93de532f0a.png) ![Screen Shot 2022-03-22 at 5 14 59 PM](https://user-images.githubusercontent.com/17861665/159579355-dc8142c9-703b-4425-9890-aa702e354fb4.png) 78459: sql: add sharded bucket count to crdb_internal.table_indexes r=chengxiong-ruan a=chengxiong-ruan Fixes #73057 Release justification: lost risk and beneficial index info. Release note (sql change): bucket count of hash sharded index is now being shown from the crdb_internal.table_indexes table. 78605: CODEOWNERS: set test-eng as roachprod codeowners r=srosenberg a=jlinder The test engineering team took ownership of roachprod. This reflects that change in the CODEOWNERS file. Release note: None Co-authored-by: Xin Hao Zhang <[email protected]> Co-authored-by: Santamaura <[email protected]> Co-authored-by: Chengxiong Ruan <[email protected]> Co-authored-by: James H. Linder <[email protected]>
Previously when a cluster with multiple nodes had a node stopped and then another node quickly started, the network page would crash. This was due to the node object holding the latency value being undefined and the ui trying to read this key when that was undefined. This occurs when a node is in an `UNAVAILABLE` state. This patch resolves the issue by being more defensive on the front end by safely attempting to access latency and if it is undefined, set the value to 0. The existing code is able to handle this case afterward and will eventually set the user friendly latency status to `--`. Resolves: #59322 Release note (ui change): Fixes a bug where a node in the `UNAVAILABLE` state will not have latency defined and cause the network page to crash.
Previously when a cluster with multiple nodes had a node stopped and then another node quickly started, the network page would crash. This was due to the node object holding the latency value being undefined and the ui trying to read this key when that was undefined. This occurs when a node is in an `UNAVAILABLE` state. This patch resolves the issue by being more defensive on the front end by safely attempting to access latency and if it is undefined, set the value to 0. The existing code is able to handle this case afterward and will eventually set the user friendly latency status to `--`. Resolves: #59322 Release note (ui change): Fixes a bug where a node in the `UNAVAILABLE` state will not have latency defined and cause the network page to crash.
Previously when a cluster with multiple nodes had a node stopped and then another node quickly started, the network page would crash. This was due to the node object holding the latency value being undefined and the ui trying to read this key when that was undefined. This occurs when a node is in an `UNAVAILABLE` state. This patch resolves the issue by being more defensive on the front end by safely attempting to access latency and if it is undefined, set the value to 0. The existing code is able to handle this case afterward and will eventually set the user friendly latency status to `--`. Resolves: cockroachdb#59322 Release note (ui change): Fixes a bug where a node in the `UNAVAILABLE` state will not have latency defined and cause the network page to crash.
Describe the problem
This was a customer issue where their node stats contained references to NodeIDs which were marked as "DEAD" in the liveness struct. However the code still attempted to retrieve stats for them since LIVE nodes had retained some network stats for the missing nodes. When the code attempts to render the network stats table, the uncaught failure causes a dead page.
Expected behavior
The network page should render without the DEAD nodes referenced.
Jira issue: CRDB-3303
The text was updated successfully, but these errors were encountered: