stats referencing dead nodes create unrecoverable failure in db console network page #59322

dhartunian · 2021-01-22T22:43:51Z

Describe the problem

This was a customer issue where their node stats contained references to NodeIDs which were marked as "DEAD" in the liveness struct. However the code still attempted to retrieve stats for them since LIVE nodes had retained some network stats for the missing nodes. When the code attempts to render the network stats table, the uncaught failure causes a dead page.

Expected behavior
The network page should render without the DEAD nodes referenced.

Jira issue: CRDB-3303

dhartunian · 2021-02-01T21:02:07Z

This PR may be helpful here: #56529

florence-crl · 2021-02-24T22:36:26Z

Hi @dhartunian, do you have an estimate when this will be fixed? thanks!

dhartunian · 2021-02-26T23:35:45Z

@florence-crl not at the moment. let us know if this is critical for next release or for a backport and we can re-prioritize

cc @thtruo

florence-crl · 2021-03-23T20:46:03Z

Hi @dhartunian, for clarification, does PR: #56529 prevent this issue from happening? thanks!

dhartunian · 2021-03-30T16:33:52Z

It appears that #56529 may fix the issue, I will reproduce by EOW and provide an update in this ticket.

Previously, nodes could reference nodes we didn't know about in their stats protos. Our frontend wouldn't know those nodeIDs because they weren't in the list of known nodes that were returned from the DB and would fail to render the network stats table. Now we filter out any nodeIDs that are unknown to us so that we don't render a broken table. This is meant to be a targeted fix for a specific problem. Future work will be done that addresses properly accounting for the various node liveness states that have been changed on the backend and have not been fully accounted for in the frontend. Resolves cockroachdb#59322 Release note (ui change): Fixes a bug where nodes would reference dead cluster nodes in their network stats and cause the network DB Console page to crash.

thtruo · 2022-02-18T21:24:13Z

I was able to reproduce this on 21.2.5 using roachprod

I created a 5 node cluster and started 3 nodes (n1, n2, n3). I can load the Network Latency page
I stopped n3. The Network Latency page correctly shows n3 is inactive
I started n4. Now the Network Latency page errors out and is no longer accessible. I see this error in the browser dev console

- Only when I restart n3 does the Network Latency page work again - For fun, I decided to stop n3 again. This time, I started n5 and again I see that the Network Latency page breaks.

Santamaura · 2022-03-14T17:32:32Z

Hi @thtruo I've been trying to reproduce this on the latest master build - v22.1.0-alpha.2-360-geb4fc9a9c7 but i'm struggling to reproduce it. Specifically on step 2 of your above procedure after n3 is inactive I can no longer view/load any pages (I guess since the majority of nodes are down?). Any ideas on what i'm doing wrong/how I can reproduce it like you did? cc @dhartunian

thtruo · 2022-03-22T15:30:59Z

Alex and I synced offline. So it looks like this issue is reproducible for v22.1 but it still affects 21.2.5+
Our next step is to ensure the change that prevents the page from breaking on 22.1 is backported to 21.2.x

Santamaura · 2022-03-22T21:02:15Z

So it seemed a bit fishy to me that it wasn't occurring on latest but was on 21.2.x since the page flow is basically the same between the versions. I spent awhile testing and I actually can reproduce this on latest as well as 21.2.x. It seems to only occur when a nodes liveness status is UNKNOWN or UNAVAILABLE (this happens for a short amount of time when a node is stopped and isn't considered DEAD yet). What was mentioned in #62917 is indeed still happening where the node is still being reported because it is not in a DEAD state yet but since it also isn't LIVE there is no returned latency value for that node. I'm thinking the best solution here might be to make the FE code more defensive in the <Latency /> component and when no latency value is passed for a node we can set the value to --.

76753: ui: create ui pages for active queries and transactions r=xinhaoz a=xinhaoz This commit creates the pages for active transactins and queries in db-console. Each page calls the /sessions API at an interval of 10 seconds to refresh the data. Release note (ui change): Users can now see actively running queries and transactions in the SQL Activity tab. The transactions and statements tabs in SQL activity now have a menu to show either active or historial transactions and statements data. ------------------------- https://www.loom.com/share/964ee2624f4e41d0b6a88d5a280ce6a3 78292: [CRDB-3303] ui: handle latency not defined on network page r=Santamaura a=Santamaura Previously when a cluster with multiple nodes had a node stopped and then another node quickly started, the network page would crash. This was due to the node object holding the latency value being undefined and the ui trying to read this key when that was undefined. This occurs when a node is in an `UNAVAILABLE` state. This patch resolves the issue by being more defensive on the front end by safely attempting to access latency and if it is undefined, set the value to 0. The existing code is able to handle this case afterward and will eventually set the user friendly latency status to `--`. Resolves: #59322 Release note (ui change): Fixes a bug where a node in the `UNAVAILABLE` state will not have latency defined and cause the network page to crash. Overview list page & network page when node status is `UNAVAILABLE`: ![Screen Shot 2022-03-22 at 5 14 47 PM](https://user-images.githubusercontent.com/17861665/159579337-a48fafb1-e8fd-4c47-a07b-df93de532f0a.png) ![Screen Shot 2022-03-22 at 5 14 59 PM](https://user-images.githubusercontent.com/17861665/159579355-dc8142c9-703b-4425-9890-aa702e354fb4.png) 78459: sql: add sharded bucket count to crdb_internal.table_indexes r=chengxiong-ruan a=chengxiong-ruan Fixes #73057 Release justification: lost risk and beneficial index info. Release note (sql change): bucket count of hash sharded index is now being shown from the crdb_internal.table_indexes table. 78605: CODEOWNERS: set test-eng as roachprod codeowners r=srosenberg a=jlinder The test engineering team took ownership of roachprod. This reflects that change in the CODEOWNERS file. Release note: None Co-authored-by: Xin Hao Zhang <[email protected]> Co-authored-by: Santamaura <[email protected]> Co-authored-by: Chengxiong Ruan <[email protected]> Co-authored-by: James H. Linder <[email protected]>

Previously when a cluster with multiple nodes had a node stopped and then another node quickly started, the network page would crash. This was due to the node object holding the latency value being undefined and the ui trying to read this key when that was undefined. This occurs when a node is in an `UNAVAILABLE` state. This patch resolves the issue by being more defensive on the front end by safely attempting to access latency and if it is undefined, set the value to 0. The existing code is able to handle this case afterward and will eventually set the user friendly latency status to `--`. Resolves: #59322 Release note (ui change): Fixes a bug where a node in the `UNAVAILABLE` state will not have latency defined and cause the network page to crash.

Previously when a cluster with multiple nodes had a node stopped and then another node quickly started, the network page would crash. This was due to the node object holding the latency value being undefined and the ui trying to read this key when that was undefined. This occurs when a node is in an `UNAVAILABLE` state. This patch resolves the issue by being more defensive on the front end by safely attempting to access latency and if it is undefined, set the value to 0. The existing code is able to handle this case afterward and will eventually set the user friendly latency status to `--`. Resolves: cockroachdb#59322 Release note (ui change): Fixes a bug where a node in the `UNAVAILABLE` state will not have latency defined and cause the network page to crash.

dhartunian added the C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. label Jan 22, 2021

dhartunian self-assigned this Jan 22, 2021

florence-crl mentioned this issue Jan 25, 2021

kv: clear network stats of a cluster's nodes #59400

Open

dhartunian mentioned this issue Mar 5, 2021

ui: node reconnection events confused with "dead" node events #61562

Closed

dhartunian mentioned this issue Mar 31, 2021

console: filter out stats referencing dead nodes #62917

Closed

jlinder added the T-observability label Jun 16, 2021

elinorgarcia removed the T-observability label Jun 17, 2021

dhartunian removed their assignment Oct 21, 2021

thtruo added the T-kv-observability label Feb 18, 2022

exalate-issue-sync bot assigned Santamaura Mar 9, 2022

Santamaura mentioned this issue Mar 22, 2022

[CRDB-3303] ui: handle latency not defined on network page #78292

Merged

craig bot closed this as completed in 372993d Mar 28, 2022

blathers-crl bot mentioned this issue Mar 28, 2022

release-21.2: [CRDB-3303] ui: handle latency not defined on network page #78627

Merged

blathers-crl bot mentioned this issue Mar 28, 2022

release-22.1: [CRDB-3303] ui: handle latency not defined on network page #78628

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

stats referencing dead nodes create unrecoverable failure in db console network page #59322

stats referencing dead nodes create unrecoverable failure in db console network page #59322

dhartunian commented Jan 22, 2021 •

edited by cockroach-jira-scripts

Loading

dhartunian commented Feb 1, 2021

florence-crl commented Feb 24, 2021

dhartunian commented Feb 26, 2021

florence-crl commented Mar 23, 2021

dhartunian commented Mar 30, 2021

thtruo commented Feb 18, 2022

Santamaura commented Mar 14, 2022

thtruo commented Mar 22, 2022

Santamaura commented Mar 22, 2022

stats referencing dead nodes create unrecoverable failure in db console network page #59322

stats referencing dead nodes create unrecoverable failure in db console network page #59322

Comments

dhartunian commented Jan 22, 2021 • edited by cockroach-jira-scripts Loading

dhartunian commented Feb 1, 2021

florence-crl commented Feb 24, 2021

dhartunian commented Feb 26, 2021

florence-crl commented Mar 23, 2021

dhartunian commented Mar 30, 2021

thtruo commented Feb 18, 2022

Santamaura commented Mar 14, 2022

thtruo commented Mar 22, 2022

Santamaura commented Mar 22, 2022

dhartunian commented Jan 22, 2021 •

edited by cockroach-jira-scripts

Loading