Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

stats referencing dead nodes create unrecoverable failure in db console network page #59322

Closed
dhartunian opened this issue Jan 22, 2021 · 9 comments · Fixed by #78292
Closed
Assignees
Labels
C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior.

Comments

@dhartunian
Copy link
Collaborator

dhartunian commented Jan 22, 2021

Describe the problem

This was a customer issue where their node stats contained references to NodeIDs which were marked as "DEAD" in the liveness struct. However the code still attempted to retrieve stats for them since LIVE nodes had retained some network stats for the missing nodes. When the code attempts to render the network stats table, the uncaught failure causes a dead page.

Expected behavior
The network page should render without the DEAD nodes referenced.

Jira issue: CRDB-3303

@dhartunian dhartunian added the C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. label Jan 22, 2021
@dhartunian dhartunian self-assigned this Jan 22, 2021
@dhartunian
Copy link
Collaborator Author

This PR may be helpful here: #56529

@florence-crl
Copy link

Hi @dhartunian, do you have an estimate when this will be fixed? thanks!

@dhartunian
Copy link
Collaborator Author

@florence-crl not at the moment. let us know if this is critical for next release or for a backport and we can re-prioritize

cc @thtruo

@florence-crl
Copy link

Hi @dhartunian, for clarification, does PR: #56529 prevent this issue from happening? thanks!

@dhartunian
Copy link
Collaborator Author

It appears that #56529 may fix the issue, I will reproduce by EOW and provide an update in this ticket.

dhartunian added a commit to dhartunian/cockroach that referenced this issue Mar 31, 2021
Previously, nodes could reference nodes we didn't know about in their
stats protos. Our frontend wouldn't know those nodeIDs because they
weren't in the list of known nodes that were returned from the DB and
would fail to render the network stats table.

Now we filter out any nodeIDs that are unknown to us so that we don't
render a broken table.

This is meant to be a targeted fix for a specific problem. Future work
will be done that addresses properly accounting for the various node
liveness states that have been changed on the backend and have not been
fully accounted for in the frontend.

Resolves cockroachdb#59322

Release note (ui change): Fixes a bug where nodes would reference dead
cluster nodes in their network stats and cause the network DB Console
page to crash.
@dhartunian dhartunian removed their assignment Oct 21, 2021
@thtruo
Copy link
Contributor

thtruo commented Feb 18, 2022

I was able to reproduce this on 21.2.5 using roachprod

  • I created a 5 node cluster and started 3 nodes (n1, n2, n3). I can load the Network Latency page
  • I stopped n3. The Network Latency page correctly shows n3 is inactive
  • I started n4. Now the Network Latency page errors out and is no longer accessible. I see this error in the browser dev console

Image 2022-02-18 at 4 12 00 PM

- Only when I restart n3 does the Network Latency page work again - For fun, I decided to stop n3 again. This time, I started n5 and again I see that the Network Latency page breaks.

@Santamaura
Copy link
Contributor

Hi @thtruo I've been trying to reproduce this on the latest master build - v22.1.0-alpha.2-360-geb4fc9a9c7 but i'm struggling to reproduce it. Specifically on step 2 of your above procedure after n3 is inactive I can no longer view/load any pages (I guess since the majority of nodes are down?). Any ideas on what i'm doing wrong/how I can reproduce it like you did? cc @dhartunian

@thtruo
Copy link
Contributor

thtruo commented Mar 22, 2022

Alex and I synced offline. So it looks like this issue is reproducible for v22.1 but it still affects 21.2.5+
Our next step is to ensure the change that prevents the page from breaking on 22.1 is backported to 21.2.x

@Santamaura
Copy link
Contributor

So it seemed a bit fishy to me that it wasn't occurring on latest but was on 21.2.x since the page flow is basically the same between the versions. I spent awhile testing and I actually can reproduce this on latest as well as 21.2.x. It seems to only occur when a nodes liveness status is UNKNOWN or UNAVAILABLE (this happens for a short amount of time when a node is stopped and isn't considered DEAD yet). What was mentioned in #62917 is indeed still happening where the node is still being reported because it is not in a DEAD state yet but since it also isn't LIVE there is no returned latency value for that node. I'm thinking the best solution here might be to make the FE code more defensive in the <Latency /> component and when no latency value is passed for a node we can set the value to --.

craig bot pushed a commit that referenced this issue Mar 28, 2022
76753: ui: create ui pages for active queries and transactions r=xinhaoz a=xinhaoz

This commit creates the pages for active transactins and
queries in db-console. Each page calls the /sessions API
at an interval of 10 seconds to refresh the data.

Release note (ui change): Users can now see actively running
queries and transactions in the SQL Activity tab. The transactions
and statements tabs in SQL activity now have a menu to show either
active or historial transactions and statements data.

-------------------------
https://www.loom.com/share/964ee2624f4e41d0b6a88d5a280ce6a3

78292: [CRDB-3303] ui: handle latency not defined on network page r=Santamaura a=Santamaura

Previously when a cluster with multiple nodes had a node stopped
and then another node quickly started, the network page would crash.
This was due to the node object holding the latency value being
undefined and the ui trying to read this key when that was undefined.
This occurs when a node is in an `UNAVAILABLE` state.
This patch resolves the issue by being more defensive on the front
end by safely attempting to access latency and if it is undefined,
set the value to 0. The existing code is able to handle this case
afterward and will eventually set the user friendly latency status
to `--`.
Resolves: #59322

Release note (ui change): Fixes a bug where a node in the
`UNAVAILABLE` state will not have latency defined and cause the
network page to crash.

Overview list page & network page when node status is `UNAVAILABLE`:

![Screen Shot 2022-03-22 at 5 14 47 PM](https://user-images.githubusercontent.com/17861665/159579337-a48fafb1-e8fd-4c47-a07b-df93de532f0a.png)

![Screen Shot 2022-03-22 at 5 14 59 PM](https://user-images.githubusercontent.com/17861665/159579355-dc8142c9-703b-4425-9890-aa702e354fb4.png)



78459: sql: add sharded bucket count to crdb_internal.table_indexes r=chengxiong-ruan a=chengxiong-ruan

Fixes #73057

Release justification: lost risk and beneficial index info.
Release note (sql change): bucket count of hash sharded index is
now being shown from the crdb_internal.table_indexes table.

78605: CODEOWNERS: set test-eng as roachprod codeowners r=srosenberg a=jlinder

The test engineering team took ownership of roachprod. This reflects
that change in the CODEOWNERS file.

Release note: None

Co-authored-by: Xin Hao Zhang <[email protected]>
Co-authored-by: Santamaura <[email protected]>
Co-authored-by: Chengxiong Ruan <[email protected]>
Co-authored-by: James H. Linder <[email protected]>
@craig craig bot closed this as completed in 372993d Mar 28, 2022
blathers-crl bot pushed a commit that referenced this issue Mar 28, 2022
Previously when a cluster with multiple nodes had a node stopped
and then another node quickly started, the network page would crash.
This was due to the node object holding the latency value being
undefined and the ui trying to read this key when that was undefined.
This occurs when a node is in an `UNAVAILABLE` state.
This patch resolves the issue by being more defensive on the front
end by safely attempting to access latency and if it is undefined,
set the value to 0. The existing code is able to handle this case
afterward and will eventually set the user friendly latency status
to `--`.
Resolves: #59322

Release note (ui change): Fixes a bug where a node in the
`UNAVAILABLE` state will not have latency defined and cause the
network page to crash.
blathers-crl bot pushed a commit that referenced this issue Mar 28, 2022
Previously when a cluster with multiple nodes had a node stopped
and then another node quickly started, the network page would crash.
This was due to the node object holding the latency value being
undefined and the ui trying to read this key when that was undefined.
This occurs when a node is in an `UNAVAILABLE` state.
This patch resolves the issue by being more defensive on the front
end by safely attempting to access latency and if it is undefined,
set the value to 0. The existing code is able to handle this case
afterward and will eventually set the user friendly latency status
to `--`.
Resolves: #59322

Release note (ui change): Fixes a bug where a node in the
`UNAVAILABLE` state will not have latency defined and cause the
network page to crash.
fqazi pushed a commit to fqazi/cockroach that referenced this issue Apr 4, 2022
Previously when a cluster with multiple nodes had a node stopped
and then another node quickly started, the network page would crash.
This was due to the node object holding the latency value being
undefined and the ui trying to read this key when that was undefined.
This occurs when a node is in an `UNAVAILABLE` state.
This patch resolves the issue by being more defensive on the front
end by safely attempting to access latency and if it is undefined,
set the value to 0. The existing code is able to handle this case
afterward and will eventually set the user friendly latency status
to `--`.
Resolves: cockroachdb#59322

Release note (ui change): Fixes a bug where a node in the
`UNAVAILABLE` state will not have latency defined and cause the
network page to crash.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior.
Projects
None yet
6 participants