Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

console: filter out stats referencing dead nodes #62917

Conversation

dhartunian
Copy link
Collaborator

Previously, nodes could reference nodes we didn't know about in their
stats protos. Our frontend wouldn't know those nodeIDs because they
weren't in the list of known nodes that were returned from the DB and
would fail to render the network stats table.

Now we filter out any nodeIDs that are unknown to us so that we don't
render a broken table.

This is meant to be a targeted fix for a specific problem. Future work
will be done that addresses properly accounting for the various node
liveness states that have been changed on the backend and have not been
fully accounted for in the frontend.

Resolves #59322

Release note (ui change): Fixes a bug where nodes would reference dead
cluster nodes in their network stats and cause the network DB Console
page to crash.

@cockroach-teamcity
Copy link
Member

This change is Reviewable

Previously, nodes could reference nodes we didn't know about in their
stats protos. Our frontend wouldn't know those nodeIDs because they
weren't in the list of known nodes that were returned from the DB and
would fail to render the network stats table.

Now we filter out any nodeIDs that are unknown to us so that we don't
render a broken table.

This is meant to be a targeted fix for a specific problem. Future work
will be done that addresses properly accounting for the various node
liveness states that have been changed on the backend and have not been
fully accounted for in the frontend.

Resolves cockroachdb#59322

Release note (ui change): Fixes a bug where nodes would reference dead
cluster nodes in their network stats and cause the network DB Console
page to crash.
@dhartunian dhartunian force-pushed the 59322-filter-out-stats-for-dead-nodes branch from 7fafc19 to f3a324d Compare March 31, 2021 23:54
Copy link
Collaborator

@koorosh koorosh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 1 of 1 files at r1.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @dhartunian and @nathanstilwell)

a discussion (no related file):
Current change looks good to me 👍🏽 but also Network page can crash when network activity items (in node status) don't contain "latency" record. Regarding to code on backend side it is possible case. See pkg/server/status/recorder.go Line 401 (getNetworkActivity method).

Please take a look on PR with suggested fix: dhartunian#2


Copy link
Contributor

@nathanstilwell nathanstilwell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given the "quick fix" nature of this PR, LGTM. I think Andrii's proposal is an interesting one, but can imagine a cleaner fix on the backend.

@koorosh
Copy link
Collaborator

koorosh commented Apr 16, 2021

Given the "quick fix" nature of this PR, LGTM. I think Andrii's proposal is an interesting one, but can imagine a cleaner fix on the backend.

Agree, this proposal isn't the best one, but probably a quick one 😃
On backend, latency field is represented as int64 type and cannot be set to nil, empty value for it is 0. It'll indicate zero latency between nodes even when there is no connection between nodes. So probably it'll require extra field to work around this issue.

@rafiss rafiss added dependencies Pull requests that update a dependency file and removed dependencies Pull requests that update a dependency file labels Apr 22, 2021
@rafiss rafiss modified the milestones: 20.2, 21.1 Apr 22, 2021
@dhartunian
Copy link
Collaborator Author

This PR is stale and I think #56529 addressed this on the server side so I'm closing it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

stats referencing dead nodes create unrecoverable failure in db console network page
6 participants