Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ui: nodes_ui endpoint response should be reduced in size for very large clusters #129408

Closed
dhartunian opened this issue Aug 21, 2024 · 0 comments · Fixed by #135186 or #136005
Closed

ui: nodes_ui endpoint response should be reduced in size for very large clusters #129408

dhartunian opened this issue Aug 21, 2024 · 0 comments · Fixed by #135186 or #136005
Assignees
Labels
C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) O-testcluster Issues found or occurred on a test cluster, i.e. a long-running internal cluster P-3 Issues/test failures with no fix SLA T-observability

Comments

@dhartunian
Copy link
Collaborator

dhartunian commented Aug 21, 2024

Today, the nodes_ui endpoint that serves DB Console can balloon in size quite severely if the cluster has hundreds of nodes. We observed this on a cluster with 100s of dead nodes which made this payload grow to 37MiB.

This payload contains may pieces of information that are likely not immediately necessary to the function of DB Console. We should either reduce the amount of info here, or break it up into separate requests so that we can quickly load the nodes list on the overview page and get the app functional quickly when there are 100s of nodes.

Screenshot 2024-08-21 at 10 38 23

Jira issue: CRDB-41527

@dhartunian dhartunian added C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) P-3 Issues/test failures with no fix SLA T-observability labels Aug 21, 2024
@vidit-bhat vidit-bhat added the O-testcluster Issues found or occurred on a test cluster, i.e. a long-running internal cluster label Aug 21, 2024
@craig craig bot closed this as completed in f927757 Nov 14, 2024
kyle-a-wong added a commit to kyle-a-wong/cockroach that referenced this issue Nov 14, 2024
The /_status/nodes_ui grpc API is used by many db-console
pages to show node data relevant information. This API
is extremely heavy and includes all node and node store
related metrics. To give some perspective, the current
drt-scale cluster's nodes_ui API call has a payload of
size of ~8.4MB. As a result, this request is taking
~2.75s to complete in db-console.

As a partial remedy to this, this patch will filter down
the node and node store metrics to only return metrics
needed by db-console.

This list of metrics was determined by the `MetricsConstants`
variable defined here:
https://github.com/cockroachdb/cockroach/blob/d5f328ea6f3efd8fbe631c97d59f7b74307d22f9/pkg/ui/workspaces/db-console/src/util/proto.ts#L55

This patch does not include any changes to the underlying
data in KV, meaning the full NodeStatus objects (which
includes the metrics) are still fetched from KV and
unmarshalled. That being said, this patch reduces the
cost of the full metrics payload back into a
serverpb.NodeResponse protobuf, sending it over the
wire, and decoding it into json.

Testing locally with a demo tpcc cluster with 20 nodes,
the payload of nodes_ui on a new cluster was around
530kb before this change, and 8kb after.

Resolves: cockroachdb#129408
Epic: None
Release note (performance improvement): the /_status/nodes_ui
API no longer returns unnecessary metrics in its response. This
decreases the payload size of the API and improves the load time
of various db-console pages and components.
@exalate-issue-sync exalate-issue-sync bot reopened this Nov 18, 2024
kyle-a-wong added a commit to kyle-a-wong/cockroach that referenced this issue Nov 22, 2024
This reverts commit 35d00d5.

Fixes: cockroachdb#129408
Epic: none
Release note: none
craig bot pushed a commit that referenced this issue Nov 22, 2024
136005: server: reapply "server: decrease nodes_ui response size" r=kyle-a-wong a=kyle-a-wong

This reverts commit 35d00d5.

This commit was originally reverted because it broke the customs chart component. This component was previously dependent on the nodes_ui metrics to populate the list of queryable metrics from TSDB. This was fixed in #135705, so this commit can be reapplied

Fixes: #129408
Epic: none
Release note: none

Co-authored-by: Kyle Wong <[email protected]>
@craig craig bot closed this as completed in 30cbc00 Nov 22, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) O-testcluster Issues found or occurred on a test cluster, i.e. a long-running internal cluster P-3 Issues/test failures with no fix SLA T-observability
Projects
None yet
3 participants