You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Decommissioning is the standard method for removing a bad node from the cluster and adding a new one in. We have seen people run into a lot of slowness trying to decommission large nodes (i.e. in the TB). It is understood that decommissioning is typically a function of node size, number of nodes and snapshot rates but we don't have any benchmarking or recommendations for how long one should expect to decommission a node.
The ask here is three fold:
create a benchmark (i.e. something in roachperf) to measure the performance of decommission
use the benchmark to identify potential bottlenecks in the decommission process
create a framework/function to calculate the speed of decommissioning given a set of inputs i.e. in a resource unconstrained cluster with 10 nodes, 2TB node size and 256MB/s snapshot rates it should take X minutes to decommission a node.
When we build this out, we should start with at least three variants of the test, exercising different node counts and store counts. Here's a strawman proposal:
a 4 node cluster (8 vCPUs per node, 1 TB per node)
a 32 node cluster (8 vCPUs per node, 1 TB per node)
a 32 node cluster with 8 stores per node (32 vCPUs per node, 4 TB per node)
Decommissioning is the standard method for removing a bad node from the cluster and adding a new one in. We have seen people run into a lot of slowness trying to decommission large nodes (i.e. in the TB). It is understood that decommissioning is typically a function of node size, number of nodes and snapshot rates but we don't have any benchmarking or recommendations for how long one should expect to decommission a node.
The ask here is three fold:
Jira issue: CRDB-13606
Epic CRDB-14621
The text was updated successfully, but these errors were encountered: