-
Notifications
You must be signed in to change notification settings - Fork 3.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
server: raided disks can cause erroneous disk metrics #97867
Comments
I wasn't able to reproduce this with 8x raid 0 disks on gce. The numbers match NodeExporter. I didn't check IO Stat. Repro
|
No success reproducing this with 8x raid 10 disks on gce. I updated the roachprod script to use RAID10 with this patch (mixed confidence on correctness). The same repro as above. diff --git a/pkg/roachprod/vm/gce/utils.go b/pkg/roachprod/vm/gce/utils.go
index af8c343e1eb..4e5cbbd1017 100644
--- a/pkg/roachprod/vm/gce/utils.go
+++ b/pkg/roachprod/vm/gce/utils.go
@@ -105,14 +105,14 @@ elif [ "${#disks[@]}" -eq "1" ] || [ -n "$use_multiple_disks" ]; then
done
else
mountpoint="${mount_prefix}1"
- echo "${#disks[@]} disks mounted, creating ${mountpoint} using RAID 0"
+ echo "${#disks[@]} disks mounted, creating ${mountpoint} using RAID 10"
mkdir -p ${mountpoint}
{{ if .Zfs }}
zpool create -f $(basename $mountpoint) -m ${mountpoint} ${disks[@]}
# NOTE: we don't need an /etc/fstab entry for ZFS. It will handle this itself.
{{ else }}
raiddisk="/dev/md0"
- mdadm -q --create ${raiddisk} --level=0 --raid-devices=${#disks[@]} "${disks[@]}"
+ mdadm -q --create ${raiddisk} --level=10 --raid-devices=${#disks[@]} "${disks[@]}"
mkfs.ext4 -q -F ${raiddisk}
mount -o ${mount_opts} ${raiddisk} ${mountpoint}
echo "${raiddisk} ${mountpoint} ext4 ${mount_opts} 1 1" | tee -a /etc/fstab
|
I took a quick look at this, and it appears as though we are indeed susceptible to over-counting. We fetch the block device stats from If a host has a physical device with partitions, or a logical device with multiple physical devices underneath it, we'll include everything. For example:
The prom node exporter has some logic that will ignore certain block devices. On Linux, the filter is here: diskstatsDefaultIgnoredDevices = "^(ram|loop|fd|(h|s|v|xv)d[a-z]|nvme\\d+n\\d+p)\\d+$" It might be worth switching out the library we use to use the Prom library instead (i.e. github.com/prometheus/procfs). However, it's still brittle due to the device names. Anything not in the filter could be double counted (i.e. Side note related to roachprod / RAID: there's no
After creating it manually and restarting the cluster, I see the results I expect (i.e. both series are the same, over-counting). |
An uncontroversial first step here could be to switch out the library we're using to fetch these metrics. For example, if the Alternatively, we could have some mechanism for filtering the devices that are included in the metrics. I'm going to move this into our backlog for now. I've also tagged Obs Inf too, as it straddles the border the two two teams. |
Previously, we wouldn't exclude volumes from disk counters that are likely to be double-counted such as RAID logical volumes that are composed of physical volumes that are also independently present in disk metrics. This change adds a regex-based filter, overridable with env vars, that excludes common double-counted volume patterns. Fixes cockroachdb#97867. Epic: none Release note (bug fix): Avoids double-counting disk read/write bytes in disk metrics if Cockroach observes volumes that are likely to be duplicated in reported disk counters, such as RAID logical vs physical volumes.
Previously, we wouldn't exclude volumes from disk counters that are likely to be double-counted such as RAID logical volumes that are composed of physical volumes that are also independently present in disk metrics. This change adds a regex-based filter, overridable with env vars, that excludes common double-counted volume patterns. Fixes cockroachdb#97867. Epic: none Release note (bug fix): Avoids double-counting disk read/write bytes in disk metrics if Cockroach observes volumes that are likely to be duplicated in reported disk counters, such as RAID logical vs physical volumes.
Previously, we wouldn't exclude volumes from disk counters that are likely to be double-counted such as RAID logical volumes that are composed of physical volumes that are also independently present in disk metrics. This change adds a regex-based filter, overridable with env vars, that excludes common double-counted volume patterns. Fixes cockroachdb#97867. Epic: none Release note (bug fix): Avoids double-counting disk read/write bytes in disk metrics if Cockroach observes volumes that are likely to be duplicated in reported disk counters, such as RAID logical vs physical volumes.
104640: server: don't double-count RAID volumes in disk metrics r=RaduBerinde a=itsbilal Previously, we wouldn't exclude volumes from disk counters that are likely to be double-counted such as RAID logical volumes that are composed of physical volumes that are also independently present in disk metrics. This change adds a regex-based filter, overridable with env vars, that excludes common double-counted volume patterns. Fixes #97867. Epic: none Release note (bug fix): Avoids double-counting disk read/write bytes in disk metrics if Cockroach observes volumes that are likely to be duplicated in reported disk counters, such as RAID logical vs physical volumes. Co-authored-by: Bilal Akhtar <[email protected]>
Should the fix for this be backported to 23.1? |
@itsbilal can be it backported to both 22.2 and 23.1, if so both please. |
Previously, we wouldn't exclude volumes from disk counters that are likely to be double-counted such as RAID logical volumes that are composed of physical volumes that are also independently present in disk metrics. This change adds a regex-based filter, overridable with env vars, that excludes common double-counted volume patterns. Fixes cockroachdb#97867. Epic: none Release note (bug fix): Avoids double-counting disk read/write bytes in disk metrics if Cockroach observes volumes that are likely to be duplicated in reported disk counters, such as RAID logical vs physical volumes.
Previously, we wouldn't exclude volumes from disk counters that are likely to be double-counted such as RAID logical volumes that are composed of physical volumes that are also independently present in disk metrics. This change adds a regex-based filter, overridable with env vars, that excludes common double-counted volume patterns. Fixes cockroachdb#97867. Epic: none Release note (bug fix): Avoids double-counting disk read/write bytes in disk metrics if Cockroach observes volumes that are likely to be duplicated in reported disk counters, such as RAID logical vs physical volumes.
Describe the problem
When running CRDB using disks which are raided, the
sys.host.disk.write.bytes
sys.host.disk.read.bytes
,sys.host.disk.write.count
sys.host.disk.read.count
metrics can be incorrectly over reported.To Reproduce
Reproduction on 8x GCE Local Disks using RAID-1
Expected behavior
The above metrics are correct and reliable.
Environment:
Jira issue: CRDB-24931
The text was updated successfully, but these errors were encountered: