server: raided disks can cause erroneous disk metrics #97867

kvoli · 2023-03-01T19:29:58Z

Describe the problem

When running CRDB using disks which are raided, the sys.host.disk.write.bytes sys.host.disk.read.bytes, sys.host.disk.write.count sys.host.disk.read.count metrics can be incorrectly over reported.

To Reproduce

Reproduction on 8x GCE Local Disks using RAID-1

Apply the following patch to roachprod.

diff --git a/pkg/roachprod/vm/gce/utils.go b/pkg/roachprod/vm/gce/utils.go
index af8c343e1eb..4e5cbbd1017 100644
--- a/pkg/roachprod/vm/gce/utils.go
+++ b/pkg/roachprod/vm/gce/utils.go
@@ -105,14 +105,14 @@ elif [ "${#disks[@]}" -eq "1" ] || [ -n "$use_multiple_disks" ]; then
   done
 else
   mountpoint="${mount_prefix}1"
-  echo "${#disks[@]} disks mounted, creating ${mountpoint} using RAID 0"
+  echo "${#disks[@]} disks mounted, creating ${mountpoint} using RAID 1"
   mkdir -p ${mountpoint}
 {{ if .Zfs }}
   zpool create -f $(basename $mountpoint) -m ${mountpoint} ${disks[@]}
   # NOTE: we don't need an /etc/fstab entry for ZFS. It will handle this itself.
 {{ else }}
   raiddisk="/dev/md0"
-  mdadm -q --create ${raiddisk} --level=0 --raid-devices=${#disks[@]} "${disks[@]}"
+  mdadm -q --create ${raiddisk} --level=1 --raid-devices=${#disks[@]} "${disks[@]}"
   mkfs.ext4 -q -F ${raiddisk}
   mount -o ${mount_opts} ${raiddisk} ${mountpoint}
   echo "${raiddisk} ${mountpoint} ext4 ${mount_opts} 1 1" | tee -a /etc/fstab

Run this script

export cluster=austen-test-raided-disks
roachprod create $cluster --gce-local-ssd-count 8 -n 2 --gce-machine-type=n2-standard-16 --os-volume-size=500
roachprod stage $cluster cockroach
roachprod start $cluster:1
roachprod grafana-start $cluster
roachprod run $cluster:2 -- "./cockroach workload run kv --drop --min-block-bytes=32000 --max-block-bytes=32000 {pgurl:1}"

Observe disk write bandwidth by querying grafana for (in grafana.testeng):

# set node to be $cluster-0001
# crdb
rate(sys_host_disk_write_bytes{job="cockroachdb",cluster="$cluster",instance=~"$node"}[$rate_interval])
# node exporter
sum(rate(node_disk_written_bytes_total{cluster="$cluster",host=~"$node"}[$rate_interval]))

Expected behavior

The above metrics are correct and reliable.

Environment:

CockroachDB version: v22.1.13, assume all versions affected.
Server OS: Ubuntu 18.04
Disks: 8 x GCE Local Instance Disk with RAID 1 configuration

Jira issue: CRDB-24931

The text was updated successfully, but these errors were encountered:

tbg · 2023-03-02T16:08:07Z

Relevant slack thread where we saw this on a second cluster of the same customer

kvoli · 2023-03-08T15:22:25Z

I wasn't able to reproduce this with 8x raid 0 disks on gce. The numbers match NodeExporter. I didn't check IO Stat.

Repro

export cluster=austen-test-raided-disks
roachprod create $cluster --gce-local-ssd-count 8 -n 2 --gce-machine-type=n2-standard-16
roachprod stage $cluster cockroach
roachprod start $cluster:1
roachprod grafana-start $cluster
roachprod run $cluster:2 -- "./cockroach workload run kv --drop --min-block-bytes=32000 --max-block-bytes=32000 {pgurl:1}"

kvoli · 2023-03-08T20:23:24Z

No success reproducing this with 8x raid 10 disks on gce. I updated the roachprod script to use RAID10 with this patch (mixed confidence on correctness). The same repro as above.

diff --git a/pkg/roachprod/vm/gce/utils.go b/pkg/roachprod/vm/gce/utils.go
index af8c343e1eb..4e5cbbd1017 100644
--- a/pkg/roachprod/vm/gce/utils.go
+++ b/pkg/roachprod/vm/gce/utils.go
@@ -105,14 +105,14 @@ elif [ "${#disks[@]}" -eq "1" ] || [ -n "$use_multiple_disks" ]; then
   done
 else
   mountpoint="${mount_prefix}1"
-  echo "${#disks[@]} disks mounted, creating ${mountpoint} using RAID 0"
+  echo "${#disks[@]} disks mounted, creating ${mountpoint} using RAID 10"
   mkdir -p ${mountpoint}
 {{ if .Zfs }}
   zpool create -f $(basename $mountpoint) -m ${mountpoint} ${disks[@]}
   # NOTE: we don't need an /etc/fstab entry for ZFS. It will handle this itself.
 {{ else }}
   raiddisk="/dev/md0"
-  mdadm -q --create ${raiddisk} --level=0 --raid-devices=${#disks[@]} "${disks[@]}"
+  mdadm -q --create ${raiddisk} --level=10 --raid-devices=${#disks[@]} "${disks[@]}"
   mkfs.ext4 -q -F ${raiddisk}
   mount -o ${mount_opts} ${raiddisk} ${mountpoint}
   echo "${raiddisk} ${mountpoint} ext4 ${mount_opts} 1 1" | tee -a /etc/fstab

kvoli · 2023-03-08T20:47:37Z

Running 8x raid 1 disks on gce, there is a discrepancy. CRDB reports 2x the actual disk bandwidth:

nicktrav · 2023-03-10T18:16:59Z

I took a quick look at this, and it appears as though we are indeed susceptible to over-counting. We fetch the block device stats from /proc/diskstats and then sum over those values, here.

If a host has a physical device with partitions, or a logical device with multiple physical devices underneath it, we'll include everything.

For example:

$ cat /proc/diskstats
   7       0 loop0 1653 0 3898 3487 0 0 0 0 0 376 2060 0 0 0 0
   7       1 loop1 64 0 2116 95 0 0 0 0 0 84 40 0 0 0 0
   7       2 loop2 76 0 2168 129 0 0 0 0 0 108 56 0 0 0 0
   7       3 loop3 14922 0 30436 10131 0 0 0 0 0 1652 6272 0 0 0 0
   7       4 loop4 4 0 8 0 0 0 0 0 0 4 0 0 0 0 0
   7       5 loop5 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
   7       6 loop6 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
   7       7 loop7 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
   8       0 sda 621374 2156 77648362 455510 5680950 2200936 1930758330 15580577 0 1975256 5240940 0 0 0 0
   8       1 sda1 620273 2156 77627020 453839 5287817 2200928 1930758248 15561190 7 1975124 5240672 0 0 0 0
   8      14 sda14 139 0 1360 43 0 0 0 0 0 84 0 0 0 0 0
   8      15 sda15 904 0 17566 1589 2 0 2 1 0 124 268 0 0 0 0
 259       0 nvme0n1 415939 1797778 283282728 421471 789781 467500 462093488 2116528 0 1432468 2000588 6186 0 786167808 524963
 259       1 nvme0n2 222 0 7536 34 1202797 2267444 745350576 9075111 3 1810368 7732236 6186 0 786167808 534549
 259       2 nvme0n3 220 0 7520 40 1202786 2267455 745350576 8878143 3 1806584 7535092 6186 0 786167808 516163
 259       3 nvme0n4 220 0 7520 51 1202780 2267461 745350576 9143286 3 1807884 7785232 6186 0 786167808 528765
 259       4 nvme0n5 220 0 7520 92 1202775 2267465 745348784 9072208 4 1808120 7704140 6186 0 786167808 534315
 259       5 nvme0n6 240 0 7680 58 1202769 2267471 745348784 9329817 4 1806304 7941068 6186 0 786167808 550560
 259       6 nvme0n7 220 0 7520 37 1202768 2267473 745350576 8923372 3 1791864 7263712 6186 0 786167808 175940
 259       7 nvme0n8 224 0 7552 51 1202766 2267475 745350576 8953102 3 1793284 7300812 6186 0 786167808 183584
   9       0 md0 51 0 2112 0 0 0 0 0 0 0 0 0 0 0 0

The prom node exporter has some logic that will ignore certain block devices. On Linux, the filter is here:

	diskstatsDefaultIgnoredDevices = "^(ram|loop|fd|(h|s|v|xv)d[a-z]|nvme\\d+n\\d+p)\\d+$"

It might be worth switching out the library we use to use the Prom library instead (i.e. github.com/prometheus/procfs). However, it's still brittle due to the device names. Anything not in the filter could be double counted (i.e. md0 and its constituent devices).

Side note related to roachprod / RAID: there's no md in that filter list, so I would have expected the two timeseries (node exporter and CRDB) to be the same (i.e. both over-counting). Turns out that for whatever reason the RAID device is not being created here:

$ lsblk
NAME    MAJ:MIN RM   SIZE RO TYPE MOUNTPOINT
loop0     7:0    0  55.4M  1 loop /snap/core18/2066
loop1     7:1    0 230.1M  1 loop /snap/google-cloud-sdk/184
loop2     7:2    0  67.6M  1 loop /snap/lxd/20326
loop3     7:3    0  32.1M  1 loop /snap/snapd/12057
sda       8:0    0   500G  0 disk
├─sda1    8:1    0 499.9G  0 part /
├─sda14   8:14   0     4M  0 part
└─sda15   8:15   0   106M  0 part /boot/efi
nvme0n1 259:0    0   375G  0 disk
nvme0n2 259:1    0   375G  0 disk
nvme0n3 259:2    0   375G  0 disk
nvme0n4 259:3    0   375G  0 disk
nvme0n5 259:4    0   375G  0 disk
nvme0n6 259:5    0   375G  0 disk
nvme0n7 259:6    0   375G  0 disk
nvme0n8 259:7    0   375G  0 disk

After creating it manually and restarting the cluster, I see the results I expect (i.e. both series are the same, over-counting).

nicktrav · 2023-03-13T22:57:50Z

An uncontroversial first step here could be to switch out the library we're using to fetch these metrics. For example, if the prometheus/procfs library was used, the stats would line up with those exported by node_exporter (as the library used is the same).

Alternatively, we could have some mechanism for filtering the devices that are included in the metrics.

I'm going to move this into our backlog for now.

I've also tagged Obs Inf too, as it straddles the border the two two teams.

kvoli · 2023-03-22T18:55:42Z

This also affects IOPs. I reran the repro and checked write IOPs:

Previously, we wouldn't exclude volumes from disk counters that are likely to be double-counted such as RAID logical volumes that are composed of physical volumes that are also independently present in disk metrics. This change adds a regex-based filter, overridable with env vars, that excludes common double-counted volume patterns. Fixes cockroachdb#97867. Epic: none Release note (bug fix): Avoids double-counting disk read/write bytes in disk metrics if Cockroach observes volumes that are likely to be duplicated in reported disk counters, such as RAID logical vs physical volumes.

104640: server: don't double-count RAID volumes in disk metrics r=RaduBerinde a=itsbilal Previously, we wouldn't exclude volumes from disk counters that are likely to be double-counted such as RAID logical volumes that are composed of physical volumes that are also independently present in disk metrics. This change adds a regex-based filter, overridable with env vars, that excludes common double-counted volume patterns. Fixes #97867. Epic: none Release note (bug fix): Avoids double-counting disk read/write bytes in disk metrics if Cockroach observes volumes that are likely to be duplicated in reported disk counters, such as RAID logical vs physical volumes. Co-authored-by: Bilal Akhtar <[email protected]>

itsbilal · 2023-06-12T14:31:52Z

Should the fix for this be backported to 23.1?

lunevalex · 2023-06-13T18:32:45Z

@itsbilal can be it backported to both 22.2 and 23.1, if so both please.

Previously, we wouldn't exclude volumes from disk counters that are likely to be double-counted such as RAID logical volumes that are composed of physical volumes that are also independently present in disk metrics. This change adds a regex-based filter, overridable with env vars, that excludes common double-counted volume patterns. Fixes cockroachdb#97867. Epic: none Release note (bug fix): Avoids double-counting disk read/write bytes in disk metrics if Cockroach observes volumes that are likely to be duplicated in reported disk counters, such as RAID logical vs physical volumes.

kvoli added C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. T-kv-observability labels Mar 1, 2023

blathers-crl bot added the A-kv-observability label Mar 1, 2023

kvoli added O-support Would prevent or help troubleshoot a customer escalation - bugs, missing observability/tooling, docs and removed A-kv-observability T-kv-observability labels Mar 8, 2023

nicktrav added the T-storage Storage Team label Mar 10, 2023

blathers-crl bot added the A-storage Relating to our storage engine (Pebble) on-disk storage. label Mar 10, 2023

nicktrav added the T-observability-inf label Mar 13, 2023

blathers-crl bot added the A-observability-inf label Mar 13, 2023

kvoli changed the title ~~server: raided disks can cause erroneous bandwidth metrics~~ server: raided disks can cause erroneous disk metrics Mar 22, 2023

itsbilal self-assigned this Jun 9, 2023

itsbilal mentioned this issue Jun 9, 2023

server: don't double-count RAID volumes in disk metrics #104640

Merged

craig bot closed this as completed in 479effa Jun 9, 2023

exalate-issue-sync bot removed the T-observability-inf label Jun 9, 2023

This was referenced Jun 13, 2023

release-23.1: server: don't double-count RAID volumes in disk metrics #104806

Merged

release-22.2: server: don't double-count RAID volumes in disk metrics #104807

Merged

jbowens added this to [Deprecated] Storage Jun 4, 2024

jbowens moved this to Done in [Deprecated] Storage Jun 4, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

server: raided disks can cause erroneous disk metrics #97867

server: raided disks can cause erroneous disk metrics #97867

kvoli commented Mar 1, 2023 •

edited by aliher1911

Loading

tbg commented Mar 2, 2023

kvoli commented Mar 8, 2023

kvoli commented Mar 8, 2023

kvoli commented Mar 8, 2023

nicktrav commented Mar 10, 2023

nicktrav commented Mar 13, 2023

kvoli commented Mar 22, 2023 •

edited

Loading

itsbilal commented Jun 12, 2023

lunevalex commented Jun 13, 2023

server: raided disks can cause erroneous disk metrics #97867

server: raided disks can cause erroneous disk metrics #97867

Comments

kvoli commented Mar 1, 2023 • edited by aliher1911 Loading

tbg commented Mar 2, 2023

kvoli commented Mar 8, 2023

kvoli commented Mar 8, 2023

kvoli commented Mar 8, 2023

nicktrav commented Mar 10, 2023

nicktrav commented Mar 13, 2023

kvoli commented Mar 22, 2023 • edited Loading

itsbilal commented Jun 12, 2023

lunevalex commented Jun 13, 2023

kvoli commented Mar 1, 2023 •

edited by aliher1911

Loading

kvoli commented Mar 22, 2023 •

edited

Loading