Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

server: raided disks can cause erroneous disk metrics #97867

Closed
kvoli opened this issue Mar 1, 2023 · 9 comments · Fixed by #104640
Closed

server: raided disks can cause erroneous disk metrics #97867

kvoli opened this issue Mar 1, 2023 · 9 comments · Fixed by #104640
Assignees
Labels
A-observability-inf A-storage Relating to our storage engine (Pebble) on-disk storage. C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. O-support Would prevent or help troubleshoot a customer escalation - bugs, missing observability/tooling, docs T-storage Storage Team

Comments

@kvoli
Copy link
Collaborator

kvoli commented Mar 1, 2023

Describe the problem

When running CRDB using disks which are raided, the sys.host.disk.write.bytes sys.host.disk.read.bytes, sys.host.disk.write.count sys.host.disk.read.count metrics can be incorrectly over reported.

To Reproduce

Reproduction on 8x GCE Local Disks using RAID-1

  1. Apply the following patch to roachprod.
diff --git a/pkg/roachprod/vm/gce/utils.go b/pkg/roachprod/vm/gce/utils.go
index af8c343e1eb..4e5cbbd1017 100644
--- a/pkg/roachprod/vm/gce/utils.go
+++ b/pkg/roachprod/vm/gce/utils.go
@@ -105,14 +105,14 @@ elif [ "${#disks[@]}" -eq "1" ] || [ -n "$use_multiple_disks" ]; then
   done
 else
   mountpoint="${mount_prefix}1"
-  echo "${#disks[@]} disks mounted, creating ${mountpoint} using RAID 0"
+  echo "${#disks[@]} disks mounted, creating ${mountpoint} using RAID 1"
   mkdir -p ${mountpoint}
 {{ if .Zfs }}
   zpool create -f $(basename $mountpoint) -m ${mountpoint} ${disks[@]}
   # NOTE: we don't need an /etc/fstab entry for ZFS. It will handle this itself.
 {{ else }}
   raiddisk="/dev/md0"
-  mdadm -q --create ${raiddisk} --level=0 --raid-devices=${#disks[@]} "${disks[@]}"
+  mdadm -q --create ${raiddisk} --level=1 --raid-devices=${#disks[@]} "${disks[@]}"
   mkfs.ext4 -q -F ${raiddisk}
   mount -o ${mount_opts} ${raiddisk} ${mountpoint}
   echo "${raiddisk} ${mountpoint} ext4 ${mount_opts} 1 1" | tee -a /etc/fstab
  1. Run this script
export cluster=austen-test-raided-disks
roachprod create $cluster --gce-local-ssd-count 8 -n 2 --gce-machine-type=n2-standard-16 --os-volume-size=500
roachprod stage $cluster cockroach
roachprod start $cluster:1
roachprod grafana-start $cluster
roachprod run $cluster:2 -- "./cockroach workload run kv --drop --min-block-bytes=32000 --max-block-bytes=32000 {pgurl:1}"
  1. Observe disk write bandwidth by querying grafana for (in grafana.testeng):
# set node to be $cluster-0001
# crdb
rate(sys_host_disk_write_bytes{job="cockroachdb",cluster="$cluster",instance=~"$node"}[$rate_interval])
# node exporter
sum(rate(node_disk_written_bytes_total{cluster="$cluster",host=~"$node"}[$rate_interval]))

image

Expected behavior

The above metrics are correct and reliable.

Environment:

  • CockroachDB version: v22.1.13, assume all versions affected.
  • Server OS: Ubuntu 18.04
  • Disks: 8 x GCE Local Instance Disk with RAID 1 configuration

Jira issue: CRDB-24931

@kvoli kvoli added C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. T-kv-observability labels Mar 1, 2023
@tbg
Copy link
Member

tbg commented Mar 2, 2023

@kvoli
Copy link
Collaborator Author

kvoli commented Mar 8, 2023

I wasn't able to reproduce this with 8x raid 0 disks on gce. The numbers match NodeExporter. I didn't check IO Stat.

image

Repro
export cluster=austen-test-raided-disks
roachprod create $cluster --gce-local-ssd-count 8 -n 2 --gce-machine-type=n2-standard-16
roachprod stage $cluster cockroach
roachprod start $cluster:1
roachprod grafana-start $cluster
roachprod run $cluster:2 -- "./cockroach workload run kv --drop --min-block-bytes=32000 --max-block-bytes=32000 {pgurl:1}"

@kvoli
Copy link
Collaborator Author

kvoli commented Mar 8, 2023

No success reproducing this with 8x raid 10 disks on gce. I updated the roachprod script to use RAID10 with this patch (mixed confidence on correctness). The same repro as above.

diff --git a/pkg/roachprod/vm/gce/utils.go b/pkg/roachprod/vm/gce/utils.go
index af8c343e1eb..4e5cbbd1017 100644
--- a/pkg/roachprod/vm/gce/utils.go
+++ b/pkg/roachprod/vm/gce/utils.go
@@ -105,14 +105,14 @@ elif [ "${#disks[@]}" -eq "1" ] || [ -n "$use_multiple_disks" ]; then
   done
 else
   mountpoint="${mount_prefix}1"
-  echo "${#disks[@]} disks mounted, creating ${mountpoint} using RAID 0"
+  echo "${#disks[@]} disks mounted, creating ${mountpoint} using RAID 10"
   mkdir -p ${mountpoint}
 {{ if .Zfs }}
   zpool create -f $(basename $mountpoint) -m ${mountpoint} ${disks[@]}
   # NOTE: we don't need an /etc/fstab entry for ZFS. It will handle this itself.
 {{ else }}
   raiddisk="/dev/md0"
-  mdadm -q --create ${raiddisk} --level=0 --raid-devices=${#disks[@]} "${disks[@]}"
+  mdadm -q --create ${raiddisk} --level=10 --raid-devices=${#disks[@]} "${disks[@]}"
   mkfs.ext4 -q -F ${raiddisk}
   mount -o ${mount_opts} ${raiddisk} ${mountpoint}
   echo "${raiddisk} ${mountpoint} ext4 ${mount_opts} 1 1" | tee -a /etc/fstab

image

@kvoli
Copy link
Collaborator Author

kvoli commented Mar 8, 2023

Running 8x raid 1 disks on gce, there is a discrepancy. CRDB reports 2x the actual disk bandwidth:

image

@kvoli kvoli added O-support Would prevent or help troubleshoot a customer escalation - bugs, missing observability/tooling, docs and removed A-kv-observability T-kv-observability labels Mar 8, 2023
@nicktrav nicktrav added the T-storage Storage Team label Mar 10, 2023
@blathers-crl blathers-crl bot added the A-storage Relating to our storage engine (Pebble) on-disk storage. label Mar 10, 2023
@nicktrav
Copy link
Collaborator

I took a quick look at this, and it appears as though we are indeed susceptible to over-counting. We fetch the block device stats from /proc/diskstats and then sum over those values, here.

If a host has a physical device with partitions, or a logical device with multiple physical devices underneath it, we'll include everything.

For example:

$ cat /proc/diskstats
   7       0 loop0 1653 0 3898 3487 0 0 0 0 0 376 2060 0 0 0 0
   7       1 loop1 64 0 2116 95 0 0 0 0 0 84 40 0 0 0 0
   7       2 loop2 76 0 2168 129 0 0 0 0 0 108 56 0 0 0 0
   7       3 loop3 14922 0 30436 10131 0 0 0 0 0 1652 6272 0 0 0 0
   7       4 loop4 4 0 8 0 0 0 0 0 0 4 0 0 0 0 0
   7       5 loop5 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
   7       6 loop6 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
   7       7 loop7 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
   8       0 sda 621374 2156 77648362 455510 5680950 2200936 1930758330 15580577 0 1975256 5240940 0 0 0 0
   8       1 sda1 620273 2156 77627020 453839 5287817 2200928 1930758248 15561190 7 1975124 5240672 0 0 0 0
   8      14 sda14 139 0 1360 43 0 0 0 0 0 84 0 0 0 0 0
   8      15 sda15 904 0 17566 1589 2 0 2 1 0 124 268 0 0 0 0
 259       0 nvme0n1 415939 1797778 283282728 421471 789781 467500 462093488 2116528 0 1432468 2000588 6186 0 786167808 524963
 259       1 nvme0n2 222 0 7536 34 1202797 2267444 745350576 9075111 3 1810368 7732236 6186 0 786167808 534549
 259       2 nvme0n3 220 0 7520 40 1202786 2267455 745350576 8878143 3 1806584 7535092 6186 0 786167808 516163
 259       3 nvme0n4 220 0 7520 51 1202780 2267461 745350576 9143286 3 1807884 7785232 6186 0 786167808 528765
 259       4 nvme0n5 220 0 7520 92 1202775 2267465 745348784 9072208 4 1808120 7704140 6186 0 786167808 534315
 259       5 nvme0n6 240 0 7680 58 1202769 2267471 745348784 9329817 4 1806304 7941068 6186 0 786167808 550560
 259       6 nvme0n7 220 0 7520 37 1202768 2267473 745350576 8923372 3 1791864 7263712 6186 0 786167808 175940
 259       7 nvme0n8 224 0 7552 51 1202766 2267475 745350576 8953102 3 1793284 7300812 6186 0 786167808 183584
   9       0 md0 51 0 2112 0 0 0 0 0 0 0 0 0 0 0 0

The prom node exporter has some logic that will ignore certain block devices. On Linux, the filter is here:

	diskstatsDefaultIgnoredDevices = "^(ram|loop|fd|(h|s|v|xv)d[a-z]|nvme\\d+n\\d+p)\\d+$"

It might be worth switching out the library we use to use the Prom library instead (i.e. github.com/prometheus/procfs). However, it's still brittle due to the device names. Anything not in the filter could be double counted (i.e. md0 and its constituent devices).


Side note related to roachprod / RAID: there's no md in that filter list, so I would have expected the two timeseries (node exporter and CRDB) to be the same (i.e. both over-counting). Turns out that for whatever reason the RAID device is not being created here:

$ lsblk
NAME    MAJ:MIN RM   SIZE RO TYPE MOUNTPOINT
loop0     7:0    0  55.4M  1 loop /snap/core18/2066
loop1     7:1    0 230.1M  1 loop /snap/google-cloud-sdk/184
loop2     7:2    0  67.6M  1 loop /snap/lxd/20326
loop3     7:3    0  32.1M  1 loop /snap/snapd/12057
sda       8:0    0   500G  0 disk
├─sda1    8:1    0 499.9G  0 part /
├─sda14   8:14   0     4M  0 part
└─sda15   8:15   0   106M  0 part /boot/efi
nvme0n1 259:0    0   375G  0 disk
nvme0n2 259:1    0   375G  0 disk
nvme0n3 259:2    0   375G  0 disk
nvme0n4 259:3    0   375G  0 disk
nvme0n5 259:4    0   375G  0 disk
nvme0n6 259:5    0   375G  0 disk
nvme0n7 259:6    0   375G  0 disk
nvme0n8 259:7    0   375G  0 disk

After creating it manually and restarting the cluster, I see the results I expect (i.e. both series are the same, over-counting).

@nicktrav
Copy link
Collaborator

An uncontroversial first step here could be to switch out the library we're using to fetch these metrics. For example, if the prometheus/procfs library was used, the stats would line up with those exported by node_exporter (as the library used is the same).

Alternatively, we could have some mechanism for filtering the devices that are included in the metrics.

I'm going to move this into our backlog for now.

I've also tagged Obs Inf too, as it straddles the border the two two teams.

@kvoli
Copy link
Collaborator Author

kvoli commented Mar 22, 2023

This also affects IOPs. I reran the repro and checked write IOPs:
image

@kvoli kvoli changed the title server: raided disks can cause erroneous bandwidth metrics server: raided disks can cause erroneous disk metrics Mar 22, 2023
@itsbilal itsbilal self-assigned this Jun 9, 2023
itsbilal added a commit to itsbilal/cockroach that referenced this issue Jun 9, 2023
Previously, we wouldn't exclude volumes from disk counters
that are likely to be double-counted such as RAID logical
volumes that are composed of physical volumes that are also
independently present in disk metrics. This change adds a
regex-based filter, overridable with env vars, that excludes
common double-counted volume patterns.

Fixes cockroachdb#97867.

Epic: none

Release note (bug fix): Avoids double-counting disk read/write bytes
in disk metrics if Cockroach observes volumes that are likely to
be duplicated in reported disk counters, such as RAID logical vs
physical volumes.
itsbilal added a commit to itsbilal/cockroach that referenced this issue Jun 9, 2023
Previously, we wouldn't exclude volumes from disk counters
that are likely to be double-counted such as RAID logical
volumes that are composed of physical volumes that are also
independently present in disk metrics. This change adds a
regex-based filter, overridable with env vars, that excludes
common double-counted volume patterns.

Fixes cockroachdb#97867.

Epic: none

Release note (bug fix): Avoids double-counting disk read/write bytes
in disk metrics if Cockroach observes volumes that are likely to
be duplicated in reported disk counters, such as RAID logical vs
physical volumes.
itsbilal added a commit to itsbilal/cockroach that referenced this issue Jun 9, 2023
Previously, we wouldn't exclude volumes from disk counters
that are likely to be double-counted such as RAID logical
volumes that are composed of physical volumes that are also
independently present in disk metrics. This change adds a
regex-based filter, overridable with env vars, that excludes
common double-counted volume patterns.

Fixes cockroachdb#97867.

Epic: none

Release note (bug fix): Avoids double-counting disk read/write bytes
in disk metrics if Cockroach observes volumes that are likely to
be duplicated in reported disk counters, such as RAID logical vs
physical volumes.
craig bot pushed a commit that referenced this issue Jun 9, 2023
104640: server: don't double-count RAID volumes in disk metrics r=RaduBerinde a=itsbilal

Previously, we wouldn't exclude volumes from disk counters that are likely to be double-counted such as RAID logical volumes that are composed of physical volumes that are also independently present in disk metrics. This change adds a regex-based filter, overridable with env vars, that excludes common double-counted volume patterns.

Fixes #97867.

Epic: none

Release note (bug fix): Avoids double-counting disk read/write bytes in disk metrics if Cockroach observes volumes that are likely to be duplicated in reported disk counters, such as RAID logical vs physical volumes.

Co-authored-by: Bilal Akhtar <[email protected]>
@craig craig bot closed this as completed in 479effa Jun 9, 2023
@itsbilal
Copy link
Contributor

Should the fix for this be backported to 23.1?

@lunevalex
Copy link
Collaborator

@itsbilal can be it backported to both 22.2 and 23.1, if so both please.

itsbilal added a commit to itsbilal/cockroach that referenced this issue Jun 14, 2023
Previously, we wouldn't exclude volumes from disk counters
that are likely to be double-counted such as RAID logical
volumes that are composed of physical volumes that are also
independently present in disk metrics. This change adds a
regex-based filter, overridable with env vars, that excludes
common double-counted volume patterns.

Fixes cockroachdb#97867.

Epic: none

Release note (bug fix): Avoids double-counting disk read/write bytes
in disk metrics if Cockroach observes volumes that are likely to
be duplicated in reported disk counters, such as RAID logical vs
physical volumes.
itsbilal added a commit to itsbilal/cockroach that referenced this issue Jun 14, 2023
Previously, we wouldn't exclude volumes from disk counters
that are likely to be double-counted such as RAID logical
volumes that are composed of physical volumes that are also
independently present in disk metrics. This change adds a
regex-based filter, overridable with env vars, that excludes
common double-counted volume patterns.

Fixes cockroachdb#97867.

Epic: none

Release note (bug fix): Avoids double-counting disk read/write bytes
in disk metrics if Cockroach observes volumes that are likely to
be duplicated in reported disk counters, such as RAID logical vs
physical volumes.
@jbowens jbowens moved this to Done in [Deprecated] Storage Jun 4, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-observability-inf A-storage Relating to our storage engine (Pebble) on-disk storage. C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. O-support Would prevent or help troubleshoot a customer escalation - bugs, missing observability/tooling, docs T-storage Storage Team
Projects
No open projects
Archived in project
Development

Successfully merging a pull request may close this issue.

5 participants