New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Add IO metrics #26804

Merged

bw-solana merged 8 commits into solana-labs:master from bw-solana:io_stats

Aug 2, 2022

Contributor

bw-solana commented Jul 27, 2022 •

edited

Loading

Problem

Currently have very limited insight into storage device performance.

Summary of Changes

Start tracking aggregated metrics from /proc/diskstats to understand storage device performance and bottlenecks

bw-solana added 7 commits

July 26, 2022 20:12


          Add IO metrics

d573a61


          Bug fix and formatting clean up

0f7ed29


          debug code and missing param

893a122


          Filter out empty lines

9c432a8


          Remove debug unit test

ebc1a83


          Use /proc/diskstats for metrics

090e1e3


          Update system_monitor_service.rs

c0ed47b

Contributor Author

bw-solana commented Jul 27, 2022

Here's an example of the new metrics in action:
[2022-07-27T23:48:13.969052420Z INFO solana_metrics::metrics] datapoint: disk-stats reads_completed=1546i reads_merged=0i sectors_read=395008i time_reading_ms=1468i writes_completed=3955i writes_merged=106i sectors_written=990008i time_writing_ms=23872i io_in_progress=13i time_io_ms=1000i time_io_weighted_ms=25341i discards_completed=0i discards_merged=0i sectors_discarded=0i time_discarding=0i flushes_completed=0i time_flushing=0i num_disks=1i


          Minor fixes and cleanup

e2fcff2

bw-solana marked this pull request as ready for review

July 28, 2022 02:26

bw-solana requested review from t-nelson, brooksprumo and jbiseda

July 28, 2022 02:27

jbiseda approved these changes

View reviewed changes

Contributor

jbiseda left a comment

LGTM

core/src/system_monitor_service.rs Show resolved Hide resolved

bw-solana merged commit f3b760d into solana-labs:master

bw-solana deleted the io_stats branch

August 2, 2022 21:30

t-nelson reviewed

View reviewed changes

Contributor

t-nelson left a comment

consider using sysfs (/sys/block/*/stat) instead of procfs to avoid atomicity issues.

core/src/system_monitor_service.rs

+                  let mut num_disks = 0;
+                  for line in reader_diskstats.lines() {
+                      let line = line.map_err(|e| e.to_string())?;
+                      let values: Vec<_> = line.split_ascii_whitespace().collect();

Contributor

t-nelson Aug 2, 2022

collect() isn't strictly necessary. could next() our way to victory on the iterator instead

core/src/system_monitor_service.rs

+                      let values: Vec<_> = line.split_ascii_whitespace().collect();
+                      if values.len() != 20 {
+                          return Err("parse error, expected exactly 20 disk stat elements".to_string());

Contributor

t-nelson Aug 2, 2022

we're probably screwed here, but would it make sense to log and continue instead?

Contributor Author

bw-solana Aug 3, 2022

I'm thinking it would be better to not get any metrics rather than potentially report incorrect metrics. Added some tolerance for all 3 kernel variations that I'm aware of (11, 15, or 17 elements)

core/src/system_monitor_service.rs

+                      if values.len() != 20 {
+                          return Err("parse error, expected exactly 20 disk stat elements".to_string());
+                      }
+                      if values[2].starts_with("loop") || values[1].ne("0") {

Contributor

t-nelson Aug 2, 2022

this will double-count at least dm-crypt volumes.

$ cat /proc/diskstats | grep dm
 253       0 dm-0 182486 0 7848082 48716 706874 0 22199432 10941388 0 610468 10990104 0 0 0 0 0 0

Contributor Author

bw-solana Aug 3, 2022

I'm thinking this is solved by using sysfs instead of procfs since we'll only look at block devices. Does that sound right to you?

core/src/system_monitor_service.rs

+                      }
+                      num_disks += 1;
+                      stats.reads_completed += values[3].parse::<u64>().map_err(|e| e.to_string())?;

Contributor

t-nelson Aug 2, 2022

similarly, should we be totally bailing or just continueing

Contributor Author

bw-solana Aug 3, 2022 •

edited

Loading

I'm thinking it would be better to not get any metrics rather than potentially report incorrect metrics. Hopefully parsing succeeds the next go around and we just report delta for a longer time period

bw-solana restored the io_stats branch

August 3, 2022 00:52

bw-solana mentioned this pull request

Io stats v2 #26898

Merged

Contributor Author

bw-solana commented Aug 3, 2022

@t-nelson create #26898 to address your feedback

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet