Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add IO metrics #26804

Merged
merged 8 commits into from
Aug 2, 2022
Merged

Add IO metrics #26804

merged 8 commits into from
Aug 2, 2022

Conversation

bw-solana
Copy link
Contributor

@bw-solana bw-solana commented Jul 27, 2022

Problem

Currently have very limited insight into storage device performance.

Summary of Changes

Start tracking aggregated metrics from /proc/diskstats to understand storage device performance and bottlenecks

@bw-solana
Copy link
Contributor Author

Here's an example of the new metrics in action:
[2022-07-27T23:48:13.969052420Z INFO solana_metrics::metrics] datapoint: disk-stats reads_completed=1546i reads_merged=0i sectors_read=395008i time_reading_ms=1468i writes_completed=3955i writes_merged=106i sectors_written=990008i time_writing_ms=23872i io_in_progress=13i time_io_ms=1000i time_io_weighted_ms=25341i discards_completed=0i discards_merged=0i sectors_discarded=0i time_discarding=0i flushes_completed=0i time_flushing=0i num_disks=1i

@bw-solana bw-solana marked this pull request as ready for review July 28, 2022 02:26
Copy link
Contributor

@jbiseda jbiseda left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

core/src/system_monitor_service.rs Show resolved Hide resolved
@bw-solana bw-solana merged commit f3b760d into solana-labs:master Aug 2, 2022
@bw-solana bw-solana deleted the io_stats branch August 2, 2022 21:30
Copy link
Contributor

@t-nelson t-nelson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

consider using sysfs (/sys/block/*/stat) instead of procfs to avoid atomicity issues.

let mut num_disks = 0;
for line in reader_diskstats.lines() {
let line = line.map_err(|e| e.to_string())?;
let values: Vec<_> = line.split_ascii_whitespace().collect();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

collect() isn't strictly necessary. could next() our way to victory on the iterator instead

let values: Vec<_> = line.split_ascii_whitespace().collect();

if values.len() != 20 {
return Err("parse error, expected exactly 20 disk stat elements".to_string());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we're probably screwed here, but would it make sense to log and continue instead?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm thinking it would be better to not get any metrics rather than potentially report incorrect metrics. Added some tolerance for all 3 kernel variations that I'm aware of (11, 15, or 17 elements)

if values.len() != 20 {
return Err("parse error, expected exactly 20 disk stat elements".to_string());
}
if values[2].starts_with("loop") || values[1].ne("0") {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this will double-count at least dm-crypt volumes.

$ cat /proc/diskstats | grep dm
 253       0 dm-0 182486 0 7848082 48716 706874 0 22199432 10941388 0 610468 10990104 0 0 0 0 0 0

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm thinking this is solved by using sysfs instead of procfs since we'll only look at block devices. Does that sound right to you?

}

num_disks += 1;
stats.reads_completed += values[3].parse::<u64>().map_err(|e| e.to_string())?;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

similarly, should we be totally bailing or just continueing

Copy link
Contributor Author

@bw-solana bw-solana Aug 3, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm thinking it would be better to not get any metrics rather than potentially report incorrect metrics. Hopefully parsing succeeds the next go around and we just report delta for a longer time period

@bw-solana bw-solana restored the io_stats branch August 3, 2022 00:52
@bw-solana bw-solana mentioned this pull request Aug 3, 2022
@bw-solana
Copy link
Contributor Author

@t-nelson create #26898 to address your feedback

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants