Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add stats to track number of open files and max open file limit of validator process #28945

Merged
merged 21 commits into from
Dec 6, 2022

Conversation

HaoranYi
Copy link
Contributor

@HaoranYi HaoranYi commented Nov 23, 2022

Problem

Recently, in kin-sim, validator crashes because of reaching the max number of open file limits.

It is useful to track the actual number of open files and max open files limits for the validator process to help debugging and trouble shooting.

Summary of Changes

  • add stats to track number of open file descriptors for validator process
  • log number of open files and max open file limits when account background service thread hit the limits and panic.

Fixes #

@HaoranYi HaoranYi force-pushed the lsof branch 3 times, most recently from 00f8c29 to e87ce35 Compare November 23, 2022 16:49
@HaoranYi HaoranYi changed the title Log number of open files and max open file limit when panic Add stats to track number of open files and max open file limit of validator process Nov 23, 2022
@HaoranYi
Copy link
Contributor Author

HaoranYi commented Nov 28, 2022

It seems that procfs.fd_count didn't tracks files open by mmap.
It only tracks fopen(), i.e. rocksdb files and sockets.

sol@kin-validator-am6-1:~/src/solana$ ls -l /proc/30565/fd | wc -l

To get the number of mmaped file, use the following command.

sol@kin-validator-am6-1:~/src/solana$ wc -l /proc/30565/maps
282755 /proc/30565/maps

Comment on lines 841 to 842
let curr_num_open_fd = proc.fd_count().unwrap();
let curr_mmap_count = proc.maps().unwrap().len();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have you measured how long this takes to execute with 2M open fds?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah. It takes about 10.520629985s for 2M open fds. Maybe we can increase the report interval from 30s to 3 minutes?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

may very well miss an event on a three minute interval

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This all seems pretty blunt. Beyond append vecs, which seems like it would be covered by #28958, what other sources of unbounded fd/mmap growth exist and perhaps we can directly instrument those as well.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the dumb wc -l /proc/PID/maps is like 600ms (850ms if the file is copied aside first) on a box with 505k maps. should be linear to 2M, ~2.5s. i'm guessing it's all the parsing that's of no use to us which is accounting for the rest of that time?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

$ time cat maps | sed -e 's:.*/\(accounts\|accounts_index\)/.*:\1:;t;c\
other' | sort | uniq -c
  416730 accounts
  82965 accounts_index
  4980 other

Copy link
Contributor Author

@HaoranYi HaoranYi Nov 29, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sol@kin-validator-am6-1:~$ cat /proc/30565/maps| sed -e 's:.*/\(accounts\|accounts_index\)/.*:\1:;t;c\other' | sort | uniq -c
  87885 accounts
 191468 accounts_index
   3399 other

Copy link
Contributor Author

@HaoranYi HaoranYi Nov 29, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks! I tried @t-nelson's command. Looks like #28958 tracks the append-vecs correctly. the other large chunk of mmaps are used for accounts_index, 191K. @jeffwashington probably, we should track the number of accounts_index mmap file too?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the dumb wc -l /proc/PID/maps is like 600ms (850ms if the file is copied aside first) on a box with 505k maps. should be linear to 2M, ~2.5s. i'm guessing it's all the parsing that's of no use to us which is accounting for the rest of that time?

yeah. it takes 1.51s to count 2M open mmaps.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks! I tried @t-nelson's command. Looks like #28958 tracks the append-vecs correctly. the other large chunk of mmaps are used for accounts_index, 191K. @jeffwashington probably, we should track the number of accounts_index mmap file too?

Here is the change to track accounts index mmap files.
#28984

validator/src/cli.rs Outdated Show resolved Hide resolved
@@ -14,6 +14,7 @@ edition = "2021"
log = { version = "0.4.17" }
memmap2 = "0.5.3"
modular-bitfield = "0.11.2"
procfs = "0.14.1"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does this crate handle the non-atomicity of multi-page reads in procfs?

fn get_mmap_count() -> Option<usize> {
let pid = std::process::id();
let map_path = format!("/proc/{}/maps", pid);
let output = std::process::Command::new("wc")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can avoid shelling out with something like https://doc.rust-lang.org/rust-by-example/std_misc/file/read_lines.html. just toss a .count() on after .lines()

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we may want to copy the maps file to a temporary first, to avoid being bitten by non-atomic reads in procfs

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like readlines is slower than wc -l.

Time elapsed is: 3.10014196s

I guess /proc/pid/maps are in memory kernel data structure, which is faster to read than the file on disk.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess /proc/pid/maps are in memory kernel data structure, which is faster to read than the file on disk.

yeah it is. also means it can be updated from beneath you. which is where the non-atomicity issues come in that i keep crowing on about 😅

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. But since this is used to report a metric, a small difference, caused by not being atomic, would be fine, I think.

std::fs::copy(map_path, copy_path.as_os_str()).unwrap();

let file = std::fs::File::open(copy_path).ok()?;
Some(std::io::BufReader::new(file).lines().count())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how's timing in comparison?

Copy link
Contributor Author

@HaoranYi HaoranYi Dec 1, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wc-l        1.5s
wc-l-copy   1.7s
readline    3.1s

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks like wc-l without copy is the fastest.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that's unexpected 🤔

did you try .lines() w/o copy?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah. .lines() w/o copy takes 2.87s
seems that wc is better optimized than rust lines() fn.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

weird. i wonder if it's the utf-8 handling on the rust side. wc is probably dumb and assumes ascii 🤔

@HaoranYi
Copy link
Contributor Author

HaoranYi commented Dec 5, 2022

@t-nelson @mvines
I made a few more changes and no longer used procfs crate as it slow and not available on mac os.
Can you give a quick review again? If there are no more questions, I would like to merge.
Thanks!

@HaoranYi
Copy link
Contributor Author

HaoranYi commented Dec 6, 2022

I assume there are no more questions. Merge it now. Feel free to ping me if you have any more comments. Thanks!

@HaoranYi HaoranYi merged commit bbd49ac into solana-labs:master Dec 6, 2022
@HaoranYi HaoranYi deleted the lsof branch December 6, 2022 19:32
@t-nelson
Copy link
Contributor

t-nelson commented Dec 6, 2022

sorry, i forgot to update here. i'm really not keen on shelling out. we're going to find that one stupid distro that doesn't have wc in coreutils. i was experimenting yesterday with ways to do it by hand. i got line counting of a pre-existing copy of /proc/PID/maps with ~450k maps down to ~0.045s average. for some reason reading straight from procfs seems to only support reading up to one page at a time, which adds a bunch of looping overhead. today i'm planning to dig through the wc source to see what kind of magic they're up to

@mvines
Copy link
Member

mvines commented Dec 6, 2022

err, yuck. 👎🏼 to adding that system wc dependency.
But generally I don't feel we need this PR entirely, so I'd be game to see a full revert and abandon the idea entirely. My previous comment that went unaddressed:

This all seems pretty blunt. Beyond append vecs, which seems like it would be covered by #28958, what other sources of unbounded fd/mmap growth exist and perhaps we can directly instrument those as well.

@HaoranYi
Copy link
Contributor Author

HaoranYi commented Dec 6, 2022

err, yuck. 👎🏼 to adding that system wc dependency. But generally I don't feel we need this PR entirely, so I'd be game to see a full revert and abandon the idea entirely. My previous comment that went unaddressed:

This all seems pretty blunt. Beyond append vecs, which seems like it would be covered by #28958, what other sources of unbounded fd/mmap growth exist and perhaps we can directly instrument those as well.

The other big source of mmap is accounts index mmap files, which should be addressed in this pr
#28984

HaoranYi added a commit to HaoranYi/solana that referenced this pull request Dec 6, 2022
@HaoranYi HaoranYi mentioned this pull request Dec 6, 2022
@HaoranYi
Copy link
Contributor Author

HaoranYi commented Dec 6, 2022

ok. I will revert it.

@t-nelson
Copy link
Contributor

t-nelson commented Dec 6, 2022

The other big source of mmap is accounts index mmap files, which should be addressed in this pr

only map consumers besides accountsdb and accounts-index are anon mappings (which I don't think we make explicitly), the executable file and shared-objects

@HaoranYi
Copy link
Contributor Author

HaoranYi commented Dec 6, 2022

Another big consumer of open fd is from rocksdb. Maybe there is a way to get the open fd from rocksdb?

@t-nelson
Copy link
Contributor

t-nelson commented Dec 6, 2022

Another big consumer of open fd is from rocksdb. Maybe there is a way to get the open fd from rocksdb?

are those the anon maps? i didn't see anything with a file path under the ledger dir

@t-nelson
Copy link
Contributor

t-nelson commented Dec 6, 2022

Another big consumer of open fd is from rocksdb. Maybe there is a way to get the open fd from rocksdb?

are those the anon maps? i didn't see anything with a file path under the ledger dir

ah ha!

https://github.com/facebook/rocksdb/blob/23af6786a997d3592e8a68f1a8d9e0699a6eae36/port/mmap.cc#L81-L82

so we can count anon maps out of procfs, but afaik there's no way to link them to their caller

@HaoranYi
Copy link
Contributor Author

HaoranYi commented Dec 6, 2022

yes. count procfs lines will include them. But I doubt there is an API on rocksdb to report it.

@t-nelson
Copy link
Contributor

t-nelson commented Dec 6, 2022

random sampling

   3406 anon
 410375 /accounts/
  56116 /accounts_index/
     68 other

HaoranYi added a commit that referenced this pull request Dec 6, 2022
gnapoli23 pushed a commit to gnapoli23/solana that referenced this pull request Dec 16, 2022
nickfrosty pushed a commit to nickfrosty/solana that referenced this pull request Jan 4, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants