-
Notifications
You must be signed in to change notification settings - Fork 4.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use memory map to speed up snapshot untar #24889
Conversation
runtime/src/snapshot_utils.rs
Outdated
let file = File::open(&snapshot_tar).unwrap(); | ||
|
||
// Map the file into immutable memory for better I/O performance | ||
let mmap = unsafe { Mmap::map(&file) }; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if this PR makes it out of draft, consider using an existing crate that wraps the unsafe. https://crates.io/crates/mmarinus seems like one such option
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like this crate depends on libc. That makes it linux-only. Not for windows. It breaks windows build.
> cargo check
Checking mmarinus v0.4.0
error[E0433]: failed to resolve: could not find `unix` in `os`
--> C:\Users\hyi\.cargo\registry\src\github.com-1ecc6299db9ec823\mmarinus-0.4.0\src\builder.rs:11:14
|
11 | use std::os::unix::io::{AsRawFd, RawFd};
| ^^^^ could not find `unix` in `os`
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps there's another crate, or we could consider upstreaming support for Windows.
Generally it's very undesirable to have raw unsafe
s in the code like this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alternatively fall back to the original code for Windows?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah. Pushed a commit to update this.
Unfortunately, there is still one unsafe code to convert mmap to slice. This is because the background thread in SharedBufferReader requires the reader to be static. It will requires a bit of rework on the SharedBufferReader, which we can leave for the future?
ledger-tool/src/main.rs
Outdated
@@ -963,6 +963,9 @@ fn main() { | |||
.validator(is_slot) | |||
.takes_value(true) | |||
.help("Halt processing at the given slot"); | |||
let no_os_memory_stats_reporting_arg = Arg::with_name("no_os_memory_stats_reporting") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a different change for adding no_os_memory_stats_reporting cli args to ledger-tool. It is not relevant to the memory map.
runtime/src/snapshot_utils.rs
Outdated
@@ -779,7 +779,7 @@ pub struct BankFromArchiveTimings { | |||
} | |||
|
|||
// From testing, 4 seems to be a sweet spot for ranges of 60M-360M accounts and 16-64 cores. This may need to be tuned later. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
update comment
what machine spec did you use to get these metrics? |
How big does the slice become? Is it as big as the tar file? Will we run into physical memory limits with 1B/10B account snapshots? |
gce box: 48core 250G RAM |
Do you have an estimate of how big the snapshot will be for 1B/10B accounts? |
Or Perhaps, we can switch to old BufReader if the memory map fails? |
it is compressed, but I imagine it to be basically linear. so if 20G: 170M accounts, then 200G: 1.7B accounts, 2T: 17B accounts. Other things will begin breaking. |
I'm happy to have this change made. I would love the perf improvements. I load from snapshots all day long. |
Yes, will try.
…On Wed, May 4, 2022 at 3:41 PM Jeff Washington (jwash) < ***@***.***> wrote:
I'm happy to have this change made. I would love the perf improvements. I
load from snapshots all day long.
I am ignorant of how the slice code and the memory mapped file stuff will
work in a mem env with very little virtual memory and slices mapped into
memory. Does it page the file since it is file backed already and that
doesn't count as vm in the same sense? So, will this just work?
Can you write an app to hold all but say 15G of memory on a machine and
leave it running. then try ledger-tool verify and see what hapens when we
try to untar a 20G snapshot?
—
Reply to this email directly, view it on GitHub
<#24889 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AABVSJGU4IBDALAH3VLCV2LVILOF5ANCNFSM5U4E4K2Q>
.
You are receiving this because you authored the thread.Message ID:
***@***.***>
|
You mean physical not virtual right? The virtual limit is usually very high like many TB in Linux and you can adjust higher if needed. |
Well this is in my ignorance. We configure our validators with very little vm (2G)? If we mmap the tar file, then map it into a slice, what happens when we have less physical memory than we need to both represent the mmapped slice AND whatever other dynamic memory the validator needs, like the accounts index. I would assume the os would just page in the mmapped areas that are accessed from the file that backs it. I do not know if this counts in the vm limit or not. And, is the slice 20G elements long? Hopefully this 'just works'. Basically, can I load a 80G snapshot on a 64G machine with 2G of virtual memory configured? If you know this will work, then I have no concerns when a 2B account snapshot is loaded on a 128G machine. |
it should work, for some definition of "work". mmap basically "swaps" out to the backing file if there's not enough RAM available. Whether this is going to have a negative effect on memory pressure for other components is the question. That is, I'm not sure if the in-memory region is treated as RES or more like VFS cache WRT eviction |
I could imagine the mmap succeeds, but it uses almost all of the available mem. Then, we'd oom generating the index or something. I would like generating the index and untarring the append vecs to be happening concurrently - something @HaoranYi and I have been talking about. I'm just trying to think ahead and not lay a trap for ourselves we hit later. |
This summarizes my ignorance and concerns ;-) I also don't know. I think it also might 'work'. But I don't know enough low level details of linux memory management and mmapped files to 'know' with confidence. So, I proposed an experiment that seemed pretty easy. Feel free to propose a different experiment! |
virtual memory can be specified. Usually it is set larger than the physical ram. For example, with 256G RAM, you can set it to vm to 512G. Then you application can map a file that is less than 512G, it should be fine. When you access the data, if it is not in physical ram, os will help to page in data into memory. But it may have bad performance because of thrashing of the available physical ram. |
I do not know the history of validators. But, it appears that we set virtual memory at 2G on our machines. I do not know the history of this or the impact if we were to change it. |
and @HaoranYi if you are testing on a gce machine, I have no idea what VM will be set at by default. I think you're testing on bare metal from the spreadsheet. |
The 2G is the swap file that is somewhat different from virtual memory and not really used when mmap'ing a file. Linux will page out to the mapped file in this case and keep recently used pages in physical memory in the page cache. The page cache itself could be maybe be paged out to swap, but swap is not necessary to map a huge vm space and the size of the swap has no bearing on how much you can map. I believe Linux keeps the page cache within the physical memory of the machine. |
yeah. The vm is set in the following file on the system. In my case, it is 700G.
|
That's map count, not byte size. It's the number of concurrent mmap's allowed |
yeah. My current config is 700K. I think the default is only 64K. |
I did a stress test on memory usage when the available physical memory is only |
Ok, so the mmap shouldn't fail due to low memory. Seems good to have confirmed that. It doesn't seem necessary then to fall back on the non-mem map case. Right? When you ran this experiment, did we oom while generating the index? |
No, it doesn't. I put a std::exit at the end of the load function. It hit
that and terminate.
Yes, if I let it continues, it will oom when generating the index.
…On Wed, May 11, 2022, 5:47 PM Jeff Washington (jwash) < ***@***.***> wrote:
Ok, so the mmap shouldn't fail due to low memory. Seems good to have
confirmed that. It doesn't seem necessary then to fall back on the non-mem
map case. Right? When you ran this experiment, did we oom while generating
the index?
—
Reply to this email directly, view it on GitHub
<#24889 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AABVSJEP33WEZPCFYZLVHATVJQ2J5ANCNFSM5U4E4K2Q>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm
This reverts commit 3367e44.
…s#24889)" (solana-labs#25174)" This reverts commit fc793de.
* Revert "Revert "Use memory map to speed up snapshot untar (solana-labs#24889)" (solana-labs#25174)" This reverts commit fc793de. * not use mmarinus * enable secondary build * Revert "enable secondary build" This reverts commit 5aa43a9. * macbuild * Revert "macbuild" This reverts commit 0da9294.
Problem
#24798
Use memory map to speed up snapshot untar.
Summary of Changes
with mmap
master
The comparison does show that read_time is reduced from 29s to 23 seconds. But the the total untar time don't differ much. The reason is because, although we are reading the data faster, we spend more time on waiting for available buffer, 21s vs 15s.
Next, we are going to tune the chunk ratio.
So far, best config is parallel_factor=8 buf_size=2G chunk=50M.
untar time reduced from 44s to 32s.
Fixes #