Use memory map to speed up snapshot untar #24889

HaoranYi · 2022-05-02T14:31:10Z

Problem

Use memory map to speed up snapshot untar.

Summary of Changes

use mmap to read snapshot archive files.
retune the parallel factor and chunksize

with mmap

reading entire decompressed file took: 44799890 us, bytes: 76654958592, read_us: 23072928, waiting_for_buffer_us: 21715846, largest fetch: 100000000, error: Ok(0)

master

reading entire decompressed file took: 44960804 us, bytes: 76654958592, read_us: 29821976, waiting_for_buffer_us: 15128175, largest fetch: 100000000, error: Ok(0)

The comparison does show that read_time is reduced from 29s to 23 seconds. But the the total untar time don't differ much. The reason is because, although we are reading the data faster, we spend more time on waiting for available buffer, 21s vs 15s.

Next, we are going to tune the chunk ratio.

// tunable parameters:
// # bytes allocated and populated by reading ahead
const TOTAL_BUFFER_BUDGET_DEFAULT: usize = 2_000_000_000;
// data is read-ahead and saved in chunks of this many bytes
const CHUNK_SIZE_DEFAULT: usize = 100_000_000;

parallel_factor=8 buf_size=2G chunk=50M
reading entire decompressed file took: 32857423 us, bytes: 76654958592, read_us: 21869626, waiting_for_buffer_us: 10964636, largest fetch: 50000000, error: Ok(0)

parallel_factor=8 buf_size=2G chunk=25M
reading entire decompressed file took: 33978930 us, bytes: 76654958592, read_us: 26325740, waiting_for_buffer_us: 7609328, largest fetch: 25000000, error: Ok(0)

So far, best config is parallel_factor=8 buf_size=2G chunk=50M.
untar time reduced from 44s to 32s.

Fixes #

mvines · 2022-05-02T18:22:59Z

runtime/src/snapshot_utils.rs

+    let file = File::open(&snapshot_tar).unwrap();
+
+    // Map the file into immutable memory for better I/O performance
+    let mmap = unsafe { Mmap::map(&file) };


if this PR makes it out of draft, consider using an existing crate that wraps the unsafe. https://crates.io/crates/mmarinus seems like one such option

It looks like this crate depends on libc. That makes it linux-only. Not for windows. It breaks windows build.

> cargo check Checking mmarinus v0.4.0 error[E0433]: failed to resolve: could not find `unix` in `os` --> C:\Users\hyi\.cargo\registry\src\github.com-1ecc6299db9ec823\mmarinus-0.4.0\src\builder.rs:11:14 | 11 | use std::os::unix::io::{AsRawFd, RawFd}; | ^^^^ could not find `unix` in `os`

Perhaps there's another crate, or we could consider upstreaming support for Windows.

Generally it's very undesirable to have raw unsafes in the code like this

Alternatively fall back to the original code for Windows?

Yeah. Pushed a commit to update this.
Unfortunately, there is still one unsafe code to convert mmap to slice. This is because the background thread in SharedBufferReader requires the reader to be static. It will requires a bit of rework on the SharedBufferReader, which we can leave for the future?

HaoranYi · 2022-05-03T13:20:24Z

ledger-tool/src/main.rs

@@ -963,6 +963,9 @@ fn main() {
        .validator(is_slot)
        .takes_value(true)
        .help("Halt processing at the given slot");
+    let no_os_memory_stats_reporting_arg = Arg::with_name("no_os_memory_stats_reporting")


This is a different change for adding no_os_memory_stats_reporting cli args to ledger-tool. It is not relevant to the memory map.

jeffwashington · 2022-05-03T14:37:18Z

runtime/src/snapshot_utils.rs

@@ -779,7 +779,7 @@ pub struct BankFromArchiveTimings {
 }

 // From testing, 4 seems to be a sweet spot for ranges of 60M-360M accounts and 16-64 cores. This may need to be tuned later.


update comment

jeffwashington · 2022-05-03T14:39:48Z

what machine spec did you use to get these metrics?

jeffwashington · 2022-05-03T14:40:40Z

How big does the slice become? Is it as big as the tar file? Will we run into physical memory limits with 1B/10B account snapshots?

HaoranYi · 2022-05-03T14:45:18Z

what machine spec did you use to get these metrics?

gce box: 48core 250G RAM

HaoranYi · 2022-05-03T14:51:19Z

How big does the slice become? Is it as big as the tar file? Will we run into physical memory limits with 1B/10B account snapshots?
In this test run, it is about 76G.
Yeah. As long as we have enough virtual memory then it will be fine. Otherwise, it will throw an error and exit.

    let mmap = Map::load(&snapshot_tar, Private, perms::Read).unwrap_or_else(|e| {
        error!(
            "Failed to map the snapshot file: {} {}.\n
                        Please increase the virtual memory on the system.",
            snapshot_tar.as_ref().display(),
            e,
        );
        std::process::exit(1);

Do you have an estimate of how big the snapshot will be for 1B/10B accounts?

HaoranYi · 2022-05-03T14:52:47Z

Or Perhaps, we can switch to old BufReader if the memory map fails?

jeffwashington · 2022-05-03T15:10:48Z

Do you have an estimate of how big the snapshot will be for 1B/10B accounts?

it is compressed, but I imagine it to be basically linear. so if 20G: 170M accounts, then 200G: 1.7B accounts, 2T: 17B accounts. Other things will begin breaking.
I sometimes test on 64G machines. Perhaps that ship has sailed.

jeffwashington · 2022-05-04T20:40:51Z

I'm happy to have this change made. I would love the perf improvements. I load from snapshots all day long.
I am ignorant of how the slice code and the memory mapped file stuff will work in a mem env with very little virtual memory and slices mapped into memory. Does it page the file since it is file backed already and that doesn't count as vm in the same sense? So, will this just work?
Can you write an app to hold all but say 15G of memory on a machine and leave it running. then try ledger-tool verify and see what hapens when we try to untar a 20G snapshot?

HaoranYi · 2022-05-04T20:51:17Z

Yes, will try.

…

On Wed, May 4, 2022 at 3:41 PM Jeff Washington (jwash) < ***@***.***> wrote: I'm happy to have this change made. I would love the perf improvements. I load from snapshots all day long. I am ignorant of how the slice code and the memory mapped file stuff will work in a mem env with very little virtual memory and slices mapped into memory. Does it page the file since it is file backed already and that doesn't count as vm in the same sense? So, will this just work? Can you write an app to hold all but say 15G of memory on a machine and leave it running. then try ledger-tool verify and see what hapens when we try to untar a 20G snapshot? — Reply to this email directly, view it on GitHub <#24889 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AABVSJGU4IBDALAH3VLCV2LVILOF5ANCNFSM5U4E4K2Q> . You are receiving this because you authored the thread.Message ID: ***@***.***>

sakridge · 2022-05-04T20:58:35Z

I'm happy to have this change made. I would love the perf improvements. I load from snapshots all day long.
I am ignorant of how the slice code and the memory mapped file stuff will work in a mem env with very little virtual memory and slices mapped into memory. Does it page the file since it is file backed already and that doesn't count as vm in the same sense? So, will this just work?
Can you write an app to hold all but say 15G of memory on a machine and leave it running. then try ledger-tool verify and see what hapens when we try to untar a 20G snapshot?

You mean physical not virtual right? The virtual limit is usually very high like many TB in Linux and you can adjust higher if needed.

jeffwashington · 2022-05-04T21:18:45Z

Well this is in my ignorance. We configure our validators with very little vm (2G)? If we mmap the tar file, then map it into a slice, what happens when we have less physical memory than we need to both represent the mmapped slice AND whatever other dynamic memory the validator needs, like the accounts index. I would assume the os would just page in the mmapped areas that are accessed from the file that backs it. I do not know if this counts in the vm limit or not. And, is the slice 20G elements long? Hopefully this 'just works'.

Basically, can I load a 80G snapshot on a 64G machine with 2G of virtual memory configured?
I suggested using up ram in a separate app so that we can answer the question:
Can I load a 20G snapshot on a machine with 2G of virtual memory and all but ~15G of memory used by another application.

If you know this will work, then I have no concerns when a 2B account snapshot is loaded on a 128G machine.

t-nelson · 2022-05-04T21:34:32Z

it should work, for some definition of "work". mmap basically "swaps" out to the backing file if there's not enough RAM available. Whether this is going to have a negative effect on memory pressure for other components is the question. That is, I'm not sure if the in-memory region is treated as RES or more like VFS cache WRT eviction

jeffwashington · 2022-05-04T21:36:07Z

Or Perhaps, we can switch to old BufReader if the memory map fails?

I could imagine the mmap succeeds, but it uses almost all of the available mem. Then, we'd oom generating the index or something. I would like generating the index and untarring the append vecs to be happening concurrently - something @HaoranYi and I have been talking about. I'm just trying to think ahead and not lay a trap for ourselves we hit later.

jeffwashington · 2022-05-04T21:38:06Z

it should work, for some definition of "work". mmap basically "swaps" out to the backing file if there's not enough RAM available. Whether this is going to have a negative effect on memory pressure for other components is the question. That is, I'm not sure if the in-memory region is treated as RES or more like VFS cache WRT eviction

This summarizes my ignorance and concerns ;-) I also don't know. I think it also might 'work'. But I don't know enough low level details of linux memory management and mmapped files to 'know' with confidence. So, I proposed an experiment that seemed pretty easy. Feel free to propose a different experiment!

HaoranYi · 2022-05-04T21:47:33Z

virtual memory can be specified. Usually it is set larger than the physical ram. For example, with 256G RAM, you can set it to vm to 512G. Then you application can map a file that is less than 512G, it should be fine. When you access the data, if it is not in physical ram, os will help to page in data into memory. But it may have bad performance because of thrashing of the available physical ram.

jeffwashington · 2022-05-04T21:50:57Z

virtual memory can be specified. Usually it is set larger than the physical ram. For example, with 256G RAM, you can set it to vm to 512G. Then you application can map a file that is less than 512G, it should be fine. When you access the data, if it is not in physical ram, os will help to page in data into memory. But it may have bad performance because of thrashing of the available physical ram.

I do not know the history of validators. But, it appears that we set virtual memory at 2G on our machines. I do not know the history of this or the impact if we were to change it.

jeffwashington · 2022-05-04T21:52:34Z

and @HaoranYi if you are testing on a gce machine, I have no idea what VM will be set at by default. I think you're testing on bare metal from the spreadsheet.

sakridge · 2022-05-04T22:06:20Z

The 2G is the swap file that is somewhat different from virtual memory and not really used when mmap'ing a file. Linux will page out to the mapped file in this case and keep recently used pages in physical memory in the page cache. The page cache itself could be maybe be paged out to swap, but swap is not necessary to map a huge vm space and the size of the swap has no bearing on how much you can map. I believe Linux keeps the page cache within the physical memory of the machine.

HaoranYi · 2022-05-04T22:18:18Z

yeah. The vm is set in the following file on the system. In my case, it is 700G.

haoran_yi_solana_com@hyi-dev2:~/git/hy_solana$ cat /etc/sysctl.conf | grep vm.max_map_count
vm.max_map_count=700000

t-nelson · 2022-05-04T22:43:44Z

yeah. The vm is set in the following file on the system. In my case, it is 700G.
haoran_yi_solana_com@hyi-dev2:~/git/hy_solana$ cat /etc/sysctl.conf | grep vm.max_map_count
vm.max_map_count=700000

That's map count, not byte size. It's the number of concurrent mmap's allowed

HaoranYi · 2022-05-05T14:18:02Z

That's map count, not byte size. It's the number of concurrent mmap's allowed

yeah. My current config is 700K. I think the default is only 64K.

HaoranYi · 2022-05-11T22:34:27Z

@jeffwashington

I did a stress test on memory usage when the available physical memory is only
20G. On the machine with 378G memory, I have one process consume 360G.
The validator is able to memory map a 84G snapshot file. The actual physical
memory increase of during the memory map is about 15G.

jeffwashington · 2022-05-11T22:47:46Z

Ok, so the mmap shouldn't fail due to low memory. Seems good to have confirmed that. It doesn't seem necessary then to fall back on the non-mem map case. Right? When you ran this experiment, did we oom while generating the index?

HaoranYi · 2022-05-11T23:11:31Z

No, it doesn't. I put a std::exit at the end of the load function. It hit that and terminate. Yes, if I let it continues, it will oom when generating the index.

…

On Wed, May 11, 2022, 5:47 PM Jeff Washington (jwash) < ***@***.***> wrote: Ok, so the mmap shouldn't fail due to low memory. Seems good to have confirmed that. It doesn't seem necessary then to fall back on the non-mem map case. Right? When you ran this experiment, did we oom while generating the index? — Reply to this email directly, view it on GitHub <#24889 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AABVSJEP33WEZPCFYZLVHATVJQ2J5ANCNFSM5U4E4K2Q> . You are receiving this because you were mentioned.Message ID: ***@***.***>

jeffwashington

lgtm

This reverts commit 3367e44.

…s#24889)" (solana-labs#25174)" This reverts commit fc793de.

* Revert "Revert "Use memory map to speed up snapshot untar (#24889)" (#25174)" This reverts commit fc793de. * not use mmarinus * enable secondary build * Revert "enable secondary build" This reverts commit 5aa43a9. * macbuild * Revert "macbuild" This reverts commit 0da9294.

* Revert "Revert "Use memory map to speed up snapshot untar (solana-labs#24889)" (solana-labs#25174)" This reverts commit fc793de. * not use mmarinus * enable secondary build * Revert "enable secondary build" This reverts commit 5aa43a9. * macbuild * Revert "macbuild" This reverts commit 0da9294.

HaoranYi changed the title ~~fix typo~~ mmap May 2, 2022

HaoranYi force-pushed the mmap branch from f62e661 to 5487205 Compare May 2, 2022 15:09

mvines reviewed May 2, 2022

View reviewed changes

HaoranYi changed the title ~~mmap~~ Use memory map to speed up snapshot untar May 2, 2022

HaoranYi force-pushed the mmap branch from ac462b0 to 14df9ca Compare May 3, 2022 03:04

HaoranYi marked this pull request as ready for review May 3, 2022 03:30

HaoranYi force-pushed the mmap branch from 4924b6e to 554d6f1 Compare May 3, 2022 12:57

HaoranYi requested a review from jeffwashington May 3, 2022 13:17

HaoranYi commented May 3, 2022

View reviewed changes

jeffwashington reviewed May 3, 2022

View reviewed changes

mmap

cc9a440

HaoranYi added 2 commits May 5, 2022 08:11

add no_os_memory_reporting_args to ledger-tool

087d2cd

use safe memory map package for unix. use regular file for windows

29a6d81

HaoranYi force-pushed the mmap branch from f3af9d2 to 2bb525b Compare May 5, 2022 13:13

fallback to untaring with filebuf when mmap fails

59ce3dd

HaoranYi force-pushed the mmap branch from 2bb525b to 59ce3dd Compare May 5, 2022 13:58

update comments

a100a97

jeffwashington self-requested a review May 12, 2022 16:43

jeffwashington approved these changes May 12, 2022

View reviewed changes

HaoranYi merged commit 3367e44 into solana-labs:master May 12, 2022

HaoranYi deleted the mmap branch May 12, 2022 18:35

jeffwashington added a commit that referenced this pull request May 12, 2022

Revert "Use memory map to speed up snapshot untar (#24889)"

aefd3b0

This reverts commit 3367e44.

jeffwashington mentioned this pull request May 12, 2022

Revert "Use memory map to speed up snapshot untar" #25174

Merged

jeffwashington added a commit that referenced this pull request May 12, 2022

Revert "Use memory map to speed up snapshot untar (#24889)" (#25174)

fc793de

This reverts commit 3367e44.

HaoranYi restored the mmap branch May 12, 2022 19:50

HaoranYi deleted the mmap branch May 12, 2022 19:52

HaoranYi added a commit to HaoranYi/solana that referenced this pull request May 12, 2022

Revert "Revert "Use memory map to speed up snapshot untar (solana-lab…

a58c8d9

…s#24889)" (solana-labs#25174)" This reverts commit fc793de.

HaoranYi mentioned this pull request May 12, 2022

Don't use mmarinus package #25177

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use memory map to speed up snapshot untar #24889

Use memory map to speed up snapshot untar #24889

HaoranYi commented May 2, 2022 •

edited

Loading

mvines May 2, 2022

HaoranYi May 3, 2022 •

edited

Loading

mvines May 3, 2022

mvines May 3, 2022

HaoranYi May 3, 2022 •

edited

Loading

HaoranYi May 3, 2022

jeffwashington May 3, 2022

jeffwashington commented May 3, 2022

jeffwashington commented May 3, 2022

HaoranYi commented May 3, 2022

HaoranYi commented May 3, 2022

HaoranYi commented May 3, 2022 •

edited

Loading

jeffwashington commented May 3, 2022

jeffwashington commented May 4, 2022

HaoranYi commented May 4, 2022 via email

sakridge commented May 4, 2022

jeffwashington commented May 4, 2022

t-nelson commented May 4, 2022

jeffwashington commented May 4, 2022

jeffwashington commented May 4, 2022

HaoranYi commented May 4, 2022

jeffwashington commented May 4, 2022

jeffwashington commented May 4, 2022

sakridge commented May 4, 2022

HaoranYi commented May 4, 2022 •

edited

Loading

t-nelson commented May 4, 2022

HaoranYi commented May 5, 2022

HaoranYi commented May 11, 2022 •

edited

Loading

jeffwashington commented May 11, 2022

HaoranYi commented May 11, 2022 via email

jeffwashington left a comment

		@@ -779,7 +779,7 @@ pub struct BankFromArchiveTimings {
		}

		// From testing, 4 seems to be a sweet spot for ranges of 60M-360M accounts and 16-64 cores. This may need to be tuned later.

Use memory map to speed up snapshot untar #24889

Use memory map to speed up snapshot untar #24889

Conversation

HaoranYi commented May 2, 2022 • edited Loading

Problem

Summary of Changes

mvines May 2, 2022

Choose a reason for hiding this comment

HaoranYi May 3, 2022 • edited Loading

Choose a reason for hiding this comment

mvines May 3, 2022

Choose a reason for hiding this comment

mvines May 3, 2022

Choose a reason for hiding this comment

HaoranYi May 3, 2022 • edited Loading

Choose a reason for hiding this comment

HaoranYi May 3, 2022

Choose a reason for hiding this comment

jeffwashington May 3, 2022

Choose a reason for hiding this comment

jeffwashington commented May 3, 2022

jeffwashington commented May 3, 2022

HaoranYi commented May 3, 2022

HaoranYi commented May 3, 2022

HaoranYi commented May 3, 2022 • edited Loading

jeffwashington commented May 3, 2022

jeffwashington commented May 4, 2022

HaoranYi commented May 4, 2022 via email

sakridge commented May 4, 2022

jeffwashington commented May 4, 2022

t-nelson commented May 4, 2022

jeffwashington commented May 4, 2022

jeffwashington commented May 4, 2022

HaoranYi commented May 4, 2022

jeffwashington commented May 4, 2022

jeffwashington commented May 4, 2022

sakridge commented May 4, 2022

HaoranYi commented May 4, 2022 • edited Loading

t-nelson commented May 4, 2022

HaoranYi commented May 5, 2022

HaoranYi commented May 11, 2022 • edited Loading

jeffwashington commented May 11, 2022

HaoranYi commented May 11, 2022 via email

jeffwashington left a comment

Choose a reason for hiding this comment

HaoranYi commented May 2, 2022 •

edited

Loading

HaoranYi May 3, 2022 •

edited

Loading

HaoranYi May 3, 2022 •

edited

Loading

HaoranYi commented May 3, 2022 •

edited

Loading

HaoranYi commented May 4, 2022 •

edited

Loading

HaoranYi commented May 11, 2022 •

edited

Loading