Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use memory map to speed up snapshot untar #24889

Merged
merged 5 commits into from
May 12, 2022
Merged

Conversation

HaoranYi
Copy link
Contributor

@HaoranYi HaoranYi commented May 2, 2022

Problem

#24798

Use memory map to speed up snapshot untar.

Summary of Changes

  1. use mmap to read snapshot archive files.
  2. retune the parallel factor and chunksize

with mmap

reading entire decompressed file took: 44799890 us, bytes: 76654958592, read_us: 23072928, waiting_for_buffer_us: 21715846, largest fetch: 100000000, error: Ok(0)

master

reading entire decompressed file took: 44960804 us, bytes: 76654958592, read_us: 29821976, waiting_for_buffer_us: 15128175, largest fetch: 100000000, error: Ok(0)

The comparison does show that read_time is reduced from 29s to 23 seconds. But the the total untar time don't differ much. The reason is because, although we are reading the data faster, we spend more time on waiting for available buffer, 21s vs 15s.

Next, we are going to tune the chunk ratio.

// tunable parameters:
// # bytes allocated and populated by reading ahead
const TOTAL_BUFFER_BUDGET_DEFAULT: usize = 2_000_000_000;
// data is read-ahead and saved in chunks of this many bytes
const CHUNK_SIZE_DEFAULT: usize = 100_000_000;
parallel_factor=8 buf_size=2G chunk=50M
reading entire decompressed file took: 32857423 us, bytes: 76654958592, read_us: 21869626, waiting_for_buffer_us: 10964636, largest fetch: 50000000, error: Ok(0)

parallel_factor=8 buf_size=2G chunk=25M
reading entire decompressed file took: 33978930 us, bytes: 76654958592, read_us: 26325740, waiting_for_buffer_us: 7609328, largest fetch: 25000000, error: Ok(0)

So far, best config is parallel_factor=8 buf_size=2G chunk=50M.
untar time reduced from 44s to 32s.

Fixes #

@HaoranYi HaoranYi changed the title fix typo mmap May 2, 2022
let file = File::open(&snapshot_tar).unwrap();

// Map the file into immutable memory for better I/O performance
let mmap = unsafe { Mmap::map(&file) };
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if this PR makes it out of draft, consider using an existing crate that wraps the unsafe. https://crates.io/crates/mmarinus seems like one such option

Copy link
Contributor Author

@HaoranYi HaoranYi May 3, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like this crate depends on libc. That makes it linux-only. Not for windows. It breaks windows build.

> cargo check
    Checking mmarinus v0.4.0
error[E0433]: failed to resolve: could not find `unix` in `os`
  --> C:\Users\hyi\.cargo\registry\src\github.com-1ecc6299db9ec823\mmarinus-0.4.0\src\builder.rs:11:14
   |
11 | use std::os::unix::io::{AsRawFd, RawFd};
   |              ^^^^ could not find `unix` in `os`

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps there's another crate, or we could consider upstreaming support for Windows.

Generally it's very undesirable to have raw unsafes in the code like this

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alternatively fall back to the original code for Windows?

Copy link
Contributor Author

@HaoranYi HaoranYi May 3, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah. Pushed a commit to update this.
Unfortunately, there is still one unsafe code to convert mmap to slice. This is because the background thread in SharedBufferReader requires the reader to be static. It will requires a bit of rework on the SharedBufferReader, which we can leave for the future?

@HaoranYi HaoranYi changed the title mmap Use memory map to speed up snapshot untar May 2, 2022
@HaoranYi HaoranYi marked this pull request as ready for review May 3, 2022 03:30
@HaoranYi HaoranYi requested a review from jeffwashington May 3, 2022 13:17
@@ -963,6 +963,9 @@ fn main() {
.validator(is_slot)
.takes_value(true)
.help("Halt processing at the given slot");
let no_os_memory_stats_reporting_arg = Arg::with_name("no_os_memory_stats_reporting")
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a different change for adding no_os_memory_stats_reporting cli args to ledger-tool. It is not relevant to the memory map.

@@ -779,7 +779,7 @@ pub struct BankFromArchiveTimings {
}

// From testing, 4 seems to be a sweet spot for ranges of 60M-360M accounts and 16-64 cores. This may need to be tuned later.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

update comment

@jeffwashington
Copy link
Contributor

what machine spec did you use to get these metrics?

@jeffwashington
Copy link
Contributor

How big does the slice become? Is it as big as the tar file? Will we run into physical memory limits with 1B/10B account snapshots?

@HaoranYi
Copy link
Contributor Author

HaoranYi commented May 3, 2022

what machine spec did you use to get these metrics?

gce box: 48core 250G RAM

@HaoranYi
Copy link
Contributor Author

HaoranYi commented May 3, 2022

How big does the slice become? Is it as big as the tar file? Will we run into physical memory limits with 1B/10B account snapshots?
In this test run, it is about 76G.
Yeah. As long as we have enough virtual memory then it will be fine. Otherwise, it will throw an error and exit.

    let mmap = Map::load(&snapshot_tar, Private, perms::Read).unwrap_or_else(|e| {
        error!(
            "Failed to map the snapshot file: {} {}.\n
                        Please increase the virtual memory on the system.",
            snapshot_tar.as_ref().display(),
            e,
        );
        std::process::exit(1);

Do you have an estimate of how big the snapshot will be for 1B/10B accounts?

@HaoranYi
Copy link
Contributor Author

HaoranYi commented May 3, 2022

Or Perhaps, we can switch to old BufReader if the memory map fails?

@jeffwashington
Copy link
Contributor

Do you have an estimate of how big the snapshot will be for 1B/10B accounts?

it is compressed, but I imagine it to be basically linear. so if 20G: 170M accounts, then 200G: 1.7B accounts, 2T: 17B accounts. Other things will begin breaking.
I sometimes test on 64G machines. Perhaps that ship has sailed.

@jeffwashington
Copy link
Contributor

I'm happy to have this change made. I would love the perf improvements. I load from snapshots all day long.
I am ignorant of how the slice code and the memory mapped file stuff will work in a mem env with very little virtual memory and slices mapped into memory. Does it page the file since it is file backed already and that doesn't count as vm in the same sense? So, will this just work?
Can you write an app to hold all but say 15G of memory on a machine and leave it running. then try ledger-tool verify and see what hapens when we try to untar a 20G snapshot?

@HaoranYi
Copy link
Contributor Author

HaoranYi commented May 4, 2022 via email

@sakridge
Copy link
Contributor

sakridge commented May 4, 2022

I'm happy to have this change made. I would love the perf improvements. I load from snapshots all day long.
I am ignorant of how the slice code and the memory mapped file stuff will work in a mem env with very little virtual memory and slices mapped into memory. Does it page the file since it is file backed already and that doesn't count as vm in the same sense? So, will this just work?
Can you write an app to hold all but say 15G of memory on a machine and leave it running. then try ledger-tool verify and see what hapens when we try to untar a 20G snapshot?

You mean physical not virtual right? The virtual limit is usually very high like many TB in Linux and you can adjust higher if needed.

@jeffwashington
Copy link
Contributor

Well this is in my ignorance. We configure our validators with very little vm (2G)? If we mmap the tar file, then map it into a slice, what happens when we have less physical memory than we need to both represent the mmapped slice AND whatever other dynamic memory the validator needs, like the accounts index. I would assume the os would just page in the mmapped areas that are accessed from the file that backs it. I do not know if this counts in the vm limit or not. And, is the slice 20G elements long? Hopefully this 'just works'.

Basically, can I load a 80G snapshot on a 64G machine with 2G of virtual memory configured?
I suggested using up ram in a separate app so that we can answer the question:
Can I load a 20G snapshot on a machine with 2G of virtual memory and all but ~15G of memory used by another application.

If you know this will work, then I have no concerns when a 2B account snapshot is loaded on a 128G machine.

@t-nelson
Copy link
Contributor

t-nelson commented May 4, 2022

it should work, for some definition of "work". mmap basically "swaps" out to the backing file if there's not enough RAM available. Whether this is going to have a negative effect on memory pressure for other components is the question. That is, I'm not sure if the in-memory region is treated as RES or more like VFS cache WRT eviction

@jeffwashington
Copy link
Contributor

Or Perhaps, we can switch to old BufReader if the memory map fails?

I could imagine the mmap succeeds, but it uses almost all of the available mem. Then, we'd oom generating the index or something. I would like generating the index and untarring the append vecs to be happening concurrently - something @HaoranYi and I have been talking about. I'm just trying to think ahead and not lay a trap for ourselves we hit later.

@jeffwashington
Copy link
Contributor

it should work, for some definition of "work". mmap basically "swaps" out to the backing file if there's not enough RAM available. Whether this is going to have a negative effect on memory pressure for other components is the question. That is, I'm not sure if the in-memory region is treated as RES or more like VFS cache WRT eviction

This summarizes my ignorance and concerns ;-) I also don't know. I think it also might 'work'. But I don't know enough low level details of linux memory management and mmapped files to 'know' with confidence. So, I proposed an experiment that seemed pretty easy. Feel free to propose a different experiment!

@HaoranYi
Copy link
Contributor Author

HaoranYi commented May 4, 2022

virtual memory can be specified. Usually it is set larger than the physical ram. For example, with 256G RAM, you can set it to vm to 512G. Then you application can map a file that is less than 512G, it should be fine. When you access the data, if it is not in physical ram, os will help to page in data into memory. But it may have bad performance because of thrashing of the available physical ram.

@jeffwashington
Copy link
Contributor

virtual memory can be specified. Usually it is set larger than the physical ram. For example, with 256G RAM, you can set it to vm to 512G. Then you application can map a file that is less than 512G, it should be fine. When you access the data, if it is not in physical ram, os will help to page in data into memory. But it may have bad performance because of thrashing of the available physical ram.

I do not know the history of validators. But, it appears that we set virtual memory at 2G on our machines. I do not know the history of this or the impact if we were to change it.

@jeffwashington
Copy link
Contributor

and @HaoranYi if you are testing on a gce machine, I have no idea what VM will be set at by default. I think you're testing on bare metal from the spreadsheet.

@sakridge
Copy link
Contributor

sakridge commented May 4, 2022

The 2G is the swap file that is somewhat different from virtual memory and not really used when mmap'ing a file. Linux will page out to the mapped file in this case and keep recently used pages in physical memory in the page cache. The page cache itself could be maybe be paged out to swap, but swap is not necessary to map a huge vm space and the size of the swap has no bearing on how much you can map. I believe Linux keeps the page cache within the physical memory of the machine.

@HaoranYi
Copy link
Contributor Author

HaoranYi commented May 4, 2022

yeah. The vm is set in the following file on the system. In my case, it is 700G.

haoran_yi_solana_com@hyi-dev2:~/git/hy_solana$ cat /etc/sysctl.conf | grep vm.max_map_count
vm.max_map_count=700000

@t-nelson
Copy link
Contributor

t-nelson commented May 4, 2022

yeah. The vm is set in the following file on the system. In my case, it is 700G.

haoran_yi_solana_com@hyi-dev2:~/git/hy_solana$ cat /etc/sysctl.conf | grep vm.max_map_count
vm.max_map_count=700000

That's map count, not byte size. It's the number of concurrent mmap's allowed

@HaoranYi
Copy link
Contributor Author

HaoranYi commented May 5, 2022

That's map count, not byte size. It's the number of concurrent mmap's allowed

yeah. My current config is 700K. I think the default is only 64K.

@HaoranYi
Copy link
Contributor Author

HaoranYi commented May 11, 2022

@jeffwashington

I did a stress test on memory usage when the available physical memory is only
20G. On the machine with 378G memory, I have one process consume 360G.
The validator is able to memory map a 84G snapshot file. The actual physical
memory
increase of during the memory map is about 15G.

@jeffwashington
Copy link
Contributor

Ok, so the mmap shouldn't fail due to low memory. Seems good to have confirmed that. It doesn't seem necessary then to fall back on the non-mem map case. Right? When you ran this experiment, did we oom while generating the index?

@HaoranYi
Copy link
Contributor Author

HaoranYi commented May 11, 2022 via email

@jeffwashington jeffwashington self-requested a review May 12, 2022 16:43
Copy link
Contributor

@jeffwashington jeffwashington left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@HaoranYi HaoranYi merged commit 3367e44 into solana-labs:master May 12, 2022
@HaoranYi HaoranYi deleted the mmap branch May 12, 2022 18:35
jeffwashington added a commit that referenced this pull request May 12, 2022
jeffwashington added a commit that referenced this pull request May 12, 2022
@HaoranYi HaoranYi restored the mmap branch May 12, 2022 19:50
@HaoranYi HaoranYi deleted the mmap branch May 12, 2022 19:52
HaoranYi added a commit to HaoranYi/solana that referenced this pull request May 12, 2022
HaoranYi added a commit that referenced this pull request May 16, 2022
* Revert "Revert "Use memory map to speed up snapshot untar (#24889)" (#25174)"

This reverts commit fc793de.

* not use mmarinus

* enable secondary build

* Revert "enable secondary build"

This reverts commit 5aa43a9.

* macbuild

* Revert "macbuild"

This reverts commit 0da9294.
jeffwashington pushed a commit to jeffwashington/solana that referenced this pull request May 16, 2022
* Revert "Revert "Use memory map to speed up snapshot untar (solana-labs#24889)" (solana-labs#25174)"

This reverts commit fc793de.

* not use mmarinus

* enable secondary build

* Revert "enable secondary build"

This reverts commit 5aa43a9.

* macbuild

* Revert "macbuild"

This reverts commit 0da9294.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants