Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement mmapDataFacade #1947

Closed
4 of 6 tasks
danpat opened this issue Feb 2, 2016 · 8 comments
Closed
4 of 6 tasks

Implement mmapDataFacade #1947

danpat opened this issue Feb 2, 2016 · 8 comments
Assignees

Comments

@danpat
Copy link
Member

danpat commented Feb 2, 2016

OSRM currently supports reading data from files into heap memory (InternalDataFacade), or pre-loading data into shared memory using IPC shared memory blocks (SharedDataFacade+osrm-datastore).

We can consolidate the behaviour of both of these by using mmap. Instead of reading files into memory explicitly, we should be able to mmap the data files, and immediately begin using them.

There are a few changes that need to be made to get us there:

  • Benchmark mmapd data access vs heap - what, if any, penalty is there? How does this change when the file we mmap is on a ramdisk?
  • Identify data structures that can't be mmaped and fix them - basically anything in osrm-datastore (src/storage/storage.cpp) that isn't just loaded into memory in one big blob. Problem here is vector<bool> and its proxy behavior; we need a contiguous container we can memcpy to.
  • Clone the SharedDataFacade and perform similar .swap operations against mmaped memory addresses rather than shm addresses.
  • Figure out IPC signalling for swapping out mmaped files on-the-fly
  • Investigate using mmap instead of explicit read disk files for leaf nodes in the StaticRTree to boost performance (coordinate lookups represent the largest part of any given routing query because of the I/O in the rtree).
  • Make sure this works on Windows too.

The main goal here is to minimize double-reads of data. In situations where we are constantly cycling out data sets (in the case of traffic updates), we want to minimize I/O and the number of times any single bit of data gets touched. By using mmap and tmpfs, we can emulate the current share-memory behavior, but avoid an extra pass over the data.

For normal osrm-routed use, we would essentially get lazy-loading of data - osrm-routed would start up faster, but queries would be slower since pages are loaded from disk on demand until data is touched and lives in the filesystem cache. This initial slowness could be avoided by pre-seeding the data files into the filesystem cache or via MAP_POPULATE (Linux 2.5.46+), and this could be done in parallel to osrm-routed already starting up and answering queries.

/cc @daniel-j-h @TheMarex

@TheMarex
Copy link
Member

Benchmark mmapd data access vs heap - what, if any, penalty is there? How does this change when the file we mmap is on a ramdisk?

Benchmarks revealed a 10% slowdown w.r.t internal memory. We are putting this on ice for the moment.

@danpat
Copy link
Member Author

danpat commented May 28, 2016

Did some quick benchmarking on OSX while on the plane this afternoon. Using this test:
https://gist.github.com/danpat/67e6ab63836ffbcc4d832e7db509a5b5

On an OSX ramdisk:

RAM access Run1: 21.026866s wall, 20.580000s user + 0.150000s system = 20.730000s CPU (98.6%)
RAM access Run2: 18.135529s wall, 18.070000s user + 0.040000s system = 18.110000s CPU (99.9%)
RAMdisk mmap Run1: 20.520104s wall, 19.460000s user + 0.790000s system = 20.250000s CPU (98.7%)
RAMdisk mmap Run2: 19.265660s wall, 18.490000s user + 0.730000s system = 19.220000s CPU (99.8%)

On the regular OSX filesystem:

RAM access Run1: 17.700162s wall, 17.650000s user + 0.030000s system = 17.680000s CPU (99.9%)
RAM access Run2: 17.893318s wall, 17.820000s user + 0.040000s system = 17.860000s CPU (99.8%)
Disk mmap Run1: 19.178829s wall, 18.200000s user + 0.740000s system = 18.940000s CPU (98.8%)
Disk mmap Run2: 19.359454s wall, 18.440000s user + 0.780000s system = 19.220000s CPU (99.3%)

I'm not quite sure what this is telling me, I suspect I need to run more samples. I played with a few different madvise values. MADV_RANDOM added about a 25% slowdown to the mmap calls when enabled, but had no effect on the direct RAM access.

My machine has 16GB of RAM and I have plenty free, so I'm fairly confident that filesystem caching was in full effect and nothing got swapped out. OSX also performs memory compression when things get tight, but I didn't see that kick in either.

/cc @daniel-j-h

@danpat
Copy link
Member Author

danpat commented May 28, 2016

I took a look at some logs from my previous tests, and I think I might've been paging some stuff to swap after all. I halved the data size (4GB to 2GB) and shrank the ramdisk a bit.
I also removed std::rand() and just used i * BIGPRIME % ARRAYSIZE to access elements during the loop. While I was seeding with std::srand() and std::rand() should be consistent when seeded, I'm not 100% clear what's happening under the covers, so I removed it as a possible variable.

Results now look like this:
Tests on the ramdisk volume:

RAM access Run1: 11.017670s wall, 10.960000s user + 0.030000s system = 10.990000s CPU (99.7%)
RAM access Run2: 11.398677s wall, 11.330000s user + 0.030000s system = 11.360000s CPU (99.7%)
RAMdisk mmap Run1: 11.630367s wall, 11.240000s user + 0.360000s system = 11.600000s CPU (99.7%)
RAMdisk mmap Run2: 11.878009s wall, 11.480000s user + 0.370000s system = 11.850000s CPU (99.8%)

Tests on the regular filesystem:

RAM access Run1: 11.302447s wall, 11.080000s user + 0.050000s system = 11.130000s CPU (98.5%)
RAM access Run2: 10.781652s wall, 10.730000s user + 0.030000s system = 10.760000s CPU (99.8%)
Disk mmap Run1: 12.049692s wall, 11.430000s user + 0.460000s system = 11.890000s CPU (98.7%)
Disk mmap Run2: 12.164826s wall, 11.710000s user + 0.380000s system = 12.090000s CPU (99.4%)

Overall, ¯_(ツ)_/¯. Seems like mmap on the regular filesystem on OSX is a bit slower (~10%). On OSX's ramdisk (e.g. diskutil erasevolume HFS+ 'RAM Disk' $(hdiutil attach -nomount ram://8485760) for a 4GB disk), we do see some speedup that brings it pretty close to direct RAM access.

@daniel-j-h do you have details of how you tested this on Linux?

@daniel-j-h
Copy link
Member

@daniel-j-h
Copy link
Member

@danpat can you have a look - you refactored the data facades.

Is this ticket still relevant and actionable?

@danpat
Copy link
Member Author

danpat commented Dec 13, 2016

We could still do this - in fact, things are slowly getting easier as we refactor the I/O handling.

Let's keep this open as a feature request - one day, down the road, somebody might implement it :-) Keeping this history will be useful.

@TheMarex
Copy link
Member

First step towards this was done in #4881. For further gains we would need to mmap every input file separately.

@danpat
Copy link
Member Author

danpat commented Oct 20, 2018

mmap-ing individual files has been done in #5242

Only thing that PR doesn't complete from our original list is:

  • implement mmap-based hot-swapping
  • Windows support

@danpat danpat assigned danpat and unassigned TheMarex Oct 26, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants