-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Compute annotations for table plugin at runtime #4798
Comments
I've been thinking a little bit about how we could evaluate the cache hit rate for CH and MLD with the least amount of effort. The easiest thing I could come up with is using point-to-point shortest path queries ( One problem with having a cache is that we need to invalidate it every time we change the weighting. This is potentially a problem for MLD due to traffic updates every few minutes. So an important factor we need to find out is how many queries it takes on average until the cache is warm (has a desired hit/miss rate). Braindump on how to implement this:
|
My use case: compute large matricies like 2000x2000, up to 10000x10000. For such matricies MDL is unpracticable. We use CH. For record: for us increase memory usage by adding other values (e.g. distance) aside weight and time is acceptable if the speed stay the same. |
@frodrigo I don't think we will drop support for CH any time soon, for the reason you pointed out. This issue is about switching from a computation of these values during pre-processing to the query phase. If you don't limit the cache size, that should eventually yield a similar performance, since every edge will have an entry in the cache. If there is a penalty and how big it is, is something we would need to find out and possibly make pre-computation still an option, or some sort of cache warming. |
Okay, so I ran two sets of requests to test out how a cache would perform:
Results for Run 1 Results for Run 2 So things to note:
The next thing I'm going to run is random requests on North America, with 5 coordinates to get a better sense of performance for the above proposed cache.
To reproduce:
I used
The raw output looks like:
|
@ghoshkaj Can you also track the number of entries in the cache here? This will be important, because the cache will need to be limited in size. If the cache can be infinitely large, then eventually all hits will be cache hits :-) |
@danpat I ran a couple of requests and the cache size is always equal to the number of misses. I'm using the cache.size() function. |
@ghoshkaj that makes sense, and the shape of the Is it easy for you to regenerate this for, say, 10k or 50k requests? It would be good to confirm that |
@danpat @TheMarex here is what it looks like for 5 coordinates and 10,000 requests for the dataset that is the US continent. 20 coordinates and 10,000 requests for the dataset that is the US continent: Those are both for CH. However, for MLD, it's a bit different. Here is what 20 coordinates, 100 requests in Berlin looks like. The hits are eventually catch up to the misses and surpass them. It looks like the cache benefits CH more than MLD matrix calculations. I'm repeating these for 10k requests and 5k requests on the extract for the US and will post them ASAP. |
Okay I had a mistake in my implementation for finding stats for MLD. Here is a collection of all of the stats that I have now collected about CH and MLD: CH - 5 coordinates 1000 requests, USCACrosses over at around 4000 accesses MLD - 5 coordinates 1000 requests, USCACrosses over at around 10 cache accesses CH - 5 coordinates 10,000 requests, USCACrosses over at around 3700 cache accesses MLD - 5 coordinates 10,000 requests, USCACrosses over at around 5 cache accesses As @TheMarex expected, MLD performs far better with MLD! Also, these number look really good! |
Okay we're running with the implementing a cache for the annotations computation in the PR #4876 by following the steps in the original comment above for CH. |
I've been reading through the steps and the code and I think the following is my plan of action. These are in the form of questions, and I would really like to know the answers to numbers 2, 4 and 6. The others I'm simply looking for a confirmation for (unless I've understood the concept incorrectly of course!) @TheMarex would you please 👍 or 👎 and/or explain the bits I've got wrong? 😸
I think once I've implemented number 8, I'll have the answers to number 7. However, number 2/6 is pretty puzzling to me still. And I would really like to know the answer to 4 as a side curiosity! |
Yep.
No the cache works the same way as the statistics logger you already implemented on the unpacked path. The node matrix is for storing the middle node, which is the meeting point for the forward and backward search. We need the middle point to extract the packed path. Since we only store the ID of the parent in the heap, new need to go back from the meeting point in forward/backward direction until we hit the source/target.
Exactly.
It tires to figure out if this node can bet on the fastest path. The precise reason why this works is a little complicated and intimately connected to how Contraction Hierarchies work.
Yep, exactly.
If we want to compute the entry
We first extract the packed path as mentioned above using the
Sounds good. 👍 |
@TheMarex recording what I understand so far. Okay so:
A question: |
Voiced with @TheMarex . Turns out original approach to calculate durations is 🙅♂️ because the manyToManySearch() uses one heap for both backward and forward search. This is a problem because we clear the heap out after one shortest path to a target is found, so we can't go back to the heap after all the paths have been found and pick out the packed_path and unpack it. To still be able to use this method we would need one heap per target, and that's inefficient memory-wise. Alternative Approach that I'll prototype, maybe @TheMarex comes up with a better solution before it's done though:
This second approach doesn't give us the memory saving from getting rid of the specialized |
@ghoshkaj Sorry about the confusion, the original idea works, I just forgot an important "detail":
|
Updates I have implemented the cache and I have two version of calculating the durations working at the moment for CH. I have taken measurements and here is how that looks right now: I have run these measurements for the following dataset and variables:
Compute duration during finding shortest route: 4.5 ms I am also going to collect the following baseline measurements and improvements:
|
Here are some request time measurements: They are random requests on the top part of California. The numbers in the table are the average over the 20000 requests. The headings in the table indicate how the durations are being computed.
There's a 78.22/20.76 = 3.7 time slowdown with the cache and 141.18/20.76 = 6.8 time slow down without cache for a 25x25 matrix. This means that roundtrip request times at worst case (even with the cache, supposing we never get repeat requests and therefore we have to do the calculations from scratch) is 1697.10ms which is 1.6 seconds for 100x100 matrix! At best case the roundtrip time will be 0.2 s for 100x100 matrix. Here is the osrm-runner commands that I used: 25x25 matrix, 20000 requests:
100x100 matrix, 20000 requests:
From the output from |
What the measurements above mean is that we pretty much definitely need a cache. So, with a goal of improving the roundtrip time for the worst case scenario, these are the immediate next steps: |
Capturing voice conversation with @TheMarex @oxidase about cache design considerations that need to be addressed before this cache is considered complete.
|
@ghoshkaj, @TheMarex I have a question about this change. I know there are people who are also interested in the distances, not only the durations. |
For fast table queries we need to be able to quickly determine the duration of shortcut (in the case of CH) or overlay (in the case of MLD) edge. This is why we save the weight and duration for every CH edge and have two sets of matrices and an additional value on the base graph for MLD.
All this data is only needed for the table plugin. All other plugins use path unpacking to compute the duration/distance value during the query. Removing this precomputed that could reduce the memory we need considerably: The overall memory consumption for CH could be reduced by 25%, for MLD by 20% (approximately, we might run into problems with bit packing for the forward/backward flags).
To make this work we need to change the many to many algorithm in the following way:
duration
in the heap but only the parent node (e.g. here) This means we can remove the specialManyToManyHeap
from SearchEngineData which also saves use some memory.SearchSpaceWithBuckets
hereNodeID
of every middle node for every source/target pairweight_table
herestart, middle
andmiddle, target
which yields a list ofEdgeBasedNode
ids, for which we each extract the segment durations, and sum them up.The performance crucial step is the last one: We need to unpack a path for every entry in the matrix and calculate the duration value from the segments of the path. These are a lot of operations that would most likely dominate the running time, which would get us something that is closer to running
n x m
point to point queries.To get back to our original performance characteristics, we would need to implement a cache that would yield amortized constant time for computing the duration of
start, target
. While recursively unpacking a path, we can check for each edge if there is an entry in the cache, if not we compute an entry and insert it.We could implement a thread local cache that is persistent between queries and based on an
unordered_map
which uses the keystart, target, exclude_index
. This cache just stores the duration of the path, but we could have multiple caches, for example for storing the distance value to enable #1353. This cache needs to be configurable in size, e.g. max500 MB
to actually yield any memory reduction over the precomputation of all annotations. As a result, we need to implement a replacement strategy like LRU for the cache.The text was updated successfully, but these errors were encountered: