-
Notifications
You must be signed in to change notification settings - Fork 13
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix Pose2 preambleCache bug #562
Conversation
ah, should put q_hat in the cache as well... will try that tomorrow quick if i get a moment spare |
Codecov Report
@@ Coverage Diff @@
## master #562 +/- ##
==========================================
- Coverage 32.84% 32.78% -0.07%
==========================================
Files 48 48
Lines 1860 1870 +10
==========================================
+ Hits 611 613 +2
- Misses 1249 1257 +8
📣 Codecov can now indicate which changes are the most critical in Pull Requests. Learn more |
Excited for the performance gains! This looks like a side-by-side caching opposed to in-line caching. Just curious about the design choice there. |
I need to learn more about side-by-side (in-place) vs in-line. I was searching online but didn't find a good reference yet. Do you perhaps have a good reference handy? I think it's in-place (side-by-side) model, if i'm reading your input correctly, because CalcFactor.cache is likely going to do two related jobs. One is a strong in-place memory case. Maybe cache should be split for two cases, I started somewhere which seems like a sensible design. Currently only PosePose2 factor is using preambleCache so far (this PR). The word cache might be a bit confusing, since it will definitely be doing "in-place" memory work for hot loop computations. While the same architecture can also be used for caching important data at addFactor time. The Marine Demo is a good example. I'm looking to find a good way of not serializing the radar sweeps into the PackedFactor data, and duplicating that for each factor. My thinking is the preambleCache step (which ScatterAlignPose2 will overload) can fetch big blobs from the data store instead. Similarly, if there is a global calibration, we can load that kind of stuff for each of the factors when they get added to the graph. EDIT: oh, i also like the simplicity of adding this for users. See the code changes for this PR -- it's really easy for a user. In-line model might also be easy, but i'd need to learn more first. EDIT2: Also, i have multithreading in mind. That design is still a work in progress, but so far the leading candidate for |
The difference is primarily on who is coordinating the cache. Client or cache. I am not sure the ubiquitous terms. I made those up. Concrete explanation of the two options:
|
Ah, got it thanks. So predominant use case here will be low level to give user maximum control. Each factor "cache" should be maximally flexible at this level. Users will want to do a whole gamut of tricks and performance tweaks. E.g. a requirement on the factor cache is to be able to leverage GPU. My sense is that the factor cache will develop with both designs being used in different parts. In-line caching makes a lot of sense for the data stores. I'm fairly sure CF.cache starting out as an in-place design is the right way to go. But then inside each factor's preambleCache function, it's likely that the user will be leveraging in-line caching features. |
PS, I added this to the CJL collection of high level design decisions: https://github.com/JuliaRobotics/Caesar.jl/wiki/High-Level-Requirements I'm now starting to think we should consolidate that with the RequirementsManagement system. |
The idea here is to try reduce hot loop memory consumption.
So just quickly trying on hexagonal example...
Once upon a time, hex would take at best around 30s to solve... After this PR, and on a computer thermally throttling to around 2.8GHz, hex solves in less than 8 seconds. Usual caveats of all processes having JIT compiled etc. (that's using 5 processes using Distributed.jl). Haven't had time to test properly (i've been lazy with fg in global scope).
TLDR; i expect this PR to reduce the hot loop memory consumption. Proper testing will tell how much this PR helps, but it seems to be helping somewhere between a little towards a lot.