-
Notifications
You must be signed in to change notification settings - Fork 11
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Enable multi-process FMU caching and expose it through URI resolvers #388
Conversation
It is now safe to use the same cache directory from multiple threads and even multiple processes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Any reason for not using boost file_lock (https://www.boost.org/doc/libs/1_71_0/boost/interprocess/sync/file_lock.hpp)?
No reason whatsoever, other than that I wasn't aware of its existence. (The worst part is, I even googled for something like it and didn't find it. Now I see that it does turn up in my search results, but at the bottom of the page, where I never look. I'm so used to all the good stuff being among the top results. :D) Anyway, this is a very good observation, and we should of course use the Boost version instead. Fix is forthcoming. |
I've noticed some other issues that need fixing too, so I'm closing the pull request so people don't waste time reviewing it until I'm done. |
I finally got around to picking this up again – and holy cow, it sent me down a very deep rabbit hole! After @hplatou's comment above, I set out to reimplement this with So rather than ditch The next thing I discovered was that in order to properly implement a concurrency-safe FMU cache, I actually needed an RW locking mechanism. It must be possible for multiple processes/threads/fibers to read the cache, but only one should be allowed to modify it. This is supported by So it turned out to take quite a bit of time and it adds a lot of code. Is it still worth it? I think so. Experience with Coral has shown that this functionality is definitely needed, as some FMUs run into hundreds of megabytes in size, causing unzipping to become a significant part of the simulation run time. Coral has support for persistent caching, but not safe caching, and this has in fact caused real-world problems when people have tried to run several simulations in parallel. I'll repeat my request that reviewers give this extra scrutiny, as concurrency is hard – both to implement and to test. |
@@ -214,21 +276,38 @@ std::shared_ptr<fmu> importer::import(const boost::filesystem::path& fmuPath) | |||
} | |||
zip.extract_file_to(modelDescriptionIndex, tempMdDir); | |||
|
|||
// Look at the model description to figure out the FMU's GUID. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've added some comments to this function that are not really related to the new caching code, but simply explain what's going on. The function was long and inscrutable enough already, and the extra code only made it worse.
Closing this again to fix some more things. The EDIT: Apparently I've been looking at just a small portion of the documentation. Silly me, I thought that the API reference would be complete, but no |
Third time's a charm! :) |
This PR is way to complex for me to approve/chime in, but I do have a question. Would't this be possible with less complexity (no file locks) by creating a Just my thought. Feel free to ignore it. This PR may be the way to go, but It's to complex for me to rewiew. Edit: Using the file path as key would not work if SSP etc is used, but the function could extract the guid only and use that as the key. Edit2: Access to the function would need to be synchronized of course. |
That's what happens if you use This PR aims to solve a different set of problems:
Number 2 follows from number 1, and number 1 is motivated by a real-world issue that Coral users have experienced: Some FMUs are simply very large, and just unzipping them becomes a significant part of the simulation run time. Furthermore, program crashes (not uncommon in our line of work) may prevent automatic cleanup, so disks tended to fill up rapidly before I implemented this in Coral. I've seen FMUs that are legitimately hundreds of megabytes in size, either because they contain large data files (e.g. hydrodynamic hull data), because they contain a large amount of code (e.g. embedded MATLAB runtime), or a combination of both. However, this adds far more complexity than I had anticipated when I started working on it, and I will of course accept if the change gets rejected on account of that. I had fun doing it anyway. ;) There are in fact several valid arguments to the effect that "re-usable huge FMUs" may be a marginal use case for CSE:
EDIT: I came up with lots of arguments against my own PR. |
I see 👍 |
Slowly picking my way through this, although it admittedly is some distance above my head. You've written a lot about what we need from the locking functionality, but not so much about why we need it. Would the most relevant use case be two or more processes (or threads) trying to write to/read from the cache simultaneously? For example, one is running a parallelized parameter sweep, and several of the subsimulations are simultaneously trying to unpack the same FMU to the same cache? |
We discussed this during today's status meeting, but I'll answer here too for completeness' sake.
Exactly. I have seen examples of both use cases, for example running batch simulations with multiple processes or running parallel parameter sweeps within one process. |
I've introduced the
If you wish, we can now easily drop the whole file-locking stuff from core by moving
|
The PR has changed significantly since the approval was given.
The The |
One option would be to move it to cse-cli, since that's where it will be used. Another is to move it to a new library, e.g. "cse-extras" (perhaps with lower quality and maintenance requirements than cse-core). That said, we should also consider whether caching would be useful in cse-server (or other client code) as well, in which case there is a case to be made for keeping it in core. |
I think it will be useful in other client applications as well, so I'm up for keeping it here. |
// References: | ||
// https://en.wikipedia.org/wiki/Percent-encoding | ||
// https://msdn.microsoft.com/en-us/library/aa365247.aspx | ||
std::string sanitise_path(const std::string& str) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So this part has been replaced with cse::uri::percent_encode()
. While the code is quite different, it still solves our problem with GUID's containing {
or }
👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, sanitise_path()
was just an old, half-baked implementation of percent_encode()
anyway, so I figured it would be cleaner to use the real thing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not able to give this the scrutiny you requested, but I like what I see and jenkins is happy 👍
This PR introduces functionality needed to fix open-simulation-platform/cosim-cli#11. In the process, I've fixed an old issue inherited from Coral, namely the lack of synchronisation of
cse::fmi::importer
's cache of unpacked FMUs. What I've done here is:cse::utility::lock_file
, which manages a lock file for interprocess synchronisation.cse::fmi::importer
to synchronise write access to cached FMUs.I'd appreciate extra scrutiny on point 2 (changes in
src/cpp/fmi/importer.cpp
), since synchronisation is tricky business.