Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable multi-process FMU caching and expose it through URI resolvers #388

Merged
merged 22 commits into from
Feb 5, 2020

Conversation

kyllingstad
Copy link
Member

This PR introduces functionality needed to fix open-simulation-platform/cosim-cli#11. In the process, I've fixed an old issue inherited from Coral, namely the lack of synchronisation of cse::fmi::importer's cache of unpacked FMUs. What I've done here is:

  1. Introduce a new Lockable class, cse::utility::lock_file, which manages a lock file for interprocess synchronisation.
  2. Use this class in cse::fmi::importer to synchronise write access to cached FMUs.
  3. Expose the ability to use an FMU cache through the "orchestration" APIs.

I'd appreciate extra scrutiny on point 2 (changes in src/cpp/fmi/importer.cpp), since synchronisation is tricky business.

hplatou
hplatou previously approved these changes Sep 30, 2019
Copy link
Contributor

@hplatou hplatou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Any reason for not using boost file_lock (https://www.boost.org/doc/libs/1_71_0/boost/interprocess/sync/file_lock.hpp)?

@kyllingstad
Copy link
Member Author

@hplatou:

Any reason for not using boost file_lock (https://www.boost.org/doc/libs/1_71_0/boost/interprocess/sync/file_lock.hpp)?

No reason whatsoever, other than that I wasn't aware of its existence. (The worst part is, I even googled for something like it and didn't find it. Now I see that it does turn up in my search results, but at the bottom of the page, where I never look. I'm so used to all the good stuff being among the top results. :D)

Anyway, this is a very good observation, and we should of course use the Boost version instead. Fix is forthcoming.

@kyllingstad
Copy link
Member Author

I've noticed some other issues that need fixing too, so I'm closing the pull request so people don't waste time reviewing it until I'm done.

@kyllingstad
Copy link
Member Author

kyllingstad commented Jan 7, 2020

I finally got around to picking this up again – and holy cow, it sent me down a very deep rabbit hole!

After @hplatou's comment above, I set out to reimplement this with boost::interprocess::file_lock rather than my own home-brewed version. However, reading the documentation for it, I discovered that file locking only works between processes. Different threads inside a process always share the same lock.

So rather than ditch cse::utility::file_lock entirely, I reimplemented it in terms of the Boost one, adding support for interthread (and interfiber!) synchronisation by coupling the file lock with a mutex.

The next thing I discovered was that in order to properly implement a concurrency-safe FMU cache, I actually needed an RW locking mechanism. It must be possible for multiple processes/threads/fibers to read the cache, but only one should be allowed to modify it.

This is supported by boost::interprocess::file_lock (yay!) but not by Boost.Fiber (boo!). Hence, I also had to implement my own cse::utility::shared_mutex, modeled after std::shared_mutex.

So it turned out to take quite a bit of time and it adds a lot of code. Is it still worth it? I think so. Experience with Coral has shown that this functionality is definitely needed, as some FMUs run into hundreds of megabytes in size, causing unzipping to become a significant part of the simulation run time. Coral has support for persistent caching, but not safe caching, and this has in fact caused real-world problems when people have tried to run several simulations in parallel.

I'll repeat my request that reviewers give this extra scrutiny, as concurrency is hard – both to implement and to test.

@kyllingstad kyllingstad reopened this Jan 7, 2020
@kyllingstad kyllingstad requested a review from hplatou January 7, 2020 14:34
@@ -214,21 +276,38 @@ std::shared_ptr<fmu> importer::import(const boost::filesystem::path& fmuPath)
}
zip.extract_file_to(modelDescriptionIndex, tempMdDir);

// Look at the model description to figure out the FMU's GUID.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added some comments to this function that are not really related to the new caching code, but simply explain what's going on. The function was long and inscrutable enough already, and the extra code only made it worse.

@kyllingstad
Copy link
Member Author

kyllingstad commented Jan 13, 2020

Closing this again to fix some more things. The boost::interprocess::file_lock documentation isn't explicit about whether one process can have several file_lock objects referring to the same file, but it is worded in a way that indicates that this is possible. However, my experiments have shown that it is not.

EDIT: Apparently I've been looking at just a small portion of the documentation. Silly me, I thought that the API reference would be complete, but no

@kyllingstad
Copy link
Member Author

Third time's a charm! :)

@kyllingstad kyllingstad reopened this Jan 14, 2020
@markaren
Copy link
Contributor

markaren commented Jan 14, 2020

This PR is way to complex for me to approve/chime in, but I do have a question. Would't this be possible with less complexity (no file locks) by creating a std::shared_ptr<fmu_cache> get_or_create_fmu_cache_dir(const fs::path &fmuPath) function that returns access to a reference counted RAI class that would delete the dir during deconstruction.

Just my thought. Feel free to ignore it. This PR may be the way to go, but It's to complex for me to rewiew.

Edit: Using the file path as key would not work if SSP etc is used, but the function could extract the guid only and use that as the key.

Edit2: Access to the function would need to be synchronized of course.

@kyllingstad
Copy link
Member Author

kyllingstad commented Jan 14, 2020

Would't this be possible with less complexity (no file locks) by creating a std::shared_ptr<fmu_cache> get_or_create_fmu_cache_dir(const fs::path &fmuPath) function that returns access to a reference counted RAI class that would delete the dir during deconstruction.

That's what happens if you use cse::default_model_uri_resolver() without giving it a cache path (which again calls cse::fmi::importer::create() without a path), and it's what we've done so far.

This PR aims to solve a different set of problems:

  1. Persisting the cache between different runs of the program.
  2. Synchronising access to the cache between processes that run in parallel.

Number 2 follows from number 1, and number 1 is motivated by a real-world issue that Coral users have experienced: Some FMUs are simply very large, and just unzipping them becomes a significant part of the simulation run time. Furthermore, program crashes (not uncommon in our line of work) may prevent automatic cleanup, so disks tended to fill up rapidly before I implemented this in Coral.

I've seen FMUs that are legitimately hundreds of megabytes in size, either because they contain large data files (e.g. hydrodynamic hull data), because they contain a large amount of code (e.g. embedded MATLAB runtime), or a combination of both.

However, this adds far more complexity than I had anticipated when I started working on it, and I will of course accept if the change gets rejected on account of that. I had fun doing it anyway. ;)

There are in fact several valid arguments to the effect that "re-usable huge FMUs" may be a marginal use case for CSE:

  • The primary use case may be to run each slave in a freshly-created container/VM
  • Maybe we intend to develop a different solution for runtimes than embedding them in the FMUs
  • Our simulations typically take so long that unzipping a couple of hundred megabytes at the beginning is not that big a deal.

EDIT: I came up with lots of arguments against my own PR.

@markaren
Copy link
Contributor

Persisting the cache between different runs of the program.

I see 👍

@eidekrist
Copy link
Member

Slowly picking my way through this, although it admittedly is some distance above my head.

You've written a lot about what we need from the locking functionality, but not so much about why we need it. Would the most relevant use case be two or more processes (or threads) trying to write to/read from the cache simultaneously? For example, one is running a parallelized parameter sweep, and several of the subsimulations are simultaneously trying to unpack the same FMU to the same cache?

@kyllingstad
Copy link
Member Author

We discussed this during today's status meeting, but I'll answer here too for completeness' sake.

You've written a lot about what we need from the locking functionality, but not so much about why we need it. Would the most relevant use case be two or more processes (or threads) trying to write to/read from the cache simultaneously? For example, one is running a parallelized parameter sweep, and several of the subsimulations are simultaneously trying to unpack the same FMU to the same cache?

Exactly. I have seen examples of both use cases, for example running batch simulations with multiple processes or running parallel parameter sweeps within one process.

@kyllingstad
Copy link
Member Author

I've introduced the file_cache interface I mentioned on Slack now, and implemented two versions of it:

  • temporary_file_cache, which is very simple and has no persistence or synchronisation
  • persistent_file_cache, which uses file and mutex locking to achieve both

If you wish, we can now easily drop the whole file-locking stuff from core by moving persistent_file_cache (and its dependencies in utility/concurrency.hpp) outside the library.

persistent_file_cache aside, I actually think file_cache alone brings a major improvement to the code which makes it worth including anyway: fmi::importer doesn't have to deal with caching at all anymore, not even the simple form that we currently have. It's all encapsulated behind file_cache, so fmi::importer can focus on the one thing it's supposed to be doing, namely unpacking FMUs. Personally, I think it has made the fmi::importer code a whole lot easier to read.

@kyllingstad kyllingstad dismissed hplatou’s stale review January 17, 2020 08:05

The PR has changed significantly since the approval was given.

@ljamt
Copy link
Member

ljamt commented Feb 4, 2020

The file_cache interface is good and I agree that the temporary_file _cache really improves the fmi::importer.

The persistent_file_cache and utility/concurrency does not infer with "running code", it's not in use besides in the utility_concurrency_unittest. This is what's needed in the cli right? If we want to move it outside the library, where would you suggest?

@kyllingstad
Copy link
Member Author

One option would be to move it to cse-cli, since that's where it will be used. Another is to move it to a new library, e.g. "cse-extras" (perhaps with lower quality and maintenance requirements than cse-core).

That said, we should also consider whether caching would be useful in cse-server (or other client code) as well, in which case there is a case to be made for keeping it in core.

@ljamt
Copy link
Member

ljamt commented Feb 4, 2020

That said, we should also consider whether caching would be useful in cse-server (or other client code) as well, in which case there is a case to be made for keeping it in core.

I think it will be useful in other client applications as well, so I'm up for keeping it here.

// References:
// https://en.wikipedia.org/wiki/Percent-encoding
// https://msdn.microsoft.com/en-us/library/aa365247.aspx
std::string sanitise_path(const std::string& str)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So this part has been replaced with cse::uri::percent_encode(). While the code is quite different, it still solves our problem with GUID's containing { or } 👍

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, sanitise_path() was just an old, half-baked implementation of percent_encode() anyway, so I figured it would be cleaner to use the real thing.

Copy link
Member

@eidekrist eidekrist left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not able to give this the scrutiny you requested, but I like what I see and jenkins is happy 👍

@kyllingstad kyllingstad merged commit 19ad219 into master Feb 5, 2020
@kyllingstad kyllingstad deleted the feature/cse-cli-11-persistent-fmu-cache branch February 5, 2020 07:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Use a persistent FMU cache
5 participants