Enable multi-process FMU caching and expose it through URI resolvers #388

kyllingstad · 2019-09-30T07:48:17Z

This PR introduces functionality needed to fix open-simulation-platform/cosim-cli#11. In the process, I've fixed an old issue inherited from Coral, namely the lack of synchronisation of cse::fmi::importer's cache of unpacked FMUs. What I've done here is:

Introduce a new Lockable class, cse::utility::lock_file, which manages a lock file for interprocess synchronisation.
Use this class in cse::fmi::importer to synchronise write access to cached FMUs.
Expose the ability to use an FMU cache through the "orchestration" APIs.

I'd appreciate extra scrutiny on point 2 (changes in src/cpp/fmi/importer.cpp), since synchronisation is tricky business.

It is now safe to use the same cache directory from multiple threads and even multiple processes.

src/cpp/cse/utility/filesystem.hpp

hplatou

LGTM. Any reason for not using boost file_lock (https://www.boost.org/doc/libs/1_71_0/boost/interprocess/sync/file_lock.hpp)?

kyllingstad · 2019-09-30T18:37:48Z

@hplatou:

Any reason for not using boost file_lock (https://www.boost.org/doc/libs/1_71_0/boost/interprocess/sync/file_lock.hpp)?

No reason whatsoever, other than that I wasn't aware of its existence. (The worst part is, I even googled for something like it and didn't find it. Now I see that it does turn up in my search results, but at the bottom of the page, where I never look. I'm so used to all the good stuff being among the top results. :D)

Anyway, this is a very good observation, and we should of course use the Boost version instead. Fix is forthcoming.

kyllingstad · 2019-09-30T19:08:45Z

I've noticed some other issues that need fixing too, so I'm closing the pull request so people don't waste time reviewing it until I'm done.

kyllingstad · 2020-01-07T14:34:39Z

I finally got around to picking this up again – and holy cow, it sent me down a very deep rabbit hole!

After @hplatou's comment above, I set out to reimplement this with boost::interprocess::file_lock rather than my own home-brewed version. However, reading the documentation for it, I discovered that file locking only works between processes. Different threads inside a process always share the same lock.

So rather than ditch cse::utility::file_lock entirely, I reimplemented it in terms of the Boost one, adding support for interthread (and interfiber!) synchronisation by coupling the file lock with a mutex.

The next thing I discovered was that in order to properly implement a concurrency-safe FMU cache, I actually needed an RW locking mechanism. It must be possible for multiple processes/threads/fibers to read the cache, but only one should be allowed to modify it.

This is supported by boost::interprocess::file_lock (yay!) but not by Boost.Fiber (boo!). Hence, I also had to implement my own cse::utility::shared_mutex, modeled after std::shared_mutex.

So it turned out to take quite a bit of time and it adds a lot of code. Is it still worth it? I think so. Experience with Coral has shown that this functionality is definitely needed, as some FMUs run into hundreds of megabytes in size, causing unzipping to become a significant part of the simulation run time. Coral has support for persistent caching, but not safe caching, and this has in fact caused real-world problems when people have tried to run several simulations in parallel.

I'll repeat my request that reviewers give this extra scrutiny, as concurrency is hard – both to implement and to test.

kyllingstad · 2020-01-07T15:05:35Z

src/cpp/fmi/importer.cpp

@@ -214,21 +276,38 @@ std::shared_ptr<fmu> importer::import(const boost::filesystem::path& fmuPath)
    }
    zip.extract_file_to(modelDescriptionIndex, tempMdDir);

+    // Look at the model description to figure out the FMU's GUID.


I've added some comments to this function that are not really related to the new caching code, but simply explain what's going on. The function was long and inscrutable enough already, and the extra code only made it worse.

kyllingstad · 2020-01-13T13:21:58Z

Closing this again to fix some more things. The boost::interprocess::file_lock documentation isn't explicit about whether one process can have several file_lock objects referring to the same file, but it is worded in a way that indicates that this is possible. However, my experiments have shown that it is not.

EDIT: Apparently I've been looking at just a small portion of the documentation. Silly me, I thought that the API reference would be complete, but no

kyllingstad · 2020-01-14T08:15:05Z

Third time's a charm! :)

markaren · 2020-01-14T08:24:52Z

This PR is way to complex for me to approve/chime in, but I do have a question. Would't this be possible with less complexity (no file locks) by creating a std::shared_ptr<fmu_cache> get_or_create_fmu_cache_dir(const fs::path &fmuPath) function that returns access to a reference counted RAI class that would delete the dir during deconstruction.

Just my thought. Feel free to ignore it. This PR may be the way to go, but It's to complex for me to rewiew.

Edit: Using the file path as key would not work if SSP etc is used, but the function could extract the guid only and use that as the key.

Edit2: Access to the function would need to be synchronized of course.

kyllingstad · 2020-01-14T09:08:51Z

Would't this be possible with less complexity (no file locks) by creating a std::shared_ptr<fmu_cache> get_or_create_fmu_cache_dir(const fs::path &fmuPath) function that returns access to a reference counted RAI class that would delete the dir during deconstruction.

That's what happens if you use cse::default_model_uri_resolver() without giving it a cache path (which again calls cse::fmi::importer::create() without a path), and it's what we've done so far.

This PR aims to solve a different set of problems:

Persisting the cache between different runs of the program.
Synchronising access to the cache between processes that run in parallel.

Number 2 follows from number 1, and number 1 is motivated by a real-world issue that Coral users have experienced: Some FMUs are simply very large, and just unzipping them becomes a significant part of the simulation run time. Furthermore, program crashes (not uncommon in our line of work) may prevent automatic cleanup, so disks tended to fill up rapidly before I implemented this in Coral.

I've seen FMUs that are legitimately hundreds of megabytes in size, either because they contain large data files (e.g. hydrodynamic hull data), because they contain a large amount of code (e.g. embedded MATLAB runtime), or a combination of both.

However, this adds far more complexity than I had anticipated when I started working on it, and I will of course accept if the change gets rejected on account of that. I had fun doing it anyway. ;)

There are in fact several valid arguments to the effect that "re-usable huge FMUs" may be a marginal use case for CSE:

The primary use case may be to run each slave in a freshly-created container/VM
Maybe we intend to develop a different solution for runtimes than embedding them in the FMUs
Our simulations typically take so long that unzipping a couple of hundred megabytes at the beginning is not that big a deal.

EDIT: I came up with lots of arguments against my own PR.

markaren · 2020-01-14T11:19:16Z

Persisting the cache between different runs of the program.

I see 👍

eidekrist · 2020-01-15T07:52:08Z

Slowly picking my way through this, although it admittedly is some distance above my head.

You've written a lot about what we need from the locking functionality, but not so much about why we need it. Would the most relevant use case be two or more processes (or threads) trying to write to/read from the cache simultaneously? For example, one is running a parallelized parameter sweep, and several of the subsimulations are simultaneously trying to unpack the same FMU to the same cache?

kyllingstad · 2020-01-15T11:08:56Z

We discussed this during today's status meeting, but I'll answer here too for completeness' sake.

You've written a lot about what we need from the locking functionality, but not so much about why we need it. Would the most relevant use case be two or more processes (or threads) trying to write to/read from the cache simultaneously? For example, one is running a parallelized parameter sweep, and several of the subsimulations are simultaneously trying to unpack the same FMU to the same cache?

Exactly. I have seen examples of both use cases, for example running batch simulations with multiple processes or running parallel parameter sweeps within one process.

kyllingstad · 2020-01-17T07:43:19Z

I've introduced the file_cache interface I mentioned on Slack now, and implemented two versions of it:

temporary_file_cache, which is very simple and has no persistence or synchronisation
persistent_file_cache, which uses file and mutex locking to achieve both

If you wish, we can now easily drop the whole file-locking stuff from core by moving persistent_file_cache (and its dependencies in utility/concurrency.hpp) outside the library.

persistent_file_cache aside, I actually think file_cache alone brings a major improvement to the code which makes it worth including anyway: fmi::importer doesn't have to deal with caching at all anymore, not even the simple form that we currently have. It's all encapsulated behind file_cache, so fmi::importer can focus on the one thing it's supposed to be doing, namely unpacking FMUs. Personally, I think it has made the fmi::importer code a whole lot easier to read.

The PR has changed significantly since the approval was given.

src/cpp/utility/concurrency.cpp

ljamt · 2020-02-04T08:43:38Z

The file_cache interface is good and I agree that the temporary_file _cache really improves the fmi::importer.

The persistent_file_cache and utility/concurrency does not infer with "running code", it's not in use besides in the utility_concurrency_unittest. This is what's needed in the cli right? If we want to move it outside the library, where would you suggest?

kyllingstad · 2020-02-04T10:17:04Z

One option would be to move it to cse-cli, since that's where it will be used. Another is to move it to a new library, e.g. "cse-extras" (perhaps with lower quality and maintenance requirements than cse-core).

That said, we should also consider whether caching would be useful in cse-server (or other client code) as well, in which case there is a case to be made for keeping it in core.

ljamt · 2020-02-04T14:08:53Z

That said, we should also consider whether caching would be useful in cse-server (or other client code) as well, in which case there is a case to be made for keeping it in core.

I think it will be useful in other client applications as well, so I'm up for keeping it here.

eidekrist · 2020-02-04T14:37:55Z

src/cpp/fmi/importer.cpp

-// References:
-//     https://en.wikipedia.org/wiki/Percent-encoding
-//     https://msdn.microsoft.com/en-us/library/aa365247.aspx
-std::string sanitise_path(const std::string& str)


So this part has been replaced with cse::uri::percent_encode(). While the code is quite different, it still solves our problem with GUID's containing { or } 👍

Yeah, sanitise_path() was just an old, half-baked implementation of percent_encode() anyway, so I figured it would be cleaner to use the real thing.

eidekrist

I'm not able to give this the scrutiny you requested, but I like what I see and jenkins is happy 👍

kyllingstad added 4 commits September 27, 2019 14:56

Add utility::lock_file

214cfd3

Synchronise access to FMU cache directory

3a740f0

It is now safe to use the same cache directory from multiple threads and even multiple processes.

Enable FMU caching in file_uri_sub_resolver

a9bf0a3

Enable caching in default URI resolver

ae1de5e

kyllingstad added the enhancement New feature or request label Sep 30, 2019

kyllingstad requested review from ljamt, markaren, hplatou and eidekrist September 30, 2019 07:48

kyllingstad self-assigned this Sep 30, 2019

kyllingstad mentioned this pull request Sep 30, 2019

Use a persistent cache for FMU contents open-simulation-platform/cosim-cli#30

Merged

Only define NOMINMAX if it's not already defined

9a10cea

kyllingstad commented Sep 30, 2019

View reviewed changes

src/cpp/cse/utility/filesystem.hpp Outdated Show resolved Hide resolved

kyllingstad added 2 commits September 30, 2019 13:20

Use correct error code for non-blocking lock test

e3af35b

lock_file: Disable copy and move, clean up a bit

afc524f

hplatou previously approved these changes Sep 30, 2019

View reviewed changes

kyllingstad closed this Sep 30, 2019

kyllingstad added 7 commits January 3, 2020 09:30

Merge branch 'master' into feature/cse-cli-11-persistent-fmu-cache

586e73e

Augmented file_lock, now based on Boost's

092225f

Small fixes: typo on Windows, const, include

d7410ac

Only define NOMINMAX if not already defined

3ef2586

Windows fixes

b17cc76

Add shared_mutex, allow sharing of file_lock

e8d309c

Use lock files in fmi::importer

616aa40

kyllingstad reopened this Jan 7, 2020

kyllingstad requested a review from hplatou January 7, 2020 14:34

Merge branch 'master' into feature/cse-cli-11-persistent-fmu-cache

01bf4b7

kyllingstad commented Jan 7, 2020

View reviewed changes

kyllingstad added 2 commits January 13, 2020 09:38

Merge commit 'a74e7a0' into feature/cse-cli-11-persistent-fmu-cache

6d86d35

Lock files do not need to be executable

85d2088

kyllingstad closed this Jan 13, 2020

Thread safety, documentation, improved clean_cache

bc90056

kyllingstad reopened this Jan 14, 2020

Remove unnecessary include

a74db05

Introduce file_cache, clean up fmi::importer

f84b9b5

kyllingstad added 2 commits January 17, 2020 09:00

Whitespace, comments

60c536d

Windows stuff

ca80e7b

eidekrist reviewed Jan 20, 2020

View reviewed changes

src/cpp/utility/concurrency.cpp Show resolved Hide resolved

ljamt approved these changes Feb 4, 2020

View reviewed changes

eidekrist reviewed Feb 4, 2020

View reviewed changes

eidekrist approved these changes Feb 5, 2020

View reviewed changes

kyllingstad merged commit 19ad219 into master Feb 5, 2020

kyllingstad deleted the feature/cse-cli-11-persistent-fmu-cache branch February 5, 2020 07:59

ljamt mentioned this pull request Feb 11, 2020

Add support for persistent FMU cache directory #230

Open

ljamt mentioned this pull request Mar 3, 2020

FMUs with curly brackets in guid are failing #533

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enable multi-process FMU caching and expose it through URI resolvers #388

Enable multi-process FMU caching and expose it through URI resolvers #388

kyllingstad commented Sep 30, 2019

hplatou left a comment

kyllingstad commented Sep 30, 2019

kyllingstad commented Sep 30, 2019

kyllingstad commented Jan 7, 2020 •

edited

Loading

kyllingstad Jan 7, 2020

kyllingstad commented Jan 13, 2020 •

edited

Loading

kyllingstad commented Jan 14, 2020

markaren commented Jan 14, 2020 •

edited

Loading

kyllingstad commented Jan 14, 2020 •

edited

Loading

markaren commented Jan 14, 2020

eidekrist commented Jan 15, 2020

kyllingstad commented Jan 15, 2020

kyllingstad commented Jan 17, 2020

ljamt commented Feb 4, 2020

kyllingstad commented Feb 4, 2020

ljamt commented Feb 4, 2020

eidekrist Feb 4, 2020

kyllingstad Feb 5, 2020

eidekrist left a comment

Enable multi-process FMU caching and expose it through URI resolvers #388

Enable multi-process FMU caching and expose it through URI resolvers #388

Conversation

kyllingstad commented Sep 30, 2019

hplatou left a comment

Choose a reason for hiding this comment

kyllingstad commented Sep 30, 2019

kyllingstad commented Sep 30, 2019

kyllingstad commented Jan 7, 2020 • edited Loading

kyllingstad Jan 7, 2020

Choose a reason for hiding this comment

kyllingstad commented Jan 13, 2020 • edited Loading

kyllingstad commented Jan 14, 2020

markaren commented Jan 14, 2020 • edited Loading

kyllingstad commented Jan 14, 2020 • edited Loading

markaren commented Jan 14, 2020

eidekrist commented Jan 15, 2020

kyllingstad commented Jan 15, 2020

kyllingstad commented Jan 17, 2020

ljamt commented Feb 4, 2020

kyllingstad commented Feb 4, 2020

ljamt commented Feb 4, 2020

eidekrist Feb 4, 2020

Choose a reason for hiding this comment

kyllingstad Feb 5, 2020

Choose a reason for hiding this comment

eidekrist left a comment

Choose a reason for hiding this comment

kyllingstad commented Jan 7, 2020 •

edited

Loading

kyllingstad commented Jan 13, 2020 •

edited

Loading

markaren commented Jan 14, 2020 •

edited

Loading

kyllingstad commented Jan 14, 2020 •

edited

Loading