Enable reuse of mmaps between artifacts #125

nagisa · 2022-04-29T13:56:12Z

This is a crude prototype which allows reusing the CodeMemory between
distinct Artifacts and Instances. This can reduce the overhead of
allocating mmaps with the kernel significantly. It turns out that the
kernel would still spend a lot of time if requested to change the
protection of a mapping from RX to RW.

This overhead is entirely mitigated by using HUGETLB, which is sadly
non-portable and is a major pain to enable (requires a kernel parameter).

With these two things in place the sys time goes down to almost zero
when loading contracts into memory, and the overall time of the
deserialize/10000 benchamrk is reduced from ~2ms per iteration down to
1.25ms.

matklad · 2022-04-29T19:07:15Z

lib/engine-universal/src/code_memory.rs

@@ -109,12 +109,12 @@ impl CodeMemory {
            executable_section_result.push(s);
        }

-        self.start_of_nonexecutable_pages = bytes;
+        self.start_of_nonexecutable_pages = Mmap::round_up_to_page_size(bytes);


Should we zero-out the executable leftover?

I don't think it is necessary to zero out this mmap at all. Due to wasm using a Harvard-style execution model, the leftover memory isn't going to be accessible to the contract. Unless we mess up our VM implementation, in which we'd have worse problems (e.g. contracts getting access to a source of non-determinism or to private counterparts of keys stored in memory)

Agree that this isn't necessary, but might still be good for defense in depth -- it just feels scary to have an executable memory somewhere in process, whose contents is not explicitly set. Though, this is very weak opinion. If we do enable huge tables, we potentially could need to zero-out a significant chunk of stuff...

I wonder if this makes the situation worse if someone pops us somewhere else. Like, what if there's an RCE in rocks? Seems not zeroing is not worse even in that case -- if the attacker can get a gadget via non-zeronig of old contract, they probably can get one via just explicit contract?

matklad · 2022-04-29T19:09:05Z

lib/engine-universal/src/code_memory.rs

+        if total_len <= self.mmap.len() {
+            self.unpublish()
+        } else {
+            self.mmap = Mmap::with_at_least(total_len).map_err(|e| e.to_string())?;


Would be cool to push this error to the caller completely, such that we don't have to worry about internal mmap failing.

nagisa · 2022-06-28T14:33:51Z

For the time being this is still a prototype that uses RWX maps, but it won’t be too onerous to adopt any other mechanism. The primary value is that now the user is responsible for allocating a pool of memory maps ahead of time, and they continue being reused within the same engine, possibly resizing when necessary to fit a larger module (though this is something I’m somewhat on an edge about – it introduces a failure mode related to memory allocation deep inside UniversalEngine and it makes memory usage growth unbounded).

906.06us is the baseline/best performance we can get when running this benchmark with 10k functions. This is the same deserialize/10000 benchmark mentioned in the PR description above.

nagisa · 2022-06-28T14:34:22Z

@Ekleog it would be a good time to review and think about the implications of the new API here.

nagisa · 2022-06-30T13:34:51Z

I think this is largely functionally complete. Compared to the current master we’re looking at a module load performance somewhere along the lines of:

many_functions/load/1   time:   [1.9852 us 1.9917 us 1.9970 us]
                        change: [-53.637% -49.569% -45.959%] (p = 0.00 < 0.05)
many_functions/load/10  time:   [3.1826 us 3.1921 us 3.2026 us]
                        change: [-33.674% -33.501% -33.328%] (p = 0.00 < 0.05)
many_functions/load/100 time:   [10.638 us 10.649 us 10.659 us]
                        change: [-74.704% -68.424% -60.145%] (p = 0.00 < 0.05)
many_functions/load/1000
                        time:   [72.230 us 72.269 us 72.342 us]
                        change: [-62.413% -58.517% -54.767%] (p = 0.00 < 0.05)
many_functions/load/10000
                        time:   [922.51 us 924.06 us 925.87 us]
                        change: [-48.213% -43.139% -39.995%] (p = 0.00 < 0.05)
many_locals/load/1000   time:   [3.1262 us 3.2095 us 3.3001 us]
                        change: [-58.665% -49.766% -43.309%] (p = 0.00 < 0.05)
many_locals/load/100000 time:   [11.248 us 11.267 us 11.292 us]
                        change: [-50.290% -49.910% -49.608%] (p = 0.00 < 0.05)
many_locals/load/10000000
                        time:   [91.316 us 91.800 us 92.441 us]
                        change: [-67.343% -63.812% -60.858%] (p = 0.00 < 0.05)

I’m thinking of delaying work on improving this further with memfd (which gives another significant boost) until after we get flat executable formats going.

These methods attempt to make it straightforward to load a module into a store when the engine contained therein is already a dynamic object. While there are some instances of it, for all cases where it matters for us, obtaining a specific type of engine is going to be pretty straightforward (`UniversalEngine` is the only supported engine right now anyway…).

Ekleog

Overall LGTM for the idea! I still have a few concerns around unsafe here and there though.

Also, I'm with @matklad that keeping an RX memory map hanging around with user-defined code is probably bad for defense-in-depth. Given we're going to turn that RX map into an RW map as soon as we get it out of the code memory pool, maybe it'd make sense to actually do the RW remap before putting it back into the code memory pool?

This way an attacker could control only an RW zone of memory outside of contract execution, which should be much harder to exploit from eg. a networking library ropchain than an RX zone of memory forever.

It does mean, though, that the idea about CodeMemory keeping an enum of RX/RW should actually happen, so that the RW remap doesn't happen two times for no good reason.

Does that make sense?

Ekleog · 2022-07-01T14:57:31Z

lib/engine/src/engine.rs

@@ -101,4 +97,41 @@ impl dyn Engine {
            None
        }
    }
+
+    /// Downcast a dynamic Executable object to a concrete implementation of the trait.
+    pub fn downcast_arc<T: Engine + 'static>(self: Arc<Self>) -> Result<Arc<T>, Arc<Self>> {


Not sure I understand, why would Arc::downcast not work here? AFAICT we could just add : Any to the definition of trait Engine, and we could get rid of this bit of unsafe code for free

So the problem with Arc::downcast is that it is implemented for Arc<dyn Any + Send + Sync + 'static>. This notably doesn’t work for Arc<dyn Engine*> and there isn’t an easy way to go between the two, so you cannot actually call Arc::downcast on Arc<dyn Engine> AFAICT.

Hmm I think having something like this should work?

trait Engine: Any + Send + Sync + 'static { // ... fn to_any(self: Arc<Self>) -> Arc<dyn Any + Send + Sync + 'static> { self } }

And then use Arc::downcast(engine.to_any())?

Basically I'm feeling that if there's no way something like this would work, it might mean that this bit of unsafe code, however sound it looks, is actually wrong :/

Also I actually just noticed that Engine::type_id is a function implemented by a non-unsafe non-sealed trait, which means that the code is technically unsound because an engine could spoof a fake TypeId. That said tracing has similar unsoundness in its traits and it's not a big deal in practice, so it's probably not too bad.

Also I actually just noticed that Engine::type_id is a function implemented by a non-unsafe non-sealed trait, which means that the code is technically unsound because an engine could spoof a fake TypeId.

Implementers wouldn’t be able to access private::Internal necessary to specify the argument type for the function.

And then use Arc::downcast(engine.to_any())?

Implementing to_any would require an unstable feature (trait upcasting) or unsafe code along the lines of what is seen here, and would end up returning dyn Any in its Err variant, which is not great in terms of usability of the API, but it would work.

Hmm so for to_any(), here is what I was thinking of: https://play.rust-lang.org/?version=stable&mode=debug&edition=2021&gist=7cedaa843727797e370656ec18459e0f (found by re-looking at the craziest trait object code I ever wrote, erlust)

I had forgotten that a default impl of the function wouldn't work, but that's probably not a problem for us. The advantage being, that there's zero unsafe code here. And the drawback that, as you mentioned, it returns dyn Any instead of dyn Engine upon failure, but that's probably not too bad as it's all Arc anyway and we can just keep a clone of the Arc<dyn Engine> from before the to_any cast.

WDYT?

Ekleog · 2022-07-01T14:58:26Z

lib/engine/src/engine.rs

+
+impl dyn Engine + Send + Sync {
+    /// Downcast a dynamic Executable object to a concrete implementation of the trait.
+    pub fn downcast_ref<T: Engine + 'static>(&self) -> Option<&T> {


Same story with dyn Any vs dyn T.

lib/engine/src/engine.rs

lib/api/src/sys/module.rs

tests/compilers/serialize.rs

lib/engine-universal/src/code_memory.rs

Ekleog · 2022-07-01T16:27:50Z

lib/engine-universal/src/engine.rs

+            page_size,
+        ) + data_sections.iter().fold(0, |acc, data| {
+            round_up(acc + data.bytes.len(), DATA_SECTION_ALIGNMENT.into())
+        });


In a first draft of this comment I was fooled by how the code for this computation is constructed, but it turns out that it's just doing additional needless round_up alignments, meaning it's just overestimating and is correct.

Maybe add a comment saying that this is an over-estimation that does additional aligning at the end of each section type, so other code readers don't get surprised in the same way?

FWIW I wanted to get rid of this entirely somehow, as I found this difficult to verify as well (this is pre-existing code). I think the most plausible approach would be to make “allocate” generic over the writer and call it twice, once with a dummy writer that would just collect the write sizes and spit out the final required size at the end, and the 2nd with a proper writer.

lib/engine-universal/src/engine.rs

Ekleog · 2022-07-01T16:58:36Z

lib/engine-universal/src/lib.rs

@@ -30,7 +30,6 @@ mod code_memory;
 mod engine;
 mod executable;
 mod link;
-mod unwind;


Does this change mean we can drop the eh_tables code handling in the previous commit too?

lib/engine-universal/src/engine.rs

lib/engine-universal/src/code_memory.rs

nagisa · 2022-07-11T09:09:42Z

lib/engine-universal/src/code_memory.rs

+    }
+
+    /// Remap the offset into an absolute address within a read-execute mapping.
+    pub fn executable_address(&self, offset: usize) -> *const u8 {


I don’t know. Obtaining an address to contents of the map is not inherently an unsafe operation, it is the publish/writer that end up prodding the dragons. So I feel like it is more natural to have unsafety there, rather than here. Analogous situation as with casting a mutable reference to a pointer vs dereferencing said pointer(s) to make a mutable reference out of it.

nagisa · 2022-07-11T09:15:58Z

lib/engine-universal/src/engine.rs

+            page_size,
+        ) + data_sections.iter().fold(0, |acc, data| {
+            round_up(acc + data.bytes.len(), DATA_SECTION_ALIGNMENT.into())
+        });


FWIW I wanted to get rid of this entirely somehow, as I found this difficult to verify as well (this is pre-existing code). I think the most plausible approach would be to make “allocate” generic over the writer and call it twice, once with a dummy writer that would just collect the write sizes and spit out the final required size at the end, and the 2nd with a proper writer.

lib/engine-universal/src/code_memory.rs

Instead, users can obtain a concrete type of engine and use inherent methods to compile with the specific engine.

For the memory maps themselves, for the time being, we’re using a scheme that uses `SHARED` maps which gives us a large chunk of the speed beenfit described in https://kazlauskas.me/entries/fast-page-maps-for-jit. Ideally we would use a memfd based approach, however that turns out to require some significant reworking of how the compiled functions are referenced – for exapmle we want to apply linking and relocation on the writable view of the functions, but the function references will contain references to the executable mapping right after the `allocate` call. I imagine we won’t get to implementing an improvement here for a little while, and definitely not before we implement flat executable file representations.

This appears to be no longer used in any meaningful way since we no longer have any backends that emit uwtables in the first place.

This makes sure that there isn’t a place with a bunch of user-defined code in memory while the code is not being used in any way.

stale · 2023-07-13T19:57:22Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

nagisa marked this pull request as draft April 29, 2022 13:56

matklad reviewed Apr 29, 2022

View reviewed changes

nagisa force-pushed the nagisa/mmap-reuse-proto branch 2 times, most recently from c1dce83 to cbf9d31 Compare June 28, 2022 13:58

matklad requested a review from Ekleog June 28, 2022 14:22

nagisa force-pushed the nagisa/mmap-reuse-proto branch from cbf9d31 to 103d50a Compare June 28, 2022 14:30

nagisa force-pushed the nagisa/mmap-reuse-proto branch from 5b81e5b to 9d03d0c Compare June 30, 2022 13:48

nagisa force-pushed the nagisa/mmap-reuse-proto branch from 9d03d0c to ff101e1 Compare June 30, 2022 13:51

nagisa changed the title ~~proto: enable reuse of mmaps between artifacts~~ Enable reuse of mmaps between artifacts Jun 30, 2022

nagisa marked this pull request as ready for review June 30, 2022 14:41

Ekleog reviewed Jul 1, 2022

View reviewed changes

nagisa commented Jul 11, 2022

View reviewed changes

nagisa added 6 commits July 12, 2022 13:30

Remove Engine::compile

173929a

Instead, users can obtain a concrete type of engine and use inherent methods to compile with the specific engine.

Remove unwind support in universal

394ce4e

This appears to be no longer used in any meaningful way since we no longer have any backends that emit uwtables in the first place.

Cleanup some of the TODOs

ea4d7d8

Remove Sync requirement for CodeMemory

0859382

Try to keep mappings RW at rest

90bff5c

This makes sure that there isn’t a place with a bunch of user-defined code in memory while the code is not being used in any way.

nagisa force-pushed the nagisa/mmap-reuse-proto branch from 0479b96 to 90bff5c Compare July 12, 2022 11:37

jakmeier mentioned this pull request Sep 30, 2022

Base gas costs for function calls are too high near/nearcore#7741

Open

nagisa mentioned this pull request Jun 19, 2023

near-vm: remove dynamically typed APIs from near_vm near/nearcore#9214

Closed

stale bot added the 🏚 stale label Jul 13, 2023

nagisa closed this Jul 19, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enable reuse of mmaps between artifacts #125

Enable reuse of mmaps between artifacts #125

nagisa commented Apr 29, 2022

matklad Apr 29, 2022

nagisa May 2, 2022 •

edited

Loading

matklad May 2, 2022

matklad May 2, 2022

matklad Apr 29, 2022

nagisa commented Jun 28, 2022

nagisa commented Jun 28, 2022

nagisa commented Jun 30, 2022 •

edited

Loading

Ekleog left a comment

Ekleog Jul 1, 2022

nagisa Jul 12, 2022

Ekleog Jul 12, 2022

nagisa Jul 12, 2022

Ekleog Jul 13, 2022

Ekleog Jul 1, 2022

nagisa Jul 12, 2022

Ekleog Jul 1, 2022

nagisa Jul 11, 2022

Ekleog Jul 12, 2022

Ekleog Jul 1, 2022

nagisa Jul 11, 2022 •

edited

Loading

nagisa Jul 11, 2022

stale bot commented Jul 13, 2023

Enable reuse of mmaps between artifacts #125

Enable reuse of mmaps between artifacts #125

Conversation

nagisa commented Apr 29, 2022

Choose a reason for hiding this comment

nagisa May 2, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nagisa commented Jun 28, 2022

nagisa commented Jun 28, 2022

nagisa commented Jun 30, 2022 • edited Loading

Ekleog left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nagisa Jul 11, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

stale bot commented Jul 13, 2023

nagisa May 2, 2022 •

edited

Loading

nagisa commented Jun 30, 2022 •

edited

Loading

nagisa Jul 11, 2022 •

edited

Loading