-
Notifications
You must be signed in to change notification settings - Fork 12
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Enable reuse of mmaps between artifacts #125
Conversation
@@ -109,12 +109,12 @@ impl CodeMemory { | |||
executable_section_result.push(s); | |||
} | |||
|
|||
self.start_of_nonexecutable_pages = bytes; | |||
self.start_of_nonexecutable_pages = Mmap::round_up_to_page_size(bytes); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we zero-out the executable leftover?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think it is necessary to zero out this mmap at all. Due to wasm using a Harvard-style execution model, the leftover memory isn't going to be accessible to the contract. Unless we mess up our VM implementation, in which we'd have worse problems (e.g. contracts getting access to a source of non-determinism or to private counterparts of keys stored in memory)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree that this isn't necessary, but might still be good for defense in depth -- it just feels scary to have an executable memory somewhere in process, whose contents is not explicitly set. Though, this is very weak opinion. If we do enable huge tables, we potentially could need to zero-out a significant chunk of stuff...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if this makes the situation worse if someone pops us somewhere else. Like, what if there's an RCE in rocks? Seems not zeroing is not worse even in that case -- if the attacker can get a gadget via non-zeronig of old contract, they probably can get one via just explicit contract?
if total_len <= self.mmap.len() { | ||
self.unpublish() | ||
} else { | ||
self.mmap = Mmap::with_at_least(total_len).map_err(|e| e.to_string())?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would be cool to push this error to the caller completely, such that we don't have to worry about internal mmap failing.
c1dce83
to
cbf9d31
Compare
cbf9d31
to
103d50a
Compare
For the time being this is still a prototype that uses RWX maps, but it won’t be too onerous to adopt any other mechanism. The primary value is that now the user is responsible for allocating a pool of memory maps ahead of time, and they continue being reused within the same engine, possibly resizing when necessary to fit a larger module (though this is something I’m somewhat on an edge about – it introduces a failure mode related to memory allocation deep inside 906.06us is the baseline/best performance we can get when running this benchmark with 10k functions. This is the same |
@Ekleog it would be a good time to review and think about the implications of the new API here. |
I think this is largely functionally complete. Compared to the current master we’re looking at a module load performance somewhere along the lines of:
I’m thinking of delaying work on improving this further with memfd (which gives another significant boost) until after we get flat executable formats going. |
5b81e5b
to
9d03d0c
Compare
These methods attempt to make it straightforward to load a module into a store when the engine contained therein is already a dynamic object. While there are some instances of it, for all cases where it matters for us, obtaining a specific type of engine is going to be pretty straightforward (`UniversalEngine` is the only supported engine right now anyway…).
9d03d0c
to
ff101e1
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall LGTM for the idea! I still have a few concerns around unsafe
here and there though.
Also, I'm with @matklad that keeping an RX memory map hanging around with user-defined code is probably bad for defense-in-depth. Given we're going to turn that RX map into an RW map as soon as we get it out of the code memory pool, maybe it'd make sense to actually do the RW remap before putting it back into the code memory pool?
This way an attacker could control only an RW zone of memory outside of contract execution, which should be much harder to exploit from eg. a networking library ropchain than an RX zone of memory forever.
It does mean, though, that the idea about CodeMemory keeping an enum of RX/RW should actually happen, so that the RW remap doesn't happen two times for no good reason.
Does that make sense?
@@ -101,4 +97,41 @@ impl dyn Engine { | |||
None | |||
} | |||
} | |||
|
|||
/// Downcast a dynamic Executable object to a concrete implementation of the trait. | |||
pub fn downcast_arc<T: Engine + 'static>(self: Arc<Self>) -> Result<Arc<T>, Arc<Self>> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure I understand, why would Arc::downcast not work here? AFAICT we could just add : Any
to the definition of trait Engine
, and we could get rid of this bit of unsafe code for free
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So the problem with Arc::downcast
is that it is implemented for Arc<dyn Any + Send + Sync + 'static>
. This notably doesn’t work for Arc<dyn Engine*>
and there isn’t an easy way to go between the two, so you cannot actually call Arc::downcast
on Arc<dyn Engine>
AFAICT.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm I think having something like this should work?
trait Engine: Any + Send + Sync + 'static {
// ...
fn to_any(self: Arc<Self>) -> Arc<dyn Any + Send + Sync + 'static> {
self
}
}
And then use Arc::downcast(engine.to_any())
?
Basically I'm feeling that if there's no way something like this would work, it might mean that this bit of unsafe code, however sound it looks, is actually wrong :/
Also I actually just noticed that Engine::type_id
is a function implemented by a non-unsafe
non-sealed trait, which means that the code is technically unsound because an engine could spoof a fake TypeId. That said tracing has similar unsoundness in its traits and it's not a big deal in practice, so it's probably not too bad.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also I actually just noticed that Engine::type_id is a function implemented by a non-unsafe non-sealed trait, which means that the code is technically unsound because an engine could spoof a fake TypeId.
Implementers wouldn’t be able to access private::Internal
necessary to specify the argument type for the function.
And then use Arc::downcast(engine.to_any())?
Implementing to_any
would require an unstable feature (trait upcasting) or unsafe
code along the lines of what is seen here, and would end up returning dyn Any
in its Err
variant, which is not great in terms of usability of the API, but it would work.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm so for to_any(), here is what I was thinking of: https://play.rust-lang.org/?version=stable&mode=debug&edition=2021&gist=7cedaa843727797e370656ec18459e0f (found by re-looking at the craziest trait object code I ever wrote, erlust)
I had forgotten that a default impl of the function wouldn't work, but that's probably not a problem for us. The advantage being, that there's zero unsafe code here. And the drawback that, as you mentioned, it returns dyn Any
instead of dyn Engine
upon failure, but that's probably not too bad as it's all Arc
anyway and we can just keep a clone of the Arc<dyn Engine>
from before the to_any
cast.
WDYT?
|
||
impl dyn Engine + Send + Sync { | ||
/// Downcast a dynamic Executable object to a concrete implementation of the trait. | ||
pub fn downcast_ref<T: Engine + 'static>(&self) -> Option<&T> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same story with dyn Any
vs dyn T
.
page_size, | ||
) + data_sections.iter().fold(0, |acc, data| { | ||
round_up(acc + data.bytes.len(), DATA_SECTION_ALIGNMENT.into()) | ||
}); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In a first draft of this comment I was fooled by how the code for this computation is constructed, but it turns out that it's just doing additional needless round_up alignments, meaning it's just overestimating and is correct.
Maybe add a comment saying that this is an over-estimation that does additional aligning at the end of each section type, so other code readers don't get surprised in the same way?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FWIW I wanted to get rid of this entirely somehow, as I found this difficult to verify as well (this is pre-existing code). I think the most plausible approach would be to make “allocate” generic over the writer and call it twice, once with a dummy writer that would just collect the write sizes and spit out the final required size at the end, and the 2nd with a proper writer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SGTM!
@@ -30,7 +30,6 @@ mod code_memory; | |||
mod engine; | |||
mod executable; | |||
mod link; | |||
mod unwind; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this change mean we can drop the eh_tables code handling in the previous commit too?
} | ||
|
||
/// Remap the offset into an absolute address within a read-execute mapping. | ||
pub fn executable_address(&self, offset: usize) -> *const u8 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don’t know. Obtaining an address to contents of the map is not inherently an unsafe operation, it is the publish
/writer
that end up prodding the dragons. So I feel like it is more natural to have unsafety there, rather than here. Analogous situation as with casting a mutable reference to a pointer vs dereferencing said pointer(s) to make a mutable reference out of it.
page_size, | ||
) + data_sections.iter().fold(0, |acc, data| { | ||
round_up(acc + data.bytes.len(), DATA_SECTION_ALIGNMENT.into()) | ||
}); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FWIW I wanted to get rid of this entirely somehow, as I found this difficult to verify as well (this is pre-existing code). I think the most plausible approach would be to make “allocate” generic over the writer and call it twice, once with a dummy writer that would just collect the write sizes and spit out the final required size at the end, and the 2nd with a proper writer.
Instead, users can obtain a concrete type of engine and use inherent methods to compile with the specific engine.
For the memory maps themselves, for the time being, we’re using a scheme that uses `SHARED` maps which gives us a large chunk of the speed beenfit described in https://kazlauskas.me/entries/fast-page-maps-for-jit. Ideally we would use a memfd based approach, however that turns out to require some significant reworking of how the compiled functions are referenced – for exapmle we want to apply linking and relocation on the writable view of the functions, but the function references will contain references to the executable mapping right after the `allocate` call. I imagine we won’t get to implementing an improvement here for a little while, and definitely not before we implement flat executable file representations.
This appears to be no longer used in any meaningful way since we no longer have any backends that emit uwtables in the first place.
This makes sure that there isn’t a place with a bunch of user-defined code in memory while the code is not being used in any way.
0479b96
to
90bff5c
Compare
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. |
This is a crude prototype which allows reusing the
CodeMemory
betweendistinct Artifacts and Instances. This can reduce the overhead of
allocating mmaps with the kernel significantly. It turns out that the
kernel would still spend a lot of time if requested to change the
protection of a mapping from RX to RW.
This overhead is entirely mitigated by using HUGETLB, which is sadly
non-portable and is a major pain to enable (requires a kernel parameter).
With these two things in place the sys time goes down to almost zero
when loading contracts into memory, and the overall time of the
deserialize/10000
benchamrk is reduced from ~2ms per iteration down to1.25ms.