-
Notifications
You must be signed in to change notification settings - Fork 58
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Should raw pointer deref/projections have to be in-bounds? #319
Comments
There's also the option (which I'm aware you aren't a fan of) of allowing For better or worse, I think most peoples intuition is that (I think loosing |
I am wondering why dereferencing of |
C does the equivalent of option 3 there. |
That is IMO a bad argument -- the only reason it is a macro is that we have not found a nice syntax we wanted to commit to. And certainly macros are less magic than "built-in syntax".
In C, I hear programmers talk about lvalues quite a lot. So I think that similarly, in Rust, programmers are generally aware of the concept of a place. I think the rest can be solved with better documentation and teaching -- IMO most explanations of these concepts and how they affect program execution are rather bad, since they don't actually follow the syntactic structure of the program.
The question is how to even systematically specify the language (e.g. in a way that can be implemented in Miri) to achieve this. I think we would need to basically distinguish safe from unsafe places and then, conceptually, use a form of type inference so that at the I don't think that is a good idea. If people think |
One reason I prefer the first option is because it allows us to make this kind of code safe (as in, not requiring the addr_of!((*ptr::null::<Struct>()).field) Whether or not that's a good way to implement (ptr as *const u8).wrapping_add(offset_of!(Struct, field)) as *const FieldType The
I'm not sure if also removing autoderef from I believe the first time such an idea came up was in the context of youngspe/project-uninit#1 - that kind of thing feels like it has to "fight the language" to avoid hazards while doing something on behalf of the user. |
@eddyb at least with the first option, "unsafe place" here is a concept that only exists for unsafety checking, but not for the actual dynamic behavior of the program or whether code has UB. I like that. :D
I am pretty sure I have seen this discussed (and discussed it myself) way earlier, in 2019 when https://github.com/Gilnaa/memoffset was moved towards "the canonical offset_of in the ecosystem". |
Ehm, sorry, my phrasing was ambiguous. For me, specifically, the idea of making some raw-pointer-based field offseting safe (and banning the existing autoderef, or least keeping it Maybe it's also because for |
It is my assertion that worrying about the precise semantics of // Applied to a raw pointer value.
// It's ok if this syntax isn't transactional and we're semantically
// GEP inboundsing to both `field` and `subfield` because subfields
// advance the offset monotonically, and GEP inbounds effectively
// asserts the start and end are in the same allocation and that the
// allocation spans the range between them.
my_ptr~field~subfield.write(val);
*my_ptr~field~subfield += 1; and // Applied to a type
// Same as above but caching the offset, possibly as a const.
let subfield_offset = MyType::<T>~field~subfield;
my_ptr.offset(subfield_offset).read() Once you have this ergonomic syntax that doesn't mess around with places, it's perfectly fine for Why won't they use it anymore? Because it is already syntactic salt! |
Like you can argue all you want about the precise semantics of |
Because the main reason for violating the current rules is for calculating offsets, maybe the right solution is to provide an That might be enough to allow the current rules to stay as is (as there's no really any reason that I know of to use an invalid pointer other than calculating offsets) |
I don't think the case in rust-lang/rust#93459 would be solved by an offset_of macro, I think it's a broader issue than just offset_of (although that is a component). |
I am not convinced that is true. I was confused by places for a long time but I think I have reached "place enlightenment" some time in 2018 and I think this is something that can be shared -- none of the explanations for this that I saw out there were satisfying, including the one they gave me at university in my first semester. Or maybe I am just delirious and stared into the sun too long, not sure. I certainly should actually write such a "comprehensible explanation of places" before making such big claims, which I haven't done yet, so maybe it is harder than I think. But I also think if places, which are a rather fundamental concept to Rust, are so hard to explain, then we have bigger problems. |
Places are like autoderef/autoref. You can and must specify them somewhere but no one using rust actually needs to understand them really. It's magic that the compiler does to make your code work, and it's fine that it's magic because the borrowchecker will catch anything that goes wrong (in my experience). But we don't do autoref/autoderef for raw pointers because in that context wacky-wild-magic semantics are terrifying and not something we can help you use correctly. I regard places as much the same thing. Now you can't completely eliminate places, but they should definitely be minimized. Most unsafe code has no need to ever actually use places, and just needs to offset and read/write. Code that does want to use places generally only really cares about applying them to the leaf of the access (what my strawman does) and not be the root of the access (what the current system does). Aside: Arguably you also want a postfix deref operator but once you are restricting the place to a leaf, you hardly need anything better than
Because it's more ergonomic and much more advisable because it's mindlessly correct compared to:
I really want to be able to tell unsafe rust programmers: "Here's this totally braindead and mindlessly correct way to do everything, and it's also the most ergonomic way to do it, so you'll want to do it anyway. (And now here's where you can be a clever smartass if you insist...)" |
Fully agreed on that part. :) |
Can't we just make Because people only want to use EDIT: I'm a goof, elichai actually already mostly said this 8 hours ago. |
I rather like the "magic" option. Speaking from a C perspective: I know what an lvalue is, but I intuitively think of it as sort of a 'legal fiction'. That is, formalistically speaking, int x = foo.bar; and int *x = &foo.bar; are both doing the same operation on lvalues (going from the lvalue That's in part because they generate different assembly instructions on most architectures; in part because I find it more intuitive to think of them that way than to imagine the invisible decay operation. But also, there are circumstances where a direct access is valid but taking a pointer is not:
Particularly in the latter two cases, the lvalue has to "remember" the path it was accessed through, which compromises the independence of the two parts of the operation (i.e. lvalue projection and lvalue-to-rvalue decay). |
Only the decay to rvalue vs converting to a pointer results in different assembly. The decay to rvalue would be loading from the pointer calculated by the lvalue expression. Converting to a pointer directly takes the calculated pointer and does nothing. In both cases each part of the lvalue expression results in a pointer calculation.
Rust doesn't have register variables.
Taking a raw pointer using
Doesn't apply to rust. We require |
Yes, but most architectures have a single "load from (pointer plus immediate offset)" instruction rather than needing a "add immediate offset to pointer" instruction followed by a "load from pointer" instruction. I fully realize that these are separate operations for most of the compilation pipeline, and aren't always combined into one assembly instruction. But I was specifically describing intuition as opposed to formalism.
I was talking about C. Upon second look, I guess I misread Ralf's post; for some reason I thought that "In C, I hear programmers talk about lvalues quite a lot." was suggesting that Rust's rules should try to match intuition for programmers coming from C. Instead he is just saying that Rust programmers can learn about places analogously to how C programmers learn about lvalues. That said, my anecdote that I haven't learned to think of C lvalues as 'real' still runs against that argument. And for that matter, I think raw pointers in particular should be designed to have as intuitive semantics as possible to Rust programmers coming from C. Also, one of the cases I mentioned applies to Rust as well:
But if you take a raw pointer, then loading from that pointer is still UB without explicitly using To be fair, I do think this is evidence for |
It's worth noting that in practice on most CPUs the difference from these is often very minor, both because they get fused into one by the processor even if they are separate instructions, and because the overhead from the memory access dominates. That said, it's not nothing, and I suppose this may also allow other optimizations in other points (I don't know).
Yes, I have nontrivial C experience (although it's been a while) and don't think of lvalues as real... Honestly, @RalfJung's assertion that "In C, I hear programmers talk about lvalues quite a lot." is quite different than my experience. I suspect it's just a different crowd, but I've never heard a C programmer who wasn't a compiler engineer or language designer talk about lvalues... Perhaps with the exception of someone referencing the C spec, but that's not typically how most folks think about programming IME. |
I'm straying off-topic, but to me it's not about performance but literally just "it looks different". I often have to manually match up C code to generated assembly for one reason or another, so that's relevant to me... |
That is also pretty much how lvalues were explained to me, and I think it is a terrible explanation. The full Rust program including explicit "load"s can then be explained entirely structurally, where the behavior of statements and both place and value expressions is all defined entirely in terms of evaluating their operands in a certain order, and then doing something with the result. Everything is nicely compositional, no need for "magic" or "legal fiction". I think this is also the only reasonable way that places can be treated systematically for use in (automated or interactive) formal verification.
Rust doesn't have Packed structs do complicate things, that is true. Specifically, the result of a place expression is not just a location in memory, but also an "expected alignment", and while that alignment is almost always equal to the alignment indicated by the type of the place, when computing a place to a field of a packed struct we lower this alignment to account for
I don't know basically anything about which combined offset-and-load instructions hardware provides, and I would assume that is true for most Rust programmers. I think it would be a mistake if we decided how to teach (let alone specify) Rust based on things like the details of various ISAs. |
If we treat dereferencable pointers, pointers, and places explicitly as separate objects (using
On the risk of only re-iterating the comment earlier with minor rewording, but imho this situation alone is confusing enough to warrant a new operator. Such a new operator is only unecessary if we assert the diagram would commute with it. But this seems unecessarily restrictive, and concerns three operations being coupled by some rather abstract requirement. (This is at least one root cause for the difficulty in finding good I'd rather see this diagram, with independent semantics of (the two potentially different) projection operations and only involving one of dereference and reference at a time (Without regard to syntax, though I find
(The middle operator being based on
|
In the OP I wrote this:
There is actually a twist here: what about projections into unsized types? I would prefer to avoid making the memory model dynamically compute the 'true' size of a type... but I guess to justify the |
I found this issue while trying to figure out whether the following code is UB: fn typed_slice<T, const N: usize>(b: &[u8], _: *const [T; N]) -> &[T] {
...
}
const P: *const T = NonNull::dangling().as_ptr();
typed_slice(b, addr_of!((*P).field)); I take it that the current answer is that yes, it is UB, and I need to use a My use case is not just finding the offset of a struct field, but also getting its type. In this example, |
Yeah, the |
Also see #350 where I argue that the only UB we should have in the realm of raw ptr deref and projections is that the offset computation must not wrap the address space. |
Another knob to tweak that came up in past LLVM discussions is whether |
I just want to add another value in the counter of "people who have been confused by this", as was explained to me on this zulip thread.
As Lokathor mentioned, a main motivator of me falling in this trap was doing things that are better done with |
The main issue in your mental model seems to be thinking of |
Ah, you're right, and I'll edit my comment. |
What happened to @Gankra 's suggestion to use |
This issue is about our existing In fact this issue should probably be closed since the summary is outdated -- the |
Currently, when you write
addr_of!((*ptr).field)
,ptr
has to be a sufficiently aligned and dereferencable pointer (meaning it has to point to actually allocated memory), with the size/alignment determined by the pointee type.Formally speaking, this stems from the fact that the
*
operator, which turns a value (of pointer or reference type) into a place, requires that the pointer be aligned and dereferencable. This is true regardless of the context in which the*
operator is used, i.e., regardless of what happens to this place after it has been constructed.Pragmatically speaking, this reflects the fact that
addr_of!((*ptr).field)
is lowered to agetelementptr inbounds
in LLVM. That said, the rules in the reference go further than what is required for codegen -- they requireptr
to be dereferencable for the full size of the type (not just thefield
), and that even if we doaddr_of!(*ptr)
.This situation frequently causes confusion, so maybe we want to change it. At least it'd be good to have a central place for discussion and reference, so I created this issue.
Cc @eddyb @joshtriplett @thomcc @Lokathor @cuviper rust-lang/reference#1000
Instance(s) of confusion in practice: rust-lang/rust#93459 (comment)
So, what could we do? Let me collect some proposals that have come up.
Remove restriction from
*
entirelyThe most radical option is to entirely remove the clause which imposes a rule on
*
, and instead just require that when a place is actually read from (viamove
/copy
) or stored to (via=
), then it has to be sufficiently aligned and dereferencable as determined by the type of the place. (We currently do not need such a rule, since as an invariant all places that are constructed are aligned and dereferencable.)This means we have to remove
inbounds
fromgetelementptr
in place projection lowering when the "source" of this place projection is a raw pointer. We can still addinbounds
for references or other places (locals/statics) since we know those to be dereferencable for other reasons (except for this twist). (We can not add it in cases like&(*ptr).field
sinceptr
might be e.g. 4 bytes before the beginning of the allocation and.field
offsets the ptr by 4.)This has the potential of making LLVM alias analysis and other optimizations less effective. In particular,
getelementptr
might not just go out-of-bounds, it would even be allowed to overflow, so there are some tricks LLVM might not be able to pull any more.It is certainly my favorite option, because it maximally reduces the amount of UB -- the remaining rule (about actual loads/stores) is pretty much unavoidable and typically expected by programmers.
Still prevent overflows
We could also say that
*ptr
on a raw pointer is UB if addingsize_of_val_raw(ptr)
toptr
would overflow, but allocation bounds do not matter. This would let us emit a hypotheticalgetelementptr nowrap
, if LLVM were to add support for such a flag in the future. I doubt this is any less surprising than our current rules, TBH, but it is less likely to be violated by the typical kind of code people write. It is also more complicated, since we still have to explicitly add the rule which says that loading from and storing to a place requires it to be dereferencable and aligned.Check inbounds on place projection
If we want to keep codegen unchanged, we could still relax our rules: we could allow
addr_of!(*ptr)
, and we could make it so thataddr_of!((*ptr).field)
only needs to be inbounds for the actual offset that is performed.For example, on top of the "remove restrictions" proposal, we could have an extra rule for each place projection which says that this projection needs to be in-bounds.
This has the advantage of keeping our current codegen and quite obviously aligning with the intuition many people seem to have for place projections. It also has some disadvantages though:
addr_of!((*ptr).field)
continues to be UB ifptr
is entirely dangling, so e.g. people wanting to write offset_of-style code still have to navigate around a footgun.Magic?
#319 (comment)
New syntax
#319 (comment)
The text was updated successfully, but these errors were encountered: