TypeId: use a (v0) mangled type to remain sound in the face of hash collisions. #95845

eddyb · 2022-04-09T13:21:04Z

This implements the strategy described in #77125 (comment), resulting in:

collision resistance: TypeIds become as sound as v0 symbol names
- that is, either can only be subverted by bypassing Cargo and linking together artifacts that share a crate identity (both name and disambiguator hash) but not the same sources
  _{on some platforms we may be able to instruct the linker to error based on e.g. different data in a symbol with a name only depends on crate identity, but in which we place the SVH (which depends on the crate's whole source/AST/HIR)}
  _{however, improving symbol name soundness is not in the scope of this PR, only TypeId parity with the best tool we have for serializing Rust identities in the face of separate compilation (i.e. v0 symbol names)}
breaking code transmute-ing TypeIds to u64
- you can see some examples in Widen TypeId from 64 bits to 128. #75923 (comment)
- this's why the (u64, ptr) layout was used in this PR (instead of a single pointer with the u64 hash stored behind it) - it's strictly larger irrespective of target
breaking code calling TypeId::of at compile-time (still unstable) then either transmute-ing it to an integer, or using it in a pattern (turns out we had a test for the latter)
- effectively unblocks Tracking Issue for const fn type_id #77125
(optionally) TypeId + rustc_demangle (which std already depends on) could offer access to the type name at runtime, which has been asked for before:
- Intrinsic for type_name_of_id to power a better impl Debug for TypeId? #61533
- Expose type_name() method on Any. #68379
- this would be equivalent to type_info::name from C++ RTTI (where type_info is obtained with the typeid(T) operator, which acts like our TypeId::of::<T>())
- while the type mangling string is not exposed in this PR, it could easily be later - but embedding rustc_demangle into core would be harder to do (and justify)

This implementation adds a new field to TypeId:

struct TypeId {
    hash: u64,
    mangling: &'static TypeManglingStr,
}

The &TypeManglingStr type is like &str but "thin", with the usize length being stored before the str data (an optimization to avoid making TypeId more than double in size).

To avoid comparing the string contents of the mangling field every time, a == b has:

a "fast reject" path for a.hash != b.hash
- i.e. a and b are definitely different if their hashes differ
a "fast accept" path for ptr::eq(a.mangling, b.mangling)
- i.e. a and b are definitely the same if their manglings are the same in memory
- this is just one address comparison thanks to the use of the "thin" &TypeManglingStr type, which its length being stored behind the pointer
a fallback path: compare the actual mangling string contents

While 1. and 2. could come in either order (as hash and mangling equalities are intrinsically linked), for simplicity (and being able to rely on #[derive] a bit), the order used above is the order they are checked in, in this PR currently.

Ideally, the fallback path won't be hit outside of a hash collision, and this should be the case today if the same crate (or at least CGU) has caused the codegen of both TypeId::of which produced the two TypeIds in the first place (which will cause their mangling fields to point at the same place in read-only static memory).

However, to get global deduplication (across all crates/CGUs), we'd have to start using certain linker features (discussed in e.g. #75923), which I've left out of this PR.

As it stands, this implementation is comparable to the typeinfo implementation libcxx calls NonUnique, which seems to be used at least on Windows (and its existence suggest we may never be able to guarantee deduplication, only do it on a best-effort basis).

Fixes #10389

cc @oli-obk @RalfJung @KodrAus @mystor

…ype_id`.

… confusion.

rust-highfive · 2022-04-09T13:21:07Z

Some changes occured to rustc_codegen_gcc

cc @antoyo

Some changes occured to the CTFE / Miri engine

cc @rust-lang/miri

rust-highfive · 2022-04-09T13:21:09Z

r? @lcnr

(rust-highfive has picked a reviewer for you, use r? to override)

RalfJung · 2022-04-09T13:24:51Z

That sounds awesome! Will this fix #10389?

RalfJung · 2022-04-09T13:27:18Z

compiler/rustc_const_eval/src/const_eval/mod.rs

@@ -39,6 +39,87 @@ pub(crate) fn const_caller_location(
    ConstValue::Scalar(Scalar::from_pointer(loc_place.ptr.into_pointer_or_addr().unwrap(), &tcx))
 }

+pub(crate) fn const_type_id<'tcx>(


I think I'd prefer this helper to be in the interpret module, not const_eval -- it is needed for all interpreter instances after all, const_eval is meant for specifically the CTFE instance of the general interpreter infrastruture.

EDIT: It is needed also for regular codegen, but I think my point remains -- const_eval should be for things that are specific for CTFE. interpret should not usually import anything from const_eval.

You're also creating your own ecx here, and returning this as a ConstValue -- why all that?

I guess an alternative might be to treat this like vtable_allocation?

type_name::alloc_type_name also seems similar in spirit?

Oooh I see, I was modelling it after const_caller_location but that's only here because it's a query instead of an intrinsic. Should it have its own module under interpret::intrinsics like caller_location does?

Yeah, note in particular that the caller_location implementation in interpret does not call the const_eval code. It is not a nullary intrinsic so it takes slightly different paths. Though probably the const_caller_location query should be made more like the vtable query, which is not inside const_eval either.

Should it have its own module under interpret::intrinsics like caller_location does?

Yeah I guess that would make sense.
It would almost make sense to move these queries for intrinsics that are implemented via interpreter stuff but also used for codegen in a separate module rustc_const_eval::intrinsics, i.e., outside interpret. @oli-obk any thoughts?

Looking at it, caller_location actually has a impl<'mir, 'tcx: 'mir, M: Machine<'mir, 'tcx>> InterpCx<'mir, 'tcx, M> block so having that inside interpret makes sense. type_name, OTOH, doesn't really do anything with the interpreter, it is very much like the vtable code (but the two are in totally different locations, which we should probably fix). "give me the vtable for type T" is not technically an intrinsic but not very different, either, is it?

I think cleaning this up is orthogonal.

Except for mplace_field we don't even need an InterpCx and can just create and intern the Allocations directly. If we added more convenience operations for writing data of a specific struct to an Allocation we can probably stop using InterpCx for caller_location, too.

Imo we should just merge this PR and then work on cleaning these up now that we have multiple intrinsics to clean up in a similar way.

I agree cleaning up the existing stuff is a separate task.

But we shouldn't have anything load-bearing inside the const_eval module, so I do think that this code should be moved somewhere else before landing. I don't care strongly which of the existing precedent it follows, but none of it has load-bearing code in const_eval, so let's not add yet another different style of implementing such an intrinsic. :)

(Even caller_location, which was mentioned as a template, has its load-bearing code in interpret::intrinsics::caller_location.)

this is still relevant, should move this function.

eddyb · 2022-04-09T13:37:26Z

_{Some backstory on how this took so long: #77125 (comment) is from August last year, and in September I already had half of this implemented - but around late October I ended up with a dead NVMe in the desktop where the branch was on. Only a few days ago I accidentally found a copy of it on the build server I typically use (git gc would've eaten other things, but this was checked out in a workdir). I saved the branch and moved on, but today I mentioned it in the context of the provide APIs, and I looked at it again, and it had way more implemented than I thought, so I decided to finish it after all.}

eddyb · 2022-04-09T14:39:05Z

Will this fix #10389?

Good point, yeah, I believe it would (cc @workingjubilee) - I'll go add it to the description.

eddyb · 2022-04-09T15:09:18Z

Heh, code in the wild seems to be relying on hasher behavior - at least I've been meaning to revert that to what it was before (for performance reasons) so we don't need to break anyone who does that (as suspicious as it is).

compiler/rustc_symbol_mangling/src/v0.rs

library/core/src/any.rs

bjorn3 · 2022-04-09T15:16:29Z

library/core/src/any.rs

+impl fmt::Debug for TypeId {
+    fn fmt(&self, f: &mut fmt::Formatter<'_>) -> fmt::Result {
+        // FIXME(eddyb) perhaps attempt demangling `mangling`.
+        f.debug_struct("TypeId").finish_non_exhaustive()


This is a debug format regression. Previously it was possible to tell different TypeId apart.

Hmm, weird, I don't remember doing this but I guess I noticed TypeId was infoleaking and reacted by doing this? Ideally we would print the mangling but I'm not sure we should commit to that without a more involved proposal.

I think showing TypeId { <mangling> } is what we want here.

I don't feel like this will add forward compat issues we should care about and being able to differentiate TypeIds is quite important. There won't be too many people relying on the concrete Debug string returned and if they do, I really don't mind breaking their code ^^

…ollisions.

eddyb · 2022-04-09T15:52:48Z

library/core/src/any.rs

+    #[inline]
+    fn eq(&self, other: &Self) -> bool {
+        // FIXME(eddyb) should `likely` be used around the `ptr::eq`?
+        if ptr::eq(self, other) {


Note to self: this could be CTFE-friendly if it used guaranteed_eq instead.

eddyb · 2022-04-09T15:58:13Z

@bors try @rust-timer queue

rust-timer · 2022-04-09T15:58:15Z

Awaiting bors try build completion.

@rustbot label: +S-waiting-on-perf

bjorn3 · 2022-07-06T15:28:44Z

Changing from &'static [u8; N] to &'static TypeData is trivial and shouldn't break any code that wouldn't be broken already by adding extra fields to struct TypeData([u8; N]) as both have the same in-memory representation and TypeData would be a private struct.

CAD97 · 2022-07-06T17:00:39Z

It would be an interesting project to try to structure TypeData such that it both contained the mangled type name and strip could remove the type name data, but that's a later project.

bjorn3 · 2022-07-06T17:44:58Z

Putting all mangled type names in a specific section and then using strip --remove-section=.rust_ty or something like that has a small chance of working. It is likely to result in the dynamic linker giving an error as relocation targets couldn't be resolved. Maybe weak symbols would work? Those are not very portable though. I don't think it is possible to convince strip to remove specific sections that are loaded at runtime without requiring extra arguments.

eddyb · 2022-07-07T21:31:44Z

Ok, so it looks everyone agreed on how this crypto hash could look like. I'll assume there is no more to discuss and there's an actionable to move this PR forward. Please feel free to switch back when ready for a review, thanks.

I don't think it makes much sense to repurpose this PR for implementing the cryptographic hash variant, and I'd rather leave the comment history here be about the mangled type variant.

Also it isn't a strong priority for me wrt limited time _{(as per #95845 (comment) this was stalled for over half a year in the first place)}, but if anyone wants to repurpose the few relevant bits, feel free to get them from this PR.

Acouple smaller things could also be done (not sure if I'll be able to help with either):

manual PartialEq impl for TypeId (i.e. turning off "structural match")
- this would disallow match-ing on TypeIds (forcing the use of == instead), as to not box ourselves into a corner if address equality is ever needed in the future
increasing TypeId's size (and cratering that change)
- should help with breaking misuses (just like comments on this PR describe)
- it doesn't have to be actual new data, could just be a dead field
  - this is library-only, e.g. TypeId(type_hash::<T>()) -> TypeId(type_hash::<T>(), 0)

In the end, I'm sorry for wasting people's time with an approach that I misjudged as having stronger support (from both the formal side and the precedent of C++ RTTI) than it actually did.

workingjubilee · 2022-07-07T22:06:44Z

I think the lang question here has been addressed: we're happy with a full (non-truncated) cryptographic hash. The compiler already has other things that depend on non-colliding hashes.

The compiler's internal implementations of these can be fixed without having to consult users, however. The results of the TypeId hashing will be public-facing.

Many Rust PRs have been historically stalled and pushed back on for the concern of potentially breaking users, even in the absence of evidence of any users who will actually be broken, even against the weight of correctness. Even, again, for situations where there was no guarantee in the public interface, merely "someone may have come to rely on it". So I do not understand why this is so cheerfully a part of discussion.

And it feels like being able to write something in Coq that ignores that the TypeId hash may collide does not automatically make our code correct in the face of actual implementations. After all, people are talking about the implications of the contents of the binary, now, and worrying what that will imply and what people will come to depend on in those. By then, the Rust language has long since left the picture and can provide very little control in actuality because the artifact has been fashioned. And Rust is included in dynamic libraries that cannot simply compile all this away. If we're concerned about people looking at TypeId symbols, why not about looking at our TypeId hashes, also?

Meanwhile, the concern of large symbols is primarily evinced by recursive symbols, which are subject to compression techniques.

I am absolutely certain that in 2040, assuming the Rust project still stands by then, that if the hash we use is broken, that someone is going to open a PR for upgrading the hash and everyone is going to hem and haw about these things then. It feels doubtful to me that, in the absence of well-defined ways to upgrade std's interfaces that bypass these concerns, that we should be using something as vulnerable as a hash. Any hash.

scottmcm · 2022-07-20T17:16:54Z

@workingjubilee I think there's a huge difference between "hey the transmute stopped working" and "the value in use is different now". I can't imagine there being any concern about changing the values generated here -- they can already change today when the rustc details change how the current unsound hash is generated.

m-ou-se · 2022-07-20T17:34:05Z

I'm not sure I understand the worry behind a hash algorithm being broken in the future. When a hash is broken, it doesn't suddenly start producing accidental collisions for small strings like type names, it just means that it's now computationally feasible to find a collision, if you try really hard. Would that even matter for this use case? If it was possible to calculate a 'malicious' typename (probably of several kilo- or megabytes) that would collide with another, would that change anything? It seems to me that if an attacker can insert a malicious type name into the code, they can do much more damage by inserting much simpler things into the code.

CAD97 · 2022-07-20T18:40:02Z

It is an attack vector (defense in depth, etc) when using TypeId as a plugin system.

However, any case where this would be an actual attack vector does include trusting some property about the malicious actor, either because they're a transitive native dependency, or they're a runtime plugin and you're trusting them to give you a unique TypeId even if the plugin itself is sandboxed.

(Part of the reason this is the case is that it's not sufficient to get a TypeId collision to get an unsound API; you also need to be downcasting based on the colliding TypeId.)

So yes, I believe it is completely reasonable to restrict this to a safe/unsafe protection boundary against ~~incompetence~~ mistakes in safe code, rather than against any sort of deliberate attack vector.

(Perhaps a collision attack would make it easier to get a malicious crate through review... but that requires not just a hash collision, but a meaningful hash collision, which is much more difficult to find even with a broken hash.)

Taking that as an axiom, a reasonably-sized hash is certainly sufficient.

(But that also said, I think it is also reasonable to choose a hash size designed to be resilient against a hash collision in the entire ecosystem rather than just within a compilation. This prevents our malicious actor from just finding a preexisting hash collision and adding a dependency on it to create UB.)

Kixiron · 2022-07-20T23:07:29Z

It feels doubtful to me that, in the absence of well-defined ways to upgrade std's interfaces that bypass these concerns, that we should be using something as vulnerable as a hash. Any hash.

This feels dangerously close to "if we can't totally fix it there's no point in making things any better" which really sucks. If we can easily improve the state of things and leave the door open for future improvements then we 100% should

compilerintrinsics · 2022-08-01T08:57:37Z

I would really hate for TypeId to bloat in size since it is being used heavily in highly-dynamic code and assumed to be small. Small structs in my code own often carry them around as-is. Can we have it remain a single word in some way? I understand there are rogue crates illegally transmuting TypeIds, but certainly there are better ways to deal with them that does not make the rest of us pay for it too.

Besides, those who are going to transmute illegally will likely do it again, since what they are really trying to accomplish is compile-time comparison of TypeId, which will go away once we have some legal way to sate their needs.

RalfJung · 2022-08-01T13:30:14Z

A small TypeId is simply unsound, as we do have real-world examples of collisions. So that is not an option.

bjorn3 · 2022-08-01T15:00:03Z

Also the issue with rogue crates illegally transmuting TypeIds is that they assume 64bit TypeIds. We have to increase the size of TypeId one way or another to avoid collisions. Rogue crates are making this harder. They are not the reason to change it in the first place.

CAD97 · 2022-08-01T15:55:59Z

TypeId could be just &'static TypeIdInner, and it would remain small. However, the reason for making it (u64, &'static TypeIdInner) is twofold:

Loudly break anyone using transmute, and
Have a fast unnaccept and hash which do not do not do a pointer chace.

You can argue that the first isn't necessary, but the second is very desirable.

If you want to argue against size_of::<TypeId>() becoming u128 big (which is the maximum anyone is proposing), #99189 is probably now the place to do it.

eddyb · 2022-08-02T08:00:29Z

Have a fast unnaccept and hash which do not do not do a pointer chase.

Note that the fast accept also doesn't require indirection! With the right linker integration, we should be able to almost always hit one of these two:

a.hash != b.hash (definitely !=, "fast reject")
- only inaccurate (hash equal, for unequal types) for hash collisions
ptr::eq(a.inner, b.inner) (definitely ==, "fast accept")
- only inaccurate (inner differing in address, for equal types) when we weren't able to get the (dynamic) linker to deduplicate the backing constant data _{(or like this PR which doesn't do that at all)}

Oh and when being passed around, such two fields will e.g. go into two registers (as they ake TypeId into a "scalar pair"), without pessimizing codegen a lot.

So for the most part, you would have double the size and double the comparisons - that shouldn't cause as big of a deal as "O(1) operation is now O(N) on average".

compilerintrinsics · 2022-08-07T05:11:44Z

To be frank, I think TypeId being 2-word large is still fine. But looking at the discussion, I worry the same arguments might be used to enlarge it yet again to 256 bits in the future, which certainly will be too large to put inline.

Anyhow, I don't have a better idea to argue against this effectively. Maybe pointer-tagging some bits from the hash could help in some way to keep the struct small but the number of bits available to be used makes it seem negligible.

CAD97 · 2022-08-07T07:25:19Z

Nobody is arguing to make TypeId any bigger than (u64, &'static TypeIdInner). Rest assured that making TypeId inline bigger again will meet significant resistance if it's ever proposed.

…apkin Use 128 bits for TypeId hash Preliminary/Draft impl of rust-lang/compiler-team#608 Prior art (probably incomplete list) - rust-lang#75923 - rust-lang#95845

Use 128 bits for TypeId hash Preliminary/Draft impl of rust-lang/compiler-team#608 Prior art (probably incomplete list) - rust-lang/rust#75923 - rust-lang/rust#95845

eddyb added 2 commits April 9, 2022 08:28

Make TypeId a lang item, and return it directly from `intrinsics::t…

727812e

…ype_id`.

codegen, symbol_mangling: prefix/move LLVM CFI functionality to avoid…

bb34360

… confusion.

rust-highfive assigned lcnr Apr 9, 2022

rustbot added the T-compiler Relevant to the compiler team, which will review and decide on the PR/issue. label Apr 9, 2022

rust-highfive added the S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. label Apr 9, 2022

RalfJung reviewed Apr 9, 2022

View reviewed changes

This comment was marked as resolved.

Sign in to view

eddyb force-pushed the v0-type-id branch from 92848fa to 56d8a11 Compare April 9, 2022 14:04

This comment was marked as resolved.

Sign in to view

bjorn3 reviewed Apr 9, 2022

View reviewed changes

compiler/rustc_symbol_mangling/src/v0.rs Show resolved Hide resolved

bjorn3 reviewed Apr 9, 2022

View reviewed changes

library/core/src/any.rs Show resolved Hide resolved

bjorn3 reviewed Apr 9, 2022

View reviewed changes

eddyb force-pushed the v0-type-id branch from 56d8a11 to 1d63bab Compare April 9, 2022 15:30

This comment was marked as resolved.

Sign in to view

TypeId: use a (v0) mangled type to remain sound in the face of hash c…

48bed67

…ollisions.

eddyb force-pushed the v0-type-id branch from 1d63bab to 48bed67 Compare April 9, 2022 15:48

eddyb mentioned this pull request Apr 9, 2022

Tracking Issue for const fn type_id #77125

Open

4 tasks

eddyb commented Apr 9, 2022

View reviewed changes

This was referenced Apr 9, 2022

Expose type_name() method on Any. #68379

Open

Intrinsic for type_name_of_id to power a better impl Debug for TypeId? #61533

Open

Collisions in type_id #10389

Closed

rustbot added the S-waiting-on-perf Status: Waiting on a perf run to be completed. label Apr 9, 2022

eddyb closed this Jul 7, 2022

carbotaniuman mentioned this pull request Jul 12, 2022

Pad size of TypeId and remove structural equality #99189

Closed

eddyb mentioned this pull request Jul 28, 2022

Widen TypeId from 64 bits to 128. #75923

Closed

SoniEx2 mentioned this pull request Aug 27, 2022

perf: faster HashMap with TypeId keys mlua-rs/rlua#259

Closed

scottmcm mentioned this pull request Sep 28, 2022

Tracking Issue for comparing TypeId in const items #101871

Open

4 tasks

thomcc mentioned this pull request Apr 5, 2023

Use 128 bits for TypeId hash #109953

Merged

bjorn3 mentioned this pull request Aug 17, 2023

A unique string key for TypeId rust-lang/rfcs#694

Open

RalfJung mentioned this pull request Aug 12, 2024

type_id is not sufficiently collision-resistant #129014

Open

the8472 mentioned this pull request Aug 12, 2024

Language vs. implementation threat models and implications for TypeId collision resistance #129030

Open

RalfJung mentioned this pull request Aug 13, 2024

Incremental compilation relies on hashes for soundness #129016

Open

TypeId: use a (v0) mangled type to remain sound in the face of hash collisions. #95845

TypeId: use a (v0) mangled type to remain sound in the face of hash collisions. #95845

Conversation

eddyb commented Apr 9, 2022 • edited Loading

rust-highfive commented Apr 9, 2022

rust-highfive commented Apr 9, 2022

RalfJung commented Apr 9, 2022

RalfJung Apr 9, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

RalfJung Apr 11, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

eddyb commented Apr 9, 2022

This comment was marked as resolved.

This comment was marked as resolved.

eddyb commented Apr 9, 2022

eddyb commented Apr 9, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lcnr Apr 25, 2022 • edited Loading

Choose a reason for hiding this comment

This comment was marked as resolved.

Choose a reason for hiding this comment

eddyb commented Apr 9, 2022

rust-timer commented Apr 9, 2022

bjorn3 commented Jul 6, 2022

CAD97 commented Jul 6, 2022

bjorn3 commented Jul 6, 2022

eddyb commented Jul 7, 2022 • edited Loading

workingjubilee commented Jul 7, 2022

scottmcm commented Jul 20, 2022

m-ou-se commented Jul 20, 2022

CAD97 commented Jul 20, 2022

Kixiron commented Jul 20, 2022 • edited Loading

compilerintrinsics commented Aug 1, 2022

RalfJung commented Aug 1, 2022 • edited Loading

bjorn3 commented Aug 1, 2022

CAD97 commented Aug 1, 2022 • edited Loading

eddyb commented Aug 2, 2022

compilerintrinsics commented Aug 7, 2022

CAD97 commented Aug 7, 2022

eddyb commented Apr 9, 2022 •

edited

Loading

RalfJung Apr 9, 2022 •

edited

Loading

RalfJung Apr 11, 2022 •

edited

Loading

lcnr Apr 25, 2022 •

edited

Loading

eddyb commented Jul 7, 2022 •

edited

Loading

Kixiron commented Jul 20, 2022 •

edited

Loading

RalfJung commented Aug 1, 2022 •

edited

Loading

CAD97 commented Aug 1, 2022 •

edited

Loading