-
Notifications
You must be signed in to change notification settings - Fork 12.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Collisions in type_id #10389
Comments
I'm not entirely sure how feasible it is for a program to have We could in theory have very cheap inequality among types, and then have an expensive equality check. Something which may walk the respective Either way, I don't think that this is a super-pressing issue for now, but I'm nominating to discuss whether we want to get this done for 1.0. This could in theory have serious implications depending on how frequently |
Ah, it was already nominated! |
Why not compare an interned version of the type data string? (i.e. what is currently passed as data to be hashed, possibly SHA-256 hashed first) The linker can be used for interning by emitting a common symbol with the type data string as name and taking its address, and otherwise the same thing can be done manually in a global constructor. This way it's always a pointer comparison, and there are no collisions. |
I don't know how node id values are generated, but assuming that they are generated sequentially, this particular collision is not realistic. However, its not hard to find collisions for more realistic node id values by picking particular values for the crate hashes: assert!(hash_struct("a2c55ca1a1f68", 4080) == hash_struct("138b8278caab5", 2804)); The key thing to consider isn't the number of node id values, though: its the total number of type id values. Some quick (hopefully correct) math shows that there is a 0.01% chance of a collision once there are around 60 million type id values. That's still a pretty large number of type id values for a somewhat low probability of a collision, thought. So, its unclear to me how big a deal this is for the Rust 1.0 timeframe. It all depends on what the acceptable probability of a collision is. |
When I saw that @alexcrichton proposed using a hash, my first reaction was "collision!" but then I thought "...but exceedingly unlikely to occur in practice". I think this is not a matter of imminent destruction but if we can leverage the linker or some other scheme to avoid this danger, we should -- and perhaps we should just go ahead and mark the current scheme as deprecated and just plan on finding a replacement scheme. |
A cryptographic hash designed for this purpose (larger output) would be enough. Although, a larger output would be more expensive to compare (four |
We don't need to deal with this right now. P-low. |
How relevant is this issue today? I think that it's all the same, but am not sure. |
It's 64-bit so collisions are likely with enough types (consider recursive type metaprogramming) and it doesn't have any check to bail out if one occurs. Bailing out is not a very good solution anyway, because it pretty much means that there's no way to compile the program, beyond using a different random seed and hoping for the best. It's a crappy situation. |
Note that "hoping for the best" by iteratively changing the seed might work with overwhelmingly large probability after very few iterations. |
use std::any::Any;
fn main() {
let weird : [([u8; 188250],[u8; 1381155],[u8; 558782]); 0] = [];
let whoops = Any::downcast_ref::<[([u8; 1990233],[u8; 798602],[u8; 2074279]); 1]>(&weird);
println!("{}",whoops.unwrap()[0].0[333333]);
} Actually a soundness issue. playground: http://is.gd/TwBayX |
I'd like the lang team to devote a little time to this now that we are post 1.0. Nominating |
OK, lang team discussed it, and our conclusion was that:
|
I was wondering about a design where we do something like:
compare the string pointers for equality (to give a fast equality check). If that fails, compare the hashes for inequality (to give a fast inequality check). If THAT fails, compare the strings for content (to handle dynamic linking). Although re-reading the thread I see @bill-myers may have had an even more clever solution. |
@nikomatsakis putting the hash of the data at the start is a good idea, to increase the probability that we catch unequal things quickly. It seems to me like @bill-myers' approach composes fine with that strategy. |
I doubt the "problem" is limited to Any. You can probably confuse the compiler just as effectively by colliding hashes for symbol mangling, or many other things. What is the objective here? Since Rust is not a sandbox language, I don't think "protect memory from malicious programmers" should be one of our goals (we should document the types of undefined behavior that can be hit in safe code, and fix the ones that are possible to hit by accident; if someone is determined to break the type system, they can already write an unsafe block, or use std::process to launch a subprocess that ptraces its parent and corrupts memory). |
Thanks to: https://www.reddit.com/r/rust/comments/5pfwjr/mitigating_underhandedness_clippy/dcrew0k/ This example works on Beta and Nightly. |
@nikomatsakis Should this be marked as I-unsound? I've done so for now, since that seems to be the general conclusion a couple of times by different people, but please unmark if I'm wrong. |
You'd probably want a scheme where a part of the hash is stored in-line, to avoid having to always read from the pointer. But yes that's one of a bunch of possibilities. |
For variant with TypeId being a |
It's my understanding that TypeIds are allocated lazily but I could be wrong. |
From the linked comment it's not clear whether that was a necessary or sufficient condition. So weaker functions might not have been rejected. Also, my understanding is that siphash claims to be suitable for cryptographic use when the key is secret. In other words it is a cryptographic hash under the assumption that any process generating collisions is oblivious to the key. Which depending on the threat model could be interpreted as "it's behaves as a cryptographic hash, for our threat model". So, as I have argued previously, it would be great if we could get a statement what the threat model is, not just on a particular solution. Additional aspects that should be considered:
|
For variant with TypeId being a &'static str, we have the fast-equality path of reference equality, wouldn't we also have a fast-inequality path when the strings are of unequal length?
Sure, if we include the length in-line. I didn't say it was an &str though, it could use a null terminated string and a hash instead of the length to make it more likely that we detect inequality.
|
It is obviously not necessary to use a hash function at all, since one could do a perfect comparison based on type_name. So I think "this is the lower bound for the guarantees provided" is the only reasonable interpretation. (The alternative is to believe that the lang team intended to rule out anything that provides stronger guarantees than a cryptographic hash function, which does not seem reasonable.) The context there was considering various weaker options like a truncated hash function, which were rejected by that decision.
I have not seen such a notion of "cryptographic hash function under assumptions about the process generating the inputs" before; is there literature defining and analyzing this idea in more detail?
Those seem like a good starting point for assembling a summary for a lang team design meeting asking whether they want to revise their previous decision. |
I believe I am simply restating what a keyed PRF is when viewed from a different angle. If it's collision-resistant when the key is not known to the attacker (e.g. like chosen plaintext attacks against a remote hasher) then that's no different from the process generating the inputs being oblivious to the key. Which is another way of saying that it has cryptographic strength if your threat model only includes natural threats (people generting lots of rust code). Which, yes, is a fairly weak threat model. |
Has there been any analysis how good of a PRF SipHash-1-3 is? |
https://eprint.iacr.org/2019/865.pdf is from 2019 that gives an overview. For the 1-x variants they cite internal collisions that have been found in https://eprint.iacr.org/2014/722.pdf (also cited in #29754 (comment)), which says
So AIUI even for those reduced-round versions they haven't found attacks that would be better than the birthday bound. |
My dumb take: the current implementation is fine—collision probability is low in theory, and there are no reports of one. The take that would satisfy lang team and probably most people: just use SHA256… |
Note that here we're talking just about TypeId, and making incremental compilation more secure against collisions is another, separate question. The perf impact of changing the hash for TypeId only should be negligible. If someone could assemble a summary of the wider use of hashes in Rust where collisions could have soundness impact. that would be a great basis for future discussion. Otherwise we'll just keep going in circles. From statements made above, my current understanding is that for non-incremental builds, type_id is the only "soundness-critical" hash. |
I think there are these three main issues, and some additional minor ones:
|
Doing an empty patch release and yanking the old version should be enough to stop the collision. The crate version is hashed into the value cargo passes to |
v0 symbol mangling losslessly encodes types and has some compression built into it already. #95845 had a working implementation of this, afaict.
This is my intuition too: I don't think performance is a real concern for the TypeId case. Binary size might be more of a concern. Could also be made configurable via a |
Strong 👍 for this -- I found the issue summary very helpful. I'd like to encourage others to avoid directly commenting and instead limit yourselves (for the moment) to comment on what is not represented in this summary (ideally with a suggested edit), so that we can close this issue and restart afresh from a better starting point. |
From the comments so far since my summary, there's multiple discussions that should likely go into separate issues:
I think we're fairly sure that (1) can be done without significant perf impact, given that TypeId hashing is not a super common operation, and there are various proposals for how to make most TypeId comparisons fast. Binary size impact has yet to be determined, this requires someone to actually implement a prototype. (There's an old PR at #95845 for one of the variants that doesn't rely on a hash at all, which is the least size-efficient option. As far as I can see, perf looked completely fine, though there were some complaints about having type names in the final binary. But there's also people that want a TypeId -> TypeName function. But anyway this is all t-compiler territory.) However, the compiler uses hashes in many places, so one question that is still unclear to me is whether just making TypeId use a stronger hash would actually fix anything, or whether it just shifts the problem to a different hash. It would be good to get input from people familiar with how our query system works on whether, for non-incremental builds, collisions in the stable hasher can lead to problems. Because if the answer is "yes", then (1) on its own seems a bit pointless, and we should evaluate the cost of making all the critical hashes more secure to really get the proper data for discussion (2). If the answer is "no", then I honestly see no good reason to not do (1); it seems like we can get protection against a stronger threat model without significant cost. At this point (2) would be about the threat model for incremental builds. The lang team has not decided on this, and existing data indicates that using a stronger hash here has a massive performance cost, but I don't know what possibilities exist to reduce that cost. So -- @michaelwoerister and anyone else who knows the query system, can you help us answer this question? :)
but it's not clear to me whether this applies to all builds or just incremental builds. |
Is a linkage capable of bringing in more types than it has bytes of machine code, or (if dynamic) types that will survive when it's unloaded? If not, we can use sequential IDs starting with the linkage address. |
Personally, I think eddyb chose a good approach in #95845 by reusing the symbol mangling scheme, as that has the same requirements of strong collision freedom. If we decide to use a strong, wide hash then I would still go
That way the implementation does not need to rely on
As said above, going through v0 mangling would basically fix this issue. The only hash that occurs in v0 is the
We already mostly do that by mixing the compiler version into
Unless I'm overlooking something all incr. comp. specific hashes only need to be stable between successive incremental builds. So we can easily harden our use of SipHash there by generating random keys and then caching these in the incr. comp. cache. |
Great, thanks @michaelwoerister! I have opened #129016 for the questions around incremental compilation. If I missed an aspect if this, please bring it up in that issue. For the people arguing that for type_id, we should consider a PRF (or even a weakened one like SipHash-1-3) to be good enough since we don't expect Rust programmers to try to exploit the compiler, I would suggest you file a new issue with the arguments for that and nominate it for t-lang discussion. Personally, since it seems like #129014 can be fixed without major downsides, I don't see a good case for weakening the requirement here.
@michaelwoerister would it make sense to file this as a possible-improvement issue? Maybe it'd be better if you did this since you could point at where in the compiler that would even happen. I will then close this issue, as further discussion should happen in the existing or to-be-created successor issues. Thanks all for the discussion! |
I'll look into opening an issue. |
I opened rust-lang/unsafe-code-guidelines#525 for potential hashing concerns related to |
The implementation of type_id from #10182 uses SipHash on various parameters depending on the type. The output size of SipHash is only 64-bits, however, making it feasible to find collisions via a Birthday Attack. I believe the code below demonstrates a collision in the type_id value of two different ty_structs:
The text was updated successfully, but these errors were encountered: