-
Notifications
You must be signed in to change notification settings - Fork 201
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Relax validation #254
Comments
I strongly agree with this.
The spec actually doesn't even mention alignment. However, since being Also, the fix in #253 doesn't really accomplish the intended effect: "The byte buffer needs to be entirely indexed, and cannot contain holes." — I have to pass the function a What it's probably trying to say is that the tensors must be arranged so there's no space in between them. Right now the implementation sorts them by element size before writing out the data. Due to the padding in the header, this works out so that the tensors are aligned. i.e. if 32bit tensors are aligned, then I guess 16 bit ones probably will be and then 8bit tensors that follow would be also? That seems to be the thinking anyway, but possibly that expectation could be violated by stuff like odd numbers of elements. However all of this is not specified in the format specification, it's just an implementation detail in the official crate. I don't really understand this part either: "This prevents the creation of polyglot files." The wording sounds like it's intentionally trying to forbid polyglot files, but why? |
Because alignment is not required. Alignment speeds things up that's it. Also on GPU alignment doesn't matter as much because you're just sending buffers to the GPU anyway.
This is currently done.
Which function are you talking about ? Usually as a user you don't have to care about alignement, at least that's the goal.
If there is a good rationale for it, I'm willing to see it.
This library was recently audited for security, and contiguity is kind of required to remove polyglot files. Polyglot files is not a security hole on its own, but it's a vector which I think is good if we can close that hole. It could be a safetensors files also being a valid zip file, hitting some unzip flaw. In any case, the contiguity is really easy to get currently, so I don't see any reason to remove it. |
What is the threat model you are using to audit this? Why should I be worried about "polygot files"? |
Everything will be made public soon, we had an external audit. As I said, polyglot is not a thread on its own, but it opens up doors. On a non security aspect, having extra chunks of useless data in a file is not really nice. |
What about having extra chunks of not useless data? Stuff like vocabulary. Also, you can put whatever as tensor data, including a PKZIP file. So it's still polygot :) |
Some architectures actually require data to be aligned, as far as I know.
I'm not the one who mentioned that, but the idea is probably that if something is aligned to 64 bytes then it's going to be aligned for all reasonable smaller values. It's also unlikely there's anything that's going to require >64 byte alignment in the near future. So if you align everything to 64 bytes you're aligned for most/all architectures and pretty future proof, at the expensive of wasting a trivial amount of disk or memory.
You can't enforce that even if you want to. There's nothing stopping me from passing a slice from the middle of a file to the You can't even stop me from using the official crate to create those files because I can just use Also, like iacore mentioned even without that kind of thing you can't stop polyglots that have their metadata at the end of the file unless you do something really ridiculous like padding the end of the file with junk so it can't represent some other format's metadata.
Like I mentioned, you can't stop that even if you want to. The only thing you can do is make SafeTensors less convenient to use. Also I really can't understand making changes based on someone deciding to... what, call Maybe the thing you're worried about is if the initial length bytes in the header represent the magic for some other file format? If so, the right way to deal with that is to make an actual new SafeTensors version that has an initial magic part — there's a reason why most file formats have some kind of identification at the start of the file. Not that caution from some random person on the internet means much but I strongly suggest not making knee-jerk changes without discussion or changing specifications in a way that will break existing applications that are implementing the format to the (existing) spec. Pulling the rug out from under developers that used your published specification is going to make a lot of people very leery of adopting the standard or recommending it, myself included. |
I think some ARM chips, and RISC V. It traps on unaligned access. |
Hey I sense a lot of animosity here. I merely wanted to help by pointing out that what you were trying to do was not permitted by the current crate, sorry if it felt like knee jerking. I really like the idea of having llm written in Rust, I more than empathize with your goals (I complain multiple times a day about Python now).
Provided good arguments, safetensors spec (and therefore codebase) can relax things a bit (for instance if it allows much better alignment/speed, without any additional risk) until there is a 1.0. In this particular case, you seem to want to save a vocabulary in the file.
I'm not sure what you're referring to. On disk, there is no alignment. A file on disk is a sequence of bytes, full stop. If you're doing memory mapping you're getting
You are fine then, neither function cares about alignment. Everything treats buffers, and return &[u8] for the tensor location. For instance here is how casting can be done https://github.com/coreylowman/dfdx/blob/main/src/tensor/safetensors.rs#L88-L102
sure, but then you'll get an error.
No you cannot prevent polyglot files for anything and everything, but you can restrict thing to make them much less easy to exploit. Also contiguity means that you know the files cannot be "too" large. I think it's a good property, file size is roughly the size of the model weights
Magic numbers have their purpose but there is no need to add any magic number anywhere. A lot of files don't have it either.
100% noted, lots of changes in the spec happened for exactly that reason, so that there was less ambiguity and was a major outcome of the audit.
No offense, but why do you want to put the vocabulary of a model in the same file ? tokenizers are generally reused across multiple models making them good candidate to use a single file for many models. Meaning splitting the tokenizer into a different file makes a lot of sense to me. Different models have widely different tokenization schemes which may not adapt to whatever saving form you have in mind currently. I know ggml stores everything in the same file, but they seem to have made breaking changes at least 2 times since.
In a similar spirit, because it's aimed to be relatively agnostic, On a similar note, |
I'll write a longer response that addresses your other points, but I just wanted to get this part out as soon as possible.
(I assume you're talking about me.) Sorry you got that impression, it's definitely not what I intended to convey. I'm a direct person and I say what I mean in a direct way. There's no animosity coming from me whatsoever. This is a technical disagreement, not a personal one (if there is in fact a disagreement). No matter what you choose to do, at most I'll just be disappointed that I can't recommend the format anymore.
I might be misunderstanding, but I just want to be very clear: I am not affiliated with the rustformers |
I wasn't saying any knee-jerk actions had occurred, I was just suggesting it would be good to allow time for discussion/consideration. Also, I really wasn't (and am not) really concerned about how one specific implementation handles it. It's a simple format that is easy to implement: there's no need to rely on that crate for anyone that needs behavior it doesn't currently provide. My concern is with the format specification.
I don't think it's clear that the specification of the format is considered to be in development and can change without warning.
I'm personally just talking about it in general. The format is defined as a header and then meta data that describes the offsets and lengths of the tensor data. This is pretty flexible, because after the header, as long as those offsets/lengths or valid then it doesn't matter what else is in the file. There are applications like storing vocabulary, but it's generally just something that people can probably find a use for in the future. That's one of the great things about open source, people find creative ways to do stuff.
Isn't allowing that the reason why there's logic to add padding when writing the header? And also the reason why the tensors get written out ordered by the element size, to ensure that each tensor begins at an aligned position?
The point of
I'm not sure what your point is. So assuming that
I think you might have misunderstood. You just have a So I can pass those functions a slice from the middle of a file I
To enforce that, you would have to change the Safetensor's format from an (apparently) open standard to a black box that users can only interact with through your interface. Obviously that is going to make it much, much less appealing for people to adopt.
The reason I brought that up is because I was guessing the exploit you were concerned about involved arranging for the initial length bytes to match some other file format's magic (possibly confusing other tools like file managers).
Not the person who brought that up, but there are decided advantages to having everything in the same file. That way you can ensure stuff like the vocabulary list actually matches the model that's running. A user only needs to download or manage a single file, etc.
Those two things aren't really related. You can store everything in the same file and preserve compatibility or not. You can have every separate and break compatibility. The breaking changes with GGML is mainly because they don't really care about preserving compatibility.
Someone looking at using the file format without your software is going to go by the spec, not one specific implementation though. If you change the spec later on, you risk breaking everything that was implemented to that spec. I wrote my own Safetensors format loader: I looked at the specification to do that. I didn't go through the source code of any specific implementation and then try to replicate its internal behavior. Why would I? If that was necessary, then the spec doesn't actually define the file format. |
Sorry about the animosity. I think safetensors is the best format so far for storing tensor weights, being extensible and simple. If we don't fix the problems it has now, I'm afraid the said problems will end up in many repositories.
Ctranslate2 models are used for 1-to-1 translation. The tokenizer tokenizes input string into strings. Each model has a vocabulary (from string to index). I don't know why it's designed like that. |
Let's fix those ! Having part of a given file being undefined seems like a big issue to me. (What should you do with it? If it should be ignored, then why have it in the first place ?)
100% which is why by default I tend to err on the side of being too strict when writing files in this crate and trying to be overrestrictive in the spec for the same reason. It's always easier to become less strict later, but once you are less strict it's impossible to come back without breaking anything.
If it wasn't explicitly written that it was allowed, it just wasn't specified actually. It was just underspecified. This is where doubt occurs.
Exactly how it's done. This is viewed as a performance in implementation details.2
This is actually great and 100% intended. I think there are no issues with that. Discrepancies between readers is bound to happen. The format is trivial on purpose that for all intents and purposes one should be able to spin up their code pretty fast. The issue here is that the spec was lacking on some aspects of the format. Which makes them unspecified, free to interpretation (you though it was OK to have arbitrary data in the buffer, this crate says it's not OK). Both are correct with regards to the previous spec, and that can cause issues down the line if writers are implemented differently with regard to that unused part. So it's good to make it more explicit. |
Why? You don't need to do anything about it. Your implementation of loading/saving those files doesn't need to try to preserve that data or anything like that. The only problem is you're adding constraints that didn't exist in the specification of the format originally. And then changing the format retroactively to align with the behavior of one specific implementation. If your crate just tolerated any files that met the specification that would be 100% fine. The spec says there's a header and the metadata that specifies the positions and lengths of the tensor data. The rest is undefined: there's no constraint on alignment, whether the tensors can or can't overlap, what order they need to exist in the file, whether there can be gaps in between them.
I don't understand. The point of a specification is to define the format. If some constraint isn't specified in the specification, then there isn't a constraint. For example, suppose I publish the spec for a file format described as:
Someone else comes along, finds the spec, decides it meets their use case. They develop an app that has a loader and ability to save those files as well. Then later on you say "Oh, it was under specified. You actually can only put prime numbers in the second value". Now everything that was written to the spec can instantly be invalidated. The first time, people might just sigh and adjust their code, regenerate files if necessary etc. Then after a while you say "We found some random tools might have issues dealing with files that have even numbers in the first value. Now the first value can only contain odd numbers. Hey, our implementation of the format always did it that way." Wouldn't that be weird? Either the spec defines the format or it doesn't.
It's physically possible, but you aren't "allowed" to write something that has undefined behavior. It's an error if you do so. There's even a pretty well known (and pretty trivial) example of how to create undefined behavior even from safe code.
But not if you retroactively add constraints and invalidate other implementations. In that case, someone could have written their code to the spec, publish files that meet that spec and later have those files/implementations become useless.
If a writer has a problem with that, then it's not operating according to the spec. Right? That's a bug in the writer. If a reader has a problem with files that were written according to the spec, then that's a bug in the reader. That's an issue the developer of the reader/writer needs to fix, not a problem with the spec. |
This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days. |
Original discussion
The current implementation requires tensors to be stored without gaps and aligned. I don't think the contiguous requirement is necessary. Further more, I don't think that is mathematically possible. See code here.
In the following example
A
is[1]bf16
B
is[1]f32
As you can see
B
is not properly aligned. You can putB
beforeA
, but that requires some packing algorithm to rearrange the data.ggml's approach is to align everything by 64 bytes (max alignment on amd64 and whatnot).
The text was updated successfully, but these errors were encountered: