Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Relax validation #254

Closed
iacore opened this issue May 22, 2023 · 14 comments
Closed

Relax validation #254

iacore opened this issue May 22, 2023 · 14 comments
Labels

Comments

@iacore
Copy link
Contributor

iacore commented May 22, 2023

Note
To whom had trouble with this official parser: Write your own parser! The "safetensors" format is very simple, and you can add your own extension with your own parser.

Original discussion

The current implementation requires tensors to be stored without gaps and aligned. I don't think the contiguous requirement is necessary. Further more, I don't think that is mathematically possible. See code here.

In the following example

  • A is [1]bf16
  • B is [1]f32
1 char = 1 byte

--------
AABBBB

As you can see B is not properly aligned. You can put B before A, but that requires some packing algorithm to rearrange the data.

ggml's approach is to align everything by 64 bytes (max alignment on amd64 and whatnot).

@KerfuffleV2
Copy link

I strongly agree with this.

and aligned.

The spec actually doesn't even mention alignment. However, since being mmapable is at least an implied feature (and the existing interface is hard to use with mmaping at least with the Rust crate) it's probably reasonable to assume that the tensor start offsets should be aligned. It would be better if the specification specifically defined this though if it is in fact a requirement.

Also, the fix in #253 doesn't really accomplish the intended effect:

"The byte buffer needs to be entirely indexed, and cannot contain holes." — I have to pass the function a &[u8]. Even if I wanted to, how could it contain holes or not be indexable? I guess if I was on embedded hardware or running on metal the space of memory might be able to be set up in a way it wasn't contiguously addressable but other than that...

What it's probably trying to say is that the tensors must be arranged so there's no space in between them. Right now the implementation sorts them by element size before writing out the data. Due to the padding in the header, this works out so that the tensors are aligned. i.e. if 32bit tensors are aligned, then I guess 16 bit ones probably will be and then 8bit tensors that follow would be also? That seems to be the thinking anyway, but possibly that expectation could be violated by stuff like odd numbers of elements.

However all of this is not specified in the format specification, it's just an implementation detail in the official crate.

I don't really understand this part either: "This prevents the creation of polyglot files."

The wording sounds like it's intentionally trying to forbid polyglot files, but why?

@Narsil
Copy link
Collaborator

Narsil commented May 22, 2023

The spec actually doesn't even mention alignment.

Because alignment is not required. Alignment speeds things up that's it.
Files already exist without alignment being forced.
Also if alignement needs to be done differently based on different machines it's going to become tricky.

Also on GPU alignment doesn't matter as much because you're just sending buffers to the GPU anyway.

requires some packing algorithm to rearrange the data.

This is currently done.

"The byte buffer needs to be entirely indexed, and cannot contain holes." — I have to pass the function a &[u8]. Even if I wanted to, how could it contain holes or not be indexable? I guess if I was on embedded hardware or running on metal the space of memory might be able to be set up in a way it wasn't contiguously addressable but other than that...

Which function are you talking about ? Usually as a user you don't have to care about alignement, at least that's the goal.

ggml's approach is to align everything by 64 bytes (max alignment on amd64 and whatnot).

If there is a good rationale for it, I'm willing to see it.

I don't think the contiguous requirement is necessary.

This library was recently audited for security, and contiguity is kind of required to remove polyglot files. Polyglot files is not a security hole on its own, but it's a vector which I think is good if we can close that hole.
Basically it allows attackers to send a polyglot files which if crafted correctly could start to hit a secondary reader's flaws.

It could be a safetensors files also being a valid zip file, hitting some unzip flaw.

In any case, the contiguity is really easy to get currently, so I don't see any reason to remove it.

@iacore
Copy link
Contributor Author

iacore commented May 22, 2023

What is the threat model you are using to audit this? Why should I be worried about "polygot files"?

@Narsil
Copy link
Collaborator

Narsil commented May 22, 2023

What is the threat model you are using to audit this? Why should I be worried about "polygot files"?

Everything will be made public soon, we had an external audit.
The threat vector is quite minute in my very personal opinion (compared to code execution), but existent nonetheless. But the fix is simple enough I don't see why we should keep open.

As I said, polyglot is not a thread on its own, but it opens up doors.

On a non security aspect, having extra chunks of useless data in a file is not really nice.
If alignment, or performance aspects, where small relaxations like that (like 64 byte alignment everywhere) could help then by all means we could relax a bit to include those, but so far it hasn't been needed.

@iacore
Copy link
Contributor Author

iacore commented May 22, 2023

What about having extra chunks of not useless data? Stuff like vocabulary.

Also, you can put whatever as tensor data, including a PKZIP file. So it's still polygot :)

@KerfuffleV2
Copy link

@Narsil

Alignment speeds things up that's it.

Some architectures actually require data to be aligned, as far as I know.

Which function are you talking about ?

read_metadata and also deserialize there.

If there is a good rationale for it, I'm willing to see it.

I'm not the one who mentioned that, but the idea is probably that if something is aligned to 64 bytes then it's going to be aligned for all reasonable smaller values. It's also unlikely there's anything that's going to require >64 byte alignment in the near future.

So if you align everything to 64 bytes you're aligned for most/all architectures and pretty future proof, at the expensive of wasting a trivial amount of disk or memory.

This library was recently audited for security, and contiguity is kind of required to remove polyglot files.

You can't enforce that even if you want to. There's nothing stopping me from passing a slice from the middle of a file to the read_metadata or deserialize functions I mentioned. I can put whatever I want before and after that section of the file.

You can't even stop me from using the official crate to create those files because I can just use serialize and write the data wherever I want, including into the middle of some other file.

Also, like iacore mentioned even without that kind of thing you can't stop polyglots that have their metadata at the end of the file unless you do something really ridiculous like padding the end of the file with junk so it can't represent some other format's metadata.

It could be a safetensors files also being a valid zip file, hitting some unzip flaw.

Like I mentioned, you can't stop that even if you want to. The only thing you can do is make SafeTensors less convenient to use.

Also I really can't understand making changes based on someone deciding to... what, call unzip on a .safestensors file? Trying to load a .zip file with SafeTensors?

Maybe the thing you're worried about is if the initial length bytes in the header represent the magic for some other file format? If so, the right way to deal with that is to make an actual new SafeTensors version that has an initial magic part — there's a reason why most file formats have some kind of identification at the start of the file.

Not that caution from some random person on the internet means much but I strongly suggest not making knee-jerk changes without discussion or changing specifications in a way that will break existing applications that are implementing the format to the (existing) spec. Pulling the rug out from under developers that used your published specification is going to make a lot of people very leery of adopting the standard or recommending it, myself included.

@iacore
Copy link
Contributor Author

iacore commented May 22, 2023

Some architectures actually require data to be aligned, as far as I know.

I think some ARM chips, and RISC V. It traps on unaligned access.

@Narsil
Copy link
Collaborator

Narsil commented May 22, 2023

Hey I sense a lot of animosity here. I merely wanted to help by pointing out that what you were trying to do was not permitted by the current crate, sorry if it felt like knee jerking.
The spec wasn't explicit and therefore I added it in the README. It's still not perfect (like any spec), and we can only try to make it clearer.

I really like the idea of having llm written in Rust, I more than empathize with your goals (I complain multiple times a day about Python now).

safetensors is not in 1.0 so breaking anything is still possible (I would like very much NOT to have to do that).
The restriction was always there in code and was lacking from the spec.

Provided good arguments, safetensors spec (and therefore codebase) can relax things a bit (for instance if it allows much better alignment/speed, without any additional risk) until there is a 1.0.
Relaxing is much easier than breaking which is why by default everything is as tight as possible in the current form.

In this particular case, you seem to want to save a vocabulary in the file.
I have doubts about it being a good idea, but if you can convince us, I'm pretty sure we can make something better than storing arbitrary bytes at the end of a file.
More on that later.

Some architectures actually require data to be aligned, as far as I know.

I'm not sure what you're referring to.
Alignment is a CPU thing for interpreting bytes (like 4 bytes are 1 float32, but this is only true if the 4 bytes are aligned on a f32 boundary).

On disk, there is no alignment. A file on disk is a sequence of bytes, full stop.

If you're doing memory mapping you're getting &[u8] from the file. And you can cast them to f32 for free if and only if the byte pointers are f32 aligned.
When you're doing memory mapping, the file is (afaik) necessarily mapped to a page address (so super aligned). Alignment in the file simply means that you are manipulating the bytes on the file, such that the memory address returned by mmap is f32 aligned (depends here for safetensors on the size of the header + u64 length).

read_metadata and also deserialize there.

You are fine then, neither function cares about alignment. Everything treats buffers, and return &[u8] for the tensor location.
This is perfectly fine.
Alignment is only necessary as a user to do casting, which requires unsafe in rust (since there's no guarantee about alignment which the compiler can infer at compile time).
AFAIK there is no safe way to do zero-copy [u8;4] into f32
https://users.rust-lang.org/t/re-interpret-slice-of-bytes-e-g-u8-as-slice-of-f32/34551

For instance here is how casting can be done https://github.com/coreylowman/dfdx/blob/main/src/tensor/safetensors.rs#L88-L102
It will try to do unsafe (and fast) casting if the file is properly aligned, but fallback if necessary.
The fallback means doing some copies, which is indeed slower than if you could just cast, but it should be similar speed that just plainly reading a given file.

You can't enforce that even if you want to. There's nothing stopping me from passing a slice from the middle of a file to the read_metadata or deserialize functions I mentioned. I can put whatever I want before and after that section of the file.

sure, but then you'll get an error.
https://github.com/huggingface/safetensors/blob/main/safetensors/src/tensor.rs#L277

Also, like iacore mentioned even without that kind of thing you can't stop polyglots that have their metadata at the end of the file unless you do something really ridiculous like padding the end of the file with junk so it can't represent some other format's metadata.

No you cannot prevent polyglot files for anything and everything, but you can restrict thing to make them much less easy to exploit. Also contiguity means that you know the files cannot be "too" large. I think it's a good property, file size is roughly the size of the model weights

Maybe the thing you're worried about is if the initial length bytes in the header represent the magic for some other file format? If so, the right way to deal with that is to make an actual new SafeTensors version that has an initial magic part — there's a reason why most file formats have some kind of identification at the start of the file.

Magic numbers have their purpose but there is no need to add any magic number anywhere. A lot of files don't have it either.
Magic numbers don't save you either from end of file parsing anyway.

Not that caution from some random person on the internet means much but I strongly suggest not making knee-jerk changes without discussion or changing specifications in a way that will break existing applications that are implementing the format to the (existing) spec. Pulling the rug out from under developers that used your published specification is going to make a lot of people very leery of adopting the standard or recommending it, myself included.

100% noted, lots of changes in the spec happened for exactly that reason, so that there was less ambiguity and was a major outcome of the audit.
I don't feel like pulling the rug in this case, everything was already checked for in the lib, the spec was lacking here.
It may be still lacking in some areas.

Stuff like vocabulary.

No offense, but why do you want to put the vocabulary of a model in the same file ?

tokenizers are generally reused across multiple models making them good candidate to use a single file for many models. Meaning splitting the tokenizer into a different file makes a lot of sense to me.

Different models have widely different tokenization schemes which may not adapt to whatever saving form you have in mind currently.
tokenizers has quite a complex structure in order to support many different tokenizers.

I know ggml stores everything in the same file, but they seem to have made breaking changes at least 2 times since.
Which I perfectly understand since it's aimed at being very efficient on a particular task and they don't intend to support every single model out there.

safetensors on the contrary aims to never break anything (alignment is a bonus thing, not something that is crucial). Again, the spec was lacking in details, but the checks were always there.

In a similar spirit, because it's aimed to be relatively agnostic, transformers can currently load safetensors files in both PyTorch and Tensorflow seemlessly. I would hope that we can add rust engines to the mix.

On a similar note, safetensors doesn't store a graph of the model, and that's because the graph can evolve when you're iterating on performance and want a different computation scheme. We do that regularly in https://github.com/huggingface/text-generation-inference.

@KerfuffleV2
Copy link

@Narsil

I'll write a longer response that addresses your other points, but I just wanted to get this part out as soon as possible.

Hey I sense a lot of animosity here.

(I assume you're talking about me.)

Sorry you got that impression, it's definitely not what I intended to convey. I'm a direct person and I say what I mean in a direct way. There's no animosity coming from me whatsoever.

This is a technical disagreement, not a personal one (if there is in fact a disagreement). No matter what you choose to do, at most I'll just be disappointed that I can't recommend the format anymore.

I really like the idea of having llm written in Rust, I more than empathize with your goals

I might be misunderstanding, but I just want to be very clear:

I am not affiliated with the rustformers llm project. I don't speak for them and they have no ability to control what I do. Even if I'm clearly the biggest, rudest jerk on the internet that shouldn't reflect on anyone's perception of the llm project in any way, shape or form.

@KerfuffleV2
Copy link

I merely wanted to help by pointing out that what you were trying to do was not permitted by the current crate, sorry if it felt like knee jerking.

I wasn't saying any knee-jerk actions had occurred, I was just suggesting it would be good to allow time for discussion/consideration.

Also, I really wasn't (and am not) really concerned about how one specific implementation handles it. It's a simple format that is easy to implement: there's no need to rely on that crate for anyone that needs behavior it doesn't currently provide. My concern is with the format specification.

safetensors is not in 1.0 so breaking anything is still possible (I would like very much NOT to have to do that).

I don't think it's clear that the specification of the format is considered to be in development and can change without warning.

In this particular case, you seem to want to save a vocabulary in the file.

I'm personally just talking about it in general. The format is defined as a header and then meta data that describes the offsets and lengths of the tensor data. This is pretty flexible, because after the header, as long as those offsets/lengths or valid then it doesn't matter what else is in the file.

There are applications like storing vocabulary, but it's generally just something that people can probably find a use for in the future. That's one of the great things about open source, people find creative ways to do stuff.

If you're doing memory mapping you're getting &[u8] from the file. And you can cast them to f32 for free if and only if the byte pointers are f32 aligned.

Isn't allowing that the reason why there's logic to add padding when writing the header? And also the reason why the tensors get written out ordered by the element size, to ensure that each tensor begins at an aligned position?

On disk, there is no alignment. A file on disk is a sequence of bytes, full stop.

The point of mmap is to treat the data on disk as if it was memory. So the layout on disk affects the layout of data in memory, when using mmap.

AFAIK there is no safe way to do zero-copy [u8;4] into f32

I'm not sure what your point is. unsafe in Rust just means "trust me, I will make sure the stuff I do doesn't result in undefined behavior". unsafe doesn't allow you to do anything that's actually wrong/undefined, it's just an escape hatch for situations that the compiler can't verify.

So assuming that [u8; 4] consists of valid f32 data, even though you might need to use unsafe, it is still perfectly safe to cast it that way.

There's nothing stopping me from passing a slice from the middle of a file [...]

sure, but then you'll get an error.

I think you might have misunderstood. You just have a &[u8]. There's no way for you to tell if that's a slice from a larger buffer or not: it's just a sequence of bytes.

So I can pass those functions a slice from the middle of a file I mmaped and it will work just fine. (Obviously what's in that slice has to be valid, but you can't know about any leading or trailing data.)

Also contiguity means that you know the files cannot be "too" large. I think it's a good property, file size is roughly the size of the model weights

To enforce that, you would have to change the Safetensor's format from an (apparently) open standard to a black box that users can only interact with through your interface. Obviously that is going to make it much, much less appealing for people to adopt.

Magic numbers have their purpose but there is no need to add any magic number anywhere. A lot of files don't have it either.

The reason I brought that up is because I was guessing the exploit you were concerned about involved arranging for the initial length bytes to match some other file format's magic (possibly confusing other tools like file managers).

No offense, but why do you want to put the vocabulary of a model in the same file ?

Not the person who brought that up, but there are decided advantages to having everything in the same file. That way you can ensure stuff like the vocabulary list actually matches the model that's running. A user only needs to download or manage a single file, etc.

I know ggml stores everything in the same file, but they seem to have made breaking changes at least 2 times since.

Those two things aren't really related. You can store everything in the same file and preserve compatibility or not. You can have every separate and break compatibility.

The breaking changes with GGML is mainly because they don't really care about preserving compatibility.

Again, the spec was lacking in details, but the checks were always there.

Someone looking at using the file format without your software is going to go by the spec, not one specific implementation though. If you change the spec later on, you risk breaking everything that was implemented to that spec.

I wrote my own Safetensors format loader: I looked at the specification to do that. I didn't go through the source code of any specific implementation and then try to replicate its internal behavior. Why would I? If that was necessary, then the spec doesn't actually define the file format.

@iacore
Copy link
Contributor Author

iacore commented May 22, 2023

Hey I sense a lot of animosity here. I merely wanted to help by pointing out that what you were trying to do was not permitted by the current crate, sorry if it felt like knee jerking.
The spec wasn't explicit and therefore I added it in the README. It's still not perfect (like any spec), and we can only try to make it clearer.

Sorry about the animosity.

I think safetensors is the best format so far for storing tensor weights, being extensible and simple. If we don't fix the problems it has now, I'm afraid the said problems will end up in many repositories.

No offense, but why do you want to put the vocabulary of a model in the same file ?

tokenizers are generally reused across multiple models making them good candidate to use a single file for many models. Meaning splitting the tokenizer into a different file makes a lot of sense to me.

Ctranslate2 models are used for 1-to-1 translation. The tokenizer tokenizes input string into strings. Each model has a vocabulary (from string to index). I don't know why it's designed like that.

@Narsil
Copy link
Collaborator

Narsil commented May 23, 2023

If we don't fix the problems it has now

Let's fix those ! Having part of a given file being undefined seems like a big issue to me. (What should you do with it? If it should be ignored, then why have it in the first place ?)

If we don't fix the problems it has now, I'm afraid the said problems will end up in many repositories.

100% which is why by default I tend to err on the side of being too strict when writing files in this crate and trying to be overrestrictive in the spec for the same reason. It's always easier to become less strict later, but once you are less strict it's impossible to come back without breaking anything.

I don't think it's clear that the specification of the format is considered to be in development and can change without warning.

If it wasn't explicitly written that it was allowed, it just wasn't specified actually. It was just underspecified. This is where doubt occurs.
The same goes for the JSON, which is actually a non uniformly specified format as I've learned (most issues are in non trivial documents, but still here we inherit from them, hence the specification has to mention serde, which is the de-facto standard here)

Isn't allowing that the reason why there's logic to add padding when writing the header? And also the reason why the tensors get written out ordered by the element size, to ensure that each tensor begins at an aligned position?

Exactly how it's done. This is viewed as a performance in implementation details.2

unsafe doesn't allow you to do anything that's actually wrong/undefined, it's just an escape hatch for situations that the compiler can't verify.

unsafe 100% allow you to create UB and plain wrong code. https://runrust.miraheze.org/wiki/Undefined_Behavior
In fact, any unsound unsafe code, in any dependency within a given project might open up UB.
Which is why people tend to use unsafe sparringly and comment it.
This crate has 0 unsafe keyword for that reason and mmap is left for users to apply properly.
The python code has 1 unsafe for the mmap call which should be OK.

I wrote my own Safetensors format loader: I looked at the specification to do that. I didn't go through the source code of any specific implementation and then try to replicate its internal behavior. Why would I? If that was necessary, then the spec doesn't actually define the file format.

This is actually great and 100% intended. I think there are no issues with that. Discrepancies between readers is bound to happen. The format is trivial on purpose that for all intents and purposes one should be able to spin up their code pretty fast.

The issue here is that the spec was lacking on some aspects of the format. Which makes them unspecified, free to interpretation (you though it was OK to have arbitrary data in the buffer, this crate says it's not OK). Both are correct with regards to the previous spec, and that can cause issues down the line if writers are implemented differently with regard to that unused part. So it's good to make it more explicit.

@KerfuffleV2
Copy link

Having part of a given file being undefined seems like a big issue to me.

Why? You don't need to do anything about it. Your implementation of loading/saving those files doesn't need to try to preserve that data or anything like that.

The only problem is you're adding constraints that didn't exist in the specification of the format originally. And then changing the format retroactively to align with the behavior of one specific implementation.

If your crate just tolerated any files that met the specification that would be 100% fine. The spec says there's a header and the metadata that specifies the positions and lengths of the tensor data. The rest is undefined: there's no constraint on alignment, whether the tensors can or can't overlap, what order they need to exist in the file, whether there can be gaps in between them.

If it wasn't explicitly written that it was allowed, it just wasn't specified actually. It was just underspecified.

I don't understand. The point of a specification is to define the format. If some constraint isn't specified in the specification, then there isn't a constraint.

For example, suppose I publish the spec for a file format described as:

  1. A little endian 32bit integer
  2. A little endian 16bit integer

Someone else comes along, finds the spec, decides it meets their use case. They develop an app that has a loader and ability to save those files as well. Then later on you say "Oh, it was under specified. You actually can only put prime numbers in the second value".

Now everything that was written to the spec can instantly be invalidated. The first time, people might just sigh and adjust their code, regenerate files if necessary etc.

Then after a while you say "We found some random tools might have issues dealing with files that have even numbers in the first value. Now the first value can only contain odd numbers. Hey, our implementation of the format always did it that way."

Wouldn't that be weird? Either the spec defines the format or it doesn't.

unsafe 100% allow you to create UB and plain wrong code.

It's physically possible, but you aren't "allowed" to write something that has undefined behavior. It's an error if you do so.

There's even a pretty well known (and pretty trivial) example of how to create undefined behavior even from safe code.

The format is trivial on purpose that for all intents and purposes one should be able to spin up their code pretty fast.

But not if you retroactively add constraints and invalidate other implementations. In that case, someone could have written their code to the spec, publish files that meet that spec and later have those files/implementations become useless.

and that can cause issues down the line if writers are implemented differently with regard to that unused part.

If a writer has a problem with that, then it's not operating according to the spec. Right? That's a bug in the writer. If a reader has a problem with files that were written according to the spec, then that's a bug in the reader.

That's an issue the developer of the reader/writer needs to fix, not a problem with the spec.

Copy link

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.

@github-actions github-actions bot added the Stale label Dec 13, 2023
@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Dec 19, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants