-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add .npy support to halide_image_io #8175
Conversation
The .npy format is NumPy's native format for storing multidimensional arrays (aka tensors/buffers). Being able to load/save in this format makes it (potentially) a lot easier to interchange data with the Python ecosystem, as well as providing a file format that support floating-point data more robustly than any of the others that we current support. This adds load/save support for a useful subset: - We support the int/uint/float types common in Halide (except for f16/bf16 for now) - We don't support reading or writing files that are in `fortran_order` - We don't support any object/struct/etc files, only numeric primitives - We only support loading files that are in the host's endianness (typically little-endian) Note that at present this doesn't support f16 / bf16 formats, but that could likely be added with minimal difficulty. The tricky bit of this is that the reading code has to parse a (limited) Python dict in text form. Please review that part carefully. TODO: we could probably add this as an option for `debug_to_file()` without too much pain in a followup PR.
tools/halide_image_io.h
Outdated
} | ||
shape += ")"; | ||
|
||
// TODO: is it safe to assume that Halide will never write in fortran_order? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not, and for_each_value below doesn't guarantee a planar order. I think you want write_planar_payload
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah. So, to be clear, if we use write_planar_payload
, we can definitely assert that fortran_order=false for things we write?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I believe so.
tools/halide_image_io.h
Outdated
if (pos == std::string::npos) { | ||
return false; // missing a required key | ||
} | ||
positions.emplace_back(pos, k); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pos + k.size()? That would make the code below robust to keys with ':' in them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would needs to be k.size() + 2 to account for the enclosing quotes... but then the enclosing logic later fails because it assumes that we stop at the following key. That said, the worst case is that if someone produces a file that has extra keys (or keys with unlikely chars), we just refuse to read the file.
tools/halide_image_io.h
Outdated
|
||
// Here we are going to slurp everything from the start of a key | ||
// to just before the start of the next key (or end of input), | ||
// then split on : with everything to the right as the presumed value. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So if there's an unusual key in there, this will be wrong. Does the spec guarantee that only those three keys exist for the versions we support?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not clear; the spec says The dictionary contains three keys
but doesn't say it contains only these.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(BTW, I have verified that a .npy
file produced by NumPy in Python on my Mac loads fine here... sample size = 1 :-)
tools/halide_image_io.h
Outdated
std::vector<std::pair<size_t, std::string>> positions; | ||
|
||
constexpr int kKeyCount = 3; | ||
std::string keys[kKeyCount] = {"descr", "fortran_order", "shape"}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess another way this can fail is if any of these strings appear in the key values somewhere. Seems impossible?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A 'real' .npy file wouldn't do that (it would be malformed), but a malicious one could. Let me think about how I can write a more robust parser here. (Sigh... if only the spec had just required a specific layout that was easier to parse in non-Python)
Built on top of #8175, this adds .npy as an option. This is actually pretty great because it's easy to do something like ``` ss = numpy.load("my_file.npy") print(ss) ``` in Python and get nicely-formatted output, which can sometimes be a lot easier for debugging that inserting lots of print() statements (see #8176) Did a drive-by change to the correctness test to use this format instead of .mat.
Added support for float16, since NumPy supports it just fine (which makes this the only format I know of that we support that can handle it OK) |
So I think the only real blocker here is how robust we need to be when parsing the Python dict; from a quick survey of other C/C++ parsing libraries I could find via google, most of are even more naive than this code; also, based on looking at the NumPy source code that generates canonical files, it never inserts anything weird into it, so it seems likely to assume that most files will be 'well-formed'. I'm fine with failing to read a non-well-formed file.... the real question is whether it's possible to malform a file in a way that can cause malicious exploits, of course. |
The spec looks like we can expect the header to be very consistent. If we want to be rigid about it, how about writing a regex that the header must match:
|
Are we reading the same spec? I didn't perceive that guarantee in the language (though I'd be happy if it was present) |
Sadly, the spec explicitly says the keys aren't required to be in order (wtf?)... |
huh, the document I found describing it says that they are guaranteed to be in alphabetical order. I guess it's not really a spec, but it is the most precise description I found: https://paulbourke.net/dataformats/npy/#:~:text=The%20npy%20files%20contain%20a,string%20with%20datatype%20and%20dimensions. |
That link says:
which is indeed what the NumPy source does, but in https://numpy.org/devdocs/reference/generated/numpy.lib.format.html they say
|
That said: we could just flout the SHOULD part and say that we're gonna require them sorted; that would be safer and would probably eliminate ~zero real-world files (and if we find counterexamples we can address it then) |
How about this old-school version:
|
trying now |
Looks good and deals with pathologies reasonably, PTAL |
* Add .npy support to halide_image_io The .npy format is NumPy's native format for storing multidimensional arrays (aka tensors/buffers). Being able to load/save in this format makes it (potentially) a lot easier to interchange data with the Python ecosystem, as well as providing a file format that support floating-point data more robustly than any of the others that we current support. This adds load/save support for a useful subset: - We support the int/uint/float types common in Halide (except for f16/bf16 for now) - We don't support reading or writing files that are in `fortran_order` - We don't support any object/struct/etc files, only numeric primitives - We only support loading files that are in the host's endianness (typically little-endian) Note that at present this doesn't support f16 / bf16 formats, but that could likely be added with minimal difficulty. The tricky bit of this is that the reading code has to parse a (limited) Python dict in text form. Please review that part carefully. TODO: we could probably add this as an option for `debug_to_file()` without too much pain in a followup PR. * clang-tidy * clang-tidy * Address review comments * Allow for "keys" as well as 'keys' * Add .npy support to debug_to_file() Built on top of #8175, this adds .npy as an option. This is actually pretty great because it's easy to do something like ``` ss = numpy.load("my_file.npy") print(ss) ``` in Python and get nicely-formatted output, which can sometimes be a lot easier for debugging that inserting lots of print() statements (see #8176) Did a drive-by change to the correctness test to use this format instead of .mat. * Add float16 support * Add support for Float16 images in npy * Assume little-endian * Remove redundant halide_error() * naming convention * naming convention * Test both mat and npy * Don't call halide_error() * Use old-school parser * clang-tidy
* Add .npy support to halide_image_io The .npy format is NumPy's native format for storing multidimensional arrays (aka tensors/buffers). Being able to load/save in this format makes it (potentially) a lot easier to interchange data with the Python ecosystem, as well as providing a file format that support floating-point data more robustly than any of the others that we current support. This adds load/save support for a useful subset: - We support the int/uint/float types common in Halide (except for f16/bf16 for now) - We don't support reading or writing files that are in `fortran_order` - We don't support any object/struct/etc files, only numeric primitives - We only support loading files that are in the host's endianness (typically little-endian) Note that at present this doesn't support f16 / bf16 formats, but that could likely be added with minimal difficulty. The tricky bit of this is that the reading code has to parse a (limited) Python dict in text form. Please review that part carefully. TODO: we could probably add this as an option for `debug_to_file()` without too much pain in a followup PR. * clang-tidy * clang-tidy * Address review comments * Allow for "keys" as well as 'keys' * Add .npy support to debug_to_file() Built on top of #8175, this adds .npy as an option. This is actually pretty great because it's easy to do something like ``` ss = numpy.load("my_file.npy") print(ss) ``` in Python and get nicely-formatted output, which can sometimes be a lot easier for debugging that inserting lots of print() statements (see #8176) Did a drive-by change to the correctness test to use this format instead of .mat. * Add float16 support * Add support for Float16 images in npy * Assume little-endian * Remove redundant halide_error() * naming convention * naming convention * Test both mat and npy * Don't call halide_error() * Use old-school parser * clang-tidy * Update debug_to_file API to remove type_code * Clean up into single table * Update CodeGen_LLVM.cpp * Fix tmp codes * Update InjectHostDevBufferCopies.cpp * Update InjectHostDevBufferCopies.cpp * trigger buildbots
The .npy format is NumPy's native format for storing multidimensional arrays (aka tensors/buffers). Being able to load/save in this format makes it (potentially) a lot easier to interchange data with the Python ecosystem, as well as providing a file format that support floating-point data more robustly than any of the others that we current support.
This adds load/save support for a useful subset:
fortran_order
Note that at present this doesn't support f16 / bf16 formats, but that could likely be added with minimal difficulty.
The tricky bit of this is that the reading code has to parse a (limited) Python dict in text form. Please review that part carefully.
TODO: we could probably add this as an option forsee #8177debug_to_file()
without too much pain in a followup PR.