Add .npy support to halide_image_io #8175

steven-johnson · 2024-04-02T23:42:33Z

The .npy format is NumPy's native format for storing multidimensional arrays (aka tensors/buffers). Being able to load/save in this format makes it (potentially) a lot easier to interchange data with the Python ecosystem, as well as providing a file format that support floating-point data more robustly than any of the others that we current support.

This adds load/save support for a useful subset:

We support the int/uint/float types common in Halide (except for f16/bf16 for now)
We don't support reading or writing files that are in fortran_order
We don't support any object/struct/etc files, only numeric primitives
We only support loading files that are in the host's endianness (typically little-endian)

Note that at present this doesn't support f16 / bf16 formats, but that could likely be added with minimal difficulty.

The tricky bit of this is that the reading code has to parse a (limited) Python dict in text form. Please review that part carefully.

~~TODO: we could probably add this as an option for debug_to_file() without too much pain in a followup PR.~~ see #8177

The .npy format is NumPy's native format for storing multidimensional arrays (aka tensors/buffers). Being able to load/save in this format makes it (potentially) a lot easier to interchange data with the Python ecosystem, as well as providing a file format that support floating-point data more robustly than any of the others that we current support. This adds load/save support for a useful subset: - We support the int/uint/float types common in Halide (except for f16/bf16 for now) - We don't support reading or writing files that are in `fortran_order` - We don't support any object/struct/etc files, only numeric primitives - We only support loading files that are in the host's endianness (typically little-endian) Note that at present this doesn't support f16 / bf16 formats, but that could likely be added with minimal difficulty. The tricky bit of this is that the reading code has to parse a (limited) Python dict in text form. Please review that part carefully. TODO: we could probably add this as an option for `debug_to_file()` without too much pain in a followup PR.

abadams · 2024-04-03T22:14:19Z

tools/halide_image_io.h

+    }
+    shape += ")";
+
+    // TODO: is it safe to assume that Halide will never write in fortran_order?


It's not, and for_each_value below doesn't guarantee a planar order. I think you want write_planar_payload

Yeah. So, to be clear, if we use write_planar_payload, we can definitely assert that fortran_order=false for things we write?

Yes, I believe so.

tools/halide_image_io.h

abadams · 2024-04-03T22:21:02Z

tools/halide_image_io.h

+        if (pos == std::string::npos) {
+            return false;  // missing a required key
+        }
+        positions.emplace_back(pos, k);


pos + k.size()? That would make the code below robust to keys with ':' in them.

Would needs to be k.size() + 2 to account for the enclosing quotes... but then the enclosing logic later fails because it assumes that we stop at the following key. That said, the worst case is that if someone produces a file that has extra keys (or keys with unlikely chars), we just refuse to read the file.

abadams · 2024-04-03T22:23:03Z

tools/halide_image_io.h

+
+    // Here we are going to slurp everything from the start of a key
+    // to just before the start of the next key (or end of input),
+    // then split on : with everything to the right as the presumed value.


So if there's an unusual key in there, this will be wrong. Does the spec guarantee that only those three keys exist for the versions we support?

It's not clear; the spec says The dictionary contains three keys but doesn't say it contains only these.

(BTW, I have verified that a .npy file produced by NumPy in Python on my Mac loads fine here... sample size = 1 :-)

abadams · 2024-04-03T22:23:32Z

tools/halide_image_io.h

+    std::vector<std::pair<size_t, std::string>> positions;
+
+    constexpr int kKeyCount = 3;
+    std::string keys[kKeyCount] = {"descr", "fortran_order", "shape"};


I guess another way this can fail is if any of these strings appear in the key values somewhere. Seems impossible?

A 'real' .npy file wouldn't do that (it would be malformed), but a malicious one could. Let me think about how I can write a more robust parser here. (Sigh... if only the spec had just required a specific layout that was easier to parse in non-Python)

Built on top of #8175, this adds .npy as an option. This is actually pretty great because it's easy to do something like ``` ss = numpy.load("my_file.npy") print(ss) ``` in Python and get nicely-formatted output, which can sometimes be a lot easier for debugging that inserting lots of print() statements (see #8176) Did a drive-by change to the correctness test to use this format instead of .mat.

steven-johnson · 2024-04-04T17:22:06Z

Added support for float16, since NumPy supports it just fine (which makes this the only format I know of that we support that can handle it OK)

steven-johnson · 2024-04-04T17:49:53Z

So I think the only real blocker here is how robust we need to be when parsing the Python dict; from a quick survey of other C/C++ parsing libraries I could find via google, most of are even more naive than this code; also, based on looking at the NumPy source code that generates canonical files, it never inserts anything weird into it, so it seems likely to assume that most files will be 'well-formed'. I'm fine with failing to read a non-well-formed file.... the real question is whether it's possible to malform a file in a way that can cause malicious exploits, of course.

abadams · 2024-04-04T19:28:17Z

The spec looks like we can expect the header to be very consistent. If we want to be rigid about it, how about writing a regex that the header must match:

   static std::regex r_header("^\\{'descr': '[<|]([ifu])(\\d+)', "
                               "'fortran_order': False, "
                               "'shape': \\(([\\d, ]+)\\), \\}$");
    static std::regex r_num("(\\d+)");
    std::smatch m;
    if (!std::regex_match(header, m, r_header)) {
        return false;
    }

    char type_code;
    int bytes;
    std::vector<int> extents;

    type_code = m[1].str()[0];
    bytes = std::stoi(m[2].str());
    std::string shape_string = m[3].str();
    for (auto it = std::sregex_token_iterator(shape_string.begin(), shape_string.end(), r_num);
         it != std::sregex_token_iterator(); it++) {
        std::cout << it->str() << "\n";
        extents.push_back(std::stoi(it->str()));
    }

    return true;

steven-johnson · 2024-04-04T19:36:27Z

The spec looks like we can expect the header to be very consistent

Are we reading the same spec? I didn't perceive that guarantee in the language (though I'd be happy if it was present)

steven-johnson · 2024-04-04T19:38:38Z

Sadly, the spec explicitly says the keys aren't required to be in order (wtf?)...

abadams · 2024-04-04T19:41:06Z

huh, the document I found describing it says that they are guaranteed to be in alphabetical order. I guess it's not really a spec, but it is the most precise description I found: https://paulbourke.net/dataformats/npy/#:~:text=The%20npy%20files%20contain%20a,string%20with%20datatype%20and%20dimensions.

steven-johnson · 2024-04-04T19:44:29Z

That link says:

For repeatability and readability, this dictionary is formatted using pprint.pformat() so the keys are in alphabetic order.

which is indeed what the NumPy source does, but in https://numpy.org/devdocs/reference/generated/numpy.lib.format.html they say

For repeatability and readability, the dictionary keys are sorted in alphabetic order. This is for convenience only. A writer SHOULD implement this if possible. A reader MUST NOT depend on this.

steven-johnson · 2024-04-04T19:45:31Z

That said: we could just flout the SHOULD part and say that we're gonna require them sorted; that would be safer and would probably eliminate ~zero real-world files (and if we find counterexamples we can address it then)

abadams · 2024-04-05T17:32:15Z

How about this old-school version:


struct NpyHeader {
    char type_code;
    int type_bytes;
    std::vector<int> extents;

    bool parse(const std::string &header) {
        const char *ptr = &header[0];
        if (*ptr++ != '{') {
            return false;
        }
        while (true) {
            char endian;
            int consumed;
            if (std::sscanf(ptr, "'descr': '%c%c%d'%n", &endian, &type_code, &type_bytes, &consumed) == 3) {
                if (endian != '<' && endian != '|') {
                    return false;
                }
                ptr += consumed;
            } else if (std::strncmp(ptr, "'fortran_order': False", 22) == 0) {
                ptr += 22;
            } else if (std::strncmp(ptr, "'shape': (", 10) == 0) {
                ptr += 10;
                int n;
                while (std::sscanf(ptr, "%d%n", &n, &consumed) == 1) {
                    extents.push_back(n);
                    ptr += consumed;
                    if (*ptr == ',') ptr++;
                    if (*ptr == ' ') ptr++;
                }
                if (*ptr++ != ')') {
                    return false;
                }
            } else if (*ptr == '}') {
                return true;
            } else {
                return false;
            }
            if (*ptr == ',') ptr++;
            if (*ptr == ' ') ptr++;
            assert(ptr <= &header.back());
        }
    }
};

steven-johnson · 2024-04-05T17:46:18Z

How about this old-school version:

trying now

steven-johnson · 2024-04-05T18:02:30Z

Looks good and deals with pathologies reasonably, PTAL

* Add .npy support to halide_image_io The .npy format is NumPy's native format for storing multidimensional arrays (aka tensors/buffers). Being able to load/save in this format makes it (potentially) a lot easier to interchange data with the Python ecosystem, as well as providing a file format that support floating-point data more robustly than any of the others that we current support. This adds load/save support for a useful subset: - We support the int/uint/float types common in Halide (except for f16/bf16 for now) - We don't support reading or writing files that are in `fortran_order` - We don't support any object/struct/etc files, only numeric primitives - We only support loading files that are in the host's endianness (typically little-endian) Note that at present this doesn't support f16 / bf16 formats, but that could likely be added with minimal difficulty. The tricky bit of this is that the reading code has to parse a (limited) Python dict in text form. Please review that part carefully. TODO: we could probably add this as an option for `debug_to_file()` without too much pain in a followup PR. * clang-tidy * clang-tidy * Address review comments * Allow for "keys" as well as 'keys' * Add .npy support to debug_to_file() Built on top of #8175, this adds .npy as an option. This is actually pretty great because it's easy to do something like ``` ss = numpy.load("my_file.npy") print(ss) ``` in Python and get nicely-formatted output, which can sometimes be a lot easier for debugging that inserting lots of print() statements (see #8176) Did a drive-by change to the correctness test to use this format instead of .mat. * Add float16 support * Add support for Float16 images in npy * Assume little-endian * Remove redundant halide_error() * naming convention * naming convention * Test both mat and npy * Don't call halide_error() * Use old-school parser * clang-tidy

* Add .npy support to halide_image_io The .npy format is NumPy's native format for storing multidimensional arrays (aka tensors/buffers). Being able to load/save in this format makes it (potentially) a lot easier to interchange data with the Python ecosystem, as well as providing a file format that support floating-point data more robustly than any of the others that we current support. This adds load/save support for a useful subset: - We support the int/uint/float types common in Halide (except for f16/bf16 for now) - We don't support reading or writing files that are in `fortran_order` - We don't support any object/struct/etc files, only numeric primitives - We only support loading files that are in the host's endianness (typically little-endian) Note that at present this doesn't support f16 / bf16 formats, but that could likely be added with minimal difficulty. The tricky bit of this is that the reading code has to parse a (limited) Python dict in text form. Please review that part carefully. TODO: we could probably add this as an option for `debug_to_file()` without too much pain in a followup PR. * clang-tidy * clang-tidy * Address review comments * Allow for "keys" as well as 'keys' * Add .npy support to debug_to_file() Built on top of #8175, this adds .npy as an option. This is actually pretty great because it's easy to do something like ``` ss = numpy.load("my_file.npy") print(ss) ``` in Python and get nicely-formatted output, which can sometimes be a lot easier for debugging that inserting lots of print() statements (see #8176) Did a drive-by change to the correctness test to use this format instead of .mat. * Add float16 support * Add support for Float16 images in npy * Assume little-endian * Remove redundant halide_error() * naming convention * naming convention * Test both mat and npy * Don't call halide_error() * Use old-school parser * clang-tidy * Update debug_to_file API to remove type_code * Clean up into single table * Update CodeGen_LLVM.cpp * Fix tmp codes * Update InjectHostDevBufferCopies.cpp * Update InjectHostDevBufferCopies.cpp * trigger buildbots

steven-johnson requested review from abadams and zvookin April 2, 2024 23:42

steven-johnson marked this pull request as ready for review April 2, 2024 23:45

clang-tidy

f36fc6c

steven-johnson added the release_notes For changes that may warrant a note in README for official releases. label Apr 2, 2024

clang-tidy

b0926ee

abadams reviewed Apr 3, 2024

View reviewed changes

tools/halide_image_io.h Outdated Show resolved Hide resolved

abadams reviewed Apr 3, 2024

View reviewed changes

steven-johnson added 2 commits April 3, 2024 16:37

Address review comments

21fad2d

Allow for "keys" as well as 'keys'

8117cf9

steven-johnson mentioned this pull request Apr 4, 2024

Add .npy support to debug_to_file() #8177

Merged

steven-johnson added 2 commits April 4, 2024 10:19

Merge branch 'main' into srj/npy-format

84a5a2f

Add float16 support

0de5868

Merge branch 'main' into srj/npy-format

22690d1

Use old-school parser

ee94ff8

abadams approved these changes Apr 5, 2024

View reviewed changes

clang-tidy

8157913

steven-johnson merged commit 35f0c29 into main Apr 6, 2024
19 checks passed

steven-johnson deleted the srj/npy-format branch April 6, 2024 15:17

BrewTestBot mentioned this pull request Jul 17, 2024

halide 18.0.0 Homebrew/homebrew-core#177657

Closed

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add .npy support to halide_image_io #8175

Add .npy support to halide_image_io #8175

steven-johnson commented Apr 2, 2024 •

edited

Loading

abadams Apr 3, 2024

steven-johnson Apr 3, 2024

abadams Apr 4, 2024

abadams Apr 3, 2024

steven-johnson Apr 4, 2024

abadams Apr 3, 2024

steven-johnson Apr 3, 2024

steven-johnson Apr 3, 2024

abadams Apr 3, 2024

steven-johnson Apr 3, 2024

steven-johnson commented Apr 4, 2024

steven-johnson commented Apr 4, 2024

abadams commented Apr 4, 2024

steven-johnson commented Apr 4, 2024

steven-johnson commented Apr 4, 2024

abadams commented Apr 4, 2024

steven-johnson commented Apr 4, 2024

steven-johnson commented Apr 4, 2024

abadams commented Apr 5, 2024

steven-johnson commented Apr 5, 2024

steven-johnson commented Apr 5, 2024

Add .npy support to halide_image_io #8175

Add .npy support to halide_image_io #8175

Conversation

steven-johnson commented Apr 2, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

steven-johnson commented Apr 4, 2024

steven-johnson commented Apr 4, 2024

abadams commented Apr 4, 2024

steven-johnson commented Apr 4, 2024

steven-johnson commented Apr 4, 2024

abadams commented Apr 4, 2024

steven-johnson commented Apr 4, 2024

steven-johnson commented Apr 4, 2024

abadams commented Apr 5, 2024

steven-johnson commented Apr 5, 2024

steven-johnson commented Apr 5, 2024

steven-johnson commented Apr 2, 2024 •

edited

Loading