Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add .npy support to halide_image_io #8175

Merged
merged 10 commits into from
Apr 6, 2024
Merged

Add .npy support to halide_image_io #8175

merged 10 commits into from
Apr 6, 2024

Conversation

steven-johnson
Copy link
Contributor

@steven-johnson steven-johnson commented Apr 2, 2024

The .npy format is NumPy's native format for storing multidimensional arrays (aka tensors/buffers). Being able to load/save in this format makes it (potentially) a lot easier to interchange data with the Python ecosystem, as well as providing a file format that support floating-point data more robustly than any of the others that we current support.

This adds load/save support for a useful subset:

  • We support the int/uint/float types common in Halide (except for f16/bf16 for now)
  • We don't support reading or writing files that are in fortran_order
  • We don't support any object/struct/etc files, only numeric primitives
  • We only support loading files that are in the host's endianness (typically little-endian)

Note that at present this doesn't support f16 / bf16 formats, but that could likely be added with minimal difficulty.

The tricky bit of this is that the reading code has to parse a (limited) Python dict in text form. Please review that part carefully.

TODO: we could probably add this as an option for debug_to_file() without too much pain in a followup PR. see #8177

The .npy format is NumPy's native format for storing multidimensional arrays (aka tensors/buffers). Being able to load/save in this format makes it (potentially) a lot easier to interchange data with the Python ecosystem, as well as providing a file format that support floating-point data more robustly than any of the others that we current support.

This adds load/save support for a useful subset:
- We support the int/uint/float types common in Halide (except for f16/bf16 for now)
- We don't support reading or writing files that are in `fortran_order`
- We don't support any object/struct/etc files, only numeric primitives
- We only support loading files that are in the host's endianness (typically little-endian)

Note that at present this doesn't support f16 / bf16 formats, but that could likely be added with minimal difficulty.

The tricky bit of this is that the reading code has to parse a (limited) Python dict in text form. Please review that part carefully.

TODO: we could probably add this as an option for `debug_to_file()` without too much pain in a followup PR.
@steven-johnson steven-johnson marked this pull request as ready for review April 2, 2024 23:45
@steven-johnson steven-johnson added the release_notes For changes that may warrant a note in README for official releases. label Apr 2, 2024
}
shape += ")";

// TODO: is it safe to assume that Halide will never write in fortran_order?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not, and for_each_value below doesn't guarantee a planar order. I think you want write_planar_payload

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah. So, to be clear, if we use write_planar_payload, we can definitely assert that fortran_order=false for things we write?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I believe so.

tools/halide_image_io.h Outdated Show resolved Hide resolved
if (pos == std::string::npos) {
return false; // missing a required key
}
positions.emplace_back(pos, k);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pos + k.size()? That would make the code below robust to keys with ':' in them.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would needs to be k.size() + 2 to account for the enclosing quotes... but then the enclosing logic later fails because it assumes that we stop at the following key. That said, the worst case is that if someone produces a file that has extra keys (or keys with unlikely chars), we just refuse to read the file.


// Here we are going to slurp everything from the start of a key
// to just before the start of the next key (or end of input),
// then split on : with everything to the right as the presumed value.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So if there's an unusual key in there, this will be wrong. Does the spec guarantee that only those three keys exist for the versions we support?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not clear; the spec says The dictionary contains three keys but doesn't say it contains only these.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(BTW, I have verified that a .npy file produced by NumPy in Python on my Mac loads fine here... sample size = 1 :-)

std::vector<std::pair<size_t, std::string>> positions;

constexpr int kKeyCount = 3;
std::string keys[kKeyCount] = {"descr", "fortran_order", "shape"};
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess another way this can fail is if any of these strings appear in the key values somewhere. Seems impossible?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A 'real' .npy file wouldn't do that (it would be malformed), but a malicious one could. Let me think about how I can write a more robust parser here. (Sigh... if only the spec had just required a specific layout that was easier to parse in non-Python)

steven-johnson added a commit that referenced this pull request Apr 4, 2024
Built on top of #8175, this adds .npy as an option. This is actually pretty great because it's easy to do something like

```
ss = numpy.load("my_file.npy")
print(ss)
```

in Python and get nicely-formatted output, which can sometimes be a lot easier for debugging that inserting lots of print() statements (see #8176)

Did a drive-by change to the correctness test to use this format instead of .mat.
@steven-johnson
Copy link
Contributor Author

Added support for float16, since NumPy supports it just fine (which makes this the only format I know of that we support that can handle it OK)

@steven-johnson
Copy link
Contributor Author

So I think the only real blocker here is how robust we need to be when parsing the Python dict; from a quick survey of other C/C++ parsing libraries I could find via google, most of are even more naive than this code; also, based on looking at the NumPy source code that generates canonical files, it never inserts anything weird into it, so it seems likely to assume that most files will be 'well-formed'. I'm fine with failing to read a non-well-formed file.... the real question is whether it's possible to malform a file in a way that can cause malicious exploits, of course.

@abadams
Copy link
Member

abadams commented Apr 4, 2024

The spec looks like we can expect the header to be very consistent. If we want to be rigid about it, how about writing a regex that the header must match:

   static std::regex r_header("^\\{'descr': '[<|]([ifu])(\\d+)', "
                               "'fortran_order': False, "
                               "'shape': \\(([\\d, ]+)\\), \\}$");
    static std::regex r_num("(\\d+)");
    std::smatch m;
    if (!std::regex_match(header, m, r_header)) {
        return false;
    }

    char type_code;
    int bytes;
    std::vector<int> extents;

    type_code = m[1].str()[0];
    bytes = std::stoi(m[2].str());
    std::string shape_string = m[3].str();
    for (auto it = std::sregex_token_iterator(shape_string.begin(), shape_string.end(), r_num);
         it != std::sregex_token_iterator(); it++) {
        std::cout << it->str() << "\n";
        extents.push_back(std::stoi(it->str()));
    }

    return true;

@steven-johnson
Copy link
Contributor Author

The spec looks like we can expect the header to be very consistent

Are we reading the same spec? I didn't perceive that guarantee in the language (though I'd be happy if it was present)

@steven-johnson
Copy link
Contributor Author

Sadly, the spec explicitly says the keys aren't required to be in order (wtf?)...

@abadams
Copy link
Member

abadams commented Apr 4, 2024

huh, the document I found describing it says that they are guaranteed to be in alphabetical order. I guess it's not really a spec, but it is the most precise description I found: https://paulbourke.net/dataformats/npy/#:~:text=The%20npy%20files%20contain%20a,string%20with%20datatype%20and%20dimensions.

@steven-johnson
Copy link
Contributor Author

That link says:

For repeatability and readability, this dictionary is formatted using pprint.pformat() so the keys are in alphabetic order.

which is indeed what the NumPy source does, but in https://numpy.org/devdocs/reference/generated/numpy.lib.format.html they say

For repeatability and readability, the dictionary keys are sorted in alphabetic order. This is for convenience only. A writer SHOULD implement this if possible. A reader MUST NOT depend on this.

@steven-johnson
Copy link
Contributor Author

That said: we could just flout the SHOULD part and say that we're gonna require them sorted; that would be safer and would probably eliminate ~zero real-world files (and if we find counterexamples we can address it then)

@abadams
Copy link
Member

abadams commented Apr 5, 2024

How about this old-school version:


struct NpyHeader {
    char type_code;
    int type_bytes;
    std::vector<int> extents;

    bool parse(const std::string &header) {
        const char *ptr = &header[0];
        if (*ptr++ != '{') {
            return false;
        }
        while (true) {
            char endian;
            int consumed;
            if (std::sscanf(ptr, "'descr': '%c%c%d'%n", &endian, &type_code, &type_bytes, &consumed) == 3) {
                if (endian != '<' && endian != '|') {
                    return false;
                }
                ptr += consumed;
            } else if (std::strncmp(ptr, "'fortran_order': False", 22) == 0) {
                ptr += 22;
            } else if (std::strncmp(ptr, "'shape': (", 10) == 0) {
                ptr += 10;
                int n;
                while (std::sscanf(ptr, "%d%n", &n, &consumed) == 1) {
                    extents.push_back(n);
                    ptr += consumed;
                    if (*ptr == ',') ptr++;
                    if (*ptr == ' ') ptr++;
                }
                if (*ptr++ != ')') {
                    return false;
                }
            } else if (*ptr == '}') {
                return true;
            } else {
                return false;
            }
            if (*ptr == ',') ptr++;
            if (*ptr == ' ') ptr++;
            assert(ptr <= &header.back());
        }
    }
};

@steven-johnson
Copy link
Contributor Author

How about this old-school version:

trying now

@steven-johnson
Copy link
Contributor Author

Looks good and deals with pathologies reasonably, PTAL

@steven-johnson steven-johnson merged commit 35f0c29 into main Apr 6, 2024
19 checks passed
@steven-johnson steven-johnson deleted the srj/npy-format branch April 6, 2024 15:17
steven-johnson added a commit that referenced this pull request Apr 11, 2024
* Add .npy support to halide_image_io

The .npy format is NumPy's native format for storing multidimensional arrays (aka tensors/buffers). Being able to load/save in this format makes it (potentially) a lot easier to interchange data with the Python ecosystem, as well as providing a file format that support floating-point data more robustly than any of the others that we current support.

This adds load/save support for a useful subset:
- We support the int/uint/float types common in Halide (except for f16/bf16 for now)
- We don't support reading or writing files that are in `fortran_order`
- We don't support any object/struct/etc files, only numeric primitives
- We only support loading files that are in the host's endianness (typically little-endian)

Note that at present this doesn't support f16 / bf16 formats, but that could likely be added with minimal difficulty.

The tricky bit of this is that the reading code has to parse a (limited) Python dict in text form. Please review that part carefully.

TODO: we could probably add this as an option for `debug_to_file()` without too much pain in a followup PR.

* clang-tidy

* clang-tidy

* Address review comments

* Allow for "keys" as well as 'keys'

* Add .npy support to debug_to_file()

Built on top of #8175, this adds .npy as an option. This is actually pretty great because it's easy to do something like

```
ss = numpy.load("my_file.npy")
print(ss)
```

in Python and get nicely-formatted output, which can sometimes be a lot easier for debugging that inserting lots of print() statements (see #8176)

Did a drive-by change to the correctness test to use this format instead of .mat.

* Add float16 support

* Add support for Float16 images in npy

* Assume little-endian

* Remove redundant halide_error()

* naming convention

* naming convention

* Test both mat and npy

* Don't call halide_error()

* Use old-school parser

* clang-tidy
steven-johnson added a commit that referenced this pull request Apr 29, 2024
* Add .npy support to halide_image_io

The .npy format is NumPy's native format for storing multidimensional arrays (aka tensors/buffers). Being able to load/save in this format makes it (potentially) a lot easier to interchange data with the Python ecosystem, as well as providing a file format that support floating-point data more robustly than any of the others that we current support.

This adds load/save support for a useful subset:
- We support the int/uint/float types common in Halide (except for f16/bf16 for now)
- We don't support reading or writing files that are in `fortran_order`
- We don't support any object/struct/etc files, only numeric primitives
- We only support loading files that are in the host's endianness (typically little-endian)

Note that at present this doesn't support f16 / bf16 formats, but that could likely be added with minimal difficulty.

The tricky bit of this is that the reading code has to parse a (limited) Python dict in text form. Please review that part carefully.

TODO: we could probably add this as an option for `debug_to_file()` without too much pain in a followup PR.

* clang-tidy

* clang-tidy

* Address review comments

* Allow for "keys" as well as 'keys'

* Add .npy support to debug_to_file()

Built on top of #8175, this adds .npy as an option. This is actually pretty great because it's easy to do something like

```
ss = numpy.load("my_file.npy")
print(ss)
```

in Python and get nicely-formatted output, which can sometimes be a lot easier for debugging that inserting lots of print() statements (see #8176)

Did a drive-by change to the correctness test to use this format instead of .mat.

* Add float16 support

* Add support for Float16 images in npy

* Assume little-endian

* Remove redundant halide_error()

* naming convention

* naming convention

* Test both mat and npy

* Don't call halide_error()

* Use old-school parser

* clang-tidy

* Update debug_to_file API to remove type_code

* Clean up into single table

* Update CodeGen_LLVM.cpp

* Fix tmp codes

* Update InjectHostDevBufferCopies.cpp

* Update InjectHostDevBufferCopies.cpp

* trigger buildbots
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
release_notes For changes that may warrant a note in README for official releases.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants