Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

numpy numbers aren't reported as numbers #354

Closed
JaroslavTulach opened this issue Sep 7, 2023 · 15 comments
Closed

numpy numbers aren't reported as numbers #354

JaroslavTulach opened this issue Sep 7, 2023 · 15 comments
Assignees

Comments

@JaroslavTulach
Copy link
Contributor

numpy defines its own numeric types. However GraalPy doesn't report them as InteropLibrary.isNumber. That causes interoperability confusion. To reproduce store following file as Numpy.java:

import org.graalvm.polyglot.Context;

class Numpy {
  public static void main(String... args) throws Exception {
    var ctx = Context.newBuilder()
      .option("python.Executable", "numtest/bin/graalpy")
      .allowAllAccess(true)
      .build();
    var arr = ctx.eval("python", """
       import site
       import numpy
       numpy.random.normal(size=10)
       """);
    assert arr.hasArrayElements();
    var e = arr.getArrayElement(0);
    assert e.isNumber() : "e: " + e + " type: " + e.getMetaObject().getMetaQualifiedName();
  }
}

and execute following commands:

$ /graalvm-community-openjdk-17.0.7+7.1/bin/graalpy -m venv numtest
$ ./numtest/bin/graalpy -m pip install numpy
Successfully installed numpy-1.23.5
$ /graalvm-community-openjdk-17.0.7+7.1/bin/java -ea Numpy.java
Exception in thread "main" java.lang.AssertionError: e: -1.3258617587423684 type: numpy.float64
        at Numpy.main(Numpy.java:17)

this is causing us problems when sorting as the content of NumPy array isn't recognized as numbers, but treated as complex objects with members.

@JaroslavTulach
Copy link
Contributor Author

JaroslavTulach commented Sep 7, 2023

Numpy defines __int__ and __float__ on numeric type - can a presence of these methods be treated as a sign that the Python object represents a number?

@msimacek
Copy link
Contributor

msimacek commented Sep 7, 2023

Yes. I think we didn't do that because we can't tell in advance if the integer number fits into any of the fixed size type. Now that interop supports BigInteger, we could report fitsInBigInteger iff the object has __index__ or __int__ and fitsInDouble iff it has __float__. @timfel any concerns?

Note that numpy integers would still always report fitstInInt false, even if it's numpy.int32. Unfortunately numpy integers are not a subclass of python integers, so we have no way to tell if the number fits in 32 bits without calling the conversion method and looking at the result. Which we cannot do because the interop API forbids side-effects.

@eregon
Copy link
Member

eregon commented Sep 7, 2023

Could we assume/expect __int__ doesn't have side effects? It seems pretty bad to return the wrong answer for fitsInInt(). E.g. some int / numpy.int32 might give 0 because it says it doesn't fit an int, depending on how / is defined.
CC @chumer

@msimacek
Copy link
Contributor

msimacek commented Sep 7, 2023

Correction: for integers, we should only consider __index__, not __int__. That's how python internally converts objects for builtins that expect integers. __int__ is permitted to be a lossy conversion (so it also converts floats etc) and is basically only used by the int builtin.

Could we assume/expect __int__ doesn't have side effects?

Not sure. In most normal libraries I wouldn't expect visible side-effects, but there's no rule saying that there can't be any. For example unittest.mock.MagicMock records all calls to special methods, including __index__. Also it could degrade performance if the interop client calls those methods all the time expecting them to be cheap, i.e. a debugger trying them on everything it sees. For numpy ints it would be downcall to C and that's not cheap at all.

@eregon how does ruby approach this?

@JaroslavTulach
Copy link
Contributor Author

JaroslavTulach commented Sep 11, 2023

For purposes of Enso we just need to know the object is a number. Then we'll treat the object as number and not as a generic object with members/hash entries. How big/precise the number is then not important. We support long, double and BigInteger and one of them will match (I hope).

@eregon
Copy link
Member

eregon commented Sep 12, 2023

@eregon how does ruby approach this?

We don't have a way to override fitsInInt() in Ruby currently, but we have similar things for array and hash predicates such as isArrayElementReadable() which is also "Invoking this message does not cause any observable side-effects".
A user can define polyglot_array_element_readable? on any Ruby class/object and that will get called on isArrayElementReadable(). So then it is the responsibility of the one implementing polyglot_array_element_readable? to avoid side effects.

You could do the same, e.g. polyglot_fits_in_int or so, which you would either define for relevant numpy numbers (seems easier, could be extra methods defined when loading numpy) or treat numpy numbers specially in fitsInInt() (that seems harder). It doesn't need to be a method but could be some (possibly fully internal) attribute or whatever on the class probably (but that's probably not feasible e.g. for 64-bit ints which might or not fit in int, that needs a method or similar).

@eregon
Copy link
Member

eregon commented Sep 12, 2023

In general I guess __index__ is already expected to have no observable side effects in regular Python, so then it might just be fine to call it.

@JaroslavTulach
Copy link
Contributor Author

treat numpy numbers specially in fitsInInt()

Yes, having a special "numpy integration module" (and other integration modules for other popular Python libraries) would deliver more reliable results faster than trying to extract a general purpose solution based on random coding patterns that appear in such libraries and are often considered fragile ("can it side-effect or not?") to rely on.

Just my 2 Kč from a user of GraalVM that seeks solutions and ignores elegance.

mergify bot pushed a commit to enso-org/enso that referenced this issue Sep 19, 2023
Closes #7677 by eliminating the _stackoverflow execption_. In general it seems _too adventurous_ to walk members of random foreign objects. There can be anything including cycles. Rather than trying to be too smart in these cases, let's just rely on `InteropLibrary.isIdentical` message.

# Important Notes
Calling `sort` on the `numpy` array no longer yields an error, but the array isn't sorted - that needs a fix on the Python side: oracle/graalpython#354 - once it is in, the elements will be treated as numbers and the sorting happens automatically (without any changes in Enso code).
@msimacek
Copy link
Contributor

Actually, asBigInteger cannot have side-effects either 🤦‍♂️

I've been thinking about the comments here and I've come up with a proposal:
We could have a mechanism to register custom interop behvior for arbitrary types. Most interop message calls would first lookup the type in the registry and if a method is found it would be called with the object instance. It would be up to the method implementor to ensure that interop contracts, like having no visible side-effects, are met. It's quite similar to what @eregon mentioned for ruby, except in python you cannot directly add methods or anything else to existing C-defined types, so we need a different mechanism of binding the method to the type.
Example of how it could work for numpy.int32:

import polyglot
import numpy

@polyglot.register_polyglot_behavior(numpy.int32)
class NumpyInt32Behavior:
    @staticmethod
    def fits_in_int(obj):
        return True

    @staticmethod
    def as_int(obj):
        return obj.__index__()
    ...

Passing in the instance would let you do additional checks that cannot be infered based on the type, i.e. to implement fits_in_int on numpy.int64 where you would have to check the bounds. The return types would always be python builtins. We would also likely remove the current fuzzy behavior of looking up special methods to guess if an object is an array etc. So this would also address #353, pandas dataframes wouldn't be arrays nor hashes until you define they are.

We would ship our own definitions for standard library types like datetime. We could probably ship definitions for numpy, since it's so extremely common, but we probably wouldn't want to provide definitions for other libraries, that's a slippery slope and creates problems with versioning. That would be up to you, users like you could publish such definitions for other libraries as packages on PyPI for others to reuse.

The tricky part is - when exactly would the definitions would get loaded? In the example, I have a dependency on numpy, so graalpy cannot load it eagerly on startup. The simplest approach would be to require the users to import it explicitly. Is that something that would work for you? Either your users literally do import numpy_graalpy_interop after they do import numpy or you do it for them somewhere if you know they have numpy installed.

@JaroslavTulach
Copy link
Contributor Author

The tricky part is - when exactly would the definitions would get loaded?

I'd like to get a callback when certain module (like numpy) gets loaded. Then I would load my numpy (or any other) registrations.

@msimacek
Copy link
Contributor

Python's import system is very customizable, it already has extension points to do that. There's a package importhook that gives you a friendly interface to do exactly what you described.

@JaroslavTulach
Copy link
Contributor Author

Please let me know when the system is ready for experimenting. Thank you.

@cosminbasca
Copy link
Member

Hi @JaroslavTulach! We have just merged an api that allows you to specify the interop behavior for external types. It consists of the polyglog.register_interop_behavior function and polyglot.get_registered_interop_behavior (for introspection). The API is easy to use and is documented under docs/user/Interoperability.md.
Usage is rather simple: pass in the type as the first argument and pure python functions as kwargs for the messages you want to redefine. Almost all interop messages are supported (with some exceptions), and the naming convention for the kw arg is snake_casing i.e., isNumber becomes is_number etc.

import polyglot

class MyType(object):
    data = 10

polyglot.register_interop_behavior(MyType, is_string=True, as_string=lambda t: f"MyType({t.data})")

We plan on providing default behavior extensions for numpy and pandas rather soon so stay tuned :) !

@JaroslavTulach
Copy link
Contributor Author

Thanks a lot @cosminbasca for letting me know. What's the best way to test latest Python bits? We are using following Maven co-ordinates to download the GraalPy packages.

Are there any co-ordinates to download latest snapshots?

@msimacek
Copy link
Contributor

Snapshots artifacts are currenlty not published on maven, but can be downloaded from https://github.com/graalvm/graalvm-ce-dev-builds/releases/ (maven-resource-bundle-community-dev.tar.gz), you can then point maven to it as a file repository.

I believe that the new API provides you with enough flexibility to achieve what you need, so I'm going to mark this issue as fixed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants