[SPARK-22939] [PySpark] Support Spark UDF in registerFunction #20137

gatorsmile · 2018-01-02T15:40:30Z

What changes were proposed in this pull request?

import random
from pyspark.sql.functions import udf
from pyspark.sql.types import IntegerType, StringType
random_udf = udf(lambda: int(random.random() * 100), IntegerType()).asNondeterministic()
spark.catalog.registerFunction("random_udf", random_udf, StringType())
spark.sql("SELECT random_udf()").collect()

We will get the following error.

Py4JError: An error occurred while calling o29.__getnewargs__. Trace:
py4j.Py4JException: Method __getnewargs__([]) does not exist
	at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:318)
	at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:326)
	at py4j.Gateway.invoke(Gateway.java:274)
	at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
	at py4j.commands.CallCommand.execute(CallCommand.java:79)
	at py4j.GatewayConnection.run(GatewayConnection.java:214)
	at java.lang.Thread.run(Thread.java:745)

This PR is to support it.

How was this patch tested?

WIP

SparkQA · 2018-01-02T16:02:00Z

Test build #85607 has finished for PR 20137 at commit 8216b6b.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-01-02T16:06:52Z

Test build #85608 has finished for PR 20137 at commit e8d0a4c.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-01-03T00:45:42Z

Test build #85616 has finished for PR 20137 at commit 35e6a4a.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-01-03T01:08:50Z

Test build #85617 has finished for PR 20137 at commit 3208136.

This patch fails Python style tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-01-03T02:11:38Z

Test build #85618 has finished for PR 20137 at commit f099261.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2018-01-03T02:23:57Z

Hey @gatorsmile, I was just looking into this now. How about we have _unwrapped for wrapped function and then we return wrapped function from wrapped function and UserDefinedFunction from UserDefinedFunction, for example, roughly, in udf.py?:

         wrapper.returnType = self.returnType
         wrapper.evalType = self.evalType
-        wrapper.asNondeterministic = self.asNondeterministic
+        wrapper.asNondeterministic = lambda: self.asNondeterministic()._wrapped()
+        wrapper._unwrapped = lambda: self
         return wrapper

and then we do?

if hasattr(f, "_unwrapped"):
    f = f._unwrapped()
if isinstance(f, UserDefinedFunction):
    udf = UserDefinedFunction(f.func, returnType=returnType, name=name,
                              evalType=PythonEvalType.SQL_BATCHED_UDF)
    udf = udf if (f._deterministic) else udf.asNondeterministic()
else:
    # Existing logics.

Retruning UserDefinedFunction from wrapped function by asNondeterministic seems actually an issue because it breaks pydoc, for example,

from pyspark.sql.functions import udf
help(udf(lambda: 1, "integer").asNondeterministic())

I haven't tested the suggestion above but I think this is going to roughly work fine and resolve two issues too.

gatorsmile · 2018-01-03T03:32:00Z

@HyukjinKwon We need to fix asNondeterministic

    def asNondeterministic(self):
        """
        Updates UserDefinedFunction to nondeterministic.

        .. versionadded:: 2.3
        """
        self._deterministic = False
        return self._wrapped()

HyukjinKwon · 2018-01-03T03:57:51Z

Ur, @gatorsmile, then, we will return a wrapped function from UserDefinedFunction(). asNondeterministic. Mind if I ask to elaborate why? I thought UserDefinedFunction should still return UserDefinedFunction.

gatorsmile · 2018-01-03T04:17:49Z

I am not against anything, but the outputs of the following two are inconsistent. It looks confusing to end users.

help(udf(lambda: 1, "integer").asNondeterministic())
help(udf(lambda: 1, "integer"))

BTW, this PR is not just for asNondeterministic(). We have the same issue for the deterministic UDFs.

HyukjinKwon · 2018-01-03T04:21:44Z

but if we do

+        wrapper.asNondeterministic = lambda: self.asNondeterministic()._wrapped()

I think it will still show a proper pydoc ..

gatorsmile · 2018-01-03T04:24:29Z

Can you run the command?

help(udf(lambda: 1, "integer").asNondeterministic())

HyukjinKwon · 2018-01-03T04:24:53Z

Let me test it and be back soon.

gatorsmile · 2018-01-03T04:27:00Z

Take your time. I will not be online in the next two hours.

HyukjinKwon · 2018-01-03T04:30:37Z

With this diff:

diff --git a/python/pyspark/sql/udf.py b/python/pyspark/sql/udf.py
index 54b5a8656e1..24de9839e90 100644
--- a/python/pyspark/sql/udf.py
+++ b/python/pyspark/sql/udf.py
@@ -162,7 +162,8 @@ class UserDefinedFunction(object):
         wrapper.func = self.func
         wrapper.returnType = self.returnType
         wrapper.evalType = self.evalType
-        wrapper.asNondeterministic = self.asNondeterministic
+        wrapper.asNondeterministic = lambda: self.asNondeterministic()._wrapped()
+        wrapper._unwrapped = lambda: self

         return wrapper

Before

from pyspark.sql.functions import udf
help(udf(lambda: 1, "integer").asNondeterministic())

Help on UserDefinedFunction in module pyspark.sql.udf object:

class UserDefinedFunction(__builtin__.object)
 |  User defined function in Python
 |
 |  .. versionadded:: 1.3
 |
 |  Methods defined here:
 |
 |  __call__(self, *cols)
 |
 |  __init__(self, func, returnType=StringType, name=None, evalType=100)
 |
 |  asNondeterministic(self)
 |      Updates UserDefinedFunction to nondeterministic.
 |
 |      .. versionadded:: 2.3
 |
 |  ----------------------------------------------------------------------
 |  Data descriptors defined here:
 |
 |  __dict__
 |      dictionary for instance variables (if defined)
 |
 |  __weakref__
 |      list of weak references to the object (if defined)
 |
:

from pyspark.sql.functions import udf
help(udf(lambda: 1, "integer"))

Help on function <lambda> in module __main__:

<lambda> lambda *args
(END)

After

from pyspark.sql.functions import udf
help(udf(lambda: 1, "integer").asNondeterministic())

Help on function <lambda> in module __main__:

<lambda> lambda *args
(END)

from pyspark.sql.functions import udf
help(udf(lambda: 1, "integer"))

Help on function <lambda> in module __main__:

<lambda> lambda *args
(END)

HyukjinKwon · 2018-01-03T04:32:10Z

With this diff:

--- a/python/pyspark/sql/udf.py
+++ b/python/pyspark/sql/udf.py
@@ -173,4 +173,4 @@ class UserDefinedFunction(object):
         .. versionadded:: 2.3
         """
         self._deterministic = False
-        return self
+        return self._wrapped()

After

from pyspark.sql.functions import udf
help(udf(lambda: 1, "integer").asNondeterministic())

Help on function <lambda> in module __main__:

<lambda> lambda *args
(END)

from pyspark.sql.functions import udf
help(udf(lambda: 1, "integer"))

Help on function <lambda> in module __main__:

<lambda> lambda *args
(END)

SparkQA · 2018-01-03T04:35:35Z

Test build #85619 has finished for PR 20137 at commit 6ac25e6.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2018-01-03T08:04:08Z

BTW, this PR is not just for asNondeterministic(). We have the same issue for the deterministic UDFs.

Yup, the fix for deterministic UDFs seem fine but the change about asNondeterministic() bugs me.

If you meant the docstring about asNondeterministic itself (not the wrapped function instance as above),

I think we can do the things below:

       wrapper.asNondeterministic = functools.wraps(
           self.asNondeterministic)(lambda: self.asNondeterministic()._wrapped())

HyukjinKwon · 2018-01-03T08:59:26Z

Thank you for bearing with me @gatorsmile.

SparkQA · 2018-01-03T09:17:04Z

Test build #85625 has finished for PR 20137 at commit 85f11bf.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

gatorsmile · 2018-01-03T14:00:24Z

@HyukjinKwon Thank you for your comment!

cc @ueshin @cloud-fan

cloud-fan · 2018-01-03T14:59:08Z

python/pyspark/sql/catalog.py

+        >>> from pyspark.sql.types import IntegerType, StringType
+        >>> random_udf = udf(lambda: random.randint(0, 100), IntegerType()).asNondeterministic()
+        >>> newRandom_udf = spark.catalog.registerFunction(
+        ...     "random_udf", random_udf, StringType())  # doctest: +SKIP


why skip the test? we can use a fixed seed

The output contains a hex value.

cloud-fan · 2018-01-03T15:06:13Z

python/pyspark/sql/tests.py

+        self.assertEqual(row[0], "6")
+        [row] = self.spark.range(1).select(random_udf()).collect()
+        self.assertEqual(row[0], 6)
+        pydoc.render_doc(udf(lambda: random.randint(6, 6), IntegerType()))


what does it do?

This is to test a help function. See https://github.com/gatorsmile/spark/blob/85f11bfbfb564acb670097ff4ce520bfbc79b855/python/pyspark/sql/tests.py#L1681-L1688

Can we put this tests there or make this separate from test_non_deterministic_udf? Adding comments is also fine to me.

will add a comment.

SparkQA · 2018-01-03T17:29:52Z

Test build #85636 has finished for PR 20137 at commit 78e9b2c.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2018-01-03T19:19:07Z

python/pyspark/sql/catalog.py

+        >>> from pyspark.sql.types import IntegerType, StringType
+        >>> random_udf = udf(lambda: random.randint(0, 100), IntegerType()).asNondeterministic()
+        >>> newRandom_udf = spark.catalog.registerFunction(
+        ...     "random_udf", random_udf, StringType())  # doctest: +SKIP


BTW, I think we can remove # doctest: +SKIP for this line because this line simply assigns a value to newRandom_udf?

newRandom_udf is also used.

HyukjinKwon · 2018-01-03T19:22:51Z

python/pyspark/sql/udf.py

        return judf

    def __call__(self, *cols):
        judf = self._judf
        sc = SparkContext._active_spark_context
        return Column(judf.apply(_to_seq(sc, cols, _to_java_column)))

+    # This function is for improving the online help system in the interactive interpreter.
+    # For example, the built-in help / pydoc.help. It wraps the UDF with the docstring and
+    # argument annotation. (See: SPARK-19161)


I think we can put this in the docstring of _wrapped between L148 and 150L.

I do not want to expose these comments to the doc.

HyukjinKwon · 2018-01-03T19:25:00Z

python/pyspark/sql/catalog.py

-        udf = UserDefinedFunction(f, returnType=returnType, name=name,
-                                  evalType=PythonEvalType.SQL_BATCHED_UDF)
+
+        if hasattr(f, 'asNondeterministic'):


Actually, this one made me to suggest wrapper._unwrapped = lambda: self way.

So, here this can be wrapped function or UserDefinedFunction and I thought it's not quite clear what we expect here by hasattr(f, 'asNondeterministic').

Could we at least leave come comments saying that this can be both wrapped function for UserDefinedFunction and UserDefinedFunction itself?

will add a comment.

HyukjinKwon · 2018-01-03T19:26:10Z

python/pyspark/sql/catalog.py

@@ -255,9 +255,26 @@ def registerFunction(self, name, f, returnType=StringType()):
        >>> _ = spark.udf.register("stringLengthInt", len, IntegerType())
        >>> spark.sql("SELECT stringLengthInt('test')").collect()
        [Row(stringLengthInt(test)=4)]


Let's fix the doc for this too. It says :param f: python function but we could describe that it takes Python native function, wrapped function and UserDefinedFunction too.

HyukjinKwon · 2018-01-03T19:29:23Z

python/pyspark/sql/udf.py

@@ -162,7 +168,8 @@ def wrapper(*args):
        wrapper.func = self.func
        wrapper.returnType = self.returnType
        wrapper.evalType = self.evalType
-        wrapper.asNondeterministic = self.asNondeterministic
+        wrapper.deterministic = self.deterministic
+        wrapper.asNondeterministic = lambda: self.asNondeterministic()._wrapped()


Can we do:

wrapper.asNondeterministic = functools.wraps( self.asNondeterministic)(lambda: self.asNondeterministic()._wrapped())

So that it can produce a proper pydoc when we do help(udf(lambda: 1, "integer").asNondeterministic) (not help(udf(lambda: 1, "integer").asNondeterministic()).

good to know the difference

I will leave this unchanged. Maybe you can submit a follow-up PR to address it?

Definitely. Will give a try within the following week tho ...

HyukjinKwon · 2018-01-03T19:38:17Z

python/pyspark/sql/udf.py

@@ -172,5 +179,5 @@ def asNondeterministic(self):

        .. versionadded:: 2.3
        """
-        self._deterministic = False
+        self.deterministic = False


Can we call it udfDeterministic to be consistent with Scala side?

spark/sql/core/src/main/scala/org/apache/spark/sql/execution/python/UserDefinedPythonFunction.scala

Line 33 in ff48b1b

udfDeterministic: Boolean) {

The opposite works fine to me too if that's possible in any way.

deterministic is used in UserDefinedFunction.scala. Users can use it to check whether this UDF is deterministic or not.

HyukjinKwon · 2018-01-03T19:38:40Z

cc @mgaido91 since you touched related codes lately.

HyukjinKwon · 2018-01-03T19:50:07Z

Looks fine to me otherwise BTW.

cloud-fan · 2018-01-04T03:56:21Z

python/pyspark/sql/catalog.py

@@ -227,15 +227,15 @@ def dropGlobalTempView(self, viewName):
    @ignore_unicode_prefix
    @since(2.0)
    def registerFunction(self, name, f, returnType=StringType()):
-        """Registers a python function (including lambda function) as a UDF
+        """Registers a Python function (including lambda function) or a wrapped/native UDF


I'm really confusing when reading this document, it would be much more clear to me if we can just say

Registers a Python function (including lambda function) or a :class:`UserDefinedFunction` as a UDF

This wrapping logic was added in #16534 , is it really worth?

It indeed added some complexity. However, I believe nothing is blocked by #16534 now if I understand correctly.

The changes #16534 is quite nice because IMHO Python guys probably use help() and dir() more frequently then reading the API doc in the website. For the set of UDFs are provided as a library, I think that's quite worth to keep.

How about leaving this wrapper logic as is for now and then we bring this discussion back when actually something is blocked (or being too complicated) by this?

Another idea just in case it helps:

Registers a Python function as a UDF or a user defined function.

BTW, to be honest, I remember I gave several quick tries to get rid of the wrapper but keep the docstring correctly at that time but I failed to make a good alternative.

Might be good to try if there is a clever way to get rid of the wrapper but keep the doc.

SparkQA · 2018-01-04T04:12:10Z

Test build #85655 has finished for PR 20137 at commit 09a1b89.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2018-01-04T04:39:10Z

LGTM except #20137 (comment) but I will make a followup soon. Fixing it here works fine to me too.

SparkQA · 2018-01-04T05:21:45Z

Test build #85657 has finished for PR 20137 at commit 2482e6b.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2018-01-04T05:22:36Z

python/pyspark/sql/catalog.py

+        # This is to check whether the input function is a wrapped/native UserDefinedFunction
+        if hasattr(f, 'asNondeterministic'):
+            udf = UserDefinedFunction(f.func, returnType=returnType, name=name,
+                                      evalType=PythonEvalType.SQL_BATCHED_UDF,


cc @ueshin @icexelloss , shall we support register pandas UDF here too?

seems we can support it by just changing evalType=PythonEvalType.SQL_BATCHED_UDF to evalType=f.evalType

+1 but I think there's no way to use a group map UDF in SQL syntax if I understood correctly. I think we can safely fail fast for now as well.

Will support the pandas UDF as a separate PR.

gatorsmile · 2018-01-04T13:06:40Z

Thanks! Merged to master and 2.3

## What changes were proposed in this pull request? ```Python import random from pyspark.sql.functions import udf from pyspark.sql.types import IntegerType, StringType random_udf = udf(lambda: int(random.random() * 100), IntegerType()).asNondeterministic() spark.catalog.registerFunction("random_udf", random_udf, StringType()) spark.sql("SELECT random_udf()").collect() ``` We will get the following error. ``` Py4JError: An error occurred while calling o29.__getnewargs__. Trace: py4j.Py4JException: Method __getnewargs__([]) does not exist at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:318) at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:326) at py4j.Gateway.invoke(Gateway.java:274) at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132) at py4j.commands.CallCommand.execute(CallCommand.java:79) at py4j.GatewayConnection.run(GatewayConnection.java:214) at java.lang.Thread.run(Thread.java:745) ``` This PR is to support it. ## How was this patch tested? WIP Author: gatorsmile <[email protected]> Closes #20137 from gatorsmile/registerFunction. (cherry picked from commit 5aadbc9) Signed-off-by: gatorsmile <[email protected]>

gatorsmile added 3 commits January 2, 2018 23:28

wip

8216b6b

clean

0ecdf63

rename

e8d0a4c

skip

35e6a4a

gatorsmile added 2 commits January 3, 2018 09:04

fix

b89b720

skip

3208136

fix

f099261

gatorsmile added 2 commits January 3, 2018 12:14

try

d1ba703

try

6ac25e6

fix

85f11bf

gatorsmile changed the title ~~[SPARK-22939] [PySpark] Support Spark UDF in registerFunction [WIP]~~ [SPARK-22939] [PySpark] Support Spark UDF in registerFunction Jan 3, 2018

cloud-fan reviewed Jan 3, 2018

View reviewed changes

cloud-fan approved these changes Jan 3, 2018

View reviewed changes

more comment.

78e9b2c

HyukjinKwon reviewed Jan 3, 2018

View reviewed changes

fix.

09a1b89

cloud-fan reviewed Jan 4, 2018

View reviewed changes

fix

2482e6b

cloud-fan reviewed Jan 4, 2018

View reviewed changes

asfgit closed this in 5aadbc9 Jan 4, 2018

[SPARK-22939] [PySpark] Support Spark UDF in registerFunction #20137

[SPARK-22939] [PySpark] Support Spark UDF in registerFunction #20137

Conversation

gatorsmile commented Jan 2, 2018

What changes were proposed in this pull request?

How was this patch tested?

SparkQA commented Jan 2, 2018

SparkQA commented Jan 2, 2018

SparkQA commented Jan 3, 2018

SparkQA commented Jan 3, 2018

SparkQA commented Jan 3, 2018

HyukjinKwon commented Jan 3, 2018 • edited Loading

gatorsmile commented Jan 3, 2018

HyukjinKwon commented Jan 3, 2018

gatorsmile commented Jan 3, 2018 • edited Loading

HyukjinKwon commented Jan 3, 2018

gatorsmile commented Jan 3, 2018

HyukjinKwon commented Jan 3, 2018

gatorsmile commented Jan 3, 2018

HyukjinKwon commented Jan 3, 2018

HyukjinKwon commented Jan 3, 2018

SparkQA commented Jan 3, 2018

HyukjinKwon commented Jan 3, 2018

HyukjinKwon commented Jan 3, 2018

SparkQA commented Jan 3, 2018

gatorsmile commented Jan 3, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Jan 3, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

HyukjinKwon Jan 3, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

HyukjinKwon commented Jan 3, 2018

HyukjinKwon commented Jan 3, 2018

cloud-fan Jan 4, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

HyukjinKwon Jan 4, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Jan 4, 2018

HyukjinKwon commented Jan 4, 2018 • edited Loading

SparkQA commented Jan 4, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gatorsmile commented Jan 4, 2018

HyukjinKwon commented Jan 3, 2018 •

edited

Loading

gatorsmile commented Jan 3, 2018 •

edited

Loading

HyukjinKwon Jan 3, 2018 •

edited

Loading

cloud-fan Jan 4, 2018 •

edited

Loading

HyukjinKwon Jan 4, 2018 •

edited

Loading

HyukjinKwon commented Jan 4, 2018 •

edited

Loading