Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-22939] [PySpark] Support Spark UDF in registerFunction #20137

Closed
wants to merge 13 commits into from

Conversation

gatorsmile
Copy link
Member

What changes were proposed in this pull request?

import random
from pyspark.sql.functions import udf
from pyspark.sql.types import IntegerType, StringType
random_udf = udf(lambda: int(random.random() * 100), IntegerType()).asNondeterministic()
spark.catalog.registerFunction("random_udf", random_udf, StringType())
spark.sql("SELECT random_udf()").collect()

We will get the following error.

Py4JError: An error occurred while calling o29.__getnewargs__. Trace:
py4j.Py4JException: Method __getnewargs__([]) does not exist
	at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:318)
	at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:326)
	at py4j.Gateway.invoke(Gateway.java:274)
	at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
	at py4j.commands.CallCommand.execute(CallCommand.java:79)
	at py4j.GatewayConnection.run(GatewayConnection.java:214)
	at java.lang.Thread.run(Thread.java:745)

This PR is to support it.

How was this patch tested?

WIP

@SparkQA
Copy link

SparkQA commented Jan 2, 2018

Test build #85607 has finished for PR 20137 at commit 8216b6b.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jan 2, 2018

Test build #85608 has finished for PR 20137 at commit e8d0a4c.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jan 3, 2018

Test build #85616 has finished for PR 20137 at commit 35e6a4a.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jan 3, 2018

Test build #85617 has finished for PR 20137 at commit 3208136.

  • This patch fails Python style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jan 3, 2018

Test build #85618 has finished for PR 20137 at commit f099261.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HyukjinKwon
Copy link
Member

HyukjinKwon commented Jan 3, 2018

Hey @gatorsmile, I was just looking into this now. How about we have _unwrapped for wrapped function and then we return wrapped function from wrapped function and UserDefinedFunction from UserDefinedFunction, for example, roughly, in udf.py?:

         wrapper.returnType = self.returnType
         wrapper.evalType = self.evalType
-        wrapper.asNondeterministic = self.asNondeterministic
+        wrapper.asNondeterministic = lambda: self.asNondeterministic()._wrapped()
+        wrapper._unwrapped = lambda: self
         return wrapper

and then we do?

if hasattr(f, "_unwrapped"):
    f = f._unwrapped()
if isinstance(f, UserDefinedFunction):
    udf = UserDefinedFunction(f.func, returnType=returnType, name=name,
                              evalType=PythonEvalType.SQL_BATCHED_UDF)
    udf = udf if (f._deterministic) else udf.asNondeterministic()
else:
    # Existing logics.

Retruning UserDefinedFunction from wrapped function by asNondeterministic seems actually an issue because it breaks pydoc, for example,

from pyspark.sql.functions import udf
help(udf(lambda: 1, "integer").asNondeterministic())

I haven't tested the suggestion above but I think this is going to roughly work fine and resolve two issues too.

@gatorsmile
Copy link
Member Author

@HyukjinKwon We need to fix asNondeterministic

    def asNondeterministic(self):
        """
        Updates UserDefinedFunction to nondeterministic.

        .. versionadded:: 2.3
        """
        self._deterministic = False
        return self._wrapped()

@HyukjinKwon
Copy link
Member

Ur, @gatorsmile, then, we will return a wrapped function from UserDefinedFunction(). asNondeterministic. Mind if I ask to elaborate why? I thought UserDefinedFunction should still return UserDefinedFunction.

@gatorsmile
Copy link
Member Author

gatorsmile commented Jan 3, 2018

I am not against anything, but the outputs of the following two are inconsistent. It looks confusing to end users.

help(udf(lambda: 1, "integer").asNondeterministic())
help(udf(lambda: 1, "integer"))

BTW, this PR is not just for asNondeterministic(). We have the same issue for the deterministic UDFs.

@HyukjinKwon
Copy link
Member

but if we do

+        wrapper.asNondeterministic = lambda: self.asNondeterministic()._wrapped()

I think it will still show a proper pydoc ..

@gatorsmile
Copy link
Member Author

Can you run the command?

help(udf(lambda: 1, "integer").asNondeterministic())

@HyukjinKwon
Copy link
Member

Let me test it and be back soon.

@gatorsmile
Copy link
Member Author

Take your time. I will not be online in the next two hours.

@HyukjinKwon
Copy link
Member

With this diff:

diff --git a/python/pyspark/sql/udf.py b/python/pyspark/sql/udf.py
index 54b5a8656e1..24de9839e90 100644
--- a/python/pyspark/sql/udf.py
+++ b/python/pyspark/sql/udf.py
@@ -162,7 +162,8 @@ class UserDefinedFunction(object):
         wrapper.func = self.func
         wrapper.returnType = self.returnType
         wrapper.evalType = self.evalType
-        wrapper.asNondeterministic = self.asNondeterministic
+        wrapper.asNondeterministic = lambda: self.asNondeterministic()._wrapped()
+        wrapper._unwrapped = lambda: self

         return wrapper

Before

from pyspark.sql.functions import udf
help(udf(lambda: 1, "integer").asNondeterministic())
Help on UserDefinedFunction in module pyspark.sql.udf object:

class UserDefinedFunction(__builtin__.object)
 |  User defined function in Python
 |
 |  .. versionadded:: 1.3
 |
 |  Methods defined here:
 |
 |  __call__(self, *cols)
 |
 |  __init__(self, func, returnType=StringType, name=None, evalType=100)
 |
 |  asNondeterministic(self)
 |      Updates UserDefinedFunction to nondeterministic.
 |
 |      .. versionadded:: 2.3
 |
 |  ----------------------------------------------------------------------
 |  Data descriptors defined here:
 |
 |  __dict__
 |      dictionary for instance variables (if defined)
 |
 |  __weakref__
 |      list of weak references to the object (if defined)
 |
:
from pyspark.sql.functions import udf
help(udf(lambda: 1, "integer"))
Help on function <lambda> in module __main__:

<lambda> lambda *args
(END)

After

from pyspark.sql.functions import udf
help(udf(lambda: 1, "integer").asNondeterministic())
Help on function <lambda> in module __main__:

<lambda> lambda *args
(END)
from pyspark.sql.functions import udf
help(udf(lambda: 1, "integer"))
Help on function <lambda> in module __main__:

<lambda> lambda *args
(END)

@HyukjinKwon
Copy link
Member

With this diff:

--- a/python/pyspark/sql/udf.py
+++ b/python/pyspark/sql/udf.py
@@ -173,4 +173,4 @@ class UserDefinedFunction(object):
         .. versionadded:: 2.3
         """
         self._deterministic = False
-        return self
+        return self._wrapped()

After

from pyspark.sql.functions import udf
help(udf(lambda: 1, "integer").asNondeterministic())
Help on function <lambda> in module __main__:

<lambda> lambda *args
(END)
from pyspark.sql.functions import udf
help(udf(lambda: 1, "integer"))
Help on function <lambda> in module __main__:

<lambda> lambda *args
(END)

@SparkQA
Copy link

SparkQA commented Jan 3, 2018

Test build #85619 has finished for PR 20137 at commit 6ac25e6.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HyukjinKwon
Copy link
Member

BTW, this PR is not just for asNondeterministic(). We have the same issue for the deterministic UDFs.

Yup, the fix for deterministic UDFs seem fine but the change about asNondeterministic() bugs me.

If you meant the docstring about asNondeterministic itself (not the wrapped function instance as above),

I think we can do the things below:

       wrapper.asNondeterministic = functools.wraps(
           self.asNondeterministic)(lambda: self.asNondeterministic()._wrapped())

@HyukjinKwon
Copy link
Member

Thank you for bearing with me @gatorsmile.

@SparkQA
Copy link

SparkQA commented Jan 3, 2018

Test build #85625 has finished for PR 20137 at commit 85f11bf.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@gatorsmile gatorsmile changed the title [SPARK-22939] [PySpark] Support Spark UDF in registerFunction [WIP] [SPARK-22939] [PySpark] Support Spark UDF in registerFunction Jan 3, 2018
@gatorsmile
Copy link
Member Author

@HyukjinKwon Thank you for your comment!

cc @ueshin @cloud-fan

>>> from pyspark.sql.types import IntegerType, StringType
>>> random_udf = udf(lambda: random.randint(0, 100), IntegerType()).asNondeterministic()
>>> newRandom_udf = spark.catalog.registerFunction(
... "random_udf", random_udf, StringType()) # doctest: +SKIP
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why skip the test? we can use a fixed seed

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The output contains a hex value.

self.assertEqual(row[0], "6")
[row] = self.spark.range(1).select(random_udf()).collect()
self.assertEqual(row[0], 6)
pydoc.render_doc(udf(lambda: random.randint(6, 6), IntegerType()))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what does it do?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we put this tests there or make this separate from test_non_deterministic_udf? Adding comments is also fine to me.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will add a comment.

@SparkQA
Copy link

SparkQA commented Jan 3, 2018

Test build #85636 has finished for PR 20137 at commit 78e9b2c.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

>>> from pyspark.sql.types import IntegerType, StringType
>>> random_udf = udf(lambda: random.randint(0, 100), IntegerType()).asNondeterministic()
>>> newRandom_udf = spark.catalog.registerFunction(
... "random_udf", random_udf, StringType()) # doctest: +SKIP
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW, I think we can remove # doctest: +SKIP for this line because this line simply assigns a value to newRandom_udf?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

newRandom_udf is also used.

return judf

def __call__(self, *cols):
judf = self._judf
sc = SparkContext._active_spark_context
return Column(judf.apply(_to_seq(sc, cols, _to_java_column)))

# This function is for improving the online help system in the interactive interpreter.
# For example, the built-in help / pydoc.help. It wraps the UDF with the docstring and
# argument annotation. (See: SPARK-19161)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can put this in the docstring of _wrapped between L148 and 150L.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not want to expose these comments to the doc.

udf = UserDefinedFunction(f, returnType=returnType, name=name,
evalType=PythonEvalType.SQL_BATCHED_UDF)

if hasattr(f, 'asNondeterministic'):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, this one made me to suggest wrapper._unwrapped = lambda: self way.

So, here this can be wrapped function or UserDefinedFunction and I thought it's not quite clear what we expect here by hasattr(f, 'asNondeterministic').

Could we at least leave come comments saying that this can be both wrapped function for UserDefinedFunction and UserDefinedFunction itself?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will add a comment.

@@ -255,9 +255,26 @@ def registerFunction(self, name, f, returnType=StringType()):
>>> _ = spark.udf.register("stringLengthInt", len, IntegerType())
>>> spark.sql("SELECT stringLengthInt('test')").collect()
[Row(stringLengthInt(test)=4)]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's fix the doc for this too. It says :param f: python function but we could describe that it takes Python native function, wrapped function and UserDefinedFunction too.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok

@@ -162,7 +168,8 @@ def wrapper(*args):
wrapper.func = self.func
wrapper.returnType = self.returnType
wrapper.evalType = self.evalType
wrapper.asNondeterministic = self.asNondeterministic
wrapper.deterministic = self.deterministic
wrapper.asNondeterministic = lambda: self.asNondeterministic()._wrapped()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we do:

       wrapper.asNondeterministic = functools.wraps(
           self.asNondeterministic)(lambda: self.asNondeterministic()._wrapped())

So that it can produce a proper pydoc when we do help(udf(lambda: 1, "integer").asNondeterministic) (not help(udf(lambda: 1, "integer").asNondeterministic()).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good to know the difference

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will leave this unchanged. Maybe you can submit a follow-up PR to address it?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Definitely. Will give a try within the following week tho ...

@@ -172,5 +179,5 @@ def asNondeterministic(self):

.. versionadded:: 2.3
"""
self._deterministic = False
self.deterministic = False
Copy link
Member

@HyukjinKwon HyukjinKwon Jan 3, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we call it udfDeterministic to be consistent with Scala side?

The opposite works fine to me too if that's possible in any way.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

deterministic is used in UserDefinedFunction.scala. Users can use it to check whether this UDF is deterministic or not.

@HyukjinKwon
Copy link
Member

cc @mgaido91 since you touched related codes lately.

@HyukjinKwon
Copy link
Member

Looks fine to me otherwise BTW.

@@ -227,15 +227,15 @@ def dropGlobalTempView(self, viewName):
@ignore_unicode_prefix
@since(2.0)
def registerFunction(self, name, f, returnType=StringType()):
"""Registers a python function (including lambda function) as a UDF
"""Registers a Python function (including lambda function) or a wrapped/native UDF
Copy link
Contributor

@cloud-fan cloud-fan Jan 4, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm really confusing when reading this document, it would be much more clear to me if we can just say

Registers a Python function (including lambda function) or a :class:`UserDefinedFunction` as a UDF

This wrapping logic was added in #16534 , is it really worth?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It indeed added some complexity. However, I believe nothing is blocked by #16534 now if I understand correctly.

The changes #16534 is quite nice because IMHO Python guys probably use help() and dir() more frequently then reading the API doc in the website. For the set of UDFs are provided as a library, I think that's quite worth to keep.

How about leaving this wrapper logic as is for now and then we bring this discussion back when actually something is blocked (or being too complicated) by this?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another idea just in case it helps:

Registers a Python function as a UDF or a user defined function.

Copy link
Member

@HyukjinKwon HyukjinKwon Jan 4, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW, to be honest, I remember I gave several quick tries to get rid of the wrapper but keep the docstring correctly at that time but I failed to make a good alternative.

Might be good to try if there is a clever way to get rid of the wrapper but keep the doc.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SGTM

@SparkQA
Copy link

SparkQA commented Jan 4, 2018

Test build #85655 has finished for PR 20137 at commit 09a1b89.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HyukjinKwon
Copy link
Member

HyukjinKwon commented Jan 4, 2018

LGTM except #20137 (comment) but I will make a followup soon. Fixing it here works fine to me too.

@SparkQA
Copy link

SparkQA commented Jan 4, 2018

Test build #85657 has finished for PR 20137 at commit 2482e6b.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

# This is to check whether the input function is a wrapped/native UserDefinedFunction
if hasattr(f, 'asNondeterministic'):
udf = UserDefinedFunction(f.func, returnType=returnType, name=name,
evalType=PythonEvalType.SQL_BATCHED_UDF,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cc @ueshin @icexelloss , shall we support register pandas UDF here too?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seems we can support it by just changing evalType=PythonEvalType.SQL_BATCHED_UDF to evalType=f.evalType

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 but I think there's no way to use a group map UDF in SQL syntax if I understood correctly. I think we can safely fail fast for now as well.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SGTM

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will support the pandas UDF as a separate PR.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 too

@gatorsmile
Copy link
Member Author

Thanks! Merged to master and 2.3

asfgit pushed a commit that referenced this pull request Jan 4, 2018
## What changes were proposed in this pull request?
```Python
import random
from pyspark.sql.functions import udf
from pyspark.sql.types import IntegerType, StringType
random_udf = udf(lambda: int(random.random() * 100), IntegerType()).asNondeterministic()
spark.catalog.registerFunction("random_udf", random_udf, StringType())
spark.sql("SELECT random_udf()").collect()
```

We will get the following error.
```
Py4JError: An error occurred while calling o29.__getnewargs__. Trace:
py4j.Py4JException: Method __getnewargs__([]) does not exist
	at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:318)
	at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:326)
	at py4j.Gateway.invoke(Gateway.java:274)
	at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
	at py4j.commands.CallCommand.execute(CallCommand.java:79)
	at py4j.GatewayConnection.run(GatewayConnection.java:214)
	at java.lang.Thread.run(Thread.java:745)
```

This PR is to support it.

## How was this patch tested?
WIP

Author: gatorsmile <[email protected]>

Closes #20137 from gatorsmile/registerFunction.

(cherry picked from commit 5aadbc9)
Signed-off-by: gatorsmile <[email protected]>
@asfgit asfgit closed this in 5aadbc9 Jan 4, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants