[SPARK-43298][PYTHON][ML] predict_batch_udf with scalar input fails with batch size of one #40967

leewyang · 2023-04-26T20:54:22Z

What changes were proposed in this pull request?

This is a followup to #39817 to handle another error condition when the input batch is a single scalar value (where the previous fix focused on a single scalar value output).

Why are the changes needed?

Using predict_batch_udf fails when the input batch size is one.

import numpy as np
from pyspark.ml.functions import predict_batch_udf
from pyspark.sql.types import DoubleType

df = spark.createDataFrame([[1.0],[2.0]], schema=["a"])

def make_predict_fn():
    def predict(inputs):
        return inputs

    return predict

identity = predict_batch_udf(make_predict_fn, return_type=DoubleType(), batch_size=1)
preds = df.withColumn("preds", identity("a")).show()

fails with:

  File "/.../spark/python/pyspark/worker.py", line 869, in main
    process()
  File "/.../spark/python/pyspark/worker.py", line 861, in process
    serializer.dump_stream(out_iter, outfile)
  File "/.../spark/python/pyspark/sql/pandas/serializers.py", line 354, in dump_stream
    return ArrowStreamSerializer.dump_stream(self, init_stream_yield_batches(), stream)
  File "/.../spark/python/pyspark/sql/pandas/serializers.py", line 86, in dump_stream
    for batch in iterator:
  File "/.../spark/python/pyspark/sql/pandas/serializers.py", line 347, in init_stream_yield_batches
    for series in iterator:
  File "/.../spark/python/pyspark/worker.py", line 555, in func
    for result_batch, result_type in result_iter:
  File "/.../spark/python/pyspark/ml/functions.py", line 818, in predict
    yield _validate_and_transform_prediction_result(
  File "/.../spark/python/pyspark/ml/functions.py", line 339, in _validate_and_transform_prediction_result
    if len(preds_array) != num_input_rows:
TypeError: len() of unsized object

After the fix:

+---+-----+
|  a|preds|
+---+-----+
|1.0|  1.0|
|2.0|  2.0|
+---+-----+

Does this PR introduce any user-facing change?

This fixes a bug in the feature that was released in Spark 3.4.0.

How was this patch tested?

Unit test was added.

leewyang · 2023-04-26T20:55:08Z

cc @HyukjinKwon @mengxr @WeichenXu123

HyukjinKwon · 2023-04-27T03:20:00Z

Mind taking a look at https://github.com/apache/spark/pull/40967/checks?check_run_id=13051337101?

leewyang · 2023-04-27T15:19:38Z

Hmm, looks like I might be stuck with this issue, any ideas? I've configured my "actions permissions" to match this comment, but I'm still getting this error:

Error: buildx failed with: ERROR: failed to solve: failed to push ghcr.io/leewyang/apache-spark-ci-image:master-4815979720: failed commit on ref "manifest-sha256:80eca005ea656d063e07f9059619043cd701e0c0d17029523fc167e4915405b4": unexpected status: 403 Forbidden

leewyang · 2023-04-27T15:54:01Z

Nm, got past this issue... waiting on the rest of the build results (but already failed the Kubernetes Integration Test).

HyukjinKwon · 2023-04-27T18:28:58Z

Merged to master.

HyukjinKwon · 2023-04-27T18:57:08Z

oops, I missed that the linter failed. Reverting, and reopening the PR.

zhengruifeng · 2023-04-28T00:40:28Z

merged to master again

…ith batch size of one ### What changes were proposed in this pull request? This is a followup to apache#39817 to handle another error condition when the input batch is a single scalar value (where the previous fix focused on a single scalar value output). ### Why are the changes needed? Using `predict_batch_udf` fails when the input batch size is one. ``` import numpy as np from pyspark.ml.functions import predict_batch_udf from pyspark.sql.types import DoubleType df = spark.createDataFrame([[1.0],[2.0]], schema=["a"]) def make_predict_fn(): def predict(inputs): return inputs return predict identity = predict_batch_udf(make_predict_fn, return_type=DoubleType(), batch_size=1) preds = df.withColumn("preds", identity("a")).show() ``` fails with: ``` File "/.../spark/python/pyspark/worker.py", line 869, in main process() File "/.../spark/python/pyspark/worker.py", line 861, in process serializer.dump_stream(out_iter, outfile) File "/.../spark/python/pyspark/sql/pandas/serializers.py", line 354, in dump_stream return ArrowStreamSerializer.dump_stream(self, init_stream_yield_batches(), stream) File "/.../spark/python/pyspark/sql/pandas/serializers.py", line 86, in dump_stream for batch in iterator: File "/.../spark/python/pyspark/sql/pandas/serializers.py", line 347, in init_stream_yield_batches for series in iterator: File "/.../spark/python/pyspark/worker.py", line 555, in func for result_batch, result_type in result_iter: File "/.../spark/python/pyspark/ml/functions.py", line 818, in predict yield _validate_and_transform_prediction_result( File "/.../spark/python/pyspark/ml/functions.py", line 339, in _validate_and_transform_prediction_result if len(preds_array) != num_input_rows: TypeError: len() of unsized object ``` After the fix: ``` +---+-----+ | a|preds| +---+-----+ |1.0| 1.0| |2.0| 2.0| +---+-----+ ``` ### Does this PR introduce _any_ user-facing change? This fixes a bug in the feature that was released in Spark 3.4.0. ### How was this patch tested? Unit test was added. Closes apache#40967 from leewyang/SPARK-43298. Authored-by: Lee Yang <[email protected]> Signed-off-by: Hyukjin Kwon <[email protected]>

…ith batch size of one ### What changes were proposed in this pull request? This is a followup to apache#39817 to handle another error condition when the input batch is a single scalar value (where the previous fix focused on a single scalar value output). ### Why are the changes needed? Using `predict_batch_udf` fails when the input batch size is one. ``` import numpy as np from pyspark.ml.functions import predict_batch_udf from pyspark.sql.types import DoubleType df = spark.createDataFrame([[1.0],[2.0]], schema=["a"]) def make_predict_fn(): def predict(inputs): return inputs return predict identity = predict_batch_udf(make_predict_fn, return_type=DoubleType(), batch_size=1) preds = df.withColumn("preds", identity("a")).show() ``` fails with: ``` File "/.../spark/python/pyspark/worker.py", line 869, in main process() File "/.../spark/python/pyspark/worker.py", line 861, in process serializer.dump_stream(out_iter, outfile) File "/.../spark/python/pyspark/sql/pandas/serializers.py", line 354, in dump_stream return ArrowStreamSerializer.dump_stream(self, init_stream_yield_batches(), stream) File "/.../spark/python/pyspark/sql/pandas/serializers.py", line 86, in dump_stream for batch in iterator: File "/.../spark/python/pyspark/sql/pandas/serializers.py", line 347, in init_stream_yield_batches for series in iterator: File "/.../spark/python/pyspark/worker.py", line 555, in func for result_batch, result_type in result_iter: File "/.../spark/python/pyspark/ml/functions.py", line 818, in predict yield _validate_and_transform_prediction_result( File "/.../spark/python/pyspark/ml/functions.py", line 339, in _validate_and_transform_prediction_result if len(preds_array) != num_input_rows: TypeError: len() of unsized object ``` After the fix: ``` +---+-----+ | a|preds| +---+-----+ |1.0| 1.0| |2.0| 2.0| +---+-----+ ``` ### Does this PR introduce _any_ user-facing change? This fixes a bug in the feature that was released in Spark 3.4.0. ### How was this patch tested? Unit test was added. Closes apache#40967 from leewyang/SPARK-43298. Authored-by: Lee Yang <[email protected]> Signed-off-by: Ruifeng Zheng <[email protected]>

github-actions bot added CORE ML PYTHON labels Apr 26, 2023

leewyang changed the title ~~predict_batch_udf with scalar input fails with batch size of one~~ [SPARK-43298][PYTHON][ML] predict_batch_udf with scalar input fails with batch size of one Apr 26, 2023

HyukjinKwon approved these changes Apr 27, 2023

View reviewed changes

leewyang force-pushed the SPARK-43298 branch from bf3a72a to e87e601 Compare April 27, 2023 04:26

HyukjinKwon closed this in d933dc5 Apr 27, 2023

HyukjinKwon reopened this Apr 27, 2023

fix predict_batch_udf error with batch_size 1

d752c59

leewyang force-pushed the SPARK-43298 branch from 1fd6894 to d752c59 Compare April 27, 2023 20:39

zhengruifeng closed this in 55774c6 Apr 28, 2023

leewyang deleted the SPARK-43298 branch April 28, 2023 15:43

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-43298][PYTHON][ML] predict_batch_udf with scalar input fails with batch size of one #40967

[SPARK-43298][PYTHON][ML] predict_batch_udf with scalar input fails with batch size of one #40967

leewyang commented Apr 26, 2023

leewyang commented Apr 26, 2023

HyukjinKwon commented Apr 27, 2023

leewyang commented Apr 27, 2023

leewyang commented Apr 27, 2023

HyukjinKwon commented Apr 27, 2023

HyukjinKwon commented Apr 27, 2023

zhengruifeng commented Apr 28, 2023

[SPARK-43298][PYTHON][ML] predict_batch_udf with scalar input fails with batch size of one #40967

[SPARK-43298][PYTHON][ML] predict_batch_udf with scalar input fails with batch size of one #40967

Conversation

leewyang commented Apr 26, 2023

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

leewyang commented Apr 26, 2023

HyukjinKwon commented Apr 27, 2023

leewyang commented Apr 27, 2023

leewyang commented Apr 27, 2023

HyukjinKwon commented Apr 27, 2023

HyukjinKwon commented Apr 27, 2023

zhengruifeng commented Apr 28, 2023