-
Notifications
You must be signed in to change notification settings - Fork 28.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-43298][PYTHON][ML] predict_batch_udf with scalar input fails with batch size of one #40967
Conversation
Mind taking a look at https://github.com/apache/spark/pull/40967/checks?check_run_id=13051337101? |
Hmm, looks like I might be stuck with this issue, any ideas? I've configured my "actions permissions" to match this comment, but I'm still getting this error:
|
Nm, got past this issue... waiting on the rest of the build results (but already failed the Kubernetes Integration Test). |
Merged to master. |
oops, I missed that the linter failed. Reverting, and reopening the PR. |
merged to master again |
…ith batch size of one ### What changes were proposed in this pull request? This is a followup to apache#39817 to handle another error condition when the input batch is a single scalar value (where the previous fix focused on a single scalar value output). ### Why are the changes needed? Using `predict_batch_udf` fails when the input batch size is one. ``` import numpy as np from pyspark.ml.functions import predict_batch_udf from pyspark.sql.types import DoubleType df = spark.createDataFrame([[1.0],[2.0]], schema=["a"]) def make_predict_fn(): def predict(inputs): return inputs return predict identity = predict_batch_udf(make_predict_fn, return_type=DoubleType(), batch_size=1) preds = df.withColumn("preds", identity("a")).show() ``` fails with: ``` File "/.../spark/python/pyspark/worker.py", line 869, in main process() File "/.../spark/python/pyspark/worker.py", line 861, in process serializer.dump_stream(out_iter, outfile) File "/.../spark/python/pyspark/sql/pandas/serializers.py", line 354, in dump_stream return ArrowStreamSerializer.dump_stream(self, init_stream_yield_batches(), stream) File "/.../spark/python/pyspark/sql/pandas/serializers.py", line 86, in dump_stream for batch in iterator: File "/.../spark/python/pyspark/sql/pandas/serializers.py", line 347, in init_stream_yield_batches for series in iterator: File "/.../spark/python/pyspark/worker.py", line 555, in func for result_batch, result_type in result_iter: File "/.../spark/python/pyspark/ml/functions.py", line 818, in predict yield _validate_and_transform_prediction_result( File "/.../spark/python/pyspark/ml/functions.py", line 339, in _validate_and_transform_prediction_result if len(preds_array) != num_input_rows: TypeError: len() of unsized object ``` After the fix: ``` +---+-----+ | a|preds| +---+-----+ |1.0| 1.0| |2.0| 2.0| +---+-----+ ``` ### Does this PR introduce _any_ user-facing change? This fixes a bug in the feature that was released in Spark 3.4.0. ### How was this patch tested? Unit test was added. Closes apache#40967 from leewyang/SPARK-43298. Authored-by: Lee Yang <[email protected]> Signed-off-by: Hyukjin Kwon <[email protected]>
…ith batch size of one ### What changes were proposed in this pull request? This is a followup to apache#39817 to handle another error condition when the input batch is a single scalar value (where the previous fix focused on a single scalar value output). ### Why are the changes needed? Using `predict_batch_udf` fails when the input batch size is one. ``` import numpy as np from pyspark.ml.functions import predict_batch_udf from pyspark.sql.types import DoubleType df = spark.createDataFrame([[1.0],[2.0]], schema=["a"]) def make_predict_fn(): def predict(inputs): return inputs return predict identity = predict_batch_udf(make_predict_fn, return_type=DoubleType(), batch_size=1) preds = df.withColumn("preds", identity("a")).show() ``` fails with: ``` File "/.../spark/python/pyspark/worker.py", line 869, in main process() File "/.../spark/python/pyspark/worker.py", line 861, in process serializer.dump_stream(out_iter, outfile) File "/.../spark/python/pyspark/sql/pandas/serializers.py", line 354, in dump_stream return ArrowStreamSerializer.dump_stream(self, init_stream_yield_batches(), stream) File "/.../spark/python/pyspark/sql/pandas/serializers.py", line 86, in dump_stream for batch in iterator: File "/.../spark/python/pyspark/sql/pandas/serializers.py", line 347, in init_stream_yield_batches for series in iterator: File "/.../spark/python/pyspark/worker.py", line 555, in func for result_batch, result_type in result_iter: File "/.../spark/python/pyspark/ml/functions.py", line 818, in predict yield _validate_and_transform_prediction_result( File "/.../spark/python/pyspark/ml/functions.py", line 339, in _validate_and_transform_prediction_result if len(preds_array) != num_input_rows: TypeError: len() of unsized object ``` After the fix: ``` +---+-----+ | a|preds| +---+-----+ |1.0| 1.0| |2.0| 2.0| +---+-----+ ``` ### Does this PR introduce _any_ user-facing change? This fixes a bug in the feature that was released in Spark 3.4.0. ### How was this patch tested? Unit test was added. Closes apache#40967 from leewyang/SPARK-43298. Authored-by: Lee Yang <[email protected]> Signed-off-by: Ruifeng Zheng <[email protected]>
What changes were proposed in this pull request?
This is a followup to #39817 to handle another error condition when the input batch is a single scalar value (where the previous fix focused on a single scalar value output).
Why are the changes needed?
Using
predict_batch_udf
fails when the input batch size is one.fails with:
After the fix:
Does this PR introduce any user-facing change?
This fixes a bug in the feature that was released in Spark 3.4.0.
How was this patch tested?
Unit test was added.