Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update upstream #375

Merged
merged 1 commit into from
Jun 11, 2018
Merged

Conversation

GulajavaMinistudio
Copy link
Owner

… driver to executor

What changes were proposed in this pull request?

SPARK-23754 was fixed in apache#21383 by changing the UDF code to wrap the user function, but this required a hack to save its argspec. This PR reverts this change and fixes the StopIteration bug in the worker

How does this work?

The root of the problem is that when an user-supplied function raises a StopIteration, pyspark might stop processing data, if this function is used in a for-loop. The solution is to catch StopIterations exceptions and re-raise them as RuntimeErrors, so that the execution fails and the error is reported to the user. This is done using the fail_on_stopiteration wrapper, in different ways depending on where the function is used:

  • In RDDs, the user function is wrapped in the driver, because this function is also called in the driver itself.
  • In SQL UDFs, the function is wrapped in the worker, since all processing happens there. Moreover, the worker needs the signature of the user function, which is lost when wrapping it, but passing this signature to the worker requires a not so nice hack.

How was this patch tested?

Same tests, plus tests for pandas UDFs

Author: edorigatti [email protected]

Closes apache#21467 from e-dorigatti/fix_udf_hack.

What changes were proposed in this pull request?

(Please fill in changes proposed in this fix)

How was this patch tested?

(Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests)
(If this patch involves UI changes, please attach a screenshot; otherwise, remove this)

Please review http://spark.apache.org/contributing.html before opening a pull request.

… driver to executor

## What changes were proposed in this pull request?
SPARK-23754 was fixed in #21383 by changing the UDF code to wrap the user function, but this required a hack to save its argspec. This PR reverts this change and fixes the `StopIteration` bug in the worker

## How does this work?

The root of the problem is that when an user-supplied function raises a `StopIteration`, pyspark might stop processing data, if this function is used in a for-loop. The solution is to catch `StopIteration`s exceptions and re-raise them as `RuntimeError`s, so that the execution fails and the error is reported to the user. This is done using the `fail_on_stopiteration` wrapper, in different ways depending on where the function is used:
 - In RDDs, the user function is wrapped in the driver, because this function is also called in the driver itself.
 - In SQL UDFs, the function is wrapped in the worker, since all processing happens there. Moreover, the worker needs the signature of the user function, which is lost when wrapping it, but passing this signature to the worker requires a not so nice hack.

## How was this patch tested?

Same tests, plus tests for pandas UDFs

Author: edorigatti <[email protected]>

Closes #21467 from e-dorigatti/fix_udf_hack.
@GulajavaMinistudio GulajavaMinistudio merged commit faa27ed into GulajavaMinistudio:master Jun 11, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants