-
Notifications
You must be signed in to change notification settings - Fork 28.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-12511][PySpark][Streaming]Make sure PythonDStream.registerSerializer is called only once #10514
Conversation
CC @davies |
LGTM |
Test build #48429 has finished for PR 10514 at commit
|
retest this please |
Test build #48438 has finished for PR 10514 at commit
|
Test build #48506 has finished for PR 10514 at commit
|
@davies could you take another look? Actually, |
Is this targeted for 1.6, or we just wait this to be fixed in py4j? The current change looks over complicated to me. |
Not sure when py4j will release the next version to fix this one. Looks py4j doesn't release frequently. |
retest this please |
Test build #48523 has finished for PR 10514 at commit
|
I submitted a new commit to fix a similar issue when restarting from checkpoint. Unlike Scala, PySpark reads the checkpoint twice. Then the My latest commit eliminated the unnecessary reading to avoid the |
Also ping @tdas since you wrote the |
Test build #48512 has finished for PR 10514 at commit
|
retest this please |
Test build #48544 has finished for PR 10514 at commit
|
LGTM, do we need to merge this into 1.6? |
Yes since it affects all people using PySpark Streaming checkpoint. |
…erializer is called only once There is an issue that Py4J's PythonProxyHandler.finalize blocks forever. (py4j/py4j#184) Py4j will create a PythonProxyHandler in Java for "transformer_serializer" when calling "registerSerializer". If we call "registerSerializer" twice, the second PythonProxyHandler will override the first one, then the first one will be GCed and trigger "PythonProxyHandler.finalize". To avoid that, we should not call"registerSerializer" more than once, so that "PythonProxyHandler" in Java side won't be GCed. Author: Shixiong Zhu <[email protected]> Closes #10514 from zsxwing/SPARK-12511. (cherry picked from commit 6cfe341) Signed-off-by: Davies Liu <[email protected]>
Merged into master and 1.6 branch, could you create a JIRA to clean this once the bugs are fixed in py4j (and released)? |
Created sub tasks in https://issues.apache.org/jira/browse/SPARK-12652 |
There is an issue that Py4J's PythonProxyHandler.finalize blocks forever. (py4j/py4j#184)
Py4j will create a PythonProxyHandler in Java for "transformer_serializer" when calling "registerSerializer". If we call "registerSerializer" twice, the second PythonProxyHandler will override the first one, then the first one will be GCed and trigger "PythonProxyHandler.finalize". To avoid that, we should not call"registerSerializer" more than once, so that "PythonProxyHandler" in Java side won't be GCed.