Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-12511][PySpark][Streaming]Make sure PythonDStream.registerSerializer is called only once #10514

Closed
wants to merge 3 commits into from

Conversation

zsxwing
Copy link
Member

@zsxwing zsxwing commented Dec 29, 2015

There is an issue that Py4J's PythonProxyHandler.finalize blocks forever. (py4j/py4j#184)

Py4j will create a PythonProxyHandler in Java for "transformer_serializer" when calling "registerSerializer". If we call "registerSerializer" twice, the second PythonProxyHandler will override the first one, then the first one will be GCed and trigger "PythonProxyHandler.finalize". To avoid that, we should not call"registerSerializer" more than once, so that "PythonProxyHandler" in Java side won't be GCed.

@zsxwing
Copy link
Member Author

zsxwing commented Dec 29, 2015

CC @davies

@davies
Copy link
Contributor

davies commented Dec 30, 2015

LGTM

@SparkQA
Copy link

SparkQA commented Dec 30, 2015

Test build #48429 has finished for PR 10514 at commit 9be0d0a.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@zsxwing
Copy link
Member Author

zsxwing commented Dec 30, 2015

retest this please

@SparkQA
Copy link

SparkQA commented Dec 30, 2015

Test build #48438 has finished for PR 10514 at commit 9be0d0a.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@zsxwing zsxwing changed the title [SPARK-12511][PySpark][Streaming]Make sure TransformFunctionSerializer is created only once [SPARK-12511][PySpark][Streaming]Make sure PythonDStream.registerSerializer is called only once Dec 30, 2015
@SparkQA
Copy link

SparkQA commented Dec 30, 2015

Test build #48506 has finished for PR 10514 at commit 3c3e164.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@zsxwing
Copy link
Member Author

zsxwing commented Dec 30, 2015

@davies could you take another look? Actually, cls._transformerSerializer's fields need to be updated with the parameters. So my previous approach doesn't work. I updated the PR description for the new approach.

@davies
Copy link
Contributor

davies commented Dec 30, 2015

Is this targeted for 1.6, or we just wait this to be fixed in py4j? The current change looks over complicated to me.

@zsxwing
Copy link
Member Author

zsxwing commented Dec 30, 2015

Not sure when py4j will release the next version to fix this one. Looks py4j doesn't release frequently.

@zsxwing
Copy link
Member Author

zsxwing commented Dec 31, 2015

retest this please

@SparkQA
Copy link

SparkQA commented Dec 31, 2015

Test build #48523 has finished for PR 10514 at commit 340b034.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@zsxwing
Copy link
Member Author

zsxwing commented Dec 31, 2015

I submitted a new commit to fix a similar issue when restarting from checkpoint.

Unlike Scala, PySpark reads the checkpoint twice. Then the PythonProxyHandlers created by the first checkpoint reading will be abandoned and trigger PythonProxyHandler.finalize.

My latest commit eliminated the unnecessary reading to avoid the PythonProxyHandler.finalize issue.

@zsxwing
Copy link
Member Author

zsxwing commented Dec 31, 2015

Also ping @tdas since you wrote the getOrCreate method.

@SparkQA
Copy link

SparkQA commented Dec 31, 2015

Test build #48512 has finished for PR 10514 at commit 340b034.

  • This patch fails from timeout after a configured wait of 250m.
  • This patch merges cleanly.
  • This patch adds no public classes.

@zsxwing
Copy link
Member Author

zsxwing commented Dec 31, 2015

retest this please

@SparkQA
Copy link

SparkQA commented Dec 31, 2015

Test build #48544 has finished for PR 10514 at commit 340b034.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@davies
Copy link
Contributor

davies commented Jan 5, 2016

LGTM, do we need to merge this into 1.6?

@zsxwing
Copy link
Member Author

zsxwing commented Jan 5, 2016

do we need to merge this into 1.6?

Yes since it affects all people using PySpark Streaming checkpoint.

asfgit pushed a commit that referenced this pull request Jan 5, 2016
…erializer is called only once

There is an issue that Py4J's PythonProxyHandler.finalize blocks forever. (py4j/py4j#184)

Py4j will create a PythonProxyHandler in Java for "transformer_serializer" when calling "registerSerializer". If we call "registerSerializer" twice, the second PythonProxyHandler will override the first one, then the first one will be GCed and trigger "PythonProxyHandler.finalize". To avoid that, we should not call"registerSerializer" more than once, so that "PythonProxyHandler" in Java side won't be GCed.

Author: Shixiong Zhu <[email protected]>

Closes #10514 from zsxwing/SPARK-12511.

(cherry picked from commit 6cfe341)
Signed-off-by: Davies Liu <[email protected]>
@asfgit asfgit closed this in 6cfe341 Jan 5, 2016
@davies
Copy link
Contributor

davies commented Jan 5, 2016

Merged into master and 1.6 branch, could you create a JIRA to clean this once the bugs are fixed in py4j (and released)?

@zsxwing
Copy link
Member Author

zsxwing commented Jan 5, 2016

Merged into master and 1.6 branch, could you create a JIRA to clean this once the bugs are fixed in py4j (and released)?

Created sub tasks in https://issues.apache.org/jira/browse/SPARK-12652

@zsxwing zsxwing deleted the SPARK-12511 branch January 5, 2016 22:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants