Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-3886] [PySpark] simplify serializer, use AutoBatchedSerializer by default. #2920

Closed
wants to merge 13 commits into from

Conversation

davies
Copy link
Contributor

@davies davies commented Oct 24, 2014

This PR simplify serializer, always use batched serializer (AutoBatchedSerializer as default), even batch size is 1.

@SparkQA
Copy link

SparkQA commented Oct 24, 2014

Test build #22117 has started for PR 2920 at commit 8d77ef2.

  • This patch merges cleanly.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/22115/
Test FAILed.

@SparkQA
Copy link

SparkQA commented Oct 24, 2014

Test build #22120 has started for PR 2920 at commit eb3938d.

  • This patch merges cleanly.

@davies davies changed the title simplify serializer, use AutoBatchedSerializer by default. [SPARK-3886] [PySpark] simplify serializer, use AutoBatchedSerializer by default. Oct 24, 2014
@SparkQA
Copy link

SparkQA commented Oct 24, 2014

Test build #22117 has finished for PR 2920 at commit 8d77ef2.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/22117/
Test PASSed.

@SparkQA
Copy link

SparkQA commented Oct 24, 2014

Test build #22129 has started for PR 2920 at commit be37ece.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Oct 24, 2014

Test build #22120 has finished for PR 2920 at commit eb3938d.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/22120/
Test FAILed.

@SparkQA
Copy link

SparkQA commented Oct 24, 2014

Test build #22129 has finished for PR 2920 at commit be37ece.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/22129/
Test FAILed.

@pwendell
Copy link
Contributor

Jenkins, test this please.

@SparkQA
Copy link

SparkQA commented Oct 24, 2014

Test build #22156 has started for PR 2920 at commit be37ece.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Oct 24, 2014

Test build #22156 has finished for PR 2920 at commit be37ece.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/22156/
Test FAILed.

@JoshRosen
Copy link
Contributor

Jenkins, retest this please.

@SparkQA
Copy link

SparkQA commented Oct 25, 2014

Test build #22187 has started for PR 2920 at commit be37ece.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Oct 25, 2014

Test build #22187 timed out for PR 2920 at commit be37ece after a configured wait of 120m.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/22187/
Test FAILed.

@SparkQA
Copy link

SparkQA commented Oct 25, 2014

Test build #426 has started for PR 2920 at commit be37ece.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Oct 25, 2014

Test build #428 has started for PR 2920 at commit be37ece.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Oct 25, 2014

Test build #22200 has started for PR 2920 at commit d79744c.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Oct 25, 2014

Test build #22200 has finished for PR 2920 at commit d79744c.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/22200/
Test FAILed.

@@ -1216,51 +1216,6 @@ def test_reserialization(self):
result5 = sorted(self.sc.sequenceFile(basepath + "/reserialize/newdataset").collect())
self.assertEqual(result5, data)

def test_unbatched_save_and_read(self):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not leave this test in place?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After this refactor, we never use unbatched serializer, so remove this test.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With or without batching, it looks like the old code ended up flattening out all of the batches when writing them to a SequenceFile, so the end result / data was the same. It looks like we already have other tests for reading Hadoop files, so I agree that this is probably safe to remove.

@SparkQA
Copy link

SparkQA commented Oct 31, 2014

Test build #22625 timed out for PR 2920 at commit 53fa60b after a configured wait of 120m.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/22625/
Test FAILed.

@SparkQA
Copy link

SparkQA commented Oct 31, 2014

Test build #22655 has started for PR 2920 at commit 8180907.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Nov 1, 2014

Test build #22655 has finished for PR 2920 at commit 8180907.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/22655/
Test FAILed.

@SparkQA
Copy link

SparkQA commented Nov 1, 2014

Test build #22668 has started for PR 2920 at commit 1d557fc.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Nov 1, 2014

Test build #22668 has finished for PR 2920 at commit 1d557fc.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/22668/
Test FAILed.

@SparkQA
Copy link

SparkQA commented Nov 1, 2014

Test build #502 has started for PR 2920 at commit 1d557fc.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Nov 1, 2014

Test build #502 has finished for PR 2920 at commit 1d557fc.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@JoshRosen
Copy link
Contributor

Could you fix up the merge conflicts here? Barring that, this LGTM.

Conflicts:
	sql/core/src/main/scala/org/apache/spark/sql/SchemaRDD.scala
@SparkQA
Copy link

SparkQA commented Nov 3, 2014

Test build #22826 has started for PR 2920 at commit 6880b14.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Nov 3, 2014

Test build #22827 has started for PR 2920 at commit e544ef9.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Nov 3, 2014

Test build #22826 has finished for PR 2920 at commit 6880b14.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • class NullType(PrimitiveType):
    • case class ScalaUdfBuilder[T: TypeTag](f: AnyRef)

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/22826/
Test PASSed.

@SparkQA
Copy link

SparkQA commented Nov 3, 2014

Test build #22827 has finished for PR 2920 at commit e544ef9.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • class NullType(PrimitiveType):
    • case class ScalaUdfBuilder[T: TypeTag](f: AnyRef)

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/22827/
Test PASSed.

@JoshRosen
Copy link
Contributor

I'm going to merge this into 1.2 in order to avoid merge conflicts when backporting future bugfixes to that branch. Thanks!

@asfgit asfgit closed this in e4f4263 Nov 4, 2014
asfgit pushed a commit that referenced this pull request Nov 4, 2014
… by default.

This PR simplify serializer, always use batched serializer (AutoBatchedSerializer as default), even batch size is 1.

Author: Davies Liu <[email protected]>

This patch had conflicts when merged, resolved by
Committer: Josh Rosen <[email protected]>

Closes #2920 from davies/fix_autobatch and squashes the following commits:

e544ef9 [Davies Liu] revert unrelated change
6880b14 [Davies Liu] Merge branch 'master' of github.com:apache/spark into fix_autobatch
1d557fc [Davies Liu] fix tests
8180907 [Davies Liu] Merge branch 'master' of github.com:apache/spark into fix_autobatch
76abdce [Davies Liu] clean up
53fa60b [Davies Liu] Merge branch 'master' of github.com:apache/spark into fix_autobatch
d7ac751 [Davies Liu] Merge branch 'master' of github.com:apache/spark into fix_autobatch
2cc2497 [Davies Liu] Merge branch 'master' of github.com:apache/spark into fix_autobatch
b4292ce [Davies Liu] fix bug in master
d79744c [Davies Liu] recover hive tests
be37ece [Davies Liu] refactor
eb3938d [Davies Liu] refactor serializer in scala
8d77ef2 [Davies Liu] simplify serializer, use AutoBatchedSerializer by default.

(cherry picked from commit e4f4263)
Signed-off-by: Josh Rosen <[email protected]>
@JoshRosen
Copy link
Contributor

There was a minor conflict with an MLlib change (it added two new lines of additional code before your change to sql.py), but I fixed it up myself on merge and ran the tests to make sure that everything still worked. Merged to master and branch-1.2.

@davies
Copy link
Contributor Author

davies commented Nov 4, 2014

Thanks, I kept fixing the conflicts, but missed this one.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants