[SPARK-3886] [PySpark] simplify serializer, use AutoBatchedSerializer by default. #2920

davies · 2014-10-24T05:30:21Z

This PR simplify serializer, always use batched serializer (AutoBatchedSerializer as default), even batch size is 1.

SparkQA · 2014-10-24T05:40:06Z

Test build #22117 has started for PR 2920 at commit 8d77ef2.

This patch merges cleanly.

AmplabJenkins · 2014-10-24T05:47:19Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/22115/
Test FAILed.

SparkQA · 2014-10-24T06:20:04Z

Test build #22120 has started for PR 2920 at commit eb3938d.

This patch merges cleanly.

SparkQA · 2014-10-24T06:57:26Z

Test build #22117 has finished for PR 2920 at commit 8d77ef2.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

AmplabJenkins · 2014-10-24T06:57:29Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/22117/
Test PASSed.

SparkQA · 2014-10-24T07:30:02Z

Test build #22129 has started for PR 2920 at commit be37ece.

This patch merges cleanly.

SparkQA · 2014-10-24T07:36:09Z

Test build #22120 has finished for PR 2920 at commit eb3938d.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

AmplabJenkins · 2014-10-24T07:36:12Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/22120/
Test FAILed.

SparkQA · 2014-10-24T08:47:57Z

Test build #22129 has finished for PR 2920 at commit be37ece.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

AmplabJenkins · 2014-10-24T08:48:00Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/22129/
Test FAILed.

pwendell · 2014-10-24T19:27:59Z

Jenkins, test this please.

SparkQA · 2014-10-24T19:32:32Z

Test build #22156 has started for PR 2920 at commit be37ece.

This patch merges cleanly.

SparkQA · 2014-10-24T20:45:05Z

Test build #22156 has finished for PR 2920 at commit be37ece.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

AmplabJenkins · 2014-10-24T20:45:08Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/22156/
Test FAILed.

JoshRosen · 2014-10-25T02:40:44Z

Jenkins, retest this please.

SparkQA · 2014-10-25T02:44:57Z

Test build #22187 has started for PR 2920 at commit be37ece.

This patch merges cleanly.

SparkQA · 2014-10-25T04:44:58Z

Test build #22187 timed out for PR 2920 at commit be37ece after a configured wait of 120m.

AmplabJenkins · 2014-10-25T04:45:00Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/22187/
Test FAILed.

SparkQA · 2014-10-25T05:31:14Z

Test build #426 has started for PR 2920 at commit be37ece.

This patch merges cleanly.

SparkQA · 2014-10-25T05:32:50Z

Test build #428 has started for PR 2920 at commit be37ece.

This patch merges cleanly.

SparkQA · 2014-10-25T05:50:00Z

Test build #22200 has started for PR 2920 at commit d79744c.

This patch merges cleanly.

SparkQA · 2014-10-25T07:09:10Z

Test build #22200 has finished for PR 2920 at commit d79744c.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

AmplabJenkins · 2014-10-25T07:09:14Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/22200/
Test FAILed.

JoshRosen · 2014-10-31T19:36:24Z

python/pyspark/tests.py

@@ -1216,51 +1216,6 @@ def test_reserialization(self):
        result5 = sorted(self.sc.sequenceFile(basepath + "/reserialize/newdataset").collect())
        self.assertEqual(result5, data)

-    def test_unbatched_save_and_read(self):


Why not leave this test in place?

After this refactor, we never use unbatched serializer, so remove this test.

With or without batching, it looks like the old code ended up flattening out all of the batches when writing them to a SequenceFile, so the end result / data was the same. It looks like we already have other tests for reading Hadoop files, so I agree that this is probably safe to remove.

SparkQA · 2014-10-31T21:07:30Z

Test build #22625 timed out for PR 2920 at commit 53fa60b after a configured wait of 120m.

AmplabJenkins · 2014-10-31T21:07:33Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/22625/
Test FAILed.

SparkQA · 2014-10-31T22:39:53Z

Test build #22655 has started for PR 2920 at commit 8180907.

This patch merges cleanly.

SparkQA · 2014-11-01T00:00:33Z

Test build #22655 has finished for PR 2920 at commit 8180907.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

AmplabJenkins · 2014-11-01T00:00:36Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/22655/
Test FAILed.

SparkQA · 2014-11-01T00:50:00Z

Test build #22668 has started for PR 2920 at commit 1d557fc.

This patch merges cleanly.

SparkQA · 2014-11-01T02:11:01Z

Test build #22668 has finished for PR 2920 at commit 1d557fc.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

AmplabJenkins · 2014-11-01T02:11:04Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/22668/
Test FAILed.

SparkQA · 2014-11-01T03:31:58Z

Test build #502 has started for PR 2920 at commit 1d557fc.

This patch merges cleanly.

SparkQA · 2014-11-01T05:06:09Z

Test build #502 has finished for PR 2920 at commit 1d557fc.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

JoshRosen · 2014-11-03T05:30:06Z

Could you fix up the merge conflicts here? Barring that, this LGTM.

Conflicts: sql/core/src/main/scala/org/apache/spark/sql/SchemaRDD.scala

SparkQA · 2014-11-03T21:32:33Z

Test build #22826 has started for PR 2920 at commit 6880b14.

This patch merges cleanly.

SparkQA · 2014-11-03T21:50:15Z

Test build #22827 has started for PR 2920 at commit e544ef9.

This patch merges cleanly.

SparkQA · 2014-11-03T23:18:32Z

Test build #22826 has finished for PR 2920 at commit 6880b14.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- class NullType(PrimitiveType):
- case class ScalaUdfBuilder[T: TypeTag](f: AnyRef)

AmplabJenkins · 2014-11-03T23:18:35Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/22826/
Test PASSed.

SparkQA · 2014-11-03T23:35:32Z

Test build #22827 has finished for PR 2920 at commit e544ef9.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- class NullType(PrimitiveType):
- case class ScalaUdfBuilder[T: TypeTag](f: AnyRef)

AmplabJenkins · 2014-11-03T23:35:36Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/22827/
Test PASSed.

JoshRosen · 2014-11-04T03:22:30Z

I'm going to merge this into 1.2 in order to avoid merge conflicts when backporting future bugfixes to that branch. Thanks!

… by default. This PR simplify serializer, always use batched serializer (AutoBatchedSerializer as default), even batch size is 1. Author: Davies Liu <[email protected]> This patch had conflicts when merged, resolved by Committer: Josh Rosen <[email protected]> Closes #2920 from davies/fix_autobatch and squashes the following commits: e544ef9 [Davies Liu] revert unrelated change 6880b14 [Davies Liu] Merge branch 'master' of github.com:apache/spark into fix_autobatch 1d557fc [Davies Liu] fix tests 8180907 [Davies Liu] Merge branch 'master' of github.com:apache/spark into fix_autobatch 76abdce [Davies Liu] clean up 53fa60b [Davies Liu] Merge branch 'master' of github.com:apache/spark into fix_autobatch d7ac751 [Davies Liu] Merge branch 'master' of github.com:apache/spark into fix_autobatch 2cc2497 [Davies Liu] Merge branch 'master' of github.com:apache/spark into fix_autobatch b4292ce [Davies Liu] fix bug in master d79744c [Davies Liu] recover hive tests be37ece [Davies Liu] refactor eb3938d [Davies Liu] refactor serializer in scala 8d77ef2 [Davies Liu] simplify serializer, use AutoBatchedSerializer by default. (cherry picked from commit e4f4263) Signed-off-by: Josh Rosen <[email protected]>

JoshRosen · 2014-11-04T07:58:34Z

There was a minor conflict with an MLlib change (it added two new lines of additional code before your change to sql.py), but I fixed it up myself on merge and ran the tests to make sure that everything still worked. Merged to master and branch-1.2.

davies · 2014-11-04T08:01:13Z

Thanks, I kept fixing the conflicts, but missed this one.

simplify serializer, use AutoBatchedSerializer by default.

8d77ef2

davies force-pushed the fix_autobatch branch from 3178077 to 8d77ef2 Compare October 24, 2014 05:33

refactor serializer in scala

eb3938d

davies changed the title ~~simplify serializer, use AutoBatchedSerializer by default.~~ [SPARK-3886] [PySpark] simplify serializer, use AutoBatchedSerializer by default. Oct 24, 2014

refactor

be37ece

recover hive tests

d79744c

fix bug in master

b4292ce

JoshRosen reviewed Oct 31, 2014
View reviewed changes

Davies Liu added 2 commits October 31, 2014 15:32

clean up

76abdce

Merge branch 'master' of github.com:apache/spark into fix_autobatch

8180907

fix tests

1d557fc

Merge branch 'master' of github.com:apache/spark into fix_autobatch

6880b14

Conflicts: sql/core/src/main/scala/org/apache/spark/sql/SchemaRDD.scala

revert unrelated change

e544ef9

asfgit closed this in e4f4263 Nov 4, 2014

davies mentioned this pull request Dec 16, 2014

[SPARK-4841] fix zip with textFile() #3706

Closed

[SPARK-3886] [PySpark] simplify serializer, use AutoBatchedSerializer by default. #2920

[SPARK-3886] [PySpark] simplify serializer, use AutoBatchedSerializer by default. #2920

Conversation

davies commented Oct 24, 2014

SparkQA commented Oct 24, 2014

AmplabJenkins commented Oct 24, 2014

SparkQA commented Oct 24, 2014

SparkQA commented Oct 24, 2014

AmplabJenkins commented Oct 24, 2014

SparkQA commented Oct 24, 2014

SparkQA commented Oct 24, 2014

AmplabJenkins commented Oct 24, 2014

SparkQA commented Oct 24, 2014

AmplabJenkins commented Oct 24, 2014

pwendell commented Oct 24, 2014

SparkQA commented Oct 24, 2014

SparkQA commented Oct 24, 2014

AmplabJenkins commented Oct 24, 2014

JoshRosen commented Oct 25, 2014

SparkQA commented Oct 25, 2014

SparkQA commented Oct 25, 2014

AmplabJenkins commented Oct 25, 2014

SparkQA commented Oct 25, 2014

SparkQA commented Oct 25, 2014

SparkQA commented Oct 25, 2014

SparkQA commented Oct 25, 2014

AmplabJenkins commented Oct 25, 2014

JoshRosen Oct 31, 2014

Choose a reason for hiding this comment

davies Nov 2, 2014

Choose a reason for hiding this comment

JoshRosen Nov 3, 2014

Choose a reason for hiding this comment

SparkQA commented Oct 31, 2014

AmplabJenkins commented Oct 31, 2014

SparkQA commented Oct 31, 2014

SparkQA commented Nov 1, 2014

AmplabJenkins commented Nov 1, 2014

SparkQA commented Nov 1, 2014

SparkQA commented Nov 1, 2014

AmplabJenkins commented Nov 1, 2014

SparkQA commented Nov 1, 2014

SparkQA commented Nov 1, 2014

JoshRosen commented Nov 3, 2014

SparkQA commented Nov 3, 2014

SparkQA commented Nov 3, 2014

SparkQA commented Nov 3, 2014

AmplabJenkins commented Nov 3, 2014

SparkQA commented Nov 3, 2014

AmplabJenkins commented Nov 3, 2014

JoshRosen commented Nov 4, 2014

JoshRosen commented Nov 4, 2014

davies commented Nov 4, 2014