-
Notifications
You must be signed in to change notification settings - Fork 28.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-3886] [PySpark] simplify serializer, use AutoBatchedSerializer by default. #2920
Conversation
Test build #22117 has started for PR 2920 at commit
|
Test FAILed. |
Test build #22120 has started for PR 2920 at commit
|
Test build #22117 has finished for PR 2920 at commit
|
Test PASSed. |
Test build #22129 has started for PR 2920 at commit
|
Test build #22120 has finished for PR 2920 at commit
|
Test FAILed. |
Test build #22129 has finished for PR 2920 at commit
|
Test FAILed. |
Jenkins, test this please. |
Test build #22156 has started for PR 2920 at commit
|
Test build #22156 has finished for PR 2920 at commit
|
Test FAILed. |
Jenkins, retest this please. |
Test build #22187 has started for PR 2920 at commit
|
Test build #22187 timed out for PR 2920 at commit |
Test FAILed. |
Test build #426 has started for PR 2920 at commit
|
Test build #428 has started for PR 2920 at commit
|
Test build #22200 has started for PR 2920 at commit
|
Test build #22200 has finished for PR 2920 at commit
|
Test FAILed. |
@@ -1216,51 +1216,6 @@ def test_reserialization(self): | |||
result5 = sorted(self.sc.sequenceFile(basepath + "/reserialize/newdataset").collect()) | |||
self.assertEqual(result5, data) | |||
|
|||
def test_unbatched_save_and_read(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not leave this test in place?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After this refactor, we never use unbatched serializer, so remove this test.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With or without batching, it looks like the old code ended up flattening out all of the batches when writing them to a SequenceFile, so the end result / data was the same. It looks like we already have other tests for reading Hadoop files, so I agree that this is probably safe to remove.
Test build #22625 timed out for PR 2920 at commit |
Test FAILed. |
Test build #22655 has started for PR 2920 at commit
|
Test build #22655 has finished for PR 2920 at commit
|
Test FAILed. |
Test build #22668 has started for PR 2920 at commit
|
Test build #22668 has finished for PR 2920 at commit
|
Test FAILed. |
Test build #502 has started for PR 2920 at commit
|
Test build #502 has finished for PR 2920 at commit
|
Could you fix up the merge conflicts here? Barring that, this LGTM. |
Conflicts: sql/core/src/main/scala/org/apache/spark/sql/SchemaRDD.scala
Test build #22826 has started for PR 2920 at commit
|
Test build #22827 has started for PR 2920 at commit
|
Test build #22826 has finished for PR 2920 at commit
|
Test PASSed. |
Test build #22827 has finished for PR 2920 at commit
|
Test PASSed. |
I'm going to merge this into 1.2 in order to avoid merge conflicts when backporting future bugfixes to that branch. Thanks! |
… by default. This PR simplify serializer, always use batched serializer (AutoBatchedSerializer as default), even batch size is 1. Author: Davies Liu <[email protected]> This patch had conflicts when merged, resolved by Committer: Josh Rosen <[email protected]> Closes #2920 from davies/fix_autobatch and squashes the following commits: e544ef9 [Davies Liu] revert unrelated change 6880b14 [Davies Liu] Merge branch 'master' of github.com:apache/spark into fix_autobatch 1d557fc [Davies Liu] fix tests 8180907 [Davies Liu] Merge branch 'master' of github.com:apache/spark into fix_autobatch 76abdce [Davies Liu] clean up 53fa60b [Davies Liu] Merge branch 'master' of github.com:apache/spark into fix_autobatch d7ac751 [Davies Liu] Merge branch 'master' of github.com:apache/spark into fix_autobatch 2cc2497 [Davies Liu] Merge branch 'master' of github.com:apache/spark into fix_autobatch b4292ce [Davies Liu] fix bug in master d79744c [Davies Liu] recover hive tests be37ece [Davies Liu] refactor eb3938d [Davies Liu] refactor serializer in scala 8d77ef2 [Davies Liu] simplify serializer, use AutoBatchedSerializer by default. (cherry picked from commit e4f4263) Signed-off-by: Josh Rosen <[email protected]>
There was a minor conflict with an MLlib change (it added two new lines of additional code before your change to |
Thanks, I kept fixing the conflicts, but missed this one. |
This PR simplify serializer, always use batched serializer (AutoBatchedSerializer as default), even batch size is 1.