Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-7311] Introduce internal Serializer API for determining if serializers support object relocation #5924

Closed
wants to merge 9 commits into from

Conversation

JoshRosen
Copy link
Contributor

This patch extends the Serializer interface with a new @Private API which allows serializers to indicate whether they support relocation of serialized objects in serializer stream output.

This relocatibilty property is described in more detail in Serializer.scala, but in a nutshell a serializer supports relocation if reordering the bytes of serialized objects in serialization stream output is equivalent to having re-ordered those elements prior to serializing them. The optimized shuffle path introduced in #4450 and #5868 both rely on serializers having this property; this patch just centralizes the logic for determining whether a serializer has this property. I also added tests and comments clarifying when this works for KryoSerializer.

This change allows the optimizations in #4450 to be applied for shuffles that use SqlSerializer2.

…nsafeShuffle path is used.

Conflicts:
	core/src/main/scala/org/apache/spark/shuffle/unsafe/UnsafeShuffleManager.scala
I verified that the Kryo tests will fail if we remove the auto-reset
check in KryoSerializer. I also checked that this test fails if we
mistakenly enable this flag for JavaSerializer. This demonstrates that
the test case is actually capable of detecting the types of bugs that it's
trying to prevent.

Of course, it's possible that certain bugs will only surface when serializing
specific data types, so we'll still have to be cautious when overriding
`supportsRelocationOfSerializedObjects` for new serializers.
This lays some groundwork for re-using this test logic for serializers defined
in other subprojects (those projects can just declare a test-jar dependency
on Spark core).
*
* See SPARK-7311 for more details.
*/
@Private
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure that we want to commit to this as a stable public API, which is why I've chosen to mark this as private and leave comments warning users that this API is private and subject to change. If someone can think of a better way to restrict use / implementation of this method, I'd be happy to incorporate that change.

@JoshRosen
Copy link
Contributor Author

/cc @rxin @sryza

* See SPARK-7311 for more details.
*/
@Private
private[spark] def supportsRelocationOfSerializedObjects: Boolean = false
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not set on this name, by the way; happy to change if someone thinks of a less verbose name that's not misleading.

@rxin
Copy link
Contributor

rxin commented May 5, 2015

LGTM.

@@ -23,7 +23,7 @@ import java.nio.ByteBuffer
import scala.reflect.ClassTag

import org.apache.spark.{SparkConf, SparkEnv}
import org.apache.spark.annotation.DeveloperApi
import org.apache.spark.annotation.{Private, Experimental, DeveloperApi}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: alphabetize

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Argh, I really need to fix my IntelliJ settings. I switched versions and didn't port all of my import sorting settings over, so stuff like this keeps happening :(

@sryza
Copy link
Contributor

sryza commented May 5, 2015

LGTM other than a couple tiny nits

@SparkQA
Copy link

SparkQA commented May 6, 2015

Test build #31910 has finished for PR 5924 at commit 0a7ebd7.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • public class EnumUtil
    • class BinaryClassificationEvaluator(JavaEvaluator, HasLabelCol, HasRawPredictionCol):
    • class HasRawPredictionCol(Params):
    • class Evaluator(object):
    • class JavaEvaluator(Evaluator, JavaWrapper):

@JoshRosen
Copy link
Contributor Author

I've addressed those tiny nits, so I'm now going to merge this into master and branch-1.4.

@JoshRosen JoshRosen changed the title [SPARK-7311] Introduce internal Serializer API method for determining if serializers support object relocation [SPARK-7311] Introduce internal Serializer API for determining if serializers support object relocation May 6, 2015
asfgit pushed a commit that referenced this pull request May 6, 2015
…ializers support object relocation

This patch extends the `Serializer` interface with a new `Private` API which allows serializers to indicate whether they support relocation of serialized objects in serializer stream output.

This relocatibilty property is described in more detail in `Serializer.scala`, but in a nutshell a serializer supports relocation if reordering the bytes of serialized objects in serialization stream output is equivalent to having re-ordered those elements prior to serializing them.  The optimized shuffle path introduced in #4450 and #5868 both rely on serializers having this property; this patch just centralizes the logic for determining whether a serializer has this property.  I also added tests and comments clarifying when this works for KryoSerializer.

This change allows the optimizations in #4450 to be applied for shuffles that use `SqlSerializer2`.

Author: Josh Rosen <[email protected]>

Closes #5924 from JoshRosen/SPARK-7311 and squashes the following commits:

50a68ca [Josh Rosen] Address minor nits
0a7ebd7 [Josh Rosen] Clarify reason why SqlSerializer2 supports this serializer
123b992 [Josh Rosen] Cleanup for submitting as standalone patch.
4aa61b2 [Josh Rosen] Add missing newline
2c1233a [Josh Rosen] Small refactoring of SerializerPropertiesSuite to enable test re-use:
0ba75e6 [Josh Rosen] Add tests for serializer relocation property.
450fa21 [Josh Rosen] Back out accidental log4j.properties change
86d4dcd [Josh Rosen] Flag that SparkSqlSerializer2 supports relocation
b9624ee [Josh Rosen] Expand serializer API and use new function to help control when new UnsafeShuffle path is used.

(cherry picked from commit 002c123)
Signed-off-by: Josh Rosen <[email protected]>
@asfgit asfgit closed this in 002c123 May 6, 2015
@JoshRosen JoshRosen deleted the SPARK-7311 branch May 6, 2015 17:54
@SparkQA
Copy link

SparkQA commented May 6, 2015

Test build #32002 has finished for PR 5924 at commit 50a68ca.

  • This patch fails to build.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/32002/
Test FAILed.

@JoshRosen
Copy link
Contributor Author

Whoops, looks like this caused a build break because the patch that introduced the @Private annotation was reverted. I'll push a hotfix to re-add that annotation.

jeanlyn pushed a commit to jeanlyn/spark that referenced this pull request May 28, 2015
…ializers support object relocation

This patch extends the `Serializer` interface with a new `Private` API which allows serializers to indicate whether they support relocation of serialized objects in serializer stream output.

This relocatibilty property is described in more detail in `Serializer.scala`, but in a nutshell a serializer supports relocation if reordering the bytes of serialized objects in serialization stream output is equivalent to having re-ordered those elements prior to serializing them.  The optimized shuffle path introduced in apache#4450 and apache#5868 both rely on serializers having this property; this patch just centralizes the logic for determining whether a serializer has this property.  I also added tests and comments clarifying when this works for KryoSerializer.

This change allows the optimizations in apache#4450 to be applied for shuffles that use `SqlSerializer2`.

Author: Josh Rosen <[email protected]>

Closes apache#5924 from JoshRosen/SPARK-7311 and squashes the following commits:

50a68ca [Josh Rosen] Address minor nits
0a7ebd7 [Josh Rosen] Clarify reason why SqlSerializer2 supports this serializer
123b992 [Josh Rosen] Cleanup for submitting as standalone patch.
4aa61b2 [Josh Rosen] Add missing newline
2c1233a [Josh Rosen] Small refactoring of SerializerPropertiesSuite to enable test re-use:
0ba75e6 [Josh Rosen] Add tests for serializer relocation property.
450fa21 [Josh Rosen] Back out accidental log4j.properties change
86d4dcd [Josh Rosen] Flag that SparkSqlSerializer2 supports relocation
b9624ee [Josh Rosen] Expand serializer API and use new function to help control when new UnsafeShuffle path is used.
jeanlyn pushed a commit to jeanlyn/spark that referenced this pull request Jun 12, 2015
…ializers support object relocation

This patch extends the `Serializer` interface with a new `Private` API which allows serializers to indicate whether they support relocation of serialized objects in serializer stream output.

This relocatibilty property is described in more detail in `Serializer.scala`, but in a nutshell a serializer supports relocation if reordering the bytes of serialized objects in serialization stream output is equivalent to having re-ordered those elements prior to serializing them.  The optimized shuffle path introduced in apache#4450 and apache#5868 both rely on serializers having this property; this patch just centralizes the logic for determining whether a serializer has this property.  I also added tests and comments clarifying when this works for KryoSerializer.

This change allows the optimizations in apache#4450 to be applied for shuffles that use `SqlSerializer2`.

Author: Josh Rosen <[email protected]>

Closes apache#5924 from JoshRosen/SPARK-7311 and squashes the following commits:

50a68ca [Josh Rosen] Address minor nits
0a7ebd7 [Josh Rosen] Clarify reason why SqlSerializer2 supports this serializer
123b992 [Josh Rosen] Cleanup for submitting as standalone patch.
4aa61b2 [Josh Rosen] Add missing newline
2c1233a [Josh Rosen] Small refactoring of SerializerPropertiesSuite to enable test re-use:
0ba75e6 [Josh Rosen] Add tests for serializer relocation property.
450fa21 [Josh Rosen] Back out accidental log4j.properties change
86d4dcd [Josh Rosen] Flag that SparkSqlSerializer2 supports relocation
b9624ee [Josh Rosen] Expand serializer API and use new function to help control when new UnsafeShuffle path is used.
nemccarthy pushed a commit to nemccarthy/spark that referenced this pull request Jun 19, 2015
…ializers support object relocation

This patch extends the `Serializer` interface with a new `Private` API which allows serializers to indicate whether they support relocation of serialized objects in serializer stream output.

This relocatibilty property is described in more detail in `Serializer.scala`, but in a nutshell a serializer supports relocation if reordering the bytes of serialized objects in serialization stream output is equivalent to having re-ordered those elements prior to serializing them.  The optimized shuffle path introduced in apache#4450 and apache#5868 both rely on serializers having this property; this patch just centralizes the logic for determining whether a serializer has this property.  I also added tests and comments clarifying when this works for KryoSerializer.

This change allows the optimizations in apache#4450 to be applied for shuffles that use `SqlSerializer2`.

Author: Josh Rosen <[email protected]>

Closes apache#5924 from JoshRosen/SPARK-7311 and squashes the following commits:

50a68ca [Josh Rosen] Address minor nits
0a7ebd7 [Josh Rosen] Clarify reason why SqlSerializer2 supports this serializer
123b992 [Josh Rosen] Cleanup for submitting as standalone patch.
4aa61b2 [Josh Rosen] Add missing newline
2c1233a [Josh Rosen] Small refactoring of SerializerPropertiesSuite to enable test re-use:
0ba75e6 [Josh Rosen] Add tests for serializer relocation property.
450fa21 [Josh Rosen] Back out accidental log4j.properties change
86d4dcd [Josh Rosen] Flag that SparkSqlSerializer2 supports relocation
b9624ee [Josh Rosen] Expand serializer API and use new function to help control when new UnsafeShuffle path is used.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants