-
Notifications
You must be signed in to change notification settings - Fork 28.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-29081][CORE] Replace calls to SerializationUtils.clone on properties with a faster implementation #25787
[SPARK-29081][CORE] Replace calls to SerializationUtils.clone on properties with a faster implementation #25787
Conversation
cc @jiangxb1987 |
ok to test |
test this please |
/** Create a new properties object with the same values as `props` */ | ||
def cloneProperties(props: Properties): Properties = { | ||
val resultProps = new Properties() | ||
resultProps.putAll(props) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[ERROR] [Error] /home/runner/work/spark/spark/core/src/main/scala/org/apache/spark/util/Utils.scala:2957: ambiguous reference to overloaded definition,
both method putAll in class Properties of type (x$1: java.util.Map[_, _])Unit
and method putAll in class Hashtable of type (x$1: java.util.Map[_ <: Object, _ <: Object])Unit
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for preventing JDK11 failure!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The change looks good from my side. Also cc @cloud-fan @zsxwing
Test build #110577 has finished for PR 25787 at commit
|
LGTM |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A benchmark is so easy, that we should verify this is actually faster first. I put together a quick and dirty one that tries to serialize a (copy of) System.getProperties locally (57 entries). Indeed, SerializationUtils is about 100x slower, although it's not exactly taking a long time: 157,000ns vs 1200ns. I think it's fine.
removing unused import
I made what I hope is a proper bench mark!
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for adding benchmark. the result is great.
For the benchmark,
- You can see
KryoBenchmark.scala
and follow the style. - You need to add the generated result file together.
I don't think we must add the benchmark for this small change, though it wouldn't hurt. The result is convincing already. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed with @srowen. A convincing benchmark result is enough for this change.
var i = 0 | ||
while (i < minIters || runTimes.sum < minDuration) { | ||
while (i < minIters || (System.nanoTime() - startTime) < minDuration) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this change related?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I meant to call this out. My empty properties test case was taking way too long. I found that the overhead of the loop was orders of magnitude longer than the function itself, especially when the array buffer was needing to resize a lot.
So instead of summing the time for each run I changed it to keep track of the total time elapsed.
I doubt that it will affect many other benchmarks, but it is something to note.
@viirya I undid that functionality change in favor of keeping track of the total time as it passes instead of summing over the array each loop. This should be the same behavior but strictly more efficient. |
@dongjoon-hyun I'm unclear the style you meant, so I just added the comments. |
@databricks-david-lewis . The current one is correct and what I wanted. :) |
As a final piece of this PR, could you make the PR description up-to-date? For example, the following become invalid.
|
BTW, @srowen and @viirya seems to suggest to remove the followings from this PR at their comments (here and here)
If we need to remove it, could you tell @databricks-david-lewis once more? I'm fine for both ways (adding or removing them). |
I think reporting a convincing result is good enough for such change. But as they are added and wouldn't hurt, I don't strongly want to remove them. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1, LGTM. Merged to master.
Thank you, @databricks-david-lewis , @gatorsmile , @jiangxb1987 , @zsxwing, @srowen , @viirya .
Replace use of
SerializationUtils.clone
with newUtils.cloneProperties
methodAdd benchmark + results showing dramatic speed up for effectively equivalent functionality.
What changes were proposed in this pull request?
While I am not sure that SerializationUtils.clone is a performance issue in production, I am sure that it is overkill for the task it is doing (providing a distinct copy of a
Properties
object).This PR provides a benchmark showing the dramatic improvement over the clone operation and replaces uses of
SerializationUtils.clone
onProperties
with the more specializedUtils.cloneProperties
.Does this PR introduce any user-facing change?
Strings are immutable so there is no reason to serialize and deserialize them, it just creates extra garbage.
The only functionality that would be changed is the unsupported insertion of non-String objects into the spark local properties.
How was this patch tested?