[SPARK-29081][CORE] Replace calls to SerializationUtils.clone on properties with a faster implementation #25787

databricks-david-lewis · 2019-09-13T23:09:50Z

Replace use of SerializationUtils.clone with new Utils.cloneProperties method
Add benchmark + results showing dramatic speed up for effectively equivalent functionality.

What changes were proposed in this pull request?

While I am not sure that SerializationUtils.clone is a performance issue in production, I am sure that it is overkill for the task it is doing (providing a distinct copy of a Properties object).
This PR provides a benchmark showing the dramatic improvement over the clone operation and replaces uses of SerializationUtils.clone on Properties with the more specialized Utils.cloneProperties.

Does this PR introduce any user-facing change?

Strings are immutable so there is no reason to serialize and deserialize them, it just creates extra garbage.
The only functionality that would be changed is the unsupported insertion of non-String objects into the spark local properties.

How was this patch tested?

Pass the Jenkins with the existing tests.
Since this is a performance improvement PR, manually run the benchmark.

…s method

gatorsmile · 2019-09-13T23:11:47Z

cc @jiangxb1987

gatorsmile · 2019-09-13T23:12:08Z

ok to test

gatorsmile · 2019-09-13T23:14:12Z

test this please

gatorsmile · 2019-09-13T23:16:23Z

core/src/main/scala/org/apache/spark/util/Utils.scala

+  /** Create a new properties object with the same values as `props` */
+  def cloneProperties(props: Properties): Properties = {
+    val resultProps = new Properties()
+    resultProps.putAll(props)


[ERROR] [Error] /home/runner/work/spark/spark/core/src/main/scala/org/apache/spark/util/Utils.scala:2957: ambiguous reference to overloaded definition, both method putAll in class Properties of type (x$1: java.util.Map[_, _])Unit and method putAll in class Hashtable of type (x$1: java.util.Map[_ <: Object, _ <: Object])Unit

Thank you for preventing JDK11 failure!

jiangxb1987

The change looks good from my side. Also cc @cloud-fan @zsxwing

SparkQA · 2019-09-14T01:14:53Z

Test build #110577 has finished for PR 25787 at commit b7357c1.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

zsxwing · 2019-09-14T06:19:42Z

The only functionality that would be changed is the unsupported insertion of non-String objects into the spark local properties.

~~I remember this was the reason we decided to use SerializationUtils.clone at the beginning. See #8710 (comment)~~ NVM. I just realized we don't have any public API to set non-string values. Totally don't remember why we didn't use this simpler way.

LGTM

core/src/main/scala/org/apache/spark/SparkContext.scala

srowen

A benchmark is so easy, that we should verify this is actually faster first. I put together a quick and dirty one that tries to serialize a (copy of) System.getProperties locally (57 entries). Indeed, SerializationUtils is about 100x slower, although it's not exactly taking a long time: 157,000ns vs 1200ns. I think it's fine.

removing unused import

databricks-david-lewis · 2019-09-14T18:24:33Z

I made what I hope is a proper bench mark!
Here are the results:

[info] Running org.apache.spark.util.PropertiesCloneBenchmark
[info] Running benchmark: Empty Properties
[info]   Running case: SerializationUtils.clone
[info]   Stopped after 322992 iterations, 1965 ms
[info]   Running case: Utils.cloneProperties
[info]   Stopped after 14292083 iterations, 999 ms
[info]
[info] Java HotSpot(TM) 64-Bit Server VM 1.8.0_131-b11 on Mac OS X 10.14.6
[info] Intel(R) Core(TM) i9-8950HK CPU @ 2.90GHz
[info] Empty Properties:                         Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
[info] ------------------------------------------------------------------------------------------------------------------------
[info] SerializationUtils.clone                              0              0           0          0.2        4216.0       1.0X
[info] Utils.cloneProperties                                 0              0           0         90.9          11.0     383.3X
[info]
[info] Running benchmark: System Properties
[info]   Running case: SerializationUtils.clone
[info]   Stopped after 12051 iterations, 1998 ms
[info]   Running case: Utils.cloneProperties
[info]   Stopped after 1534682 iterations, 1891 ms
[info]
[info] Java HotSpot(TM) 64-Bit Server VM 1.8.0_131-b11 on Mac OS X 10.14.6
[info] Intel(R) Core(TM) i9-8950HK CPU @ 2.90GHz
[info] System Properties:                        Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
[info] ------------------------------------------------------------------------------------------------------------------------
[info] SerializationUtils.clone                              0              0           0          0.0      121122.0       1.0X
[info] Utils.cloneProperties                                 0              0           0          1.1         947.0     127.9X
[info]
[info] Running benchmark: Small Properties
[info]   Running case: SerializationUtils.clone
[info]   Stopped after 5149 iterations, 1999 ms
[info]   Running case: Utils.cloneProperties
[info]   Stopped after 1428075 iterations, 1902 ms
[info]
[info] Java HotSpot(TM) 64-Bit Server VM 1.8.0_131-b11 on Mac OS X 10.14.6
[info] Intel(R) Core(TM) i9-8950HK CPU @ 2.90GHz
[info] Small Properties:                         Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
[info] ------------------------------------------------------------------------------------------------------------------------
[info] SerializationUtils.clone                              0              0           0          0.0      315474.0       1.0X
[info] Utils.cloneProperties                                 0              0           0          0.9        1106.0     285.2X
[info]
[info] Running benchmark: Medium Properties
[info]   Running case: SerializationUtils.clone
[info]   Stopped after 1304 iterations, 2000 ms
[info]   Running case: Utils.cloneProperties
[info]   Stopped after 296214 iterations, 1979 ms
[info]
[info] Java HotSpot(TM) 64-Bit Server VM 1.8.0_131-b11 on Mac OS X 10.14.6
[info] Intel(R) Core(TM) i9-8950HK CPU @ 2.90GHz
[info] Medium Properties:                        Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
[info] ------------------------------------------------------------------------------------------------------------------------
[info] SerializationUtils.clone                              1              2           0          0.0     1296861.0       1.0X
[info] Utils.cloneProperties                                 0              0           0          0.2        5450.0     238.0X
[info]
[info] Running benchmark: Large Properties
[info]   Running case: SerializationUtils.clone
[info]   Stopped after 675 iterations, 2001 ms
[info]   Running case: Utils.cloneProperties
[info]   Stopped after 154225 iterations, 1989 ms
[info]
[info] Java HotSpot(TM) 64-Bit Server VM 1.8.0_131-b11 on Mac OS X 10.14.6
[info] Intel(R) Core(TM) i9-8950HK CPU @ 2.90GHz
[info] Large Properties:                         Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
[info] ------------------------------------------------------------------------------------------------------------------------
[info] SerializationUtils.clone                              3              3           0          0.0     2587096.0       1.0X
[info] Utils.cloneProperties                                 0              0           0          0.1       11077.0     233.6X

dongjoon-hyun

Thank you for adding benchmark. the result is great.

For the benchmark,

You can see KryoBenchmark.scala and follow the style.
You need to add the generated result file together.

srowen · 2019-09-14T19:41:35Z

I don't think we must add the benchmark for this small change, though it wouldn't hurt. The result is convincing already.

viirya

Agreed with @srowen. A convincing benchmark result is enough for this change.

viirya · 2019-09-14T20:35:41Z

core/src/test/scala/org/apache/spark/benchmark/Benchmark.scala

    var i = 0
-    while (i < minIters || runTimes.sum < minDuration) {
+    while (i < minIters || (System.nanoTime() - startTime) < minDuration) {


Is this change related?

I meant to call this out. My empty properties test case was taking way too long. I found that the overhead of the loop was orders of magnitude longer than the function itself, especially when the array buffer was needing to resize a lot.
So instead of summing the time for each run I changed it to keep track of the total time elapsed.
I doubt that it will affect many other benchmarks, but it is something to note.

adding benchmark file

databricks-david-lewis · 2019-09-14T20:49:51Z

@viirya I undid that functionality change in favor of keeping track of the total time as it passes instead of summing over the array each loop. This should be the same behavior but strictly more efficient.

databricks-david-lewis · 2019-09-14T20:50:43Z

@dongjoon-hyun I'm unclear the style you meant, so I just added the comments.

dongjoon-hyun · 2019-09-14T20:56:38Z

@databricks-david-lewis . The current one is correct and what I wanted. :)

dongjoon-hyun · 2019-09-14T20:58:41Z

As a final piece of this PR, could you make the PR description up-to-date? For example, the following become invalid.

I don't have empirical evidence that SerializationUtils.clone is a problem, but it is certainly over-kill. I see it often in stacktraces I take for debugging and want to get rid of it.

dongjoon-hyun · 2019-09-14T22:42:59Z

BTW, @srowen and @viirya seems to suggest to remove the followings from this PR at their comments (here and here)

PropertiesCloneBenchmark.scala
PropertiesCloneBenchmark-results.txt
Benchmark.scala

If we need to remove it, could you tell @databricks-david-lewis once more? I'm fine for both ways (adding or removing them).

viirya · 2019-09-14T22:49:16Z

I think reporting a convincing result is good enough for such change. But as they are added and wouldn't hurt, I don't strongly want to remove them.

dongjoon-hyun

+1, LGTM. Merged to master.
Thank you, @databricks-david-lewis , @gatorsmile , @jiangxb1987 , @zsxwing, @srowen , @viirya .

Replace use of SerializationUtils.clone with new Utils.clonePropertie…

b7357c1

…s method

gatorsmile reviewed Sep 13, 2019

View reviewed changes

fixing jdk11 compilation bug by calling .forEach(...) instead of .putAll

578675d

maropu changed the title ~~[SPARK-29081] Replace calls to SerializationUtils.clone on properties with a faster implementation~~ [SPARK-29081][CORE] Replace calls to SerializationUtils.clone on properties with a faster implementation Sep 13, 2019

jiangxb1987 approved these changes Sep 13, 2019

View reviewed changes

dongjoon-hyun added the SPARK CORE label Sep 14, 2019

dongjoon-hyun reviewed Sep 14, 2019

View reviewed changes

core/src/main/scala/org/apache/spark/SparkContext.scala Show resolved Hide resolved

srowen approved these changes Sep 14, 2019

View reviewed changes

adding benchmark!

b659939

removing unused import

dongjoon-hyun requested changes Sep 14, 2019

View reviewed changes

viirya approved these changes Sep 14, 2019

View reviewed changes

viirya reviewed Sep 14, 2019

View reviewed changes

databricks-david-lewis added 2 commits September 14, 2019 14:45

Adding comment and running instructions

0d2c5bf

adding benchmark file

revert back to previous functionality but optimize time keeping

fa4b77a

dongjoon-hyun approved these changes Sep 15, 2019

View reviewed changes

dongjoon-hyun closed this in 8c0e961 Sep 15, 2019

dongjoon-hyun mentioned this pull request Nov 25, 2019

[SPARK-30030][INFRA] Use RegexChecker instead of TokenChecker to check org.apache.commons.lang. #26666

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-29081][CORE] Replace calls to SerializationUtils.clone on properties with a faster implementation #25787

[SPARK-29081][CORE] Replace calls to SerializationUtils.clone on properties with a faster implementation #25787

databricks-david-lewis commented Sep 13, 2019 •

edited

Loading

gatorsmile commented Sep 13, 2019

gatorsmile commented Sep 13, 2019

gatorsmile commented Sep 13, 2019

gatorsmile Sep 13, 2019

dongjoon-hyun Sep 14, 2019

jiangxb1987 left a comment

SparkQA commented Sep 14, 2019

zsxwing commented Sep 14, 2019 •

edited

Loading

srowen left a comment

databricks-david-lewis commented Sep 14, 2019

dongjoon-hyun left a comment

srowen commented Sep 14, 2019

viirya left a comment

viirya Sep 14, 2019

databricks-david-lewis Sep 14, 2019

databricks-david-lewis commented Sep 14, 2019

databricks-david-lewis commented Sep 14, 2019

dongjoon-hyun commented Sep 14, 2019

dongjoon-hyun commented Sep 14, 2019

dongjoon-hyun commented Sep 14, 2019

viirya commented Sep 14, 2019

dongjoon-hyun left a comment

[SPARK-29081][CORE] Replace calls to SerializationUtils.clone on properties with a faster implementation #25787

[SPARK-29081][CORE] Replace calls to SerializationUtils.clone on properties with a faster implementation #25787

Conversation

databricks-david-lewis commented Sep 13, 2019 • edited Loading

What changes were proposed in this pull request?

Does this PR introduce any user-facing change?

How was this patch tested?

gatorsmile commented Sep 13, 2019

gatorsmile commented Sep 13, 2019

gatorsmile commented Sep 13, 2019

gatorsmile Sep 13, 2019

Choose a reason for hiding this comment

dongjoon-hyun Sep 14, 2019

Choose a reason for hiding this comment

jiangxb1987 left a comment

Choose a reason for hiding this comment

SparkQA commented Sep 14, 2019

zsxwing commented Sep 14, 2019 • edited Loading

srowen left a comment

Choose a reason for hiding this comment

databricks-david-lewis commented Sep 14, 2019

dongjoon-hyun left a comment

Choose a reason for hiding this comment

srowen commented Sep 14, 2019

viirya left a comment

Choose a reason for hiding this comment

viirya Sep 14, 2019

Choose a reason for hiding this comment

databricks-david-lewis Sep 14, 2019

Choose a reason for hiding this comment

databricks-david-lewis commented Sep 14, 2019

databricks-david-lewis commented Sep 14, 2019

dongjoon-hyun commented Sep 14, 2019

dongjoon-hyun commented Sep 14, 2019

dongjoon-hyun commented Sep 14, 2019

viirya commented Sep 14, 2019

dongjoon-hyun left a comment

Choose a reason for hiding this comment

databricks-david-lewis commented Sep 13, 2019 •

edited

Loading

zsxwing commented Sep 14, 2019 •

edited

Loading