[SPARK-4611][MLlib] Implement the efficient vector norm #3462

dbtsai · 2014-11-26T01:06:02Z

The vector norm in breeze is implemented by activeIterator which is known to be very slow.
In this PR, an efficient vector norm is implemented, and with this API, Normalizer and
k-means have big performance improvement.

Here is the benchmark against mnist8m dataset.

a) Normalizer
Before
DenseVector: 68.25secs
SparseVector: 17.01secs

With this PR
DenseVector: 12.71secs
SparseVector: 2.73secs

b) k-means
Before
DenseVector: 83.46secs
SparseVector: 61.60secs

With this PR
DenseVector: 70.04secs
SparseVector: 59.05secs

SparkQA · 2014-11-26T01:09:56Z

Test build #23851 has started for PR 3462 at commit c1a877c.

This patch merges cleanly.

SparkQA · 2014-11-26T01:15:17Z

Test build #23853 has started for PR 3462 at commit dbed124.

This patch merges cleanly.

dbtsai · 2014-11-26T02:17:41Z

Using foreachActive instead of while loop

DenseVector: 12.95secs
SparseVector: 2.89secs

  private[spark] def norm(p: Double): Double = {
    require(p >= 1.0)
    if (p == 1) {
      var sum = 0.0
      this.foreachActive { (_, value) =>
        sum += math.abs(value)
      }
      sum
    } else if (p == 2) {
      var sum = 0.0
      this.foreachActive { (_, value) =>
        sum += value * value
      }
      math.sqrt(sum)
    } else if (p == Double.PositiveInfinity) {
      var max = 0.0
      this.foreachActive { (_, value) =>
        val absValue = math.abs(value)
        if (absValue > max) max = absValue
      }
      max
    } else {
      var sum = 0.0
      this.foreachActive { (_, value) =>
        sum += math.pow(math.abs(value), p)
      }
      math.pow(sum, 1.0 / p)
    }
  }

SparkQA · 2014-11-26T02:35:28Z

Test build #23851 has finished for PR 3462 at commit c1a877c.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

AmplabJenkins · 2014-11-26T02:35:31Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/23851/
Test PASSed.

SparkQA · 2014-11-26T02:38:30Z

Test build #23853 has finished for PR 3462 at commit dbed124.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

AmplabJenkins · 2014-11-26T02:38:33Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/23853/
Test PASSed.

SparkQA · 2014-11-26T03:22:52Z

Test build #23869 has started for PR 3462 at commit 0b632e6.

This patch merges cleanly.

mengxr · 2014-11-26T04:19:19Z

mllib/src/main/scala/org/apache/spark/mllib/linalg/Vectors.scala

+   */
+  private[spark] def norm(p: Double): Double
+
+  protected def norm(p: Double, values: Array[Double]): Double = {


I think we should move this method to Vectors because it is a static method, like the following:

object Vectors { private def norm(values: Array[Double], p: Double): Double = { ... } private[mllib] def norm(v: Vector, p: Double): Double = { norm(v.values, p) } }

The parent Vector class doesn't have values member variable. Unless we do another pattern matching, otherwise it will not work. I think it's okay to norm as the member function of Vector. What do you think?

SparkQA · 2014-11-26T04:47:41Z

Test build #23869 has finished for PR 3462 at commit 0b632e6.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

AmplabJenkins · 2014-11-26T04:47:45Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/23869/
Test PASSed.

SparkQA · 2014-11-26T07:35:04Z

Test build #23881 has started for PR 3462 at commit 9b7cb56.

This patch merges cleanly.

SparkQA · 2014-11-26T07:55:35Z

Test build #23885 has started for PR 3462 at commit 6fa616c.

This patch merges cleanly.

SparkQA · 2014-11-26T08:10:39Z

Test build #23886 has started for PR 3462 at commit 0c3637f.

This patch merges cleanly.

srowen · 2014-11-26T08:34:05Z

mllib/src/main/scala/org/apache/spark/mllib/linalg/Vectors.scala

+    }
+    val size = values.size
+
+    if (p == 1) {


How about p match { ... here? with @switch to ensure it's just a lookup? should be faster even than ifs.

Interesting, will try tomorrow. But I don't expect too much difference.

In bytecode, there is no direct switch operation. As a result, the swtich or pattern matching will be compiled into if statement in the bytecode. See the following example

def fun1(p: Double) = { p match { case 1.0 => 1.0 case 2.0 => 2.0 case _ => p } } def fun2(p: Double) = { if (p == 1.0) 1.0 else if (p == 2.0) 2.0 else p }

will be compiled to

// access flags 0x1 public fun1(D)D L0 LINENUMBER 145 L0 DLOAD 1 DSTORE 3 L1 LINENUMBER 146 L1 DCONST_1 DLOAD 3 DCMPL IFNE L2 DCONST_1 DSTORE 5 GOTO L3 L2 LINENUMBER 147 L2 FRAME APPEND [D] LDC 2.0 DLOAD 3 DCMPL IFNE L4 LDC 2.0 DSTORE 5 GOTO L3 L4 LINENUMBER 148 L4 FRAME SAME DLOAD 1 DSTORE 5 L3 LINENUMBER 145 L3 FRAME APPEND [D] DLOAD 5 DRETURN L5 LOCALVARIABLE this Lorg/apache/spark/mllib/stat/Test$; L0 L5 0 LOCALVARIABLE p D L0 L5 1 MAXSTACK = 4 MAXLOCALS = 7 // access flags 0x1 public fun2(D)D L0 LINENUMBER 153 L0 DLOAD 1 DCONST_1 DCMPL IFNE L1 DCONST_1 GOTO L2 L1 LINENUMBER 154 L1 FRAME SAME DLOAD 1 LDC 2.0 DCMPL IFNE L3 LDC 2.0 GOTO L2 L3 LINENUMBER 155 L3 FRAME SAME DLOAD 1 L2 LINENUMBER 153 L2 FRAME SAME1 D DRETURN L4 LOCALVARIABLE this Lorg/apache/spark/mllib/stat/Test$; L0 L4 0 LOCALVARIABLE p D L0 L4 1 MAXSTACK = 4 MAXLOCALS = 3

This is an interesting tangent. What happens if you add @switch? http://www.scala-lang.org/api/current/index.html#scala.annotation.switch Bytecode should have instructions for switch statements that aren't just conditionals, like tableswitch: https://docs.oracle.com/javase/specs/jvms/se7/html/jvms-3.html#jvms-3.10

ha~ It only works if I change type from Double to Int. See the oracle doc you referenced The Java Virtual Machine's tableswitch and lookupswitch instructions operate only on int data. Because operations on byte, char, or short values are internally promoted to int, a switch whose expression evaluates to one of those types is compiled as though it evaluated to type int.

With

def fun1(p: Int) = { (p: @switch) match { case 1 => 1 case 2 => 2 case _ => p } }

I got

public fun1(I)I L0 LINENUMBER 147 L0 ILOAD 1 ISTORE 2 ILOAD 2 TABLESWITCH 1: L1 2: L2 default: L3 L3 LINENUMBER 150 L3 FRAME APPEND [I] ILOAD 1 GOTO L4 L2 LINENUMBER 149 L2 FRAME SAME ICONST_2 GOTO L4 L1 LINENUMBER 148 L1 FRAME SAME ICONST_1 L4 LINENUMBER 147 L4 FRAME SAME1 I IRETURN L5 LOCALVARIABLE this Lorg/apache/spark/mllib/stat/Test$; L0 L5 0 LOCALVARIABLE p I L0 L5 1 MAXSTACK = 1 MAXLOCALS = 3

It is an interesting discussion ~ :) But maybe more people are familiar with the if ... else if ... else statement. And this is not on the critical path.

yeah. but even with @switch here, the code will not be optimized unless p has type of Int.

SparkQA · 2014-11-26T08:57:58Z

Test build #23881 has finished for PR 3462 at commit 9b7cb56.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

AmplabJenkins · 2014-11-26T08:58:01Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/23881/
Test PASSed.

SparkQA · 2014-11-26T09:24:17Z

Test build #23885 has finished for PR 3462 at commit 6fa616c.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

AmplabJenkins · 2014-11-26T09:24:21Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/23885/
Test PASSed.

SparkQA · 2014-11-26T09:35:29Z

Test build #23886 has finished for PR 3462 at commit 0c3637f.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

AmplabJenkins · 2014-11-26T09:35:33Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/23886/
Test PASSed.

mengxr · 2014-12-01T08:26:12Z

mllib/src/test/scala/org/apache/spark/mllib/linalg/VectorsSuite.scala

+
+    assert(Vectors.norm(dv, 3.7) ~== math.pow(dv.toArray.foldLeft(0.0)((a, v) =>
+      a + math.pow(math.abs(v), 3.7)), 1.0 / 3.7) relTol 1E-8)
+    assert(Vectors.norm(sv, 3.7) ~== math.pow(dv.toArray.foldLeft(0.0)((a, v) =>


dv -> sv

SparkQA · 2014-12-01T08:55:11Z

Test build #23984 has started for PR 3462 at commit 63c7165.

This patch merges cleanly.

SparkQA · 2014-12-01T10:10:17Z

Test build #23984 has finished for PR 3462 at commit 63c7165.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

AmplabJenkins · 2014-12-01T10:10:20Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/23984/
Test PASSed.

The vector norm in breeze is implemented by `activeIterator` which is known to be very slow. In this PR, an efficient vector norm is implemented, and with this API, `Normalizer` and `k-means` have big performance improvement. Here is the benchmark against mnist8m dataset. a) `Normalizer` Before DenseVector: 68.25secs SparseVector: 17.01secs With this PR DenseVector: 12.71secs SparseVector: 2.73secs b) `k-means` Before DenseVector: 83.46secs SparseVector: 61.60secs With this PR DenseVector: 70.04secs SparseVector: 59.05secs Author: DB Tsai <[email protected]> Closes #3462 from dbtsai/norm and squashes the following commits: 63c7165 [DB Tsai] typo 0c3637f [DB Tsai] add import org.apache.spark.SparkContext._ back 6fa616c [DB Tsai] address feedback 9b7cb56 [DB Tsai] move norm to static method 0b632e6 [DB Tsai] kmeans dbed124 [DB Tsai] style c1a877c [DB Tsai] first commit (cherry picked from commit 64f3175) Signed-off-by: Xiangrui Meng <[email protected]>

mengxr · 2014-12-02T03:41:39Z

LGTM. Merged into master and branch-1.2. Thanks!

first commit

c1a877c

style

dbed124

kmeans

0b632e6

dbtsai changed the title ~~Implement the efficient vector norm~~ [SPARK-4611][MLlib] Implement the efficient vector norm Nov 26, 2014

mengxr reviewed Nov 26, 2014
View reviewed changes

move norm to static method

9b7cb56

address feedback

6fa616c

add import org.apache.spark.SparkContext._ back

0c3637f

srowen reviewed Nov 26, 2014
View reviewed changes

mengxr reviewed Dec 1, 2014
View reviewed changes

typo

63c7165

asfgit closed this in 64f3175 Dec 2, 2014

dbtsai deleted the norm branch December 3, 2014 19:21

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-4611][MLlib] Implement the efficient vector norm #3462

[SPARK-4611][MLlib] Implement the efficient vector norm #3462

dbtsai commented Nov 26, 2014

SparkQA commented Nov 26, 2014

SparkQA commented Nov 26, 2014

dbtsai commented Nov 26, 2014

SparkQA commented Nov 26, 2014

AmplabJenkins commented Nov 26, 2014

SparkQA commented Nov 26, 2014

AmplabJenkins commented Nov 26, 2014

SparkQA commented Nov 26, 2014

mengxr Nov 26, 2014

dbtsai Nov 26, 2014

SparkQA commented Nov 26, 2014

AmplabJenkins commented Nov 26, 2014

SparkQA commented Nov 26, 2014

SparkQA commented Nov 26, 2014

SparkQA commented Nov 26, 2014

srowen Nov 26, 2014

dbtsai Nov 26, 2014

dbtsai Nov 26, 2014

srowen Nov 26, 2014

dbtsai Nov 26, 2014

mengxr Dec 1, 2014

dbtsai Dec 1, 2014

SparkQA commented Nov 26, 2014

AmplabJenkins commented Nov 26, 2014

SparkQA commented Nov 26, 2014

AmplabJenkins commented Nov 26, 2014

SparkQA commented Nov 26, 2014

AmplabJenkins commented Nov 26, 2014

mengxr Dec 1, 2014

SparkQA commented Dec 1, 2014

SparkQA commented Dec 1, 2014

AmplabJenkins commented Dec 1, 2014

mengxr commented Dec 2, 2014

[SPARK-4611][MLlib] Implement the efficient vector norm #3462

[SPARK-4611][MLlib] Implement the efficient vector norm #3462

Conversation

dbtsai commented Nov 26, 2014

SparkQA commented Nov 26, 2014

SparkQA commented Nov 26, 2014

dbtsai commented Nov 26, 2014

SparkQA commented Nov 26, 2014

AmplabJenkins commented Nov 26, 2014

SparkQA commented Nov 26, 2014

AmplabJenkins commented Nov 26, 2014

SparkQA commented Nov 26, 2014

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Nov 26, 2014

AmplabJenkins commented Nov 26, 2014

SparkQA commented Nov 26, 2014

SparkQA commented Nov 26, 2014

SparkQA commented Nov 26, 2014

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Nov 26, 2014

AmplabJenkins commented Nov 26, 2014

SparkQA commented Nov 26, 2014

AmplabJenkins commented Nov 26, 2014

SparkQA commented Nov 26, 2014

AmplabJenkins commented Nov 26, 2014

Choose a reason for hiding this comment

SparkQA commented Dec 1, 2014

SparkQA commented Dec 1, 2014

AmplabJenkins commented Dec 1, 2014

mengxr commented Dec 2, 2014