Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-4611][MLlib] Implement the efficient vector norm #3462

Closed
wants to merge 7 commits into from
Closed

[SPARK-4611][MLlib] Implement the efficient vector norm #3462

wants to merge 7 commits into from

Conversation

dbtsai
Copy link
Member

@dbtsai dbtsai commented Nov 26, 2014

The vector norm in breeze is implemented by activeIterator which is known to be very slow.
In this PR, an efficient vector norm is implemented, and with this API, Normalizer and
k-means have big performance improvement.

Here is the benchmark against mnist8m dataset.

a) Normalizer
Before
DenseVector: 68.25secs
SparseVector: 17.01secs

With this PR
DenseVector: 12.71secs
SparseVector: 2.73secs

b) k-means
Before
DenseVector: 83.46secs
SparseVector: 61.60secs

With this PR
DenseVector: 70.04secs
SparseVector: 59.05secs

@SparkQA
Copy link

SparkQA commented Nov 26, 2014

Test build #23851 has started for PR 3462 at commit c1a877c.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Nov 26, 2014

Test build #23853 has started for PR 3462 at commit dbed124.

  • This patch merges cleanly.

@dbtsai
Copy link
Member Author

dbtsai commented Nov 26, 2014

Using foreachActive instead of while loop

DenseVector: 12.95secs
SparseVector: 2.89secs

  private[spark] def norm(p: Double): Double = {
    require(p >= 1.0)
    if (p == 1) {
      var sum = 0.0
      this.foreachActive { (_, value) =>
        sum += math.abs(value)
      }
      sum
    } else if (p == 2) {
      var sum = 0.0
      this.foreachActive { (_, value) =>
        sum += value * value
      }
      math.sqrt(sum)
    } else if (p == Double.PositiveInfinity) {
      var max = 0.0
      this.foreachActive { (_, value) =>
        val absValue = math.abs(value)
        if (absValue > max) max = absValue
      }
      max
    } else {
      var sum = 0.0
      this.foreachActive { (_, value) =>
        sum += math.pow(math.abs(value), p)
      }
      math.pow(sum, 1.0 / p)
    }
  }

@SparkQA
Copy link

SparkQA commented Nov 26, 2014

Test build #23851 has finished for PR 3462 at commit c1a877c.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/23851/
Test PASSed.

@SparkQA
Copy link

SparkQA commented Nov 26, 2014

Test build #23853 has finished for PR 3462 at commit dbed124.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/23853/
Test PASSed.

@SparkQA
Copy link

SparkQA commented Nov 26, 2014

Test build #23869 has started for PR 3462 at commit 0b632e6.

  • This patch merges cleanly.

@dbtsai dbtsai changed the title Implement the efficient vector norm [SPARK-4611][MLlib] Implement the efficient vector norm Nov 26, 2014
*/
private[spark] def norm(p: Double): Double

protected def norm(p: Double, values: Array[Double]): Double = {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should move this method to Vectors because it is a static method, like the following:

object Vectors {
  private def norm(values: Array[Double], p: Double): Double = {
    ...
  }

  private[mllib] def norm(v: Vector, p: Double): Double = {
    norm(v.values, p)
  }
}

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The parent Vector class doesn't have values member variable. Unless we do another pattern matching, otherwise it will not work. I think it's okay to norm as the member function of Vector. What do you think?

@SparkQA
Copy link

SparkQA commented Nov 26, 2014

Test build #23869 has finished for PR 3462 at commit 0b632e6.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/23869/
Test PASSed.

@SparkQA
Copy link

SparkQA commented Nov 26, 2014

Test build #23881 has started for PR 3462 at commit 9b7cb56.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Nov 26, 2014

Test build #23885 has started for PR 3462 at commit 6fa616c.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Nov 26, 2014

Test build #23886 has started for PR 3462 at commit 0c3637f.

  • This patch merges cleanly.

}
val size = values.size

if (p == 1) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about p match { ... here? with @switch to ensure it's just a lookup? should be faster even than ifs.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting, will try tomorrow. But I don't expect too much difference.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In bytecode, there is no direct switch operation. As a result, the swtich or pattern matching will be compiled into if statement in the bytecode. See the following example

  def fun1(p: Double) = {
    p match {
      case 1.0 => 1.0
      case 2.0 => 2.0
      case _ => p
    }
  }

  def fun2(p: Double) = {
    if (p == 1.0) 1.0
    else if (p == 2.0) 2.0
    else p
  }

will be compiled to

  // access flags 0x1
  public fun1(D)D
   L0
    LINENUMBER 145 L0
    DLOAD 1
    DSTORE 3
   L1
    LINENUMBER 146 L1
    DCONST_1
    DLOAD 3
    DCMPL
    IFNE L2
    DCONST_1
    DSTORE 5
    GOTO L3
   L2
    LINENUMBER 147 L2
   FRAME APPEND [D]
    LDC 2.0
    DLOAD 3
    DCMPL
    IFNE L4
    LDC 2.0
    DSTORE 5
    GOTO L3
   L4
    LINENUMBER 148 L4
   FRAME SAME
    DLOAD 1
    DSTORE 5
   L3
    LINENUMBER 145 L3
   FRAME APPEND [D]
    DLOAD 5
    DRETURN
   L5
    LOCALVARIABLE this Lorg/apache/spark/mllib/stat/Test$; L0 L5 0
    LOCALVARIABLE p D L0 L5 1
    MAXSTACK = 4
    MAXLOCALS = 7

  // access flags 0x1
  public fun2(D)D
   L0
    LINENUMBER 153 L0
    DLOAD 1
    DCONST_1
    DCMPL
    IFNE L1
    DCONST_1
    GOTO L2
   L1
    LINENUMBER 154 L1
   FRAME SAME
    DLOAD 1
    LDC 2.0
    DCMPL
    IFNE L3
    LDC 2.0
    GOTO L2
   L3
    LINENUMBER 155 L3
   FRAME SAME
    DLOAD 1
   L2
    LINENUMBER 153 L2
   FRAME SAME1 D
    DRETURN
   L4
    LOCALVARIABLE this Lorg/apache/spark/mllib/stat/Test$; L0 L4 0
    LOCALVARIABLE p D L0 L4 1
    MAXSTACK = 4
    MAXLOCALS = 3

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is an interesting tangent. What happens if you add @switch? http://www.scala-lang.org/api/current/index.html#scala.annotation.switch Bytecode should have instructions for switch statements that aren't just conditionals, like tableswitch: https://docs.oracle.com/javase/specs/jvms/se7/html/jvms-3.html#jvms-3.10

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ha~ It only works if I change type from Double to Int. See the oracle doc you referenced The Java Virtual Machine's tableswitch and lookupswitch instructions operate only on int data. Because operations on byte, char, or short values are internally promoted to int, a switch whose expression evaluates to one of those types is compiled as though it evaluated to type int.

With

  def fun1(p: Int) = {
    (p: @switch) match {
      case 1 => 1
      case 2 => 2
      case _ => p
    }
  }

I got

  public fun1(I)I
   L0
    LINENUMBER 147 L0
    ILOAD 1
    ISTORE 2
    ILOAD 2
    TABLESWITCH
      1: L1
      2: L2
      default: L3
   L3
    LINENUMBER 150 L3
   FRAME APPEND [I]
    ILOAD 1
    GOTO L4
   L2
    LINENUMBER 149 L2
   FRAME SAME
    ICONST_2
    GOTO L4
   L1
    LINENUMBER 148 L1
   FRAME SAME
    ICONST_1
   L4
    LINENUMBER 147 L4
   FRAME SAME1 I
    IRETURN
   L5
    LOCALVARIABLE this Lorg/apache/spark/mllib/stat/Test$; L0 L5 0
    LOCALVARIABLE p I L0 L5 1
    MAXSTACK = 1
    MAXLOCALS = 3

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is an interesting discussion ~ :) But maybe more people are familiar with the if ... else if ... else statement. And this is not on the critical path.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah. but even with @switch here, the code will not be optimized unless p has type of Int.

@SparkQA
Copy link

SparkQA commented Nov 26, 2014

Test build #23881 has finished for PR 3462 at commit 9b7cb56.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/23881/
Test PASSed.

@SparkQA
Copy link

SparkQA commented Nov 26, 2014

Test build #23885 has finished for PR 3462 at commit 6fa616c.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/23885/
Test PASSed.

@SparkQA
Copy link

SparkQA commented Nov 26, 2014

Test build #23886 has finished for PR 3462 at commit 0c3637f.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/23886/
Test PASSed.


assert(Vectors.norm(dv, 3.7) ~== math.pow(dv.toArray.foldLeft(0.0)((a, v) =>
a + math.pow(math.abs(v), 3.7)), 1.0 / 3.7) relTol 1E-8)
assert(Vectors.norm(sv, 3.7) ~== math.pow(dv.toArray.foldLeft(0.0)((a, v) =>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dv -> sv

@SparkQA
Copy link

SparkQA commented Dec 1, 2014

Test build #23984 has started for PR 3462 at commit 63c7165.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Dec 1, 2014

Test build #23984 has finished for PR 3462 at commit 63c7165.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/23984/
Test PASSed.

asfgit pushed a commit that referenced this pull request Dec 2, 2014
The vector norm in breeze is implemented by `activeIterator` which is known to be very slow.
In this PR, an efficient vector norm is implemented, and with this API, `Normalizer` and
`k-means` have big performance improvement.

Here is the benchmark against mnist8m dataset.

a) `Normalizer`
Before
DenseVector: 68.25secs
SparseVector: 17.01secs

With this PR
DenseVector: 12.71secs
SparseVector: 2.73secs

b) `k-means`
Before
DenseVector: 83.46secs
SparseVector: 61.60secs

With this PR
DenseVector: 70.04secs
SparseVector: 59.05secs

Author: DB Tsai <[email protected]>

Closes #3462 from dbtsai/norm and squashes the following commits:

63c7165 [DB Tsai] typo
0c3637f [DB Tsai] add import org.apache.spark.SparkContext._ back
6fa616c [DB Tsai] address feedback
9b7cb56 [DB Tsai] move norm to static method
0b632e6 [DB Tsai] kmeans
dbed124 [DB Tsai] style
c1a877c [DB Tsai] first commit

(cherry picked from commit 64f3175)
Signed-off-by: Xiangrui Meng <[email protected]>
@mengxr
Copy link
Contributor

mengxr commented Dec 2, 2014

LGTM. Merged into master and branch-1.2. Thanks!

@asfgit asfgit closed this in 64f3175 Dec 2, 2014
@dbtsai dbtsai deleted the norm branch December 3, 2014 19:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants