[SPARK-6065] [MLlib] Optimize word2vec.findSynonyms using blas calls #5467

MechCoder · 2015-04-11T09:50:25Z

Use blas calls to find the dot product between two vectors.
Prevent re-computing the L2 norm of the given vector for each word in model.

MechCoder · 2015-04-11T09:51:06Z

@jkbradley Was this what you had in mind?

P.S: I prefer we finish off the other PR before discussion on this.

SparkQA · 2015-04-11T11:17:07Z

Test build #30070 has finished for PR 5467 at commit 66cf62a.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.
This patch does not change any dependencies.

jkbradley · 2015-04-13T22:37:24Z

Yep, that's pretty much what I had in mind, except that I'd recommend:

using MLlib's local Matrix type (and its BLAS call in mllib.linalg.BLAS)
computing and storing the Matrix once during initialization
- In this case, we should store a matching Array of Strings so that getVectors can be computed (rather than storing duplicate info)

SparkQA · 2015-04-14T05:48:47Z

Test build #30217 has finished for PR 5467 at commit a7237aa.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.
This patch does not change any dependencies.

MechCoder · 2015-04-14T07:47:26Z

I've addressed your comments. I did not use the blas calls from linalg.blas initially since I thought there might be some overhead due to preprocessing.

This should be faster at least for repeated calls to findSynonyms, for obvious reasons.

SparkQA · 2015-04-14T09:16:18Z

Test build #30232 has finished for PR 5467 at commit 17210c3.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.
This patch does not change any dependencies.

jkbradley · 2015-04-14T19:16:23Z

mllib/src/main/scala/org/apache/spark/mllib/feature/Word2Vec.scala

@@ -431,6 +431,14 @@ class Word2Vec extends Serializable with Logging {
 class Word2VecModel private[mllib] (
    private val model: Map[String, Array[Float]]) extends Serializable with Saveable {


Don't store this any more. Store wordVecMat, plus a matching collection of words; that collection should probably be a Map[String, Int] mapping word to index in wordVecMat (so that transform() is still fast).

Naming: How about "wordVectors" instead of "wordVecMat"?

You'll need to update getVectors to construct the map.

It seems to me that Map[String, Int] will just be model.keys.zip.(0 until model.size).toMap. Is it right to expect that the ordering of keys in the model Map does not change?

No, I don't think so. I'll make an inline comment where it's used about handling that.

jkbradley · 2015-04-14T19:17:04Z

Those updates might require significant changes, so I'll make another pass after updates. Thanks!

MechCoder · 2015-04-15T19:46:36Z

@jkbradley I've pushed some updates.

SparkQA · 2015-04-15T21:11:22Z

Test build #30369 has finished for PR 5467 at commit 3b0d075.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.
This patch does not change any dependencies.

jkbradley · 2015-04-16T21:09:23Z

@MechCoder
I just realized that Word2Vec already has a Matrix which could just be passed to Word2VecModel's constructor. That might be easier and let you remove the code which converts to the map (in Word2Vec.fit) and converts back (in Word2VecModel).

jkbradley · 2015-04-16T21:09:26Z

mllib/src/main/scala/org/apache/spark/mllib/feature/Word2Vec.scala

-    private val model: Map[String, Array[Float]]) extends Serializable with Saveable {
+    model: Map[String, Array[Float]]) extends Serializable with Saveable {
+
+  val indexedModel = model.keys.zip(0 until model.size).toMap


State explicit types

Rename: "wordIndex"

1. Calculate norms during initialization. 2. Use Blas calls from linalg.blas

SparkQA · 2015-04-17T08:27:32Z

Test build #30464 has finished for PR 5467 at commit 64575b0.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.
This patch does not change any dependencies.

SparkQA · 2015-04-17T11:14:13Z

Test build #30477 has finished for PR 5467 at commit da1642d.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.
This patch does not change any dependencies.

MechCoder · 2015-04-17T11:37:44Z

@jkbradley I think I have addressed all your comments except the constructor.

How about retaining the present Word2VecModel(Map: [String, Array(Float)]) and converting it internally to Word2VecModel(Map: [String, Int], Matrix) using something like

this(Map: [String, Int], Matrix) = {
   // do map to indexedModel and Matrix conversion
  this(indexedModel: Map[String, Int], Matrix)
}

Supplying a Map[String, Array[Float]] seems much more intuitive from a user's point of view, when we do decide to make it non-experimental.

jkbradley · 2015-04-17T21:24:18Z

@MechCoder I agree that supplying a Map is more intuitive. How about we support:

Private constructor: Take Matrix
Public constructor: Take Map

MechCoder · 2015-04-18T20:15:04Z

@jkbradley Thinking over it again, I'm not sure if it would offer a great advantage to do so. If you are talking about preventing this slicing (https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/feature/Word2Vec.scala#L408)?

If yes, even if I do prevent this slicing and pass the Matrix, I would have to do the slicing again from Matrix.values to find the norms (https://github.com/apache/spark/pull/5467/files#diff-88f4b62c382b26ef8e856b23f5167ccdR446) .

What other advantage did you have in mind?

MechCoder · 2015-04-20T18:41:19Z

@jkbradley I have problems in understanding how to write the code for this. I had this design in mind.

class Word2VecModel private[mllib] (
    wordIndex: Map[String, Int],
    wordVectors: DenseMatrix) extends Serializable with Saveable {

    // Calculate wordVecNorms and wordList here

     private def this(model: Map[String, Array[Float]]) = {
        // convert to wordIndex and wordVectors here.
        this(wordIndex, wordVectors)
     }

However, this does not work, since the first line after overriding the constructor should call the constructor itself. How to go about this?

jkbradley · 2015-04-20T22:11:54Z

@MechCoder That looks correct, except you'll need to call this() immediately. I'd write helper methods for constructing wordIndex and wordVectors:

  private def this(model: Map[String, Array[Float]]) = this(buildWordIndex(model), buildWordVectors(model))

You'll need to add unit tests and stuff too. Would you actually mind if we made this another JIRA and PR? I'm starting to worry about mission creep. : )

jkbradley · 2015-04-20T22:17:49Z

It just occurred to me that we're converting from Float to Double. I'm not sure historically why Word2Vec used Float, but I'm worrying now about switching since it will double model sizes. (I'm sorry I didn't think about this earlier!)

This PR should still be doable, but you would need to store an Array[Float] instead of the Matrix type. You would also need to use mllib.linalg.BLAS.nativeBLAS to make the BLAS calls.

What do you think?

MechCoder · 2015-04-21T03:49:18Z

Alright, we can move those to another PR.

This PR should still be doable, but you would need to store an Array[Float] instead of the Matrix type. You would also need to use mllib.linalg.BLAS.nativeBLAS to make the BLAS calls.

That was what I did initially :(

jkbradley · 2015-04-21T05:29:57Z

Yes, I'm sorry about that. Please do push back if you think my advice is incorrect. How difficult would it be to check out an earlier version from that point, and then look at the Github commit diffs for your later commits to check through updates you made later which might still apply to the old version?

MechCoder · 2015-04-21T05:32:32Z

yes, on it. by the way, it would be great if you could give me some advice on this PR, #5455 I'm not sure how to proceed.

MechCoder · 2015-04-21T10:04:44Z

I've pushed some updates. I've made numDim and numWords a class var, so that they can also be used elsewhere.

SparkQA · 2015-04-21T11:54:48Z

Test build #30663 has finished for PR 5467 at commit 6b74c81.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.
This patch does not change any dependencies.

jkbradley · 2015-04-21T18:05:22Z

mllib/src/main/scala/org/apache/spark/mllib/feature/Word2Vec.scala

+    model: Map[String, Array[Float]]) extends Serializable with Saveable {
+
+  // Maintain a ordered list of words based on the index in the initial model.
+  private val wordList: Array[String] = model.keys.toArray


Since this is the first place that an ordering on keys is defined, can you use this below when creating wordVectors (to make sure the ordering is exactly the same)?

Also, please add a little doc saying what each of these 6 values are.

Where do I need to write this? As a comment?

Yes, please, a little one-line comment for each value would be fine.

jkbradley · 2015-04-21T18:06:14Z

@MechCoder I think that's it. Thanks very much for updating & putting up with the re-do.

MechCoder · 2015-04-21T19:47:01Z

@jkbradley fixed, hopefully should be it

SparkQA · 2015-04-21T21:32:03Z

Test build #30689 has finished for PR 5467 at commit ffc9240.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.
This patch does not change any dependencies.

MechCoder · 2015-04-21T21:38:43Z

@jkbradley I've fixed up your comment. It makes sense any way, since now the entire model is iterated across only once.

SparkQA · 2015-04-21T23:28:13Z

Test build #30700 has finished for PR 5467 at commit dd0b0b2.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.
This patch does not change any dependencies.

jkbradley · 2015-04-21T23:42:19Z

LGTM, merging into master. Thanks very much!

MechCoder · 2015-04-22T03:57:02Z

@jkbradley Could you open a jira for the TODO?

mengxr · 2015-04-24T08:31:05Z

mllib/src/main/scala/org/apache/spark/mllib/feature/Word2Vec.scala

@@ -479,9 +508,23 @@ class Word2VecModel private[mllib] (
   */
  def findSynonyms(vector: Vector, num: Int): Array[(String, Double)] = {
    require(num > 0, "Number of similar words should > 0")
-    // TODO: optimize top-k


This TODO was created to use BoundedPriorityQueue to compute top k:

https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/util/BoundedPriorityQueue.scala

mengxr · 2015-04-24T08:33:47Z

@MechCoder Sorry for my late comment! I made some minor comments. It would be good if you can submit a follow-up PR to address those issues. Thanks!

MechCoder · 2015-04-24T12:02:05Z

cool, will make the changes along with sprak-7045

1. Use blas calls to find the dot product between two vectors. 2. Prevent re-computing the L2 norm of the given vector for each word in model. Author: MechCoder <[email protected]> Closes apache#5467 from MechCoder/spark-6065 and squashes the following commits: dd0b0b2 [MechCoder] Preallocate wordVectors ffc9240 [MechCoder] Minor 6b74c81 [MechCoder] Switch back to native blas calls da1642d [MechCoder] Explicit types and indexing 64575b0 [MechCoder] Save indexedmap and a wordvecmat instead of matrix fbe0108 [MechCoder] Made the following changes 1. Calculate norms during initialization. 2. Use Blas calls from linalg.blas 1350cf3 [MechCoder] [SPARK-6065] Optimize word2vec.findSynonynms using blas calls

MechCoder force-pushed the spark-6065 branch from 66cf62a to a7237aa Compare April 14, 2015 04:12

jkbradley reviewed Apr 14, 2015
View reviewed changes

jkbradley reviewed Apr 16, 2015
View reviewed changes

MechCoder added 3 commits April 17, 2015 12:08

[SPARK-6065] Optimize word2vec.findSynonynms using blas calls

1350cf3

Made the following changes

fbe0108

1. Calculate norms during initialization. 2. Use Blas calls from linalg.blas

Save indexedmap and a wordvecmat instead of matrix

64575b0

MechCoder force-pushed the spark-6065 branch from 3b0d075 to 64575b0 Compare April 17, 2015 06:39

Explicit types and indexing

da1642d

Switch back to native blas calls

6b74c81

jkbradley reviewed Apr 21, 2015
View reviewed changes

Minor

ffc9240

Preallocate wordVectors

dd0b0b2

asfgit closed this in 7fe6142 Apr 21, 2015

MechCoder deleted the spark-6065 branch April 22, 2015 03:06

mengxr reviewed Apr 24, 2015
View reviewed changes

MechCoder mentioned this pull request Jun 2, 2015

[SPARK-7045] [MLlib] Avoid intermediate representation when creating model #5748

Closed

		@@ -431,6 +431,14 @@ class Word2Vec extends Serializable with Logging {
		class Word2VecModel private[mllib] (
		private val model: Map[String, Array[Float]]) extends Serializable with Saveable {

[SPARK-6065] [MLlib] Optimize word2vec.findSynonyms using blas calls #5467

[SPARK-6065] [MLlib] Optimize word2vec.findSynonyms using blas calls #5467

Conversation

MechCoder commented Apr 11, 2015

MechCoder commented Apr 11, 2015

SparkQA commented Apr 11, 2015

jkbradley commented Apr 13, 2015

SparkQA commented Apr 14, 2015

MechCoder commented Apr 14, 2015

SparkQA commented Apr 14, 2015

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jkbradley commented Apr 14, 2015

MechCoder commented Apr 15, 2015

SparkQA commented Apr 15, 2015

jkbradley commented Apr 16, 2015

Choose a reason for hiding this comment

SparkQA commented Apr 17, 2015

SparkQA commented Apr 17, 2015

MechCoder commented Apr 17, 2015

jkbradley commented Apr 17, 2015

MechCoder commented Apr 18, 2015

MechCoder commented Apr 20, 2015

jkbradley commented Apr 20, 2015

jkbradley commented Apr 20, 2015

MechCoder commented Apr 21, 2015

jkbradley commented Apr 21, 2015

MechCoder commented Apr 21, 2015

MechCoder commented Apr 21, 2015

SparkQA commented Apr 21, 2015

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jkbradley commented Apr 21, 2015

MechCoder commented Apr 21, 2015

SparkQA commented Apr 21, 2015

MechCoder commented Apr 21, 2015

SparkQA commented Apr 21, 2015

jkbradley commented Apr 21, 2015

MechCoder commented Apr 22, 2015

Choose a reason for hiding this comment

mengxr commented Apr 24, 2015

MechCoder commented Apr 24, 2015