Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-6065] [MLlib] Optimize word2vec.findSynonyms using blas calls #5467

Closed
wants to merge 7 commits into from

Conversation

MechCoder
Copy link
Contributor

  1. Use blas calls to find the dot product between two vectors.
  2. Prevent re-computing the L2 norm of the given vector for each word in model.

@MechCoder
Copy link
Contributor Author

@jkbradley Was this what you had in mind?

P.S: I prefer we finish off the other PR before discussion on this.

@SparkQA
Copy link

SparkQA commented Apr 11, 2015

Test build #30070 has finished for PR 5467 at commit 66cf62a.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.
  • This patch does not change any dependencies.

@jkbradley
Copy link
Member

Yep, that's pretty much what I had in mind, except that I'd recommend:

  • using MLlib's local Matrix type (and its BLAS call in mllib.linalg.BLAS)
  • computing and storing the Matrix once during initialization
    • In this case, we should store a matching Array of Strings so that getVectors can be computed (rather than storing duplicate info)

@SparkQA
Copy link

SparkQA commented Apr 14, 2015

Test build #30217 has finished for PR 5467 at commit a7237aa.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.
  • This patch does not change any dependencies.

@MechCoder
Copy link
Contributor Author

I've addressed your comments. I did not use the blas calls from linalg.blas initially since I thought there might be some overhead due to preprocessing.

This should be faster at least for repeated calls to findSynonyms, for obvious reasons.

@SparkQA
Copy link

SparkQA commented Apr 14, 2015

Test build #30232 has finished for PR 5467 at commit 17210c3.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.
  • This patch does not change any dependencies.

@@ -431,6 +431,14 @@ class Word2Vec extends Serializable with Logging {
class Word2VecModel private[mllib] (
private val model: Map[String, Array[Float]]) extends Serializable with Saveable {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't store this any more. Store wordVecMat, plus a matching collection of words; that collection should probably be a Map[String, Int] mapping word to index in wordVecMat (so that transform() is still fast).

Naming: How about "wordVectors" instead of "wordVecMat"?

You'll need to update getVectors to construct the map.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems to me that Map[String, Int] will just be model.keys.zip.(0 until model.size).toMap. Is it right to expect that the ordering of keys in the model Map does not change?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, I don't think so. I'll make an inline comment where it's used about handling that.

@jkbradley
Copy link
Member

Those updates might require significant changes, so I'll make another pass after updates. Thanks!

@MechCoder
Copy link
Contributor Author

@jkbradley I've pushed some updates.

@SparkQA
Copy link

SparkQA commented Apr 15, 2015

Test build #30369 has finished for PR 5467 at commit 3b0d075.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.
  • This patch does not change any dependencies.

@jkbradley
Copy link
Member

@MechCoder
I just realized that Word2Vec already has a Matrix which could just be passed to Word2VecModel's constructor. That might be easier and let you remove the code which converts to the map (in Word2Vec.fit) and converts back (in Word2VecModel).

private val model: Map[String, Array[Float]]) extends Serializable with Saveable {
model: Map[String, Array[Float]]) extends Serializable with Saveable {

val indexedModel = model.keys.zip(0 until model.size).toMap
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

State explicit types

Rename: "wordIndex"

@SparkQA
Copy link

SparkQA commented Apr 17, 2015

Test build #30464 has finished for PR 5467 at commit 64575b0.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.
  • This patch does not change any dependencies.

@SparkQA
Copy link

SparkQA commented Apr 17, 2015

Test build #30477 has finished for PR 5467 at commit da1642d.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.
  • This patch does not change any dependencies.

@MechCoder
Copy link
Contributor Author

@jkbradley I think I have addressed all your comments except the constructor.

How about retaining the present Word2VecModel(Map: [String, Array(Float)]) and converting it internally to Word2VecModel(Map: [String, Int], Matrix) using something like

this(Map: [String, Int], Matrix) = {
   // do map to indexedModel and Matrix conversion
  this(indexedModel: Map[String, Int], Matrix)
}

Supplying a Map[String, Array[Float]] seems much more intuitive from a user's point of view, when we do decide to make it non-experimental.

@jkbradley
Copy link
Member

@MechCoder I agree that supplying a Map is more intuitive. How about we support:

  • Private constructor: Take Matrix
  • Public constructor: Take Map

@MechCoder
Copy link
Contributor Author

@jkbradley Thinking over it again, I'm not sure if it would offer a great advantage to do so. If you are talking about preventing this slicing (https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/feature/Word2Vec.scala#L408)?

If yes, even if I do prevent this slicing and pass the Matrix, I would have to do the slicing again from Matrix.values to find the norms (https://github.com/apache/spark/pull/5467/files#diff-88f4b62c382b26ef8e856b23f5167ccdR446) .

What other advantage did you have in mind?

@MechCoder
Copy link
Contributor Author

@jkbradley I have problems in understanding how to write the code for this. I had this design in mind.

class Word2VecModel private[mllib] (
    wordIndex: Map[String, Int],
    wordVectors: DenseMatrix) extends Serializable with Saveable {

    // Calculate wordVecNorms and wordList here

     private def this(model: Map[String, Array[Float]]) = {
        // convert to wordIndex and wordVectors here.
        this(wordIndex, wordVectors)
     }

However, this does not work, since the first line after overriding the constructor should call the constructor itself. How to go about this?

@jkbradley
Copy link
Member

@MechCoder That looks correct, except you'll need to call this() immediately. I'd write helper methods for constructing wordIndex and wordVectors:

  private def this(model: Map[String, Array[Float]]) = this(buildWordIndex(model), buildWordVectors(model))

You'll need to add unit tests and stuff too. Would you actually mind if we made this another JIRA and PR? I'm starting to worry about mission creep. : )

@jkbradley
Copy link
Member

It just occurred to me that we're converting from Float to Double. I'm not sure historically why Word2Vec used Float, but I'm worrying now about switching since it will double model sizes. (I'm sorry I didn't think about this earlier!)

This PR should still be doable, but you would need to store an Array[Float] instead of the Matrix type. You would also need to use mllib.linalg.BLAS.nativeBLAS to make the BLAS calls.

What do you think?

@MechCoder
Copy link
Contributor Author

Alright, we can move those to another PR.

This PR should still be doable, but you would need to store an Array[Float] instead of the Matrix type. You would also need to use mllib.linalg.BLAS.nativeBLAS to make the BLAS calls.

That was what I did initially :(

@jkbradley
Copy link
Member

Yes, I'm sorry about that. Please do push back if you think my advice is incorrect. How difficult would it be to check out an earlier version from that point, and then look at the Github commit diffs for your later commits to check through updates you made later which might still apply to the old version?

@MechCoder
Copy link
Contributor Author

yes, on it. by the way, it would be great if you could give me some advice on this PR, #5455 I'm not sure how to proceed.

@MechCoder
Copy link
Contributor Author

I've pushed some updates. I've made numDim and numWords a class var, so that they can also be used elsewhere.

@SparkQA
Copy link

SparkQA commented Apr 21, 2015

Test build #30663 has finished for PR 5467 at commit 6b74c81.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.
  • This patch does not change any dependencies.

model: Map[String, Array[Float]]) extends Serializable with Saveable {

// Maintain a ordered list of words based on the index in the initial model.
private val wordList: Array[String] = model.keys.toArray
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this is the first place that an ordering on keys is defined, can you use this below when creating wordVectors (to make sure the ordering is exactly the same)?

Also, please add a little doc saying what each of these 6 values are.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where do I need to write this? As a comment?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, please, a little one-line comment for each value would be fine.

@jkbradley
Copy link
Member

@MechCoder I think that's it. Thanks very much for updating & putting up with the re-do.

@MechCoder
Copy link
Contributor Author

@jkbradley fixed, hopefully should be it

@SparkQA
Copy link

SparkQA commented Apr 21, 2015

Test build #30689 has finished for PR 5467 at commit ffc9240.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.
  • This patch does not change any dependencies.

@MechCoder
Copy link
Contributor Author

@jkbradley I've fixed up your comment. It makes sense any way, since now the entire model is iterated across only once.

@SparkQA
Copy link

SparkQA commented Apr 21, 2015

Test build #30700 has finished for PR 5467 at commit dd0b0b2.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.
  • This patch does not change any dependencies.

@jkbradley
Copy link
Member

LGTM, merging into master. Thanks very much!

@asfgit asfgit closed this in 7fe6142 Apr 21, 2015
@MechCoder MechCoder deleted the spark-6065 branch April 22, 2015 03:06
@MechCoder
Copy link
Contributor Author

@jkbradley Could you open a jira for the TODO?

@@ -479,9 +508,23 @@ class Word2VecModel private[mllib] (
*/
def findSynonyms(vector: Vector, num: Int): Array[(String, Double)] = {
require(num > 0, "Number of similar words should > 0")
// TODO: optimize top-k
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mengxr
Copy link
Contributor

mengxr commented Apr 24, 2015

@MechCoder Sorry for my late comment! I made some minor comments. It would be good if you can submit a follow-up PR to address those issues. Thanks!

@MechCoder
Copy link
Contributor Author

cool, will make the changes along with sprak-7045

nemccarthy pushed a commit to nemccarthy/spark that referenced this pull request Jun 19, 2015
1. Use blas calls to find the dot product between two vectors.
2. Prevent re-computing the L2 norm of the given vector for each word in model.

Author: MechCoder <[email protected]>

Closes apache#5467 from MechCoder/spark-6065 and squashes the following commits:

dd0b0b2 [MechCoder] Preallocate wordVectors
ffc9240 [MechCoder] Minor
6b74c81 [MechCoder] Switch back to native blas calls
da1642d [MechCoder] Explicit types and indexing
64575b0 [MechCoder] Save indexedmap and a wordvecmat instead of matrix
fbe0108 [MechCoder] Made the following changes 1. Calculate norms during initialization. 2. Use Blas calls from linalg.blas
1350cf3 [MechCoder] [SPARK-6065] Optimize word2vec.findSynonynms using blas calls
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants