-
Notifications
You must be signed in to change notification settings - Fork 28.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-6065] [MLlib] Optimize word2vec.findSynonyms using blas calls #5467
Conversation
MechCoder
commented
Apr 11, 2015
- Use blas calls to find the dot product between two vectors.
- Prevent re-computing the L2 norm of the given vector for each word in model.
@jkbradley Was this what you had in mind? P.S: I prefer we finish off the other PR before discussion on this. |
Test build #30070 has finished for PR 5467 at commit
|
Yep, that's pretty much what I had in mind, except that I'd recommend:
|
Test build #30217 has finished for PR 5467 at commit
|
I've addressed your comments. I did not use the blas calls from linalg.blas initially since I thought there might be some overhead due to preprocessing. This should be faster at least for repeated calls to |
Test build #30232 has finished for PR 5467 at commit
|
@@ -431,6 +431,14 @@ class Word2Vec extends Serializable with Logging { | |||
class Word2VecModel private[mllib] ( | |||
private val model: Map[String, Array[Float]]) extends Serializable with Saveable { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't store this any more. Store wordVecMat, plus a matching collection of words; that collection should probably be a Map[String, Int] mapping word to index in wordVecMat (so that transform() is still fast).
Naming: How about "wordVectors" instead of "wordVecMat"?
You'll need to update getVectors to construct the map.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems to me that Map[String, Int]
will just be model.keys.zip.(0 until model.size).toMap
. Is it right to expect that the ordering of keys in the model Map does not change?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, I don't think so. I'll make an inline comment where it's used about handling that.
Those updates might require significant changes, so I'll make another pass after updates. Thanks! |
@jkbradley I've pushed some updates. |
Test build #30369 has finished for PR 5467 at commit
|
@MechCoder |
private val model: Map[String, Array[Float]]) extends Serializable with Saveable { | ||
model: Map[String, Array[Float]]) extends Serializable with Saveable { | ||
|
||
val indexedModel = model.keys.zip(0 until model.size).toMap |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
State explicit types
Rename: "wordIndex"
1. Calculate norms during initialization. 2. Use Blas calls from linalg.blas
Test build #30464 has finished for PR 5467 at commit
|
Test build #30477 has finished for PR 5467 at commit
|
@jkbradley I think I have addressed all your comments except the constructor. How about retaining the present Word2VecModel(Map: [String, Array(Float)]) and converting it internally to Word2VecModel(Map: [String, Int], Matrix) using something like
Supplying a Map[String, Array[Float]] seems much more intuitive from a user's point of view, when we do decide to make it non-experimental. |
@MechCoder I agree that supplying a Map is more intuitive. How about we support:
|
@jkbradley Thinking over it again, I'm not sure if it would offer a great advantage to do so. If you are talking about preventing this slicing (https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/feature/Word2Vec.scala#L408)? If yes, even if I do prevent this slicing and pass the Matrix, I would have to do the slicing again from What other advantage did you have in mind? |
@jkbradley I have problems in understanding how to write the code for this. I had this design in mind.
However, this does not work, since the first line after overriding the constructor should call the constructor itself. How to go about this? |
@MechCoder That looks correct, except you'll need to call this() immediately. I'd write helper methods for constructing wordIndex and wordVectors:
You'll need to add unit tests and stuff too. Would you actually mind if we made this another JIRA and PR? I'm starting to worry about mission creep. : ) |
It just occurred to me that we're converting from Float to Double. I'm not sure historically why Word2Vec used Float, but I'm worrying now about switching since it will double model sizes. (I'm sorry I didn't think about this earlier!) This PR should still be doable, but you would need to store an Array[Float] instead of the Matrix type. You would also need to use What do you think? |
Alright, we can move those to another PR.
That was what I did initially :( |
Yes, I'm sorry about that. Please do push back if you think my advice is incorrect. How difficult would it be to check out an earlier version from that point, and then look at the Github commit diffs for your later commits to check through updates you made later which might still apply to the old version? |
yes, on it. by the way, it would be great if you could give me some advice on this PR, #5455 I'm not sure how to proceed. |
I've pushed some updates. I've made numDim and numWords a class var, so that they can also be used elsewhere. |
Test build #30663 has finished for PR 5467 at commit
|
model: Map[String, Array[Float]]) extends Serializable with Saveable { | ||
|
||
// Maintain a ordered list of words based on the index in the initial model. | ||
private val wordList: Array[String] = model.keys.toArray |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since this is the first place that an ordering on keys is defined, can you use this below when creating wordVectors (to make sure the ordering is exactly the same)?
Also, please add a little doc saying what each of these 6 values are.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where do I need to write this? As a comment?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, please, a little one-line comment for each value would be fine.
@MechCoder I think that's it. Thanks very much for updating & putting up with the re-do. |
@jkbradley fixed, hopefully should be it |
Test build #30689 has finished for PR 5467 at commit
|
@jkbradley I've fixed up your comment. It makes sense any way, since now the entire model is iterated across only once. |
Test build #30700 has finished for PR 5467 at commit
|
LGTM, merging into master. Thanks very much! |
@jkbradley Could you open a jira for the TODO? |
@@ -479,9 +508,23 @@ class Word2VecModel private[mllib] ( | |||
*/ | |||
def findSynonyms(vector: Vector, num: Int): Array[(String, Double)] = { | |||
require(num > 0, "Number of similar words should > 0") | |||
// TODO: optimize top-k |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This TODO was created to use BoundedPriorityQueue
to compute top k:
@MechCoder Sorry for my late comment! I made some minor comments. It would be good if you can submit a follow-up PR to address those issues. Thanks! |
cool, will make the changes along with sprak-7045 |
1. Use blas calls to find the dot product between two vectors. 2. Prevent re-computing the L2 norm of the given vector for each word in model. Author: MechCoder <[email protected]> Closes apache#5467 from MechCoder/spark-6065 and squashes the following commits: dd0b0b2 [MechCoder] Preallocate wordVectors ffc9240 [MechCoder] Minor 6b74c81 [MechCoder] Switch back to native blas calls da1642d [MechCoder] Explicit types and indexing 64575b0 [MechCoder] Save indexedmap and a wordvecmat instead of matrix fbe0108 [MechCoder] Made the following changes 1. Calculate norms during initialization. 2. Use Blas calls from linalg.blas 1350cf3 [MechCoder] [SPARK-6065] Optimize word2vec.findSynonynms using blas calls