-
Notifications
You must be signed in to change notification settings - Fork 28.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SPARK-4156 [MLLIB] EM algorithm for GMMs #3022
Conversation
Can one of the admins verify this patch? |
Jenkins, test this please. |
@tgaloppo Thanks for the PR and congratulations on the first contribution. Apologies for the lack of feedback thus far -- I guess everyone is busy with the 1.2 release deadline on Nov 1. I will take a look at the PR in the next few days. Please make sure you get the JIRA assigned to yourself next time before working. It's the only way to avoid duplicate work. cc: @jkbradley, @mengxr |
Test FAILed. |
This test appeared to fail due to some form of timeout during the pull; is there any action I need to take? |
Test build #514 has started for PR 3022 at commit
|
Test build #514 has finished for PR 3022 at commit
|
/** Sum the values in array of doubles */ | ||
private def sum(x : Array[Double]) : Double = { | ||
var s : Double = 0.0 | ||
x.foreach(u => s += u) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You might not care about this at all, but calling foreach
on an Array
is actually notably slower than using a while loop over the indices. foreach
over a Range
is actually pretty close to while loop (ie. (0 until x.length).foreach{idx => s += x(idx)}
. Or if you don't care about runtimes, then you can always just call array.sum
(it actually comes from an implicit conversion to WrappedArray
):
scala> ((0 to 100).map{_ / 100.0}.toArray).sum
res2: Double = 50.5
Please advise how to resolve merge issues. |
Modified sum function for better performance
Thanks, @squito ... while I expect the array to only have a few elements, I have made changes according to your advice. |
…ges. Improved cluster initialization strategy.
Merged with the latest master branch to hopefully fix any merge issues. |
… and tolerance parameters. Modified cluster initialization strategy to use an initial covariance matrix derived from the sample points used to initialize the mean.
package org.apache.spark.examples.mllib | ||
|
||
import org.apache.spark.{SparkConf, SparkContext} | ||
import org.apache.spark.mllib.clustering.GaussianMixtureModel |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no need for this import
val mu = vectorMean(x) | ||
val ss = BreezeVector.zeros[Double](x(0).length) | ||
val cov = BreezeMatrix.eye[Double](ss.length) | ||
x.map(xi => (xi - mu) :^ 2.0).foreach(u => ss += u) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
breeze has squaredDistance
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
squaredDistance returns a scalar... I want the squared entry values.
Changed ExpectationSum to a private class
I've performed most of the requested changes. I do not see the BLAS function mentioned (dsyr), so I left this as a TODO. Also, I could not find EPSILON in MLUtils. I left predictMembership public and changed predict to predictLabels, providing soft and hard label assignments, respectively. I know there are some other thoughts around improving these, but I am not clear on what I should do. cc: @mengxr @jkbradley |
Sorry for late reply.predictLabels() and predictMembership() looks fine.But what about moving the computeSoftAssignments() to GaussianMixtureModelEM class(in KMeans, findClosest() is defined in KMeans rather than in KMeansModel) It will be good if the name of the class GaussianMixtureModelEM is changed as @mengxr suggested. |
} | ||
} | ||
|
||
private def run(inputFile: String, k: Int, convergenceTol: Double) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we take maxIterations as an optional input parameter?
Made maximum iterations an optional parameter to DenseGmmEM
|
||
/** A default instance, 2 Gaussians, 100 iterations, 0.01 log-likelihood threshold */ | ||
def this() = this(2, 0.01, 100) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove extra newlines
@tgaloppo MLUtils.EPSILON is actually private[util]. I think it would be fine to change it to be private[mllib]. CC: @mengxr @tgaloppo I strongly recommend predict() instead of predictLabels() to be consistent with KMeansModel. @FlytxtRnD computeSoftAssignments() is a function of the model, not the learning algorithm, so I think it belongs in the model. IMO, findClosest() should be in KMeansModel instead of KMeans, but that should be fixed in another PR. (It is not too important though since it is a private[mllib] API.) |
GaussianMixtureEM: Renamed from GaussianMixtureModelEM; corrected formatting issues GaussianMixtureModel: Renamed predictLabels() to predict() Others: Modifications based on rename of GaussianMixtureEM
Ok. I changed the privacy of EPSILON and am now using it in this code. |
Test build #555 has started for PR 3022 at commit
|
Test build #555 has finished for PR 3022 at commit
|
@tgaloppo Thanks for the updates, and thanks for all of your work in getting this ready! LGTM CC: @mengxr After this is merged, I'll make some JIRAs for the various item we've discussed along the way + a few more. Let me know if I've missed anything here:
|
@jkbradley Thank you for your help and feedback along the way. Please assign some (or all) of those tickets to me and I will continue to improve the implementation. In particular, you mentioned that there are a number of PR's with code for common distributions... I would be happy to help formalize a common interface and make these a public part of the library. |
@tgaloppo I've merged this into master. Thanks for contributing GMM! |
@tgaloppo @FlytxtRnD I made some JIRAs for the to-do items above. I'd say the most important are:
It would be great to do: Some less critical ones are:
I removed the NAN JIRAs, but we should investigate numerical stability at some point. Please let me know if you'd like any assigned to you, and thanks in advance for your work on this! If I'm able to work on one of the JIRAs, I'll make a note on the JIRA page. |
@jkbradley Please assign 5017, 5018, 5019, and 5020 to me. Regarding 5018, can you refer me to other PR's that are bringing in common distributions? I can work toward formalizing an API to make all of them public. I also indicated that I would be happy to provide the Python wrappers for the algorithm (ticket 5012); @FlytxtRnD had provided an initial Python implementation of the algorithm... if they would like to provide the wrappers instead, that would be cool (but I am still definitely happy to do it if not). CC: @mengxr |
@tgaloppo It's ideal if we assign & fix one JIRA at a time (as separate PRs). Can I start by assigning one of your choosing? For 5018, there is only one other such PR I know of, and it uses a Dirichlet distribution. But for API examples, I would recommend checking out popular libraries, such as R, Matlab, numpy, etc. |
@jkbradley No problem. Let's start with 5020, and I'll move on from there. |
@jkbradley Please assign me SPARK-5017, and I will take care of this in preparation for 5018 and 5019. |
Done :) |
Implementation of Expectation-Maximization for Gaussian Mixture Models.
This is my maiden contribution to Apache Spark, so I apologize now if I have done anything incorrectly; having said that, this work is my own, and I offer it to the project under the project's open source license.