-
Notifications
You must be signed in to change notification settings - Fork 28.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-14661] [MLlib] trim PCAModel by required explained variance #12419
Conversation
Can one of the admins verify this patch? |
@@ -109,4 +111,21 @@ class PCAModel private[spark] ( | |||
s"SparseVector or DenseVector. Instead got: ${vector.getClass}") | |||
} | |||
} | |||
|
|||
def minimalByVarianceExplained(requiredVarianceRetained: Double): PCAModel = { | |||
val minFeaturesNum = explainedVariance |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about explainedVariance.values.scanLeft(0.0)(_ + _).indexWhere(_ >= requiredVarianceRetained) + 1
. Eh, OK to make that robust you'd have to handle the case where that returns 0 (means you need to keep all the PCs, so, could just return this
), and also arg-check the required variance to be in [0,1].
It looks like this could just as well be implemented in ML instead of MLlib. It is my understanding that we should avoid adding new features to MLlib unless it's blocking an improvement in ML. That doesn't seem to be the case here. |
… refactored code to be more robust)
@psuszyns I have some high level comments. To me, it does not make sense to train a PCA model, keeping k components, and then trim by variance explained. If I have a model with 10 columns, and I train a PCA model with k = 6 components, I retain some fraction of the variance. Then I request to trim the model by some fraction that might be greater than the variance I originally retained, so it will be impossible. I think this should be implemented by having two parameters |
Well I wanted to add this option without braking/amending current API. In my app I use it by first training the PCA with k = number of features and then calling the method I added. But I agree that it would be nicer to have the 'variance retained' as the input parameter of the PCA. I'll add appropriate setters to the 'ml' version and another constructor to the 'mllib' version, ok? |
I don't believe this will break the API. You can get away without even changing the MLlib API by adding a private constructor or a private call to the fit method that passes in a retained variance parameter. Also, it looks like it would be good to update the |
…incipalComponentsAndExplainedVariance instead of PCAModel mutation function)
@sethah please review my latest commit - is it any close to what you had in your mind? |
@psuszyns This introduces a breaking change to the MLlib API, which we should avoid since it is not strictly necessary. Looking at this more carefully, the simplest way to do this seems like it would be to add this for only spark.ML by requesting the full PCA from MLlib, then trimming according to retained variance in the spark.ML fit method. I'm not sure if we ought to make this available in MLlib, given that we could avoid some of the complexity. If we do, we need to do it in a way that does not break the APIs. Also, please do run the style checker, and see Contributing to Spark for Spark specific style guidelines. |
* @return a matrix of size n-by-k, whose columns are principal components, and | ||
* a vector of values which indicate how much variance each principal component | ||
* explains | ||
*/ | ||
@Since("1.6.0") | ||
def computePrincipalComponentsAndExplainedVariance(k: Int): (Matrix, Vector) = { | ||
def computePrincipalComponentsAndExplainedVariance(filter: Either[Int, Double]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm no expert in the ML domain, but from a user perspective, this breaks API backwards compatibility.
An alternative could be to create a new method and factor out common behaviour shared with the current computePrincipalComponentsAndExplainedVariance
into a private utility method.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@sethah @jodersky It looks like the comment Since("1.6.0") is false becaue this method is not available in spark 1.6 - this change was merged to master instead of 1.6 branch. Do you still consider this change as API breaking given that it modifies API that wasn't yet released? If yes then I'll do as @jodersky said and introduce a new method and move common code to a new private one. I'd really like to have this feature in MLlib version because I use it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure about the breakage, nevertheless I would recommend implementing a new method regardless. I find the method's parameter type Either[Int, Double]
to be quite confusing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah having one method to mean two things using an Either
is too strange. At least, you would provide two overloads. And then, no reason to overload versus given them distinct and descriptive names.
I don't understand the question about unreleased APIs -- 1.6.0 was released a while ago and this method takes an Int parameter there. We certainly want to keep the ability to set a fixed number of principal components.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is RowMatrix as in 1.6.1 release: https://github.com/apache/spark/blob/15de51c238a7340fa81cb0b80d029a05d97bfc5c/mllib/src/main/scala/org/apache/spark/mllib/linalg/distributed/RowMatrix.scala am I correct? If yes then can you find there a method named computePrincipalComponentsAndExplainedVariance? I can't, yet on master it is annotated with Since("1.6.0") - isn't it false?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Aha you're right, it wasn't in 1.6. This is my fault: 21b3d2a
It never was added to branch 1.6, despite the apparent intention. At this point I think it should be considered 2.0+ and you can fix that annotation here. So yeah this method was never 'released'. Still I think we want to do something different with the argument anyway.
Hi @psuszyns, I just wonder if we are still active on this. |
What changes were proposed in this pull request?
New method in PCAModel for auto-trimming the model to minimal number of features calculated from required variance retained by those features
How was this patch tested?
unit tests