-
Notifications
You must be signed in to change notification settings - Fork 28.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-18385][ML] Make the transformer's natively in ml framework to avoid extra conversion #15831
Conversation
Test build #68410 has finished for PR 15831 at commit
|
Test build #68411 has finished for PR 15831 at commit
|
I see this patch was created as a result of the PR that separated the ml/mllib linalg packages, to avoid some inefficiencies in conversion. However, it also is a partial step toward feature parity. Typically, we would port full algorithms all at once, instead of just porting the transformer functionality as is done here, but I understand that this is not just about parity. I would suggest one of the following:
For an example of option 2, for private[spark] def compressDense(
selectedFeatures: Array[Int],
values: Array[Double]): Array[Double] = {
selectedFeatures.map(i => values(i))
}
private[spark] def compressSparse(
compressedSize: Int,
selectedFeatures: Array[Int],
indices: Array[Int],
values: Array[Double]): (Array[Int], Array[Double]) = {
...
} then in the actual model classes we can just do something like: private def compress(features: Vector): Vector = {
features match {
case SparseVector(_, indices, values) =>
val newSize = selectedFeatures.length
val (newIndices, newValues) =
ChiSqSelectorModel.compressSparse(newSize, selectedFeatures, indices, values)
Vectors.sparse(newSize, newIndices, newValues)
case DenseVector(values) =>
Vectors.dense(ChiSqSelectorModel.compressDense(selectedFeatures, values))
}
} This approach would allow us to avoid copying a lot of code until we do full feature ports. What are others opinions? I lean towards the second option since it keeps the scope reasonable. |
case DenseVector(values) => | ||
val values = features.toArray | ||
Vectors.dense(selectedFeatures.map(i => values(i))) | ||
case other => |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
btw there is no reason to have this case since Vector
is a sealed trait
@sethah I agree, 2nd approach is much more reasonable. |
@techaddict @sethah I'm more prefer option 1, since we would like to remove spark.mllib package in a future release(may be 3.0) and we wouldn't like to make any change to it except bug fix. Could you make this improvement separately for relevant algorithms? Thanks. |
@sethah @yanboliang I've started with migrating |
I'm also generally supportive of (1) - porting the code to I guess we can start doing this for some transformers like these - but ideally we should focus on porting stuff that's still missing in I'd prefer that we create a top-level JIRA to track all the components that need to be done, and link everything appropriately. We also need to decide on priority - we may realistically be working on it over a 1-1.5 year time frame (of course hopefully it will take a lot shorter). |
@MLnick I will create a umbrella jira and start adding jira's for things I'm aware of of and you can start prioritising 👍 sounds like a plan ? |
the same TODO also appear in |
I think we decided to go a different direction than what is proposed here? Actually, I still think there's merit in fixing the problem without having to do full feature ports. Either way, I'm not sure anyone is still taking on this task, so @zhengruifeng or @techaddict it would be great if you wanted to either revive this PR/help review, or start working on the larger umbrella JIRA and sub tasks... |
@sethah I will revive this pr thanks 👍 |
@techaddict @sethah I have some time to work on the porting, but I dont find the umbrella JIRA |
Hi @@techaddict, how is this PR going? |
@HyukjinKwon was busy, will restart this week. |
Test build #77448 has finished for PR 15831 at commit
|
What changes were proposed in this pull request?
Follow Up of
SPARK-14615
Transformer's added in ml framework to avoid extra conversion for:
ChiSqSelector
IDF
StandardScaler
PCA
How was this patch tested?
Existing Tests