Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-18385][ML] Make the transformer's natively in ml framework to avoid extra conversion #15831

Closed
wants to merge 5 commits into from

Conversation

techaddict
Copy link
Contributor

@techaddict techaddict commented Nov 9, 2016

What changes were proposed in this pull request?

Follow Up of SPARK-14615
Transformer's added in ml framework to avoid extra conversion for:
ChiSqSelector
IDF
StandardScaler
PCA

How was this patch tested?

Existing Tests

@techaddict
Copy link
Contributor Author

cc: @dbtsai @mengxr

@SparkQA
Copy link

SparkQA commented Nov 9, 2016

Test build #68410 has finished for PR 15831 at commit a9483ef.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Nov 9, 2016

Test build #68411 has finished for PR 15831 at commit 89e6858.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@sethah
Copy link
Contributor

sethah commented Nov 17, 2016

I see this patch was created as a result of the PR that separated the ml/mllib linalg packages, to avoid some inefficiencies in conversion. However, it also is a partial step toward feature parity. Typically, we would port full algorithms all at once, instead of just porting the transformer functionality as is done here, but I understand that this is not just about parity. I would suggest one of the following:

  1. Port over full feature functionality. This increases the scope and therefore the algos should probably separated out individually into PRs.
  2. Keep the scope the same, but avoid copying code.

For an example of option 2, for ChiSqSelector, we can implement new static methods in the mllib.ChiSqSelectorModel:

private[spark] def compressDense(
      selectedFeatures: Array[Int],
      values: Array[Double]): Array[Double] = {
    selectedFeatures.map(i => values(i))
  }

  private[spark] def compressSparse(
      compressedSize: Int,
      selectedFeatures: Array[Int],
      indices: Array[Int],
      values: Array[Double]): (Array[Int], Array[Double]) = {
  ...
}

then in the actual model classes we can just do something like:

private def compress(features: Vector): Vector = {
    features match {
      case SparseVector(_, indices, values) =>
        val newSize = selectedFeatures.length
        val (newIndices, newValues) =
          ChiSqSelectorModel.compressSparse(newSize, selectedFeatures, indices, values)
        Vectors.sparse(newSize, newIndices, newValues)
      case DenseVector(values) =>
        Vectors.dense(ChiSqSelectorModel.compressDense(selectedFeatures, values))
    }
}

This approach would allow us to avoid copying a lot of code until we do full feature ports. What are others opinions? I lean towards the second option since it keeps the scope reasonable.

cc @dbtsai @yanboliang

case DenseVector(values) =>
val values = features.toArray
Vectors.dense(selectedFeatures.map(i => values(i)))
case other =>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

btw there is no reason to have this case since Vector is a sealed trait

@techaddict
Copy link
Contributor Author

@sethah I agree, 2nd approach is much more reasonable.

@yanboliang
Copy link
Contributor

@techaddict @sethah I'm more prefer option 1, since we would like to remove spark.mllib package in a future release(may be 3.0) and we wouldn't like to make any change to it except bug fix. Could you make this improvement separately for relevant algorithms? Thanks.

@techaddict
Copy link
Contributor Author

@sethah @yanboliang I've started with migrating IDF, can you review the WIP and if i'm going in the right direction https://github.com/techaddict/spark/pull/2/files
there is some code duplication were we can make mllib code actually depend on the ml one

@MLnick
Copy link
Contributor

MLnick commented Dec 1, 2016

I'm also generally supportive of (1) - porting the code to ml and having the mllib code wrap the ml version - this is the approach for other models that have been done. Of course only once all mllib code has been ported over fully can we ultimately deprecate mllib.

I guess we can start doing this for some transformers like these - but ideally we should focus on porting stuff that's still missing in ml first.

I'd prefer that we create a top-level JIRA to track all the components that need to be done, and link everything appropriately. We also need to decide on priority - we may realistically be working on it over a 1-1.5 year time frame (of course hopefully it will take a lot shorter).

@techaddict
Copy link
Contributor Author

@MLnick I will create a umbrella jira and start adding jira's for things I'm aware of of and you can start prioritising 👍 sounds like a plan ?

@zhengruifeng
Copy link
Contributor

the same TODO also appear in HashingTF, what about include it in this PR?

@sethah
Copy link
Contributor

sethah commented Jan 10, 2017

I think we decided to go a different direction than what is proposed here? Actually, I still think there's merit in fixing the problem without having to do full feature ports. Either way, I'm not sure anyone is still taking on this task, so @zhengruifeng or @techaddict it would be great if you wanted to either revive this PR/help review, or start working on the larger umbrella JIRA and sub tasks...

@techaddict
Copy link
Contributor Author

@sethah I will revive this pr thanks 👍

@zhengruifeng
Copy link
Contributor

@techaddict @sethah I have some time to work on the porting, but I dont find the umbrella JIRA

@HyukjinKwon
Copy link
Member

Hi @@techaddict, how is this PR going?

@techaddict
Copy link
Contributor Author

@HyukjinKwon was busy, will restart this week.

@SparkQA
Copy link

SparkQA commented May 27, 2017

Test build #77448 has finished for PR 15831 at commit 89e6858.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants