[SPARK-18385][ML] Make the transformer's natively in ml framework to avoid extra conversion #15831

techaddict · 2016-11-09T16:14:15Z

What changes were proposed in this pull request?

Follow Up of SPARK-14615
Transformer's added in ml framework to avoid extra conversion for:
ChiSqSelector
IDF
StandardScaler
PCA

How was this patch tested?

Existing Tests

… extra conversion

techaddict · 2016-11-09T16:16:42Z

cc: @dbtsai @mengxr

SparkQA · 2016-11-09T17:15:39Z

Test build #68410 has finished for PR 15831 at commit a9483ef.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-11-09T17:17:59Z

Test build #68411 has finished for PR 15831 at commit 89e6858.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

sethah · 2016-11-17T19:40:16Z

I see this patch was created as a result of the PR that separated the ml/mllib linalg packages, to avoid some inefficiencies in conversion. However, it also is a partial step toward feature parity. Typically, we would port full algorithms all at once, instead of just porting the transformer functionality as is done here, but I understand that this is not just about parity. I would suggest one of the following:

Port over full feature functionality. This increases the scope and therefore the algos should probably separated out individually into PRs.
Keep the scope the same, but avoid copying code.

For an example of option 2, for ChiSqSelector, we can implement new static methods in the mllib.ChiSqSelectorModel:

private[spark] def compressDense(
      selectedFeatures: Array[Int],
      values: Array[Double]): Array[Double] = {
    selectedFeatures.map(i => values(i))
  }

  private[spark] def compressSparse(
      compressedSize: Int,
      selectedFeatures: Array[Int],
      indices: Array[Int],
      values: Array[Double]): (Array[Int], Array[Double]) = {
  ...
}

then in the actual model classes we can just do something like:

private def compress(features: Vector): Vector = {
    features match {
      case SparseVector(_, indices, values) =>
        val newSize = selectedFeatures.length
        val (newIndices, newValues) =
          ChiSqSelectorModel.compressSparse(newSize, selectedFeatures, indices, values)
        Vectors.sparse(newSize, newIndices, newValues)
      case DenseVector(values) =>
        Vectors.dense(ChiSqSelectorModel.compressDense(selectedFeatures, values))
    }
}

This approach would allow us to avoid copying a lot of code until we do full feature ports. What are others opinions? I lean towards the second option since it keeps the scope reasonable.

cc @dbtsai @yanboliang

sethah · 2016-11-17T19:41:59Z

mllib/src/main/scala/org/apache/spark/ml/feature/ChiSqSelector.scala

+      case DenseVector(values) =>
+        val values = features.toArray
+        Vectors.dense(selectedFeatures.map(i => values(i)))
+      case other =>


btw there is no reason to have this case since Vector is a sealed trait

techaddict · 2016-11-17T23:53:17Z

@sethah I agree, 2nd approach is much more reasonable.

yanboliang · 2016-11-21T13:27:18Z

@techaddict @sethah I'm more prefer option 1, since we would like to remove spark.mllib package in a future release(may be 3.0) and we wouldn't like to make any change to it except bug fix. Could you make this improvement separately for relevant algorithms? Thanks.

techaddict · 2016-12-01T15:39:52Z

@sethah @yanboliang I've started with migrating IDF, can you review the WIP and if i'm going in the right direction https://github.com/techaddict/spark/pull/2/files
there is some code duplication were we can make mllib code actually depend on the ml one

MLnick · 2016-12-01T18:12:26Z

I'm also generally supportive of (1) - porting the code to ml and having the mllib code wrap the ml version - this is the approach for other models that have been done. Of course only once all mllib code has been ported over fully can we ultimately deprecate mllib.

I guess we can start doing this for some transformers like these - but ideally we should focus on porting stuff that's still missing in ml first.

I'd prefer that we create a top-level JIRA to track all the components that need to be done, and link everything appropriately. We also need to decide on priority - we may realistically be working on it over a 1-1.5 year time frame (of course hopefully it will take a lot shorter).

techaddict · 2016-12-02T01:22:01Z

@MLnick I will create a umbrella jira and start adding jira's for things I'm aware of of and you can start prioritising 👍 sounds like a plan ?

zhengruifeng · 2017-01-10T02:33:42Z

the same TODO also appear in HashingTF, what about include it in this PR?

sethah · 2017-01-10T02:39:14Z

I think we decided to go a different direction than what is proposed here? Actually, I still think there's merit in fixing the problem without having to do full feature ports. Either way, I'm not sure anyone is still taking on this task, so @zhengruifeng or @techaddict it would be great if you wanted to either revive this PR/help review, or start working on the larger umbrella JIRA and sub tasks...

techaddict · 2017-01-10T03:40:37Z

@sethah I will revive this pr thanks 👍

zhengruifeng · 2017-01-10T10:41:09Z

@techaddict @sethah I have some time to work on the porting, but I dont find the umbrella JIRA

HyukjinKwon · 2017-05-11T13:54:29Z

Hi @@techaddict, how is this PR going?

techaddict · 2017-05-17T12:52:05Z

@HyukjinKwon was busy, will restart this week.

SparkQA · 2017-05-27T05:05:50Z

Test build #77448 has finished for PR 15831 at commit 89e6858.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

techaddict added 4 commits November 9, 2016 20:23

ChiSqSelector: make the transformer natively in ml framework to avoid…

da36261

… extra conversion

add transformer for IDF

733394f

add StandardScaler transform

da43731

add PCA transform

a9483ef

remove TODO:

89e6858

sethah reviewed Nov 17, 2016

View reviewed changes

HyukjinKwon mentioned this pull request Jun 7, 2017

[INFRA] Close stale PRs #18223

Closed

asfgit closed this in b771fed Jun 8, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-18385][ML] Make the transformer's natively in ml framework to avoid extra conversion #15831

[SPARK-18385][ML] Make the transformer's natively in ml framework to avoid extra conversion #15831

techaddict commented Nov 9, 2016 •

edited

Loading

techaddict commented Nov 9, 2016

SparkQA commented Nov 9, 2016

SparkQA commented Nov 9, 2016

sethah commented Nov 17, 2016 •

edited

Loading

sethah Nov 17, 2016

techaddict commented Nov 17, 2016

yanboliang commented Nov 21, 2016

techaddict commented Dec 1, 2016

MLnick commented Dec 1, 2016 •

edited

Loading

techaddict commented Dec 2, 2016

zhengruifeng commented Jan 10, 2017

sethah commented Jan 10, 2017

techaddict commented Jan 10, 2017

zhengruifeng commented Jan 10, 2017

HyukjinKwon commented May 11, 2017

techaddict commented May 17, 2017

SparkQA commented May 27, 2017

[SPARK-18385][ML] Make the transformer's natively in ml framework to avoid extra conversion #15831

[SPARK-18385][ML] Make the transformer's natively in ml framework to avoid extra conversion #15831

Conversation

techaddict commented Nov 9, 2016 • edited Loading

What changes were proposed in this pull request?

How was this patch tested?

techaddict commented Nov 9, 2016

SparkQA commented Nov 9, 2016

SparkQA commented Nov 9, 2016

sethah commented Nov 17, 2016 • edited Loading

sethah Nov 17, 2016

Choose a reason for hiding this comment

techaddict commented Nov 17, 2016

yanboliang commented Nov 21, 2016

techaddict commented Dec 1, 2016

MLnick commented Dec 1, 2016 • edited Loading

techaddict commented Dec 2, 2016

zhengruifeng commented Jan 10, 2017

sethah commented Jan 10, 2017

techaddict commented Jan 10, 2017

zhengruifeng commented Jan 10, 2017

HyukjinKwon commented May 11, 2017

techaddict commented May 17, 2017

SparkQA commented May 27, 2017

techaddict commented Nov 9, 2016 •

edited

Loading

sethah commented Nov 17, 2016 •

edited

Loading

MLnick commented Dec 1, 2016 •

edited

Loading