Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-3181][MLLIB]: Add Robust Regression Algorithm with Huber Estimator #8013

Closed
wants to merge 1 commit into from

Conversation

fjiang6
Copy link

@fjiang6 fjiang6 commented Aug 7, 2015

Huber Robust Regression under spark/ml/regression
Unit Tests

@SparkQA
Copy link

SparkQA commented Aug 7, 2015

Test build #40111 has finished for PR 8013 at commit 2f67e63.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • class RobustRegression(override val uid: String)

@fjiang6
Copy link
Author

fjiang6 commented Aug 7, 2015

@mengxr @dbtsai @srowen had RobustRegression in the same LinearRegression codebase as requested. And included the Unit Tests.

@dbtsai
Copy link
Member

dbtsai commented Aug 7, 2015

Still a lot of duplication. We're adding new features into LiR now, and it will be hard to maintain. Is it possible that you just add the objective function, and use Params to switch between different objective function? Thanks.

@SparkQA
Copy link

SparkQA commented Aug 8, 2015

Test build #40222 has finished for PR 8013 at commit 96e38a7.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Aug 8, 2015

Test build #40223 has finished for PR 8013 at commit 23e4c62.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@fjiang6
Copy link
Author

fjiang6 commented Aug 8, 2015

@dbtsai ust added the objective function, and use Params to switch between different objective function. Thanks!

@@ -325,4 +325,21 @@ private[ml] trait HasStepSize extends Params {
/** @group getParam */
final def getStepSize: Double = $(stepSize)
}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sharedParams.scala can not be edited directly. Please look at SharedParamsCodeGen.scala.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, make HasRobust as HasRobustRegression in SharedParamsCodeGen.scala.

@SparkQA
Copy link

SparkQA commented Aug 11, 2015

Test build #40530 has finished for PR 8013 at commit 51e47dc.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • case class EqualNullSafe(attribute: String, value: Any) extends Filter

@fjiang6
Copy link
Author

fjiang6 commented Aug 11, 2015

This class was not added by me. I didn't touch PySpark.

@SparkQA
Copy link

SparkQA commented Aug 23, 2015

Test build #41422 has finished for PR 8013 at commit 1567635.

  • This patch fails to build.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Aug 23, 2015

Test build #41421 has finished for PR 8013 at commit 3bb5930.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Aug 23, 2015

Test build #41423 has finished for PR 8013 at commit a04179b.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@@ -65,6 +65,10 @@ private[shared] object SharedParamsCodeGen {
isValid = "ParamValidators.inArray(Array(\"skip\", \"error\"))"),
ParamDesc[Boolean]("standardization", "whether to standardize the training features" +
" before fitting the model.", Some("true")),
ParamDesc[Boolean]("robustRegression", "whether to use robust Huber Cost Function",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be better to introduce a "costFunction" param which defaults to "LeastSquares" and pattern match in LinearRegression#L195 since that will force mutual exclusivity when more than two cost functions are possible

@SparkQA
Copy link

SparkQA commented Aug 27, 2015

Test build #41662 has finished for PR 8013 at commit e447623.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@@ -97,6 +97,22 @@ class LinearRegression(override val uid: String)
setDefault(standardization -> true)

/**
* Set the robust Option to determine whether to use robust Huber Cost Function
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: This is not really an "Option", can we just make this say " Set whether to use robust Huber Cost Function"

@feynmanliang
Copy link
Contributor

There is a lot of code repetition between this and #2096, perhaps you can make the mllib one wrap this?

@dbtsai
Copy link
Member

dbtsai commented Sep 1, 2015

Hello, robust tuning parameter k should not be a constant as you implemented.
In the paper, http://users.stat.umn.edu/~sandy/courses/8053/handouts/robust.pdf
k = 1.345σ where σ is the square error of current square loss. But this will be very expensive to compute the current square error of current square loss and then compute the huber loss, so I think it's reasonable to approximate the square error from previous weight.

add the objective function, and use Params to switch

edit to pass scala style tests

make HasRobustRegression in SharedParamsCodeGen.scala, Make the document more explicitly and make k tunable and default to 1.345 by having another param

UnitTests with Outliers

UnitTests with Outliers

Edit HuberAggregator

scala codestyle

Update LinearRegression.scala
@SparkQA
Copy link

SparkQA commented Jan 17, 2016

Test build #49555 has finished for PR 8013 at commit 01601ee.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@rxin
Copy link
Contributor

rxin commented Jun 15, 2016

Thanks for the pull request. I'm going through a list of pull requests to cut them down since the sheer number is breaking some of the tooling we have. Due to lack of activity on this pull request, I'm going to push a commit to close it. Feel free to reopen it or create a new one. We can also continue the discussion on the JIRA ticket.

@dbtsai there are a few pull requests that were waiting on your review. Can you revisit them even if they are closed?

@asfgit asfgit closed this in 1a33f2e Jun 15, 2016
@dbtsai
Copy link
Member

dbtsai commented Jul 4, 2016

@rxin @mengxr I'm back to US from a leave. Going to revisit PRs under me.

I had worked with @MechCoder to implement Huber estimator in python scikit scikit-learn/scikit-learn#5291 which had been merged. @fjiang6, @MechCoder, @sethah, are you interested in porting this feature to Spark which should be fairly straightforward?

Thanks.

@yanboliang
Copy link
Contributor

@dbtsai I'm interested in porting Huber estimator to Spark. If you did not start it, I can send a PR in a few days. Thanks!

@dbtsai
Copy link
Member

dbtsai commented Jul 5, 2016

@yanboliang Sounds great! Thanks.

@MechCoder
Copy link
Contributor

I'll be happy to review it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants