[SPARK-4980] [MLlib] Add decay factors to streaming linear methods #8022

rotationsymmetry · 2015-08-07T05:03:17Z

This PR includes an implementation of decay factors in streaming linear and logistic regression. Unit tests are also included.

The algorithm and design details are described in the document: https://docs.google.com/document/d/1UfKvuaaJVQCvh-wOLLYT8l7STQFjPxE7fitZyd0tqTo/edit?usp=sharing

Your comments and suggestions are highly appreciated. I will add more tests and ScalaDoc as suggested.

Thanks!

cc @freeman-lab @mengxr

feynmanliang · 2015-08-25T01:12:59Z

mllib/src/main/scala/org/apache/spark/mllib/regression/StreamingDecay.scala

+  def getDiscount(numNewDataPoints: Long): Double
+}
+
+private[mllib] trait StreamingDecaySetter[T <: StreamingDecaySetter[T]] extends Logging {


Why do we need F-bounded polymorphism here? Does the code not work when you replace T with self.type?

"Does the code not work when you replace T with self.type?"

I guess not? For example,

trait Setter { def set: this.type = this} class Apple extends Setter val a = new Apple() a.set

The return type of a.set is a.type, not Apple. Do I answer your question?

"Why do we need F-bounded polymorphism here?"
I agree with you that this is not needed here. Originally I included this as an extra level of type checking. But since I have self: T=> in the next line, I don't think we need it any more. I will remove it in the next push to the PR.

feynmanliang · 2015-08-27T02:37:04Z

@rotationsymmetry you also have a merge conflict, sorry 😞 do you mind resolving?

rotationsymmetry · 2015-08-27T21:19:40Z

@feynmanliang Thank you very much for your review.

I have incorporated your comments in commit a4ed2b0.

Add ScalaDoc for public API.
Add ScalaDoc to decribe the forgetful algorithm in StreamingLinearAlgorithm.
Remove F-polymorphism in StreamingDecaySetter[T].
decayFactor and timeUnit in StreamingDecaySetter[T] are now private.
Remove division by zero in trainOn of StreamingLinearAlgorithm; provide comments to explains why.
Improve testing cases of StreamingLogisticRegressionSuite to have rel tol=0.1.
resolve merge conflict.

As for your comment of "having getLambda instead of getDiscount in StreamingDecay", I feel that the discount factor better conveys the mathematical idea of the algorithm. Lambda, on the other hand, is only a temporary value in the calculation. For example, in the spark doc, the discount factor is employed to describe the algorithm. I have included similar description in the ScalaDoc for StreamingLinearAlgorithm.

Thanks again for your review. If you have any further comments, please let me know.

feynmanliang · 2015-08-27T21:40:38Z

...rc/main/scala/org/apache/spark/mllib/classification/StreamingLogisticRegressionWithSGD.scala

@@ -32,6 +32,11 @@ import org.apache.spark.mllib.regression.StreamingLinearAlgorithm
 * of features must be constant. An initial weight
 * vector must be provided.
 *
+ * This class inherits the forgetful algorithm from StreamingLinearAlgorithm


"[[StreamingLinearAlgorithm]]" so API docs generate a link, ditto for L37

rotationsymmetry · 2015-09-02T03:23:36Z

@feynmanliang I have make another push to the PR:

Refactor StreamingDecay
Use case object for TimeUnit (only for the regression/classification. no change to StreamingKMeans until rely of the author)
Clean up ScalaDoc
Add tests for half life and TimeUnit

Thank again for your review.

feynmanliang · 2015-09-03T17:46:48Z

...rc/main/scala/org/apache/spark/mllib/classification/StreamingLogisticRegressionWithSGD.scala

@@ -101,4 +107,14 @@ class StreamingLogisticRegressionWithSGD private[mllib] (
    this.model = Some(algorithm.createModel(initialWeights, 0.0))
    this
  }
+
+  override def setDecayFactor(decayFactor: Double): this.type = {


This boilerplate is duplicated in streaming linear regression. I am guessing you do this to get the concrete subclass (correct me if I'm wrong), but you actually don't need to do this since the this.type in trait StreamingDecay takes care of this. A simple REPL example:

scala> trait Superclass { def test: this.type } defined trait Superclass scala> class Subclass extends Superclass { def test = this } defined class Subclass scala> (new Subclass()).test res0: Subclass = Subclass@1cb4ab3e

Oh, I meant that you could remove these setters entirely

scala> trait Superclass { def test: this.type = this } defined trait Superclass scala> class Subclass extends Superclass defined class Subclass scala> (new Subclass).test res1: Subclass = Subclass@b364520

feynmanliang · 2015-09-03T18:22:59Z

Made another pass

rotationsymmetry · 2015-09-04T22:33:49Z

@feynmanliang Thank you for your comments. I have revised the PR, including

Refactor: timeUnit has its own setter.
Add @SInCE.
Clean up ScalaDoc.

As I am rewriting the ScalaDoc, it appears that the algorithm can be more easily described and understood if we rename decayFactor to retentionFactor. What do you think?

feynmanliang · 2015-09-05T02:57:27Z

Streaming KMeans uses decayFactor and I think it's important we maintain consistency

feynmanliang · 2015-09-05T03:03:41Z

mllib/src/main/scala/org/apache/spark/mllib/regression/StreamingDecay.scala

+ */
+@Experimental
+private[mllib] trait StreamingDecay extends Logging{
+  private[this] var decayFactor: Double = 0


just private is fine

feynmanliang · 2015-09-05T03:08:15Z

LGTM after these changes and pending tests

CC @mengxr @freeman-lab

rotationsymmetry · 2015-09-05T21:34:35Z

@feynmanliang Much appreciated. I have update the PR for your comments.

mengxr · 2015-10-20T23:33:54Z

add to whitelist

mengxr · 2015-10-20T23:33:57Z

ok to test

SparkQA · 2015-10-21T00:29:16Z

Test build #44017 has finished for PR 8022 at commit 9ba83cb.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

mengxr · 2015-11-03T16:43:12Z

@rotationsymmetry Could you provide a simple unit test in Java to show Java compatibility?

mengxr · 2015-11-04T00:45:51Z

mllib/src/main/scala/org/apache/spark/mllib/regression/StreamingLinearAlgorithm.scala

+        val lambda = numNewDataPoints / updatedDataWeight
+
+        BLAS.scal(lambda, newModel.weights)
+        BLAS.axpy(1-lambda, model.get.weights, newModel.weights)


Do we have some references about this merging scheme? I assume that this works for many cases, but there is no guarantee in theory.

Revise test "parameter accuracy" in StreamingLinearRegressionSuite to account for decay.

Split StreamingDecay into two traits. Update StreamingLogisticRegressionWithSGD. Update test suites.

Also make StreamingDecaySetter to be private[mllib].

Add ScalaDoc for public API. Add ScalaDoc to decribe the forgetful algorithm in StreamingLinearAlgorithm. Remove F-polymorphism in StreamingDecaySetter[T]. decayFactor and timeUnit in StreamingDecaySetter[T] are now private. Remove division by zero in trainOn of StreamingLinearAlgorithm; provide comments to explains why. Improve testing cases of StreamingLogisticRegressionSuite to have rel tol=0.1.

Refactor StreamingDecay Use case object for TimeUnit Clean up ScalaDoc

@SInCE

Add @SInCE. Clean up ScalaDoc.

clean up new lines and comments.

SparkQA · 2015-11-09T05:08:36Z

Test build #45336 has finished for PR 8022 at commit 0072400.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

rxin · 2016-06-15T22:05:45Z

Thanks for the pull request. I'm going through a list of pull requests to cut them down since the sheer number is breaking some of the tooling we have. Due to lack of activity on this pull request, I'm going to push a commit to close it. Feel free to reopen it or create a new one.

feynmanliang reviewed Aug 25, 2015
View reviewed changes

feynmanliang reviewed Aug 27, 2015
View reviewed changes

feynmanliang reviewed Sep 3, 2015
View reviewed changes

feynmanliang reviewed Sep 5, 2015
View reviewed changes

mengxr reviewed Nov 4, 2015
View reviewed changes

rotationsymmetry and others added 11 commits November 8, 2015 14:49

Add decay to StreamingLinearAlgorithm through StreamingDecay trait.

a20e2f4

Revise test "parameter accuracy" in StreamingLinearRegressionSuite to account for decay.

Fix fluent setter API

d43c3a8

Split StreamingDecay into two traits. Update StreamingLogisticRegressionWithSGD. Update test suites.

Add unit tests.

0534328

Also make StreamingDecaySetter to be private[mllib].

minor fixes

999beba

fix Scala style

98a8a5b

incorporating further comments

16227ab

Refactor StreamingDecay Use case object for TimeUnit Clean up ScalaDoc

Add tests for half-life and TimeUnit.

8605004

Refactor: timeUnit has its own setter.

686fd2c

Add @SInCE. Clean up ScalaDoc.

remove duplicate setters.

3b42f96

clean up new lines and comments.

Improve Java API compatibility.

0072400

asfgit closed this in 1a33f2e Jun 15, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-4980] [MLlib] Add decay factors to streaming linear methods #8022

[SPARK-4980] [MLlib] Add decay factors to streaming linear methods #8022

rotationsymmetry commented Aug 7, 2015

feynmanliang Aug 25, 2015

rotationsymmetry Aug 26, 2015

feynmanliang commented Aug 27, 2015

rotationsymmetry commented Aug 27, 2015

feynmanliang Aug 27, 2015

rotationsymmetry commented Sep 2, 2015

feynmanliang Sep 3, 2015

rotationsymmetry Sep 4, 2015

feynmanliang Sep 5, 2015

feynmanliang commented Sep 3, 2015

rotationsymmetry commented Sep 4, 2015

feynmanliang commented Sep 5, 2015

feynmanliang Sep 5, 2015

feynmanliang commented Sep 5, 2015

rotationsymmetry commented Sep 5, 2015

mengxr commented Oct 20, 2015

mengxr commented Oct 20, 2015

SparkQA commented Oct 21, 2015

mengxr commented Nov 3, 2015

mengxr Nov 4, 2015

SparkQA commented Nov 9, 2015

rxin commented Jun 15, 2016

[SPARK-4980] [MLlib] Add decay factors to streaming linear methods #8022

[SPARK-4980] [MLlib] Add decay factors to streaming linear methods #8022

Conversation

rotationsymmetry commented Aug 7, 2015

Choose a reason for hiding this comment

Choose a reason for hiding this comment

feynmanliang commented Aug 27, 2015

rotationsymmetry commented Aug 27, 2015

Choose a reason for hiding this comment

rotationsymmetry commented Sep 2, 2015

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

feynmanliang commented Sep 3, 2015

rotationsymmetry commented Sep 4, 2015

feynmanliang commented Sep 5, 2015

Choose a reason for hiding this comment

feynmanliang commented Sep 5, 2015

rotationsymmetry commented Sep 5, 2015

mengxr commented Oct 20, 2015

mengxr commented Oct 20, 2015

SparkQA commented Oct 21, 2015

mengxr commented Nov 3, 2015

Choose a reason for hiding this comment

SparkQA commented Nov 9, 2015

rxin commented Jun 15, 2016