-
Notifications
You must be signed in to change notification settings - Fork 28.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-4980] [MLlib] Add decay factors to streaming linear methods #8022
[SPARK-4980] [MLlib] Add decay factors to streaming linear methods #8022
Conversation
def getDiscount(numNewDataPoints: Long): Double | ||
} | ||
|
||
private[mllib] trait StreamingDecaySetter[T <: StreamingDecaySetter[T]] extends Logging { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need F-bounded polymorphism here? Does the code not work when you replace T
with self.type
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"Does the code not work when you replace T with self.type?"
I guess not? For example,
trait Setter { def set: this.type = this}
class Apple extends Setter
val a = new Apple()
a.set
The return type of a.set
is a.type
, not Apple
. Do I answer your question?
"Why do we need F-bounded polymorphism here?"
I agree with you that this is not needed here. Originally I included this as an extra level of type checking. But since I have self: T=>
in the next line, I don't think we need it any more. I will remove it in the next push to the PR.
@rotationsymmetry you also have a merge conflict, sorry 😞 do you mind resolving? |
@feynmanliang Thank you very much for your review. I have incorporated your comments in commit a4ed2b0.
As for your comment of "having getLambda instead of getDiscount in StreamingDecay", I feel that the discount factor better conveys the mathematical idea of the algorithm. Lambda, on the other hand, is only a temporary value in the calculation. For example, in the spark doc, the discount factor is employed to describe the algorithm. I have included similar description in the ScalaDoc for Thanks again for your review. If you have any further comments, please let me know. |
@@ -32,6 +32,11 @@ import org.apache.spark.mllib.regression.StreamingLinearAlgorithm | |||
* of features must be constant. An initial weight | |||
* vector must be provided. | |||
* | |||
* This class inherits the forgetful algorithm from StreamingLinearAlgorithm |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"[[StreamingLinearAlgorithm]]" so API docs generate a link, ditto for L37
@feynmanliang I have make another push to the PR: Refactor StreamingDecay Thank again for your review. |
@@ -101,4 +107,14 @@ class StreamingLogisticRegressionWithSGD private[mllib] ( | |||
this.model = Some(algorithm.createModel(initialWeights, 0.0)) | |||
this | |||
} | |||
|
|||
override def setDecayFactor(decayFactor: Double): this.type = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This boilerplate is duplicated in streaming linear regression. I am guessing you do this to get the concrete subclass (correct me if I'm wrong), but you actually don't need to do this since the this.type
in trait StreamingDecay
takes care of this. A simple REPL example:
scala> trait Superclass { def test: this.type }
defined trait Superclass
scala> class Subclass extends Superclass { def test = this }
defined class Subclass
scala> (new Subclass()).test
res0: Subclass = Subclass@1cb4ab3e
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, I meant that you could remove these setters entirely
scala> trait Superclass { def test: this.type = this }
defined trait Superclass
scala> class Subclass extends Superclass
defined class Subclass
scala> (new Subclass).test
res1: Subclass = Subclass@b364520
Made another pass |
@feynmanliang Thank you for your comments. I have revised the PR, including
As I am rewriting the ScalaDoc, it appears that the algorithm can be more easily described and understood if we rename decayFactor to retentionFactor. What do you think? |
Streaming KMeans uses |
*/ | ||
@Experimental | ||
private[mllib] trait StreamingDecay extends Logging{ | ||
private[this] var decayFactor: Double = 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just private
is fine
LGTM after these changes and pending tests |
@feynmanliang Much appreciated. I have update the PR for your comments. |
add to whitelist |
ok to test |
Test build #44017 has finished for PR 8022 at commit
|
@rotationsymmetry Could you provide a simple unit test in Java to show Java compatibility? |
val lambda = numNewDataPoints / updatedDataWeight | ||
|
||
BLAS.scal(lambda, newModel.weights) | ||
BLAS.axpy(1-lambda, model.get.weights, newModel.weights) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we have some references about this merging scheme? I assume that this works for many cases, but there is no guarantee in theory.
Revise test "parameter accuracy" in StreamingLinearRegressionSuite to account for decay.
Split StreamingDecay into two traits. Update StreamingLogisticRegressionWithSGD. Update test suites.
Also make StreamingDecaySetter to be private[mllib].
Add ScalaDoc for public API. Add ScalaDoc to decribe the forgetful algorithm in StreamingLinearAlgorithm. Remove F-polymorphism in StreamingDecaySetter[T]. decayFactor and timeUnit in StreamingDecaySetter[T] are now private. Remove division by zero in trainOn of StreamingLinearAlgorithm; provide comments to explains why. Improve testing cases of StreamingLogisticRegressionSuite to have rel tol=0.1.
Refactor StreamingDecay Use case object for TimeUnit Clean up ScalaDoc
Add @SInCE. Clean up ScalaDoc.
clean up new lines and comments.
Test build #45336 has finished for PR 8022 at commit
|
Thanks for the pull request. I'm going through a list of pull requests to cut them down since the sheer number is breaking some of the tooling we have. Due to lack of activity on this pull request, I'm going to push a commit to close it. Feel free to reopen it or create a new one. |
This PR includes an implementation of decay factors in streaming linear and logistic regression. Unit tests are also included.
The algorithm and design details are described in the document: https://docs.google.com/document/d/1UfKvuaaJVQCvh-wOLLYT8l7STQFjPxE7fitZyd0tqTo/edit?usp=sharing
Your comments and suggestions are highly appreciated. I will add more tests and ScalaDoc as suggested.
Thanks!
cc @freeman-lab @mengxr