-
Notifications
You must be signed in to change notification settings - Fork 28.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Spark-12732][ML] bug fix in linear regression train #10702
[Spark-12732][ML] bug fix in linear regression train #10702
Conversation
@@ -219,33 +219,41 @@ class LinearRegression @Since("1.3.0") (@Since("1.3.0") override val uid: String | |||
} | |||
|
|||
val yMean = ySummarizer.mean(0) | |||
val yStd = math.sqrt(ySummarizer.variance(0)) | |||
var yStd = math.sqrt(ySummarizer.variance(0)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you put all this logic up here?
val rawYStd = math.sqrt(ySummarizer.variance(0))
val yStd = if (rawYStd > 0.0 || fitIntercept) {
rawYStd
} else {
logWarning(...)
1.0
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, when rawYStd == 0.0
, standardization == true
and regParam != 0.0
, the problem will be ill-defined. We may want to throw an exception.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@iyounus what do you think of these comments?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@srowen I agree with these comments. I'm working on WeightedLeastSquares
(#10274), and there are some commonalities between these two issues. Once I've completed all tests for WeightedLeastSquares
, I'll compete this issue.
Some of the tests for LinearRegression
use both normal
and l-bfgs
solvers. It would nice if WeightedLeastSquares
issue is merged so that I can write tests in similar way.
Test build #49163 has finished for PR 10702 at commit
|
@dbtsai Do you have time to make a pass? |
Test build #49816 has finished for PR 10702 at commit
|
…0, standardization=true and fitIntercept=false. added test for this case. Modified existing test for constant label.
I've added an exception for the case when label is constant and I cannot test the case when Even for the case when |
Test build #49889 has finished for PR 10702 at commit
|
@iyounus Thanks. |
|
||
test("regularized linear regression through origin with constant label") { | ||
// The problem is ill-defined if fitIntercept=false, regParam is non-zero and \ | ||
// standardization=true. An exception is thrown in this case. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When standardization=false
, the problem is still ill-defined since GLMNET always standardizes the labels. That's why you see it in the analytical solution. Let's throw exception when fitIntercept=false
and regParam != 0.0
.
…se. Also, if yMean==0 and yStd==0, no training is done.
Test build #50450 has finished for PR 10702 at commit
|
I've completed this PR. I think all the tests are there. Here, I'm going to document a couple of minor issues just for future reference. Issue 1
This is with For case (3), even though the standardization is Issue 2
Here, results (1) and (2) are identical to what we get from Now, for case (3), the numerical values are different as compared to Issue 3
In my opinion, when |
} | ||
|
||
// if y is constant (rawYStd is zero), then y cannot be scaled. In this case | ||
// setting yStd=1.0 ensures that y is not scaled anymore in l-bfgs algorithm. | ||
val yStd = if (rawYStd > 0) rawYStd else if (yMean != 0.0) math.abs(yMean) else 1.0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
val yStd = if (rawYStd > 0) rawYStd else math.abs(yMean)
since you already check the condition before.
// zero coefficient; as a result, training is not needed. | ||
// Also, if yMean==0 and rawYStd==0, all the coefficients are zero regardless of | ||
// the fitIntercept | ||
logWarning(s"The standard deviation of the label is zero, so the coefficients will be " + |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe you want to update the warning message for the second situation as well.
LGTM except minor comments. Thanks. |
Commenting on your issues. Issue 1: As a result, in your case 4, when label is not standardized, and the features are standardized, this is not defined, so the users should get the result. For case 3, can you elaborate why analytical solution exists even the label is standardized? Issue 2: In my opinion, even case 1, and case 2 are ill-defined since in GLMNET, the label is standardized by default, and GLMNET will not return any result at all. It just happens that without regularization, with/without standardization on labels will not change the solution, so we just treat them as if we don't standardize the label. This can explain your case 3. Issue 3: I think this is because your normal equation solver doesn't standardize the label, so the discrepancies occur. |
Test build #50454 has finished for PR 10702 at commit
|
Test build #50455 has finished for PR 10702 at commit
|
For the case (3), I'm assuming that the label and features are not standardized. So, in that case, the solution exists. Here is my perspective on this. The normal equation |
…hen label is constant)
Test build #50512 has finished for PR 10702 at commit
|
@@ -74,7 +74,8 @@ class LinearRegression @Since("1.3.0") (@Since("1.3.0") override val uid: String | |||
/** | |||
* Set the regularization parameter. | |||
* Default is 0.0. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All the indentations you just added are off.
For the case (3), I agree with your agreement completely. Can you try your normal equation solution with L2 without any standardization (nonzero ystd data) and see if the result match GLMNET? If I remember correctly, this will not match since GLMNET internally will standardize labels even one has |
@@ -398,7 +422,8 @@ class LinearRegressionModel private[ml] ( | |||
|
|||
/** | |||
* Evaluates the model on a testset. | |||
* @param dataset Test dataset to evaluate model on. | |||
* |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I really don't know how this has happened. :)
I've fixed these. All the indentation should be in order now.
GLMNET sets all coefficients to zero if yStd=0 and fitIntercept=false regardless of standardization or regularization. Thats why I cannot compare my normal equation with GLMNET. |
Test build #50575 has finished for PR 10702 at commit
|
I meat comparing the result with your solution when |
For |
Yes, that's what I meat. Without standardizing the labels, no way to match glmnet, but this makes the problem ill-defined when |
LGTM. Merged into master. Thanks. |
Fixed the bug in linear regression train for the case when the target variable is constant. The two cases for
fitIntercept=true
orfitIntercept=false
should be treated differently.