-
Notifications
You must be signed in to change notification settings - Fork 28.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Spark-24024][ML] Fix poisson deviance calculations in GLM to handle y = 0 #21125
Conversation
ok to test |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Only couple small comments, and we're ready to merge it once they're resolved.
Thanks.
DB Tsai | Siri Open Source Technologies | Apple, Inc
private def ylogy(y: Double, mu: Double): Double = { | ||
if (y == 0) 0.0 else y * math.log(y / mu) | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another ylogy
implementation in Binomial
. Can you move this code to object GeneralizedLinearRegression
and make it private to this package?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks so much for the quick review. I have moved the ylog
implementation to object GeneralizedLinearRegression
. One quick question here: I am not sure I have fully understood why this is the right place for ylog
? Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any suggestion to avoid the duplicated code? Let's followup this later if you have an idea.
@@ -495,8 +495,8 @@ class GeneralizedLinearRegressionSuite extends MLTest with DefaultReadWriteTest | |||
[1] 1.8121235 -0.1747493 -0.5815417 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you update the R script which generate the deviance
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated. The updated script is sufficient to calculate deviance on its own.
Vectors.dense(0.0, -0.0457441, -0.6833928), | ||
Vectors.dense(1.8121235, -0.1747493, -0.5815417)) | ||
Vectors.dense(0.0, -0.0457441, -0.6833928, 3.8093), | ||
Vectors.dense(1.8121235, -0.1747493, -0.5815417, 3.7006)) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Adding them to expected
is not consistent to the rest of the test code.
How about
val residualDeviancesR = Array(3.8093, 3.7006)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Modified. Thanks!
@@ -507,7 +507,8 @@ class GeneralizedLinearRegressionSuite extends MLTest with DefaultReadWriteTest | |||
val trainer = new GeneralizedLinearRegression().setFamily("poisson").setLink(link) | |||
.setFitIntercept(fitIntercept).setLinkPredictionCol("linkPrediction") | |||
val model = trainer.fit(dataset) | |||
val actual = Vectors.dense(model.intercept, model.coefficients(0), model.coefficients(1)) | |||
val actual = Vectors.dense(model.intercept, model.coefficients(0), model.coefficients(1), | |||
model.summary.deviance) | |||
assert(actual ~= expected(idx) absTol 1e-4, "Model mismatch: GLM with poisson family, " + | |||
s"$link link and fitIntercept = $fitIntercept (with zero values).") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
assert(model.summary.deviance ~== residualDeviancesR(idx) absTol 1E-3)
Test build #89699 has finished for PR 21125 at commit
|
Test build #89723 has finished for PR 21125 at commit
|
LGTM, merged into master. Thanks. DB Tsai | Siri Open Source Technologies | Apple, Inc |
What changes were proposed in this pull request?
It is reported by Spark users that the deviance calculation for poisson regression does not handle y = 0. Thus, the correct model summary cannot be obtained. The user has confirmed the the issue is in
The user also mentioned there are many other places he believe we should check the same thing. However, no other changes are needed, including Gamma distribution.
How was this patch tested?
Add a comparison with R deviance calculation to the existing unit test.