Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Spark-24024][ML] Fix poisson deviance calculations in GLM to handle y = 0 #21125

Closed
wants to merge 2 commits into from

Conversation

tengpeng
Copy link
Contributor

@tengpeng tengpeng commented Apr 23, 2018

What changes were proposed in this pull request?

It is reported by Spark users that the deviance calculation for poisson regression does not handle y = 0. Thus, the correct model summary cannot be obtained. The user has confirmed the the issue is in

override def deviance(y: Double, mu: Double, weight: Double): Double =
{ 2.0 * weight * (y * math.log(y / mu) - (y - mu)) }
when y = 0.

The user also mentioned there are many other places he believe we should check the same thing. However, no other changes are needed, including Gamma distribution.

How was this patch tested?

Add a comparison with R deviance calculation to the existing unit test.

@dbtsai
Copy link
Member

dbtsai commented Apr 23, 2018

ok to test

Copy link
Member

@dbtsai dbtsai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only couple small comments, and we're ready to merge it once they're resolved.

Thanks.

DB Tsai | Siri Open Source Technologies |  Apple, Inc

private def ylogy(y: Double, mu: Double): Double = {
if (y == 0) 0.0 else y * math.log(y / mu)
}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another ylogy implementation in Binomial. Can you move this code to object GeneralizedLinearRegression and make it private to this package?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks so much for the quick review. I have moved the ylog implementation to object GeneralizedLinearRegression. One quick question here: I am not sure I have fully understood why this is the right place for ylog? Thanks!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any suggestion to avoid the duplicated code? Let's followup this later if you have an idea.

@@ -495,8 +495,8 @@ class GeneralizedLinearRegressionSuite extends MLTest with DefaultReadWriteTest
[1] 1.8121235 -0.1747493 -0.5815417
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you update the R script which generate the deviance?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated. The updated script is sufficient to calculate deviance on its own.

Vectors.dense(0.0, -0.0457441, -0.6833928),
Vectors.dense(1.8121235, -0.1747493, -0.5815417))
Vectors.dense(0.0, -0.0457441, -0.6833928, 3.8093),
Vectors.dense(1.8121235, -0.1747493, -0.5815417, 3.7006))

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding them to expected is not consistent to the rest of the test code.

How about

val residualDeviancesR = Array(3.8093, 3.7006)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Modified. Thanks!

@@ -507,7 +507,8 @@ class GeneralizedLinearRegressionSuite extends MLTest with DefaultReadWriteTest
val trainer = new GeneralizedLinearRegression().setFamily("poisson").setLink(link)
.setFitIntercept(fitIntercept).setLinkPredictionCol("linkPrediction")
val model = trainer.fit(dataset)
val actual = Vectors.dense(model.intercept, model.coefficients(0), model.coefficients(1))
val actual = Vectors.dense(model.intercept, model.coefficients(0), model.coefficients(1),
model.summary.deviance)
assert(actual ~= expected(idx) absTol 1e-4, "Model mismatch: GLM with poisson family, " +
s"$link link and fitIntercept = $fitIntercept (with zero values).")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

assert(model.summary.deviance ~== residualDeviancesR(idx) absTol 1E-3)

@SparkQA
Copy link

SparkQA commented Apr 23, 2018

Test build #89699 has finished for PR 21125 at commit 3c6a4da.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@tengpeng tengpeng changed the title [Spark-24024] Fix poisson deviance calculations in GLM to handle y = 0 [Spark-24024][ML] Fix poisson deviance calculations in GLM to handle y = 0 Apr 23, 2018
@SparkQA
Copy link

SparkQA commented Apr 23, 2018

Test build #89723 has finished for PR 21125 at commit da53b1a.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@asfgit asfgit closed this in 293a0f2 Apr 23, 2018
@dbtsai
Copy link
Member

dbtsai commented Apr 23, 2018

LGTM, merged into master. Thanks.

DB Tsai | Siri Open Source Technologies |  Apple, Inc

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants