Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-18929][ML] Add Tweedie distribution in GLM #16344

Closed
wants to merge 24 commits into from

Conversation

actuaryzhang
Copy link
Contributor

@actuaryzhang actuaryzhang commented Dec 19, 2016

What changes were proposed in this pull request?

I propose to add the full Tweedie family into the GeneralizedLinearRegression model. The Tweedie family is characterized by a power variance function. Currently supported distributions such as Gaussian, Poisson and Gamma families are a special case of the Tweedie https://en.wikipedia.org/wiki/Tweedie_distribution.

@yanboliang @srowen @sethah

@yanboliang
Copy link
Contributor

Jenkins, test this please.

Copy link
Member

@srowen srowen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's probably reasonable. Doesn't this give us support for compound Poisson et al, by supporting general Tweedie distributions?

if (variancePower > 1.0 && variancePower < 2.0) {
require(y >= 0.0, "The response variable of the specified Tweedie distribution " +
s"should be non-negative, but got $y")
math.max(y, 0.1)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we're going to use this magic 0.1 constant in many places, factor out a constant? 0.1 seems quite large as an 'epsilon' but I guess that's what R's implementation uses for whatever reason?

Copy link
Contributor Author

@actuaryzhang actuaryzhang Dec 20, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have not seen a formal justification for the choice of 0.1 in R. This seminal paper (section 4) suggests 1/6 (about 0.17) to be the best constant. I would prefer to be consistent with R so that we can make comparison. Using a constant is a good idea.

deviance: Double,
numInstances: Double,
weightSum: Double): Double = {
0.0
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Throw a UnsupportedOperationException?

if (mu < epsilon) {
epsilon
} else if (mu.isInfinity) {
Double.MaxValue
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Out of curiosity is this meaningful to "cap" at Double.MaxValue? By the time you get there a lot of stuff is going to be infinite or not meaningful.

Copy link
Member

@srowen srowen Dec 21, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, it's done that way in other implementations. OK. I'm not sure if it's going to do much.

I think there's a problem in the Gaussian project method because it uses Double.MinValue to appear to mean "the smallest double" when it's the "smallest possible double" I'll investigate and file a bug if needed. EDIT: Ignore this side comment, Double.MinValue isn't the same as Double.MIN_VALUE in Java.


val defaultLink: Link = Log

var variancePower: Double = 1.5
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a global shared variable -- we really can't do this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would you please suggest a better way to set the variancePower? I want to be consistent with the existing code to have the Family objects, but I need to also pass on the input variancePower to the Tweedie object which is used to compute the variance function. Any suggestion will be highly appreciated. @srowen @yanboliang

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the Tweedie implementation needs to be able to access parameters of the GLM, to read off variancePower.

As it is this is a global variable and two jobs would overwrite each others' values.

@@ -242,7 +275,7 @@ class GeneralizedLinearRegression @Since("2.0.0") (@Since("2.0.0") override val
def setLinkPredictionCol(value: String): this.type = set(linkPredictionCol, value)

override protected def train(dataset: Dataset[_]): GeneralizedLinearRegressionModel = {
val familyObj = Family.fromName($(family))
val familyObj = Family.fromName($(family), $(variancePower))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we can do this either. variancePower is specific to one family, not a property of all of them.


val defaultLink: Link = Log

var variancePower: Double = 1.5
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the Tweedie implementation needs to be able to access parameters of the GLM, to read off variancePower.

As it is this is a global variable and two jobs would overwrite each others' values.

@actuaryzhang
Copy link
Contributor Author

@srowen Thanks for the comments. Really helpful. I have made a new commit that addresses the issues you raised:

  • I think the use of a global family object does not work well for the tweedie case, since we have to set the variance power. I now define a tweedie class and create the tweedie object within the train method of GeneralizedLinearRegression where the variance power is set. Does this make sense?
  • I created a constant delta = 0.1 in the Family class which is used to shift mu (or y) = 0 to avoid numerical issues.
  • I now change variancePower to varPower to save typing (and consistent with R and H2o).
  • The project method you asked about is copied from the other families, which I suppose is to bound the mean to stabilize estimation.
  • I now throw UnsupportedOperationException for tweedie AIC.

Let me know if there is any other issues.

Copying others for reviewing. @yanboliang @sethah

* @group param
*/
@Since("2.2.0")
final val varPower: Param[Double] = new Param(this, "varPower",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I vote to revert this back to variancePower to follow MLlib's convention.

* Used only for the tweedie family.
* (see <a href="https://en.wikipedia.org/wiki/Tweedie_distribution">
* Tweedie Distribution (Wikipedia)</a>)
* Supported value: (1, 2) and (2, Inf).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question: Why we don't allow 0, 1 and 2? They correspond respectively to Gaussian, Poisson and Gamma families, I think we should support fitting a poisson GLM via the tweedie family entrance and R can do it:

y <- rgamma(20,shape=5)
x <- 1:20
glm(y~x,family=tweedie(var.power=1,link.power=1))
glm(y~x,family=poisson(link=identity))

Binomial -> Logit, Binomial -> Probit, Binomial -> CLogLog,
Poisson -> Log, Poisson -> Identity, Poisson -> Sqrt,
Gamma -> Inverse, Gamma -> Identity, Gamma -> Log
"gaussian" -> Identity, "gaussian" -> Log, "gaussian" -> Inverse,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

String is error-prone, I think we can construct a member object for Tweedie whose variancePower is the default value(1.5).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yanboliang The member object (in GeneralizedLinearRegressionBase) won't be accessible in Family, right? The method Family.fromName($(family)) uses global objects like Poisson, Gamma etc. To use Family.fromName, I need to create a Tweedie global object. Then we are back to the issue that @srowen pointed out of setting variancePower of the global object. Please advise. Thanks.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yanboliang Could you help me understand the issue caused by using string? If I use object, then I have to create a Tweedie object that is not used anywhere else. And also I have to write two methods in Family: one returns the global Tweedie object (where the variancePower is preset) and one returns the a TweedieFamily object created using the user-specified variancePower. I hope we are fine using string since there are only a few values.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, you are right. It needs an extra global object to avoid error-prone which may a little expensive. I'm ok with using string.

"The power in the variance function of the Tweedie distribution which characterizes " +
"the relationship between the variance and mean of the distribution. " +
"Used only for the tweedie family. Supported value: (1, 2) and (2, Inf).",
(x: Double) => if (x > 1.0 && x != 2.0) true else false)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can just write => x > 1.0 && x != 2.0. if (x) true else false is redundant.

@@ -64,6 +64,27 @@ private[regression] trait GeneralizedLinearRegressionBase extends PredictorParam
def getFamily: String = $(family)

/**
* Param for the power in the variance function of the Tweedie distribution which provides
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nits: Param -> parameter, tweedie -> Tweedie (two lines below).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changed tweedie. but other docs have been using Param..

@@ -397,14 +436,19 @@ object GeneralizedLinearRegression extends DefaultParamsReadable[GeneralizedLine

/** Trim the fitted value so that it will be in valid range. */
def project(mu: Double): Double = mu

/** Constant added to y = 0 for initialization or deviance to avoid numerical issues. */
val delta: Double = 0.1
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be defined in an object; it's a static constant. The comment isn't quite accurate; it's not added, but it's a minimum.

@@ -242,7 +275,12 @@ class GeneralizedLinearRegression @Since("2.0.0") (@Since("2.0.0") override val
def setLinkPredictionCol(value: String): this.type = set(linkPredictionCol, value)

override protected def train(dataset: Dataset[_]): GeneralizedLinearRegressionModel = {
val familyObj = Family.fromName($(family))
val familyObj = if ($(family) == "tweedie") {
new Tweedie($(varPower))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, why does this parameter need to be in the Family object at all? can't the implementation of Tweedie just go get the parameter's value? it's odd to have a Family representing all but one family, because Tweedie is one of them.

Copy link
Contributor Author

@actuaryzhang actuaryzhang Dec 21, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@srowen Family is not subclass of GeneralizedLinearRegression. Could you elaborate how to make it get the variancePower value?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes of course. Given how the code is structured, one straightforward solution is to generalize Family so that its operations take a reference to the model, so that implementation may access its parameters.

Another is to make the code instantiate Family subclasses instead of using single objects and give the instance a reference to the model.

*
* @param name family name: "gaussian", "binomial", "poisson" or "gamma".
* @param name family name: "gaussian", "binomial", "poisson" and "gamma".
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nite: revert this, because it's really an "or"

if (mu < epsilon) {
epsilon
} else if (mu.isInfinity) {
Double.MaxValue
Copy link
Member

@srowen srowen Dec 21, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, it's done that way in other implementations. OK. I'm not sure if it's going to do much.

I think there's a problem in the Gaussian project method because it uses Double.MinValue to appear to mean "the smallest double" when it's the "smallest possible double" I'll investigate and file a bug if needed. EDIT: Ignore this side comment, Double.MinValue isn't the same as Double.MIN_VALUE in Java.

@actuaryzhang
Copy link
Contributor Author

@srowen @yanboliang Thanks much for the feedback. I now have a better understanding of the code and the issue. I have made new commits reflecting your suggestions. The major changes are summarized below:

  • Create new method fromModel in the Family object to get the desired family object by accessing the GeneralizedLinearRegressionBase object directly. This solves the main challenge of setting the variancePower of the Tweedie family. The old fromName method is deleted since we now need both name and variancePower.
  • Create TweedieFamily as new subclass of family. This class encompasses the special cases of Gaussian, Poisson and Gamma, which can be handled together using the general formula based on variancePower.

Please let me know if there are any other issues. Thanks.

Copy link
Member

@srowen srowen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is moving somewhere interesting, though I'm still not sure about this design. Comments inline.


/** Set of link names that GeneralizedLinearRegression supports. */
private[regression] lazy val supportedLinkNames = supportedFamilyAndLinkPairs.map(_._2.name)

private[regression] val epsilon: Double = 1E-16

/** Constant used in initialization and deviance to avoid numerical issues. */
private[regression] val delta: Double = 0.1
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should still be in an object IMHO; it's a constant right? epsilon really should be too. It's not a big deal but not quite right.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They are already in the GeneralizedLinearRegression object, aren't they? Or do you mean creating a new object say Constant that stores these two constants, and using them like Constant.delta?

Since delta is only used in the TweedieFamily class, I can also move it there. Let me know what is best. Thanks.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm OK to put it here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not a companion object for TweedieFamily though? that just seems easy and more correct


override def initialize(y: Double, weight: Double): Double = y
override def variance(mu: Double): Double = {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of case statements like this, why not just override in subclasses?

weightSum: Double): Double = {
val wt = predictions.map(x => math.log(x._3)).sum()
numInstances * (math.log(deviance / numInstances * 2.0 * math.Pi) + 1.0) + 2.0 - wt
override def aic(predictions: RDD[(Double, Double, Double)],
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Likewise there's not a lot of value in pushing 4 separate implementations into one method and using if-else. Just override this method.

private[regression] class TweedieFamily(private val variancePower: Double)
extends Family{

val name: String = variancePower match {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why remove the name and switch like this? you can instead adjust this so that subclasses override a name method.

@actuaryzhang
Copy link
Contributor Author

actuaryzhang commented Dec 22, 2016

@srowen Thanks for the comments. Makes lots of sense to move the switch to subclass. I did not know one could override a val.

In the new commit, I have moved the defaultLink and aic to subclasses. I also added back name in the constructor. The other methods initialize, variance, deviance and project are common and thus implemented in the TweedieFamily parent class. Hope this makes sense.

@yanboliang

Copy link
Contributor

@yanboliang yanboliang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@actuaryzhang Please see my inline comments. Thanks.

final val variancePower: Param[Double] = new Param(this, "variancePower",
"The power in the variance function of the Tweedie distribution which characterizes " +
"the relationship between the variance and mean of the distribution. " +
"Used for the Tweedie family. Supported value: 0 and [1, Inf).",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Used only for should be more clear?

* Valid link functions for each family is listed below. The first link function of each family
* is the default one.
* - "gaussian" : "identity", "log", "inverse"
* - "binomial" : "logit", "probit", "cloglog"
* - "poisson" : "log", "identity", "sqrt"
* - "gamma" : "inverse", "identity", "log"
* - "tweedie" : "identity", "log"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

- "tweedie" : "log", "identity", see L155: the first link function of each family is the default one.

case Binomial.name => Binomial
case Poisson.name => Poisson
case Gamma.name => Gamma
def fromModel(model: GeneralizedLinearRegressionBase): Family = {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rename to fromParams, we extract family and variancePower from the Params which is the superclass of GLR estimator and model. And actually we use this function for both estimator(L279) and model(L974).

* Tweedie exponential family distribution.
* This includes the special cases of Gaussian, Poisson and Gamma.
*/
private[regression] class TweedieFamily(private val variancePower: Double)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TweedieFamily -> Tweedie, we don't add suffix for other family class/object.

extends Family("tweedie") {

/*
The canonical link is 1 - variancePower. Except for the special cases of Gaussian,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The canonical link is 1 - variancePower, could you clarify this to make it more clear?

require(y >= 0.0, s"The response variable of the specified distribution " +
s"should be non-negative, but got $y")
} else if (variancePower >= 2.0) {
require(y > 0.0, s"The response variable of the specified distribution " +
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ditto.

* Gaussian exponential family distribution.
* The default link for the Gaussian family is the identity link.
*/
private[regression] object Gaussian extends TweedieFamily(0.0) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should keep the concrete implementation of variance, deviance and aic for Gaussian, Poisson and Gamma, the main reasons are:

  • These functions were called very frequently, the concrete implementation in subclasses should be more efficient.
  • It's helpful to locate errors or bugs.

Binomial -> Logit, Binomial -> Probit, Binomial -> CLogLog,
Poisson -> Log, Poisson -> Identity, Poisson -> Sqrt,
Gamma -> Inverse, Gamma -> Identity, Gamma -> Log
"gaussian" -> Identity, "gaussian" -> Log, "gaussian" -> Inverse,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, you are right. It needs an extra global object to avoid error-prone which may a little expensive. I'm ok with using string.


/** Set of link names that GeneralizedLinearRegression supports. */
private[regression] lazy val supportedLinkNames = supportedFamilyAndLinkPairs.map(_._2.name)

private[regression] val epsilon: Double = 1E-16

/** Constant used in initialization and deviance to avoid numerical issues. */
private[regression] val delta: Double = 0.1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm OK to put it here.

@@ -578,6 +578,100 @@ class GeneralizedLinearRegressionSuite
}
}

test("generalized linear regression: tweedie family against glm") {
/*
R code:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

library(statmod) which can help users to reproduce this test case.

@actuaryzhang
Copy link
Contributor Author

@yanboliang Thanks much for the detailed comments. I have addressed all of them in the new commits. Please take another look.
@srowen

@actuaryzhang
Copy link
Contributor Author

@srowen @yanboliang
Any additional issues regarding this PR?

Copy link
Member

@srowen srowen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is down to small items now, I think it's pretty good.

*
* @param name family name: "gaussian", "binomial", "poisson" or "gamma".
* @param params a GenerealizedLinearRegressionBase object
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo in GenerealizedLinearRegressionBase; this type of doc doesn't do anything though because the type is already documented. It should say something non-trivial


/** Set of link names that GeneralizedLinearRegression supports. */
private[regression] lazy val supportedLinkNames = supportedFamilyAndLinkPairs.map(_._2.name)

private[regression] val epsilon: Double = 1E-16

/** Constant used in initialization and deviance to avoid numerical issues. */
private[regression] val delta: Double = 0.1
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not a companion object for TweedieFamily though? that just seems easy and more correct

@actuaryzhang
Copy link
Contributor Author

@srowen Made a new commit according to your suggestion. Everything looking good now?
@yanboliang

@yanboliang
Copy link
Contributor

@actuaryzhang I'm on travel these days, and will make another pass in a few days. Thanks.

@actuaryzhang
Copy link
Contributor Author

@yanboliang Did you get a chance to take another look at this? Thanks.

@actuaryzhang
Copy link
Contributor Author

actuaryzhang commented Jan 18, 2017

@yanboliang Finally, the test is done. Is there anything else needed for this PR, other than the pending fresh test?

@actuaryzhang
Copy link
Contributor Author

@srowen @yanboliang @felixcheung @jkbradley Could you help kick off the new test please? Thanks.

@MLnick
Copy link
Contributor

MLnick commented Jan 19, 2017

jenkins add to whitelist

@MLnick
Copy link
Contributor

MLnick commented Jan 19, 2017

jenkins test this please

@SparkQA
Copy link

SparkQA commented Jan 19, 2017

Test build #71643 has started for PR 16344 at commit 83deee3.

@actuaryzhang
Copy link
Contributor Author

actuaryzhang commented Jan 19, 2017

Could anybody help me understand what's causing this test to fail? I see several other ML PR failing as well, with the same error message like below:

Error instrumenting class:org.apache.spark.deploy.mesos.ui.MesosClusterUI
Error instrumenting class:org.apache.spark.mllib.regression.IsotonicRegressionModel$SaveLoadV1_0$

Was this related to the recent commit on instrumentation #15671? How do I fix this? Thanks!
@zhengruifeng @yanboliang @jkbradley

@yanboliang
Copy link
Contributor

Jenkins, retest this please.

@yanboliang
Copy link
Contributor

@actuaryzhang This test failure is caused by Jenkins was not stable, you just need to retest if you encounter similar issue.

@SparkQA
Copy link

SparkQA commented Jan 20, 2017

Test build #71686 has finished for PR 16344 at commit 83deee3.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Contributor

@yanboliang yanboliang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only some comments about documents, otherwise, looks good to me. Thanks!

require(supportedFamilyAndLinkPairs.contains(
Family.fromName($(family)) -> Link.fromName($(link))), "Generalized Linear Regression " +
s"with ${$(family)} family does not support ${$(link)} link function.")
if ($(family) == "tweedie") {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

$(family).toLowerCase == "tweedie", see #16516, change here and other places.

case Gamma.name => Gamma
def fromParams(params: GeneralizedLinearRegressionBase): Family = {
params.getFamily match {
case "gaussian" => Gaussian
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Revert to Gaussian.name here and bellow, which is less error-prone.

*/
def apply(params: GeneralizedLinearRegressionBase): FamilyAndLink = {
val familyObj = Family.fromParams(params)
val linkObj = if ((params.getFamily != "tweedie" && params.isDefined(params.link)) ||
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

isSet is more accurate than isDefined at here, since there is always no default values for both of them. I knew the original code used isDefined, but it's better we can correct them.


var idx = 0
for (fitIntercept <- Seq(false, true); linkPower <- Seq(0.0, 1.0, -1.0)) {
for (variancePower <- Seq(1.6, 2.5)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit:

for (fitIntercept <- Seq(false, true);
         linkPower <- Seq(0.0, 1.0, -1.0);
         variancePower <- Seq(1.6, 2.5)) {


/** @group getParam */
@Since("2.0.0")
def getFamily: String = $(family)

/**
* Param for the power in the variance function of the Tweedie distribution which provides
* the relationship between the variance and mean of the distribution.
* Used only for the Tweedie family.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Only applicable for "tweedie" family. should be better?

* Gets the [[Family]] object from its name.
* Gets the [[Family]] object based on family and variancePower.
* 1) retrieve object based on family name
* 2) if family name is tweedie, retrieve object based on variancePower
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should the following document be better?

Gets the Family object based on param family and variancePower.
If param family was set with "gaussian", "binomial", "poisson" or "gamma", return the corresponding object directly; otherwise, construct a Tweedie object according to variancePower.

@@ -620,25 +779,67 @@ object GeneralizedLinearRegression extends DefaultParamsReadable[GeneralizedLine
private[regression] object Link {

/**
* Gets the [[Link]] object from its name.
* Gets the [[Link]] object based on link or linkPower.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Gets the Link object based on param family, link and linkPower.
If param family was set with "tweedie", return or construct link function object according to linkPower; otherwise, return link function object according to link.

*
* @param name link name: "identity", "logit", "log",
* "inverse", "probit", "cloglog" or "sqrt".
* @param params the parameter map containing link and link power
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

link and link power -> family, link and linkPower

1.0, 3.0, 2.0, 1.0,
0.0, 4.0, 3.0, 3.0), 4, 4, byrow = TRUE))

f <- glm(V1 ~ -1 + V3 + V4, data = df, weights = V2,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Change f to model, and add summary(model) at next line. It's helpful for users to reproduce the result in R.


Number of Fisher Scoring iterations: 11

residuals(model, type="pearson")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typos here? I guess you paste irrelevant results for the following residuals, they should be consistent with L1279-L1281.

@SparkQA
Copy link

SparkQA commented Jan 23, 2017

Test build #71817 has finished for PR 16344 at commit 995e88f.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@actuaryzhang
Copy link
Contributor Author

@yanboliang Thanks so much for your detailed review. Your suggestions make lots of sense and I have included all of them in the new commit. Let me know if there is any other change needed.

@SparkQA
Copy link

SparkQA commented Jan 23, 2017

Test build #71823 has finished for PR 16344 at commit 54da2cb.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@yanboliang
Copy link
Contributor

This looks good to me. cc @jkbradley @srowen @sethah for another pass if you have time. Thanks.

@yanboliang
Copy link
Contributor

Merged into master. If there are comments from others, we can address them in follow-up work. Thanks.

@asfgit asfgit closed this in 4172ff8 Jan 27, 2017
uzadude pushed a commit to uzadude/spark that referenced this pull request Jan 27, 2017
## What changes were proposed in this pull request?
I propose to add the full Tweedie family into the GeneralizedLinearRegression model. The Tweedie family is characterized by a power variance function. Currently supported distributions such as Gaussian, Poisson and Gamma families are a special case of the Tweedie https://en.wikipedia.org/wiki/Tweedie_distribution.

yanboliang srowen sethah

Author: actuaryzhang <[email protected]>
Author: Wayne Zhang <[email protected]>

Closes apache#16344 from actuaryzhang/tweedie.
@actuaryzhang actuaryzhang deleted the tweedie branch February 1, 2017 00:28
cmonkey pushed a commit to cmonkey/spark that referenced this pull request Feb 15, 2017
## What changes were proposed in this pull request?
I propose to add the full Tweedie family into the GeneralizedLinearRegression model. The Tweedie family is characterized by a power variance function. Currently supported distributions such as Gaussian, Poisson and Gamma families are a special case of the Tweedie https://en.wikipedia.org/wiki/Tweedie_distribution.

yanboliang srowen sethah

Author: actuaryzhang <[email protected]>
Author: Wayne Zhang <[email protected]>

Closes apache#16344 from actuaryzhang/tweedie.
ghost pushed a commit to dbtsai/spark that referenced this pull request Feb 26, 2017
…on.linkPower

Add Scaladoc for GeneralizedLinearRegression.linkPower default value

Follow-up to apache#16344

Author: Joseph K. Bradley <[email protected]>

Closes apache#17069 from jkbradley/tweedie-comment.
Yunni pushed a commit to Yunni/spark that referenced this pull request Feb 27, 2017
…on.linkPower

Add Scaladoc for GeneralizedLinearRegression.linkPower default value

Follow-up to apache#16344

Author: Joseph K. Bradley <[email protected]>

Closes apache#17069 from jkbradley/tweedie-comment.
ghost pushed a commit to dbtsai/spark that referenced this pull request Feb 28, 2017
Update GLM documentation to include the Tweedie distribution. apache#16344

jkbradley yanboliang

Author: actuaryzhang <[email protected]>

Closes apache#17103 from actuaryzhang/doc.
asfgit pushed a commit that referenced this pull request Mar 14, 2017
## What changes were proposed in this pull request?
Port Tweedie GLM  #16344  to SparkR

felixcheung yanboliang

## How was this patch tested?
new test in SparkR

Author: actuaryzhang <[email protected]>

Closes #16729 from actuaryzhang/sparkRTweedie.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants