-
Notifications
You must be signed in to change notification settings - Fork 28.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-18929][ML] Add Tweedie distribution in GLM #16344
Conversation
Jenkins, test this please. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's probably reasonable. Doesn't this give us support for compound Poisson et al, by supporting general Tweedie distributions?
if (variancePower > 1.0 && variancePower < 2.0) { | ||
require(y >= 0.0, "The response variable of the specified Tweedie distribution " + | ||
s"should be non-negative, but got $y") | ||
math.max(y, 0.1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we're going to use this magic 0.1 constant in many places, factor out a constant? 0.1 seems quite large as an 'epsilon' but I guess that's what R's implementation uses for whatever reason?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have not seen a formal justification for the choice of 0.1 in R. This seminal paper (section 4) suggests 1/6 (about 0.17) to be the best constant. I would prefer to be consistent with R so that we can make comparison. Using a constant is a good idea.
deviance: Double, | ||
numInstances: Double, | ||
weightSum: Double): Double = { | ||
0.0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Throw a UnsupportedOperationException?
if (mu < epsilon) { | ||
epsilon | ||
} else if (mu.isInfinity) { | ||
Double.MaxValue |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Out of curiosity is this meaningful to "cap" at Double.MaxValue? By the time you get there a lot of stuff is going to be infinite or not meaningful.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see, it's done that way in other implementations. OK. I'm not sure if it's going to do much.
I think there's a problem in the Gaussian project method because it uses Double.MinValue to appear to mean "the smallest double" when it's the "smallest possible double" I'll investigate and file a bug if needed. EDIT: Ignore this side comment, Double.MinValue isn't the same as Double.MIN_VALUE in Java.
|
||
val defaultLink: Link = Log | ||
|
||
var variancePower: Double = 1.5 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a global shared variable -- we really can't do this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would you please suggest a better way to set the variancePower? I want to be consistent with the existing code to have the Family
objects, but I need to also pass on the input variancePower
to the Tweedie
object which is used to compute the variance function. Any suggestion will be highly appreciated. @srowen @yanboliang
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the Tweedie implementation needs to be able to access parameters of the GLM, to read off variancePower.
As it is this is a global variable and two jobs would overwrite each others' values.
@@ -242,7 +275,7 @@ class GeneralizedLinearRegression @Since("2.0.0") (@Since("2.0.0") override val | |||
def setLinkPredictionCol(value: String): this.type = set(linkPredictionCol, value) | |||
|
|||
override protected def train(dataset: Dataset[_]): GeneralizedLinearRegressionModel = { | |||
val familyObj = Family.fromName($(family)) | |||
val familyObj = Family.fromName($(family), $(variancePower)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we can do this either. variancePower is specific to one family, not a property of all of them.
|
||
val defaultLink: Link = Log | ||
|
||
var variancePower: Double = 1.5 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the Tweedie implementation needs to be able to access parameters of the GLM, to read off variancePower.
As it is this is a global variable and two jobs would overwrite each others' values.
@srowen Thanks for the comments. Really helpful. I have made a new commit that addresses the issues you raised:
Let me know if there is any other issues. Copying others for reviewing. @yanboliang @sethah |
* @group param | ||
*/ | ||
@Since("2.2.0") | ||
final val varPower: Param[Double] = new Param(this, "varPower", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I vote to revert this back to variancePower
to follow MLlib's convention.
* Used only for the tweedie family. | ||
* (see <a href="https://en.wikipedia.org/wiki/Tweedie_distribution"> | ||
* Tweedie Distribution (Wikipedia)</a>) | ||
* Supported value: (1, 2) and (2, Inf). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Question: Why we don't allow 0, 1 and 2
? They correspond respectively to Gaussian, Poisson and Gamma
families, I think we should support fitting a poisson GLM via the tweedie
family entrance and R can do it:
y <- rgamma(20,shape=5)
x <- 1:20
glm(y~x,family=tweedie(var.power=1,link.power=1))
glm(y~x,family=poisson(link=identity))
Binomial -> Logit, Binomial -> Probit, Binomial -> CLogLog, | ||
Poisson -> Log, Poisson -> Identity, Poisson -> Sqrt, | ||
Gamma -> Inverse, Gamma -> Identity, Gamma -> Log | ||
"gaussian" -> Identity, "gaussian" -> Log, "gaussian" -> Inverse, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
String is error-prone, I think we can construct a member object for Tweedie
whose variancePower
is the default value(1.5).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@yanboliang The member object (in GeneralizedLinearRegressionBase
) won't be accessible in Family
, right? The method Family.fromName($(family))
uses global objects like Poisson
, Gamma
etc. To use Family.fromName
, I need to create a Tweedie
global object. Then we are back to the issue that @srowen pointed out of setting variancePower
of the global object. Please advise. Thanks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@yanboliang Could you help me understand the issue caused by using string? If I use object, then I have to create a Tweedie object that is not used anywhere else. And also I have to write two methods in Family
: one returns the global Tweedie object (where the variancePower is preset) and one returns the a TweedieFamily object created using the user-specified variancePower. I hope we are fine using string since there are only a few values.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, you are right. It needs an extra global object to avoid error-prone which may a little expensive. I'm ok with using string.
"The power in the variance function of the Tweedie distribution which characterizes " + | ||
"the relationship between the variance and mean of the distribution. " + | ||
"Used only for the tweedie family. Supported value: (1, 2) and (2, Inf).", | ||
(x: Double) => if (x > 1.0 && x != 2.0) true else false) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can just write => x > 1.0 && x != 2.0
. if (x) true else false
is redundant.
@@ -64,6 +64,27 @@ private[regression] trait GeneralizedLinearRegressionBase extends PredictorParam | |||
def getFamily: String = $(family) | |||
|
|||
/** | |||
* Param for the power in the variance function of the Tweedie distribution which provides |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nits: Param -> parameter, tweedie -> Tweedie (two lines below).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
changed tweedie. but other docs have been using Param..
@@ -397,14 +436,19 @@ object GeneralizedLinearRegression extends DefaultParamsReadable[GeneralizedLine | |||
|
|||
/** Trim the fitted value so that it will be in valid range. */ | |||
def project(mu: Double): Double = mu | |||
|
|||
/** Constant added to y = 0 for initialization or deviance to avoid numerical issues. */ | |||
val delta: Double = 0.1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be defined in an object
; it's a static constant. The comment isn't quite accurate; it's not added, but it's a minimum.
@@ -242,7 +275,12 @@ class GeneralizedLinearRegression @Since("2.0.0") (@Since("2.0.0") override val | |||
def setLinkPredictionCol(value: String): this.type = set(linkPredictionCol, value) | |||
|
|||
override protected def train(dataset: Dataset[_]): GeneralizedLinearRegressionModel = { | |||
val familyObj = Family.fromName($(family)) | |||
val familyObj = if ($(family) == "tweedie") { | |||
new Tweedie($(varPower)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm, why does this parameter need to be in the Family
object at all? can't the implementation of Tweedie just go get the parameter's value? it's odd to have a Family representing all but one family, because Tweedie is one of them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@srowen Family
is not subclass of GeneralizedLinearRegression
. Could you elaborate how to make it get the variancePower
value?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes of course. Given how the code is structured, one straightforward solution is to generalize Family
so that its operations take a reference to the model, so that implementation may access its parameters.
Another is to make the code instantiate Family
subclasses instead of using single object
s and give the instance a reference to the model.
* | ||
* @param name family name: "gaussian", "binomial", "poisson" or "gamma". | ||
* @param name family name: "gaussian", "binomial", "poisson" and "gamma". |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nite: revert this, because it's really an "or"
if (mu < epsilon) { | ||
epsilon | ||
} else if (mu.isInfinity) { | ||
Double.MaxValue |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see, it's done that way in other implementations. OK. I'm not sure if it's going to do much.
I think there's a problem in the Gaussian project method because it uses Double.MinValue to appear to mean "the smallest double" when it's the "smallest possible double" I'll investigate and file a bug if needed. EDIT: Ignore this side comment, Double.MinValue isn't the same as Double.MIN_VALUE in Java.
@srowen @yanboliang Thanks much for the feedback. I now have a better understanding of the code and the issue. I have made new commits reflecting your suggestions. The major changes are summarized below:
Please let me know if there are any other issues. Thanks. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is moving somewhere interesting, though I'm still not sure about this design. Comments inline.
|
||
/** Set of link names that GeneralizedLinearRegression supports. */ | ||
private[regression] lazy val supportedLinkNames = supportedFamilyAndLinkPairs.map(_._2.name) | ||
|
||
private[regression] val epsilon: Double = 1E-16 | ||
|
||
/** Constant used in initialization and deviance to avoid numerical issues. */ | ||
private[regression] val delta: Double = 0.1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should still be in an object
IMHO; it's a constant right? epsilon
really should be too. It's not a big deal but not quite right.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
They are already in the GeneralizedLinearRegression
object, aren't they? Or do you mean creating a new object say Constant
that stores these two constants, and using them like Constant.delta
?
Since delta
is only used in the TweedieFamily
class, I can also move it there. Let me know what is best. Thanks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm OK to put it here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not a companion object for TweedieFamily though? that just seems easy and more correct
|
||
override def initialize(y: Double, weight: Double): Double = y | ||
override def variance(mu: Double): Double = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of case statements like this, why not just override in subclasses?
weightSum: Double): Double = { | ||
val wt = predictions.map(x => math.log(x._3)).sum() | ||
numInstances * (math.log(deviance / numInstances * 2.0 * math.Pi) + 1.0) + 2.0 - wt | ||
override def aic(predictions: RDD[(Double, Double, Double)], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Likewise there's not a lot of value in pushing 4 separate implementations into one method and using if-else. Just override this method.
private[regression] class TweedieFamily(private val variancePower: Double) | ||
extends Family{ | ||
|
||
val name: String = variancePower match { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why remove the name and switch like this? you can instead adjust this so that subclasses override a name
method.
@srowen Thanks for the comments. Makes lots of sense to move the switch to subclass. I did not know one could override a In the new commit, I have moved the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@actuaryzhang Please see my inline comments. Thanks.
final val variancePower: Param[Double] = new Param(this, "variancePower", | ||
"The power in the variance function of the Tweedie distribution which characterizes " + | ||
"the relationship between the variance and mean of the distribution. " + | ||
"Used for the Tweedie family. Supported value: 0 and [1, Inf).", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Used only for
should be more clear?
* Valid link functions for each family is listed below. The first link function of each family | ||
* is the default one. | ||
* - "gaussian" : "identity", "log", "inverse" | ||
* - "binomial" : "logit", "probit", "cloglog" | ||
* - "poisson" : "log", "identity", "sqrt" | ||
* - "gamma" : "inverse", "identity", "log" | ||
* - "tweedie" : "identity", "log" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- "tweedie" : "log", "identity"
, see L155: the first link function of each family is the default one
.
case Binomial.name => Binomial | ||
case Poisson.name => Poisson | ||
case Gamma.name => Gamma | ||
def fromModel(model: GeneralizedLinearRegressionBase): Family = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rename to fromParams
, we extract family
and variancePower
from the Params
which is the superclass of GLR estimator and model. And actually we use this function for both estimator(L279) and model(L974).
* Tweedie exponential family distribution. | ||
* This includes the special cases of Gaussian, Poisson and Gamma. | ||
*/ | ||
private[regression] class TweedieFamily(private val variancePower: Double) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TweedieFamily
-> Tweedie
, we don't add suffix for other family class/object.
extends Family("tweedie") { | ||
|
||
/* | ||
The canonical link is 1 - variancePower. Except for the special cases of Gaussian, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The canonical link is 1 - variancePower
, could you clarify this to make it more clear?
require(y >= 0.0, s"The response variable of the specified distribution " + | ||
s"should be non-negative, but got $y") | ||
} else if (variancePower >= 2.0) { | ||
require(y > 0.0, s"The response variable of the specified distribution " + |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ditto.
* Gaussian exponential family distribution. | ||
* The default link for the Gaussian family is the identity link. | ||
*/ | ||
private[regression] object Gaussian extends TweedieFamily(0.0) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should keep the concrete implementation of variance, deviance and aic
for Gaussian, Poisson and Gamma, the main reasons are:
- These functions were called very frequently, the concrete implementation in subclasses should be more efficient.
- It's helpful to locate errors or bugs.
Binomial -> Logit, Binomial -> Probit, Binomial -> CLogLog, | ||
Poisson -> Log, Poisson -> Identity, Poisson -> Sqrt, | ||
Gamma -> Inverse, Gamma -> Identity, Gamma -> Log | ||
"gaussian" -> Identity, "gaussian" -> Log, "gaussian" -> Inverse, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, you are right. It needs an extra global object to avoid error-prone which may a little expensive. I'm ok with using string.
|
||
/** Set of link names that GeneralizedLinearRegression supports. */ | ||
private[regression] lazy val supportedLinkNames = supportedFamilyAndLinkPairs.map(_._2.name) | ||
|
||
private[regression] val epsilon: Double = 1E-16 | ||
|
||
/** Constant used in initialization and deviance to avoid numerical issues. */ | ||
private[regression] val delta: Double = 0.1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm OK to put it here.
@@ -578,6 +578,100 @@ class GeneralizedLinearRegressionSuite | |||
} | |||
} | |||
|
|||
test("generalized linear regression: tweedie family against glm") { | |||
/* | |||
R code: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
library(statmod)
which can help users to reproduce this test case.
@yanboliang Thanks much for the detailed comments. I have addressed all of them in the new commits. Please take another look. |
@srowen @yanboliang |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is down to small items now, I think it's pretty good.
* | ||
* @param name family name: "gaussian", "binomial", "poisson" or "gamma". | ||
* @param params a GenerealizedLinearRegressionBase object |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
typo in GenerealizedLinearRegressionBase; this type of doc doesn't do anything though because the type is already documented. It should say something non-trivial
|
||
/** Set of link names that GeneralizedLinearRegression supports. */ | ||
private[regression] lazy val supportedLinkNames = supportedFamilyAndLinkPairs.map(_._2.name) | ||
|
||
private[regression] val epsilon: Double = 1E-16 | ||
|
||
/** Constant used in initialization and deviance to avoid numerical issues. */ | ||
private[regression] val delta: Double = 0.1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not a companion object for TweedieFamily though? that just seems easy and more correct
@srowen Made a new commit according to your suggestion. Everything looking good now? |
@actuaryzhang I'm on travel these days, and will make another pass in a few days. Thanks. |
@yanboliang Did you get a chance to take another look at this? Thanks. |
@yanboliang Finally, the test is done. Is there anything else needed for this PR, other than the pending fresh test? |
@srowen @yanboliang @felixcheung @jkbradley Could you help kick off the new test please? Thanks. |
jenkins add to whitelist |
jenkins test this please |
Test build #71643 has started for PR 16344 at commit |
Could anybody help me understand what's causing this test to fail? I see several other ML PR failing as well, with the same error message like below:
Was this related to the recent commit on instrumentation #15671? How do I fix this? Thanks! |
Jenkins, retest this please. |
@actuaryzhang This test failure is caused by Jenkins was not stable, you just need to retest if you encounter similar issue. |
Test build #71686 has finished for PR 16344 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Only some comments about documents, otherwise, looks good to me. Thanks!
require(supportedFamilyAndLinkPairs.contains( | ||
Family.fromName($(family)) -> Link.fromName($(link))), "Generalized Linear Regression " + | ||
s"with ${$(family)} family does not support ${$(link)} link function.") | ||
if ($(family) == "tweedie") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
$(family).toLowerCase == "tweedie"
, see #16516, change here and other places.
case Gamma.name => Gamma | ||
def fromParams(params: GeneralizedLinearRegressionBase): Family = { | ||
params.getFamily match { | ||
case "gaussian" => Gaussian |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Revert to Gaussian.name
here and bellow, which is less error-prone.
*/ | ||
def apply(params: GeneralizedLinearRegressionBase): FamilyAndLink = { | ||
val familyObj = Family.fromParams(params) | ||
val linkObj = if ((params.getFamily != "tweedie" && params.isDefined(params.link)) || |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
isSet
is more accurate than isDefined
at here, since there is always no default values for both of them. I knew the original code used isDefined
, but it's better we can correct them.
|
||
var idx = 0 | ||
for (fitIntercept <- Seq(false, true); linkPower <- Seq(0.0, 1.0, -1.0)) { | ||
for (variancePower <- Seq(1.6, 2.5)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit:
for (fitIntercept <- Seq(false, true);
linkPower <- Seq(0.0, 1.0, -1.0);
variancePower <- Seq(1.6, 2.5)) {
|
||
/** @group getParam */ | ||
@Since("2.0.0") | ||
def getFamily: String = $(family) | ||
|
||
/** | ||
* Param for the power in the variance function of the Tweedie distribution which provides | ||
* the relationship between the variance and mean of the distribution. | ||
* Used only for the Tweedie family. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: Only applicable for "tweedie" family.
should be better?
* Gets the [[Family]] object from its name. | ||
* Gets the [[Family]] object based on family and variancePower. | ||
* 1) retrieve object based on family name | ||
* 2) if family name is tweedie, retrieve object based on variancePower |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should the following document be better?
Gets the Family object based on param family and variancePower.
If param family was set with "gaussian", "binomial", "poisson" or "gamma", return the corresponding object directly; otherwise, construct a Tweedie object according to variancePower.
@@ -620,25 +779,67 @@ object GeneralizedLinearRegression extends DefaultParamsReadable[GeneralizedLine | |||
private[regression] object Link { | |||
|
|||
/** | |||
* Gets the [[Link]] object from its name. | |||
* Gets the [[Link]] object based on link or linkPower. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Gets the Link object based on param family, link and linkPower.
If param family was set with "tweedie", return or construct link function object according to linkPower; otherwise, return link function object according to link.
* | ||
* @param name link name: "identity", "logit", "log", | ||
* "inverse", "probit", "cloglog" or "sqrt". | ||
* @param params the parameter map containing link and link power |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
link and link power
-> family, link and linkPower
1.0, 3.0, 2.0, 1.0, | ||
0.0, 4.0, 3.0, 3.0), 4, 4, byrow = TRUE)) | ||
|
||
f <- glm(V1 ~ -1 + V3 + V4, data = df, weights = V2, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Change f
to model
, and add summary(model)
at next line. It's helpful for users to reproduce the result in R.
|
||
Number of Fisher Scoring iterations: 11 | ||
|
||
residuals(model, type="pearson") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Typos here? I guess you paste irrelevant results for the following residuals
, they should be consistent with L1279-L1281.
Test build #71817 has finished for PR 16344 at commit
|
@yanboliang Thanks so much for your detailed review. Your suggestions make lots of sense and I have included all of them in the new commit. Let me know if there is any other change needed. |
Test build #71823 has finished for PR 16344 at commit
|
This looks good to me. cc @jkbradley @srowen @sethah for another pass if you have time. Thanks. |
Merged into master. If there are comments from others, we can address them in follow-up work. Thanks. |
## What changes were proposed in this pull request? I propose to add the full Tweedie family into the GeneralizedLinearRegression model. The Tweedie family is characterized by a power variance function. Currently supported distributions such as Gaussian, Poisson and Gamma families are a special case of the Tweedie https://en.wikipedia.org/wiki/Tweedie_distribution. yanboliang srowen sethah Author: actuaryzhang <[email protected]> Author: Wayne Zhang <[email protected]> Closes apache#16344 from actuaryzhang/tweedie.
## What changes were proposed in this pull request? I propose to add the full Tweedie family into the GeneralizedLinearRegression model. The Tweedie family is characterized by a power variance function. Currently supported distributions such as Gaussian, Poisson and Gamma families are a special case of the Tweedie https://en.wikipedia.org/wiki/Tweedie_distribution. yanboliang srowen sethah Author: actuaryzhang <[email protected]> Author: Wayne Zhang <[email protected]> Closes apache#16344 from actuaryzhang/tweedie.
…on.linkPower Add Scaladoc for GeneralizedLinearRegression.linkPower default value Follow-up to apache#16344 Author: Joseph K. Bradley <[email protected]> Closes apache#17069 from jkbradley/tweedie-comment.
…on.linkPower Add Scaladoc for GeneralizedLinearRegression.linkPower default value Follow-up to apache#16344 Author: Joseph K. Bradley <[email protected]> Closes apache#17069 from jkbradley/tweedie-comment.
Update GLM documentation to include the Tweedie distribution. apache#16344 jkbradley yanboliang Author: actuaryzhang <[email protected]> Closes apache#17103 from actuaryzhang/doc.
## What changes were proposed in this pull request? Port Tweedie GLM #16344 to SparkR felixcheung yanboliang ## How was this patch tested? new test in SparkR Author: actuaryzhang <[email protected]> Closes #16729 from actuaryzhang/sparkRTweedie.
What changes were proposed in this pull request?
I propose to add the full Tweedie family into the GeneralizedLinearRegression model. The Tweedie family is characterized by a power variance function. Currently supported distributions such as Gaussian, Poisson and Gamma families are a special case of the Tweedie https://en.wikipedia.org/wiki/Tweedie_distribution.
@yanboliang @srowen @sethah