Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-19391][SparkR][ML] Tweedie GLM API for SparkR #16729

Closed
wants to merge 29 commits into from

Conversation

actuaryzhang
Copy link
Contributor

What changes were proposed in this pull request?

Port Tweedie GLM #16344 to SparkR

@felixcheung @yanboliang

How was this patch tested?

new test in SparkR

@SparkQA
Copy link

SparkQA commented Jan 28, 2017

Test build #72111 has finished for PR 16729 at commit 852dd6e.

  • This patch fails R style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jan 28, 2017

Test build #72112 has finished for PR 16729 at commit 5aa4ae7.

  • This patch fails R style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jan 28, 2017

Test build #72114 has finished for PR 16729 at commit 3682692.

  • This patch fails SparkR unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@@ -77,6 +77,18 @@ test_that("spark.glm and predict", {
out <- capture.output(print(summary(model)))
expect_true(any(grepl("Dispersion parameter for gamma family", out)))

# tweedie family
require(statmod)
Copy link
Member

@felixcheung felixcheung Jan 28, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can't require this as of now - we would need to update Jenkins otherwise it would fail like it is right now, because on Jenkins we don't have the statmod package

spark.glm and predict (@test_mllib_regression.R#81) - there is no package called 'statmod'

Copy link
Member

@felixcheung felixcheung Jan 28, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in fact, library is more correct here as it will fail if the package isn't installed, instead of the warning we see and failing later.

@@ -84,6 +84,12 @@ setClass("IsotonicRegressionModel", representation(jobj = "jobj"))
#' # can also read back the saved model and print
#' savedModel <- read.ml(path)
#' summary(savedModel)
#'
#' # fit tweedie model
#' require(statmod)
Copy link
Member

@felixcheung felixcheung Jan 28, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

generally people use library instead of require

@felixcheung
Copy link
Member

felixcheung commented Jan 28, 2017

I did look into this... I think it's great if statmod is there and we support it, but I'm concerned that we can't enable this tweedie family without an external dependency and the fact that we barely really depend on it. As of now it won't work when one set family = "tweedie" as a string since it will still look for the function.

Is there a way to expose this in the API without having a hard dependency on tweedie family defined in statmod?

@@ -84,6 +84,12 @@ setClass("IsotonicRegressionModel", representation(jobj = "jobj"))
#' # can also read back the saved model and print
#' savedModel <- read.ml(path)
#' summary(savedModel)
#'
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please update L56 for documentation. Also we should update the programming guide and vignettes too

@@ -109,7 +125,8 @@ setMethod("spark.glm", signature(data = "SparkDataFrame", formula = "formula"),
# For known families, Gamma is upper-cased
jobj <- callJStatic("org.apache.spark.ml.r.GeneralizedLinearRegressionWrapper",
"fit", formula, data@sdf, tolower(family$family), family$link,
tol, as.integer(maxIter), as.character(weightCol), regParam)
tol, as.integer(maxIter), as.character(weightCol), regParam,
as.double(variancePower), as.double(linkPower))
Copy link
Member

@felixcheung felixcheung Jan 28, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we probably don't need to as.double here since it is either set to fixed values (L116) or from a calculation (L112). Instead, we should check var.power and link.power are within the correct range - not sure if the tweedie function does that.

prediction <- predict(model, training)
expect_equal(typeof(take(select(prediction, "prediction"), 1)$prediction), "double")
vals <- collect(select(prediction, "prediction"))
rVals <- suppressWarnings(predict(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we need suppressWarnings here?

.setFitIntercept(rFormula.hasIntercept)
.setTol(tol)
.setMaxIter(maxIter)
.setWeightCol(weightCol)
.setRegParam(regParam)
.setFeaturesCol(rFormula.getFeaturesCol)
// set variancePower and linkPower if family is tweedie; otherwise, set link function
if (family.toLowerCase == "tweedie") {
glr = glr.setVariancePower(variancePower).setLinkPower(linkPower)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we need to assign glr = here? generally the setter method will update the instance

model <- spark.glm(training, Sepal_Width ~ Sepal_Length + Species,
family = tweedie(var.power = 1.2, link.power = 1.0))
prediction <- predict(model, training)
expect_equal(typeof(take(select(prediction, "prediction"), 1)$prediction), "double")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you might want to use dtypes instead?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would you remind me what dtypes is and why we need to use it here? Thanks.

@@ -143,7 +150,12 @@ private[r] object GeneralizedLinearRegressionWrapper
val rDeviance: Double = summary.deviance
val rResidualDegreeOfFreedomNull: Long = summary.residualDegreeOfFreedomNull
val rResidualDegreeOfFreedom: Long = summary.residualDegreeOfFreedom
val rAic: Double = summary.aic
val rAic: Double = if (family.toLowerCase == "tweedie" &&
!Array(0.0, 1.0, 2.0).contains(variancePower)) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we are comparing double values here, do you know how reliable is this? should it have epsilon in the comparison?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the suggestion. Changed it to comparison instead.

@SparkQA
Copy link

SparkQA commented Jan 29, 2017

Test build #72131 has finished for PR 16729 at commit fb66ce0.

  • This patch fails R style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@actuaryzhang
Copy link
Contributor Author

@felixcheung Thanks so much for your quick and detailed review. I have made a new commit that removed dependency on statmod and fixed the issues you pointed out. The major change is to add variancePower and linkPower as arguments in spark.glm to avoid using the tweedie family from statmod. Let me know if this design is reasonable. Then I can update additional docs and the vignette. Thanks!

@SparkQA
Copy link

SparkQA commented Jan 29, 2017

Test build #72132 has finished for PR 16729 at commit 0d722fd.

  • This patch fails SparkR unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jan 30, 2017

Test build #72136 has finished for PR 16729 at commit d11fc4b.

  • This patch fails SparkR unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jan 30, 2017

Test build #72137 has finished for PR 16729 at commit 4c24158.

  • This patch fails SparkR unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Mar 6, 2017

Test build #73964 has finished for PR 16729 at commit ef65adc.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@actuaryzhang
Copy link
Contributor Author

actuaryzhang commented Mar 6, 2017

@felixcheung Sorry for taking so long for this update.

I think your first suggestion makes most sense, i.e., we do not expose the internal tweedie.
When statmod is loaded, users can use tweedie directly (from statmod); otherwise, they can use SparkR:::tweedie which has the same syntax.

I have made this to work. The following shows it now works both when statmod is not loaded (using SparkR:::tweedie) and when statmod is loaded (using tweedie).

Let me know if there is any other issues. Thanks.

training <- suppressWarnings(createDataFrame(iris))
model1 <- spark.glm(training, Sepal_Width ~ Sepal_Length + Species,
                     family = SparkR:::tweedie(var.power = 1.2, link.power = 1.0))
summary(model1)$coefficients

                     Estimate Std. Error    t value     Pr(>|t|)
(Intercept)         1.7009666 0.22970461   7.405017 9.638512e-12
Sepal_Length        0.3436703 0.04518882   7.605206 3.200329e-12
Species_versicolor -0.9703190 0.07090188 -13.685377 0.000000e+00
Species_virginica  -0.9852650 0.09129919 -10.791607 0.000000e+00

library(statmod)
model2 <- spark.glm(training, Sepal_Width ~ Sepal_Length + Species,
                     family = tweedie(var.power = 1.2, link.power = 1.0))
summary(model2)$coefficients
                     Estimate Std. Error    t value     Pr(>|t|)
(Intercept)         1.7009666 0.22970461   7.405017 9.638512e-12
Sepal_Length        0.3436703 0.04518882   7.605206 3.200329e-12
Species_versicolor -0.9703190 0.07090188 -13.685377 0.000000e+00
Species_virginica  -0.9852650 0.09129919 -10.791607 0.000000e+00

@felixcheung
Copy link
Member

Thanks for working on this - to clarify, this only works with SparkR:::tweedie (ie. 3 :)
As this would be a private implementation?

@actuaryzhang
Copy link
Contributor Author

@felixcheung Yes, the SparkR tweedie is not exported. See below.

 model1 <- spark.glm(training, Sepal_Width ~ Sepal_Length + Species,
+                     family = SparkR::tweedie(var.power = 1.2, link.power = 1.0))
Error: 'tweedie' is not an exported object from 'namespace:SparkR'

@actuaryzhang
Copy link
Contributor Author

@felixcheung Could you take a look at this new fix when you get a chance? Thanks.

@felixcheung
Copy link
Member

yea - I'm sorry if it was confusing - I was referring to SparkR::tweedie (2 :), and not SparkR:::tweedie (3 :), which was why I wasn't sure that could be done.

In the past there were concerns of exposing methods privately, so I'm not sure if we want to encourage accessing the tweedie function that way?

Perhaps then # 3 would be the only option (and it would be like Python)

@actuaryzhang
Copy link
Contributor Author

@felixcheung If we go with # 3, do we still want to compatibility with statmod::tweedie? It's confusing to have two different ways of specifying the same model.

@felixcheung
Copy link
Member

@actuaryzhang that's true, it's not ideal.

This is somewhat an unusual case for R for several reasons.
In my head the guiding principles are:

  • we avoid depending on another package (eg. statmod)
  • we avoid naming a method/class that masks another popular package
  • we can't ask user to call into private API
  • we could use another package, for convenience, if it's loaded, but generally (I'd agree) we should avoid to minimize confusion or unpredictable behavior.

But since in this method we have this odd design where we take a R function (R glm family), and don't really use it, I thought it makes sense to allow the user to pass it in, that's why I suggested earlier as

add var.power link.power as parameter? I realize that is closer to what you have originally, but in addition to 2 new parameters we could also check if tolower(family$family) == "tweedie" and if so, get family$var.power instead, so in this case it would work when statmod is there, and it would still work when it is not. (but in addition, for consistency we would need to make passing "tweedie" as a string also work - but that wouldn't be much different from tweedie is a function)

In the case statmod is loaded and the user is passing in tweedie in the family param, do we not need to explicitly check for it here, since it is a valid function? IF we are going down this path, perhaps we should have whitelist instead of a blacklist?

Also the other concern is we do have another signature glm where we "overload" the existing stats::glm one. In this case we can't add additional param like var.power or link.power, I think?

@actuaryzhang
Copy link
Contributor Author

actuaryzhang commented Mar 9, 2017

@felixcheung OK, new implementation of # 3. Now works in two ways:

  1. family = "tweedie" + variancePower + linkPower
  2. When statmod is available, tweedie()

They work for both spark.glm and overloaded glm. Please take another look. Thanks.

# 1. Use variancePower and linkPower directly
> model <- spark.glm(training, Sepal_Width ~ Sepal_Length + Species,
+                      family = "tweedie", variancePower = 1.2, linkPower = 0.0)
> summary(model)$coefficients
                     Estimate Std. Error    t value     Pr(>|t|)
(Intercept)         0.6455411 0.07672839   8.413327 3.330669e-14
Sepal_Length        0.1169143 0.01508433   7.750714 1.425526e-12
Species_versicolor -0.3224752 0.02345653 -13.747781 0.000000e+00
Species_virginica  -0.3282173 0.03042303 -10.788450 0.000000e+00

# 2. Use statmod
> library(statmod)
> model <- spark.glm(training, Sepal_Width ~ Sepal_Length + Species, family = tweedie(1.2, 0))
> summary(model)$coefficients
                     Estimate Std. Error    t value     Pr(>|t|)
(Intercept)         0.6455411 0.07672839   8.413327 3.330669e-14
Sepal_Length        0.1169143 0.01508433   7.750714 1.425526e-12
Species_versicolor -0.3224752 0.02345653 -13.747781 0.000000e+00
Species_virginica  -0.3282173 0.03042303 -10.788450 0.000000e+00

@SparkQA
Copy link

SparkQA commented Mar 9, 2017

Test build #74242 has finished for PR 16729 at commit 5ce4c84.

  • This patch fails R style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Mar 9, 2017

Test build #74243 has finished for PR 16729 at commit aeeb3f7.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@actuaryzhang
Copy link
Contributor Author

One other change I could make is to change variancePower and linkPower to var.power and link.power to be consistent with statmod. But l would like to get your feedback on this new design first.

@felixcheung
Copy link
Member

I like the example in this implementation! thanks
yes I think we should name them var.power and link.power.

Copy link
Member

@felixcheung felixcheung left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looking good, thanks for working on this!
you can follow up with programming guide change separately if you want.

summary(tweedieGLM1)
```
We can try other distributions in the tweedie family, for example, a compound Poisson distribution with a log link:
```{r}
tweedieGLM2 <- spark.glm(carsDF, mpg ~ wt + hp, family = SparkR:::tweedie(1.2, 0.0))
tweedieGLM2 <- spark.glm(carsDF, mpg ~ wt + hp, family = "tweedie",
variancePower = 1.2, linkPower = 0.0)
summary(tweedieGLM2)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's add an example with statmod too? either here or in roxygen2 API doc (later might be a better place?)

#'
#' Note that there are two ways to specify the tweedie family.
#' a) Set \code{family = "tweedie"} and specify the variancePower and linkPower
#' b) When package \code{statmod} is loaded, the tweedie family is specified using the
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

roxygen2 will collapse these two lines - suggest separating with ; or use \item

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

basically roxygen2 trims all the "insignificant whitespace"

@@ -100,6 +120,12 @@ setMethod("spark.glm", signature(data = "SparkDataFrame", formula = "formula"),
print(family)
stop("'family' not recognized")
}
# Handle when family = statmod::tweedie()
if (tolower(family$family) == "tweedie" && !is.null(family$variance)) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i assume it handles the "fake" family created on L111 correctly? it doesn't have variance

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This part only handles the case when statmod::tweedie is specified: it retrieves the var.power and link.power and construct a list with family name and link name to be used.
The check for non-null variance is to skip handling the "fake" family. All we need when specifying family = "tweedie" is just a list with family name and link name.

family <- get(family, mode = "function", envir = parent.frame())
# Handle when family = "tweedie"
if (tolower(family) == "tweedie") {
family <- list(family = "tweedie", link = "linkNotUsed")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I think you can set link = NULL

if (tolower(family$family) == "tweedie" && !is.null(family$variance)) {
variancePower <- log(family$variance(exp(1)))
linkPower <- log(family$linkfun(exp(1)))
family <- list(family = "tweedie", link = "linkNotUsed")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto, link = NULL

#'
#' # fit tweedie model
#' model <- spark.glm(df, Freq ~ Sex + Age, family = "tweedie",
#' variancePower = 1.2, linkPower = 0)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could you add an example with statmod?

@actuaryzhang
Copy link
Contributor Author

@felixcheung Thanks for the feedback. Made a new commit that

  1. change variancePower and linkPower to var.power and link.power.
  2. use link = NULL for tweedie family.
  3. add example of using statmod.

Let me know if there is anything else needed.

@SparkQA
Copy link

SparkQA commented Mar 13, 2017

Test build #74417 has finished for PR 16729 at commit 4cffc40.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Member

@felixcheung felixcheung left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looking good! just this earlier comment #16729 (comment)

@actuaryzhang
Copy link
Contributor Author

Sorry that I forgot to address that comment. Fixed now.

@SparkQA
Copy link

SparkQA commented Mar 13, 2017

Test build #74423 has finished for PR 16729 at commit 0b496a6.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@actuaryzhang
Copy link
Contributor Author

@felixcheung Could you merge this please? Thanks!

@felixcheung
Copy link
Member

merged to master

@asfgit asfgit closed this in f6314ea Mar 14, 2017
@actuaryzhang actuaryzhang deleted the sparkRTweedie branch May 9, 2017 05:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants