[SPARK-19391][SparkR][ML] Tweedie GLM API for SparkR #16729

actuaryzhang · 2017-01-28T20:09:26Z

What changes were proposed in this pull request?

Port Tweedie GLM #16344 to SparkR

@felixcheung @yanboliang

How was this patch tested?

new test in SparkR

SparkQA · 2017-01-28T20:16:42Z

Test build #72111 has finished for PR 16729 at commit 852dd6e.

This patch fails R style tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-01-28T20:26:51Z

Test build #72112 has finished for PR 16729 at commit 5aa4ae7.

This patch fails R style tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-01-28T22:30:31Z

Test build #72114 has finished for PR 16729 at commit 3682692.

This patch fails SparkR unit tests.
This patch merges cleanly.
This patch adds no public classes.

felixcheung · 2017-01-28T22:44:59Z

R/pkg/inst/tests/testthat/test_mllib_regression.R

@@ -77,6 +77,18 @@ test_that("spark.glm and predict", {
  out <- capture.output(print(summary(model)))
  expect_true(any(grepl("Dispersion parameter for gamma family", out)))

+  # tweedie family
+  require(statmod)


we can't require this as of now - we would need to update Jenkins otherwise it would fail like it is right now, because on Jenkins we don't have the statmod package

spark.glm and predict (@test_mllib_regression.R#81) - there is no package called 'statmod'

in fact, library is more correct here as it will fail if the package isn't installed, instead of the warning we see and failing later.

felixcheung · 2017-01-28T22:46:42Z

R/pkg/R/mllib_regression.R

@@ -84,6 +84,12 @@ setClass("IsotonicRegressionModel", representation(jobj = "jobj"))
 #' # can also read back the saved model and print
 #' savedModel <- read.ml(path)
 #' summary(savedModel)
+#'
+#' # fit tweedie model
+#' require(statmod)


generally people use library instead of require

felixcheung · 2017-01-28T22:49:24Z

I did look into this... I think it's great if statmod is there and we support it, but I'm concerned that we can't enable this tweedie family without an external dependency and the fact that we barely really depend on it. As of now it won't work when one set family = "tweedie" as a string since it will still look for the function.

Is there a way to expose this in the API without having a hard dependency on tweedie family defined in statmod?

felixcheung · 2017-01-28T22:50:19Z

R/pkg/R/mllib_regression.R

@@ -84,6 +84,12 @@ setClass("IsotonicRegressionModel", representation(jobj = "jobj"))
 #' # can also read back the saved model and print
 #' savedModel <- read.ml(path)
 #' summary(savedModel)
+#'


please update L56 for documentation. Also we should update the programming guide and vignettes too

felixcheung · 2017-01-28T22:54:28Z

R/pkg/R/mllib_regression.R

@@ -109,7 +125,8 @@ setMethod("spark.glm", signature(data = "SparkDataFrame", formula = "formula"),
            # For known families, Gamma is upper-cased
            jobj <- callJStatic("org.apache.spark.ml.r.GeneralizedLinearRegressionWrapper",
                                "fit", formula, data@sdf, tolower(family$family), family$link,
-                                tol, as.integer(maxIter), as.character(weightCol), regParam)
+                                tol, as.integer(maxIter), as.character(weightCol), regParam,
+                                as.double(variancePower), as.double(linkPower))


we probably don't need to as.double here since it is either set to fixed values (L116) or from a calculation (L112). Instead, we should check var.power and link.power are within the correct range - not sure if the tweedie function does that.

felixcheung · 2017-01-28T22:57:01Z

R/pkg/inst/tests/testthat/test_mllib_regression.R

+  prediction <- predict(model, training)
+  expect_equal(typeof(take(select(prediction, "prediction"), 1)$prediction), "double")
+  vals <- collect(select(prediction, "prediction"))
+  rVals <- suppressWarnings(predict(


why do we need suppressWarnings here?

felixcheung · 2017-01-28T22:58:30Z

mllib/src/main/scala/org/apache/spark/ml/r/GeneralizedLinearRegressionWrapper.scala

      .setFitIntercept(rFormula.hasIntercept)
      .setTol(tol)
      .setMaxIter(maxIter)
      .setWeightCol(weightCol)
      .setRegParam(regParam)
      .setFeaturesCol(rFormula.getFeaturesCol)
+    // set variancePower and linkPower if family is tweedie; otherwise, set link function
+    if (family.toLowerCase == "tweedie") {
+      glr = glr.setVariancePower(variancePower).setLinkPower(linkPower)


do we need to assign glr = here? generally the setter method will update the instance

felixcheung · 2017-01-28T23:09:02Z

R/pkg/inst/tests/testthat/test_mllib_regression.R

+  model <- spark.glm(training, Sepal_Width ~ Sepal_Length + Species,
+                     family = tweedie(var.power = 1.2, link.power = 1.0))
+  prediction <- predict(model, training)
+  expect_equal(typeof(take(select(prediction, "prediction"), 1)$prediction), "double")


you might want to use dtypes instead?

Would you remind me what dtypes is and why we need to use it here? Thanks.

felixcheung · 2017-01-28T23:15:20Z

mllib/src/main/scala/org/apache/spark/ml/r/GeneralizedLinearRegressionWrapper.scala

@@ -143,7 +150,12 @@ private[r] object GeneralizedLinearRegressionWrapper
    val rDeviance: Double = summary.deviance
    val rResidualDegreeOfFreedomNull: Long = summary.residualDegreeOfFreedomNull
    val rResidualDegreeOfFreedom: Long = summary.residualDegreeOfFreedom
-    val rAic: Double = summary.aic
+    val rAic: Double = if (family.toLowerCase == "tweedie" &&
+      !Array(0.0, 1.0, 2.0).contains(variancePower)) {


we are comparing double values here, do you know how reliable is this? should it have epsilon in the comparison?

Thanks for the suggestion. Changed it to comparison instead.

SparkQA · 2017-01-29T21:02:43Z

Test build #72131 has finished for PR 16729 at commit fb66ce0.

This patch fails R style tests.
This patch merges cleanly.
This patch adds no public classes.

actuaryzhang · 2017-01-29T21:17:34Z

@felixcheung Thanks so much for your quick and detailed review. I have made a new commit that removed dependency on statmod and fixed the issues you pointed out. The major change is to add variancePower and linkPower as arguments in spark.glm to avoid using the tweedie family from statmod. Let me know if this design is reasonable. Then I can update additional docs and the vignette. Thanks!

SparkQA · 2017-01-29T22:21:57Z

Test build #72132 has finished for PR 16729 at commit 0d722fd.

This patch fails SparkR unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-01-30T00:23:42Z

Test build #72136 has finished for PR 16729 at commit d11fc4b.

This patch fails SparkR unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-01-30T00:35:36Z

Test build #72137 has finished for PR 16729 at commit 4c24158.

This patch fails SparkR unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-03-06T05:26:48Z

Test build #73964 has finished for PR 16729 at commit ef65adc.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

actuaryzhang · 2017-03-06T05:52:38Z

@felixcheung Sorry for taking so long for this update.

I think your first suggestion makes most sense, i.e., we do not expose the internal tweedie.
When statmod is loaded, users can use tweedie directly (from statmod); otherwise, they can use SparkR:::tweedie which has the same syntax.

I have made this to work. The following shows it now works both when statmod is not loaded (using SparkR:::tweedie) and when statmod is loaded (using tweedie).

Let me know if there is any other issues. Thanks.

training <- suppressWarnings(createDataFrame(iris))
model1 <- spark.glm(training, Sepal_Width ~ Sepal_Length + Species,
                     family = SparkR:::tweedie(var.power = 1.2, link.power = 1.0))
summary(model1)$coefficients

                     Estimate Std. Error    t value     Pr(>|t|)
(Intercept)         1.7009666 0.22970461   7.405017 9.638512e-12
Sepal_Length        0.3436703 0.04518882   7.605206 3.200329e-12
Species_versicolor -0.9703190 0.07090188 -13.685377 0.000000e+00
Species_virginica  -0.9852650 0.09129919 -10.791607 0.000000e+00

library(statmod)
model2 <- spark.glm(training, Sepal_Width ~ Sepal_Length + Species,
                     family = tweedie(var.power = 1.2, link.power = 1.0))
summary(model2)$coefficients
                     Estimate Std. Error    t value     Pr(>|t|)
(Intercept)         1.7009666 0.22970461   7.405017 9.638512e-12
Sepal_Length        0.3436703 0.04518882   7.605206 3.200329e-12
Species_versicolor -0.9703190 0.07090188 -13.685377 0.000000e+00
Species_virginica  -0.9852650 0.09129919 -10.791607 0.000000e+00

felixcheung · 2017-03-06T18:45:58Z

Thanks for working on this - to clarify, this only works with SparkR:::tweedie (ie. 3 :)
As this would be a private implementation?

actuaryzhang · 2017-03-06T18:56:52Z

@felixcheung Yes, the SparkR tweedie is not exported. See below.

 model1 <- spark.glm(training, Sepal_Width ~ Sepal_Length + Species,
+                     family = SparkR::tweedie(var.power = 1.2, link.power = 1.0))
Error: 'tweedie' is not an exported object from 'namespace:SparkR'

actuaryzhang · 2017-03-08T06:42:55Z

@felixcheung Could you take a look at this new fix when you get a chance? Thanks.

felixcheung · 2017-03-08T07:20:16Z

yea - I'm sorry if it was confusing - I was referring to SparkR::tweedie (2 :), and not SparkR:::tweedie (3 :), which was why I wasn't sure that could be done.

In the past there were concerns of exposing methods privately, so I'm not sure if we want to encourage accessing the tweedie function that way?

Perhaps then # 3 would be the only option (and it would be like Python)

actuaryzhang · 2017-03-08T08:11:00Z

@felixcheung If we go with # 3, do we still want to compatibility with statmod::tweedie? It's confusing to have two different ways of specifying the same model.

felixcheung · 2017-03-08T17:47:00Z

@actuaryzhang that's true, it's not ideal.

This is somewhat an unusual case for R for several reasons.
In my head the guiding principles are:

we avoid depending on another package (eg. statmod)
we avoid naming a method/class that masks another popular package
we can't ask user to call into private API
we could use another package, for convenience, if it's loaded, but generally (I'd agree) we should avoid to minimize confusion or unpredictable behavior.

But since in this method we have this odd design where we take a R function (R glm family), and don't really use it, I thought it makes sense to allow the user to pass it in, that's why I suggested earlier as

add var.power link.power as parameter? I realize that is closer to what you have originally, but in addition to 2 new parameters we could also check if tolower(family$family) == "tweedie" and if so, get family$var.power instead, so in this case it would work when statmod is there, and it would still work when it is not. (but in addition, for consistency we would need to make passing "tweedie" as a string also work - but that wouldn't be much different from tweedie is a function)

In the case statmod is loaded and the user is passing in tweedie in the family param, do we not need to explicitly check for it here, since it is a valid function? IF we are going down this path, perhaps we should have whitelist instead of a blacklist?

Also the other concern is we do have another signature glm where we "overload" the existing stats::glm one. In this case we can't add additional param like var.power or link.power, I think?

actuaryzhang · 2017-03-09T06:30:16Z

@felixcheung OK, new implementation of # 3. Now works in two ways:

family = "tweedie" + variancePower + linkPower
When statmod is available, tweedie()

They work for both spark.glm and overloaded glm. Please take another look. Thanks.

# 1. Use variancePower and linkPower directly
> model <- spark.glm(training, Sepal_Width ~ Sepal_Length + Species,
+                      family = "tweedie", variancePower = 1.2, linkPower = 0.0)
> summary(model)$coefficients
                     Estimate Std. Error    t value     Pr(>|t|)
(Intercept)         0.6455411 0.07672839   8.413327 3.330669e-14
Sepal_Length        0.1169143 0.01508433   7.750714 1.425526e-12
Species_versicolor -0.3224752 0.02345653 -13.747781 0.000000e+00
Species_virginica  -0.3282173 0.03042303 -10.788450 0.000000e+00

# 2. Use statmod
> library(statmod)
> model <- spark.glm(training, Sepal_Width ~ Sepal_Length + Species, family = tweedie(1.2, 0))
> summary(model)$coefficients
                     Estimate Std. Error    t value     Pr(>|t|)
(Intercept)         0.6455411 0.07672839   8.413327 3.330669e-14
Sepal_Length        0.1169143 0.01508433   7.750714 1.425526e-12
Species_versicolor -0.3224752 0.02345653 -13.747781 0.000000e+00
Species_virginica  -0.3282173 0.03042303 -10.788450 0.000000e+00

SparkQA · 2017-03-09T06:36:49Z

Test build #74242 has finished for PR 16729 at commit 5ce4c84.

This patch fails R style tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-03-09T07:48:32Z

Test build #74243 has finished for PR 16729 at commit aeeb3f7.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

actuaryzhang · 2017-03-09T07:53:20Z

One other change I could make is to change variancePower and linkPower to var.power and link.power to be consistent with statmod. But l would like to get your feedback on this new design first.

felixcheung · 2017-03-12T19:39:23Z

I like the example in this implementation! thanks
yes I think we should name them var.power and link.power.

felixcheung

looking good, thanks for working on this!
you can follow up with programming guide change separately if you want.

felixcheung · 2017-03-12T19:41:06Z

R/pkg/vignettes/sparkr-vignettes.Rmd

 summary(tweedieGLM1)
 ```
 We can try other distributions in the tweedie family, for example, a compound Poisson distribution with a log link:
 ```{r}
-tweedieGLM2 <- spark.glm(carsDF, mpg ~ wt + hp, family = SparkR:::tweedie(1.2, 0.0))
+tweedieGLM2 <- spark.glm(carsDF, mpg ~ wt + hp, family = "tweedie", 
+                         variancePower = 1.2, linkPower = 0.0)
 summary(tweedieGLM2)


let's add an example with statmod too? either here or in roxygen2 API doc (later might be a better place?)

felixcheung · 2017-03-12T19:42:56Z

R/pkg/R/mllib_regression.R

+#'
+#'               Note that there are two ways to specify the tweedie family.
+#'               a) Set \code{family = "tweedie"} and specify the variancePower and linkPower
+#'               b) When package \code{statmod} is loaded, the tweedie family is specified using the


roxygen2 will collapse these two lines - suggest separating with ; or use \item

basically roxygen2 trims all the "insignificant whitespace"

felixcheung · 2017-03-12T20:14:06Z

R/pkg/R/mllib_regression.R

@@ -100,6 +120,12 @@ setMethod("spark.glm", signature(data = "SparkDataFrame", formula = "formula"),
              print(family)
              stop("'family' not recognized")
            }
+            # Handle when family = statmod::tweedie()
+            if (tolower(family$family) == "tweedie" && !is.null(family$variance)) {


i assume it handles the "fake" family created on L111 correctly? it doesn't have variance

This part only handles the case when statmod::tweedie is specified: it retrieves the var.power and link.power and construct a list with family name and link name to be used.
The check for non-null variance is to skip handling the "fake" family. All we need when specifying family = "tweedie" is just a list with family name and link name.

felixcheung · 2017-03-12T20:17:16Z

R/pkg/R/mllib_regression.R

-              family <- get(family, mode = "function", envir = parent.frame())
+              # Handle when family = "tweedie"
+              if (tolower(family) == "tweedie") {
+                family <- list(family = "tweedie", link = "linkNotUsed")


nit: I think you can set link = NULL

felixcheung · 2017-03-12T20:17:29Z

R/pkg/R/mllib_regression.R

+            if (tolower(family$family) == "tweedie" && !is.null(family$variance)) {
+              variancePower <- log(family$variance(exp(1)))
+              linkPower <- log(family$linkfun(exp(1)))
+              family <- list(family = "tweedie", link = "linkNotUsed")


ditto, link = NULL

felixcheung · 2017-03-12T20:18:10Z

R/pkg/R/mllib_regression.R

+#'
+#' # fit tweedie model
+#' model <- spark.glm(df, Freq ~ Sex + Age, family = "tweedie",
+#'                    variancePower = 1.2, linkPower = 0)


could you add an example with statmod?

actuaryzhang · 2017-03-13T01:21:39Z

@felixcheung Thanks for the feedback. Made a new commit that

change variancePower and linkPower to var.power and link.power.
use link = NULL for tweedie family.
add example of using statmod.

Let me know if there is anything else needed.

SparkQA · 2017-03-13T02:17:06Z

Test build #74417 has finished for PR 16729 at commit 4cffc40.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

felixcheung

looking good! just this earlier comment #16729 (comment)

actuaryzhang · 2017-03-13T05:25:27Z

Sorry that I forgot to address that comment. Fixed now.

SparkQA · 2017-03-13T06:22:02Z

Test build #74423 has finished for PR 16729 at commit 0b496a6.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

actuaryzhang · 2017-03-13T20:43:53Z

@felixcheung Could you merge this please? Thanks!

felixcheung · 2017-03-14T07:51:39Z

merged to master

actuaryzhang added 3 commits January 27, 2017 11:56

start working on SparkR tweedie API

67364ab

set link only for non-tweedie; fix issue on aic

654551b

add test for tweedie

852dd6e

fix style

5aa4ae7

fix style issue

3682692

felixcheung reviewed Jan 28, 2017

View reviewed changes

actuaryzhang added 4 commits January 28, 2017 15:29

remove dependency on statmod

3555afb

create model matix directly from formula

56f6da0

update glmWrapper

083849c

add comments

fb66ce0

fix style issue

0d722fd

actuaryzhang added 2 commits January 29, 2017 15:15

remove statmod from suggest; update glm

d11fc4b

clean up doc

4c24158

remove link to statmod

295711d

fix test issue

ef65adc

actuaryzhang added 2 commits March 8, 2017 21:59

add back variancePower and linkPower params

c11e57c

update vignettes

5ce4c84

fix style

aeeb3f7

felixcheung requested changes Mar 12, 2017

View reviewed changes

change names of tweedie parameters to be consistent with R

4cffc40

felixcheung requested changes Mar 13, 2017

View reviewed changes

update doc

0b496a6

felixcheung approved these changes Mar 13, 2017

View reviewed changes

asfgit closed this in f6314ea Mar 14, 2017

actuaryzhang deleted the sparkRTweedie branch May 9, 2017 05:17

[SPARK-19391][SparkR][ML] Tweedie GLM API for SparkR #16729

[SPARK-19391][SparkR][ML] Tweedie GLM API for SparkR #16729

Conversation

actuaryzhang commented Jan 28, 2017

What changes were proposed in this pull request?

How was this patch tested?

SparkQA commented Jan 28, 2017

SparkQA commented Jan 28, 2017

SparkQA commented Jan 28, 2017

felixcheung Jan 28, 2017 • edited Loading

Choose a reason for hiding this comment

felixcheung Jan 28, 2017 • edited Loading

Choose a reason for hiding this comment

felixcheung Jan 28, 2017 • edited Loading

Choose a reason for hiding this comment

felixcheung commented Jan 28, 2017 • edited Loading

Choose a reason for hiding this comment

felixcheung Jan 28, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Jan 29, 2017

actuaryzhang commented Jan 29, 2017

SparkQA commented Jan 29, 2017

SparkQA commented Jan 30, 2017

SparkQA commented Jan 30, 2017

SparkQA commented Mar 6, 2017

actuaryzhang commented Mar 6, 2017 • edited Loading

felixcheung commented Mar 6, 2017

actuaryzhang commented Mar 6, 2017

actuaryzhang commented Mar 8, 2017

felixcheung commented Mar 8, 2017

actuaryzhang commented Mar 8, 2017

felixcheung commented Mar 8, 2017

actuaryzhang commented Mar 9, 2017 • edited Loading

SparkQA commented Mar 9, 2017

SparkQA commented Mar 9, 2017

actuaryzhang commented Mar 9, 2017

felixcheung commented Mar 12, 2017

felixcheung left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

actuaryzhang commented Mar 13, 2017

SparkQA commented Mar 13, 2017

felixcheung left a comment

Choose a reason for hiding this comment

actuaryzhang commented Mar 13, 2017

SparkQA commented Mar 13, 2017

actuaryzhang commented Mar 13, 2017

felixcheung commented Mar 14, 2017

felixcheung Jan 28, 2017 •

edited

Loading

felixcheung Jan 28, 2017 •

edited

Loading

felixcheung Jan 28, 2017 •

edited

Loading

felixcheung commented Jan 28, 2017 •

edited

Loading

felixcheung Jan 28, 2017 •

edited

Loading

actuaryzhang commented Mar 6, 2017 •

edited

Loading

actuaryzhang commented Mar 9, 2017 •

edited

Loading