Skip to content

Commit

Permalink
update vignettes
Browse files Browse the repository at this point in the history
  • Loading branch information
actuaryzhang committed Mar 9, 2017
1 parent c11e57c commit 5ce4c84
Show file tree
Hide file tree
Showing 3 changed files with 19 additions and 17 deletions.
15 changes: 7 additions & 8 deletions R/pkg/R/mllib_regression.R
Original file line number Diff line number Diff line change
Expand Up @@ -114,19 +114,18 @@ setMethod("spark.glm", signature(data = "SparkDataFrame", formula = "formula"),
}
}
if (is.function(family)) {
# family = statmod::tweedie()
if (tolower(family$family) == "tweedie") {
family <- list(family = "tweedie", link = "linkNotUsed")
variancePower <- log(family$variance(exp(1)))
linkPower <- log(family$linkfun(exp(1)))
} else {
family <- family()
}
family <- family()
}
if (is.null(family$family)) {
print(family)
stop("'family' not recognized")
}
# family = statmod::tweedie()
if (tolower(family$family) == "tweedie" && !is.null(family$variance)) {
variancePower <- log(family$variance(exp(1)))
linkPower <- log(family$linkfun(exp(1)))
family <- list(family = "tweedie", link = "linkNotUsed")
}

formula <- paste(deparse(formula), collapse = "")
if (!is.null(weightCol) && weightCol == "") {
Expand Down
12 changes: 6 additions & 6 deletions R/pkg/inst/tests/testthat/test_mllib_regression.R
Original file line number Diff line number Diff line change
Expand Up @@ -91,9 +91,9 @@ test_that("spark.glm and predict", {
#' print(coef(rModel))

rCoef <- c(0.6455409, 0.1169143, -0.3224752, -0.3282174)
rVals <- as.numeric(model.matrix(Sepal.Width ~ Sepal.Length + Species,
data = iris) %*% rCoef)
expect_true(all(abs(rVals - vals) < 1e-6), rVals - vals)
rVals <- exp(as.numeric(model.matrix(Sepal.Width ~ Sepal.Length + Species,
data = iris) %*% rCoef))
expect_true(all(abs(rVals - vals) < 1e-5), rVals - vals)

# Test stats::predict is working
x <- rnorm(15)
Expand Down Expand Up @@ -281,9 +281,9 @@ test_that("glm and predict", {
#' print(coef(rModel))

rCoef <- c(0.6455409, 0.1169143, -0.3224752, -0.3282174)
rVals <- as.numeric(model.matrix(Sepal.Width ~ Sepal.Length + Species,
data = iris) %*% rCoef)
expect_true(all(abs(rVals - vals) < 1e-6), rVals - vals)
rVals <- exp(as.numeric(model.matrix(Sepal.Width ~ Sepal.Length + Species,
data = iris) %*% rCoef))
expect_true(all(abs(rVals - vals) < 1e-5), rVals - vals)

# Test stats::predict is working
x <- rnorm(15)
Expand Down
9 changes: 6 additions & 3 deletions R/pkg/vignettes/sparkr-vignettes.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -682,7 +682,9 @@ There are three ways to specify the `family` argument.

* Result returned by a family function, e.g. `family = poisson(link = log)`.

* Note that when package `statmod` is loaded, the tweedie family is specified as `tweedie(var.power, link.power)`. Otherwise, one can use the SparkR internal definition `SparkR:::tweedie(var.power, link.power)`. In the above, `var.power` is the power index of the variance function and `link.power` is the index of the the power link function (the default value is `link.power = 1.0 - var.power`). This is consistent with the `tweedie` family defined in the `statmod` package. Some examples: `family = tweedie(0.0)` is gaussian with identity link, `family = tweedie(1.0)` poisson with log link, `family = tweedie(2.0)` Gamma with inverse link, and `family = tweedie(1.5, 0.0)` compound Poisson with log link.
* Note that there are two ways to specify the tweedie family:
a) Set `family = "tweedie"` and specify the `variancePower` and `linkPower`
b) When package `statmod` is loaded, the tweedie family is specified using the family definition therein, i.e., `tweedie()`.

For more information regarding the families and their link functions, see the Wikipedia page [Generalized Linear Model](https://en.wikipedia.org/wiki/Generalized_linear_model).

Expand All @@ -700,12 +702,13 @@ head(select(gaussianFitted, "model", "prediction", "mpg", "wt", "hp"))

The following is the same fit using the tweedie family:
```{r}
tweedieGLM1 <- spark.glm(carsDF, mpg ~ wt + hp, family = SparkR:::tweedie(0.0))
tweedieGLM1 <- spark.glm(carsDF, mpg ~ wt + hp, family = "tweedie", variancePower = 0.0)
summary(tweedieGLM1)
```
We can try other distributions in the tweedie family, for example, a compound Poisson distribution with a log link:
```{r}
tweedieGLM2 <- spark.glm(carsDF, mpg ~ wt + hp, family = SparkR:::tweedie(1.2, 0.0))
tweedieGLM2 <- spark.glm(carsDF, mpg ~ wt + hp, family = "tweedie",
variancePower = 1.2, linkPower = 0.0)
summary(tweedieGLM2)
```

Expand Down

0 comments on commit 5ce4c84

Please sign in to comment.