-
Notifications
You must be signed in to change notification settings - Fork 61
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Sparse model matrix, hatvalues, and HC2/HC3/HU standard errors #418
base: master
Are you sure you want to change the base?
Conversation
BTW, probably need to make a few more changes to make it faster, so I'll push and circle back when I think it's in a good state! |
- Add fixed effect support - Add exact = FALSE option to approximate using JLA algorithm
@lrberge I think we are good to merge! Some dev notes for HC2/HC3 implementation:
ss_adj = attr(vcov_noAdj, "ss_adj")
if(!is.null(ss_adj)){
if(is.function(ss_adj)){
ss_adj = ss_adj(n = n, K = K)
}
attr(vcov_noAdj, "ss_adj") = NULL
} |
@kylebutts This looks very interesting. Can you provide some quick benchmarks/comparisons to show practical gains over the current approach? Ofc we'd expect memory allocations to decrease significantly, but I'm curious to see speed implications too. Thanks. |
@grantmcdermott I think memory usage will be about the same in most cases. The creation step is just slightly slower than The real advantage is not the speed in which you can generate the matrix but the uses afterwards. For example, s3alfisc/summclust#19 (comment) has a 13x speedup. Here's a quick example of the speedups of avoiding dense matrices: library(data.table)
library(fixest)
library(Matrix)
# Test -------------------------------------------------------------------------
N = 10000
var_fes = 10
var_eps = 4
# Ensure P_ii exists
N_i = pmax(1, rpois(N, 3))
# Create dataset
df = data.table(id = rep(1:N, N_i))
df[, j := 1:.N, by = id]
df[, fe_i := rnorm(1, 0, sd = sqrt(var_fes)), by = id]
df[, x := rnorm(.N, mean = fe_i, sd = 1)]
df[, x2 := rnorm(.N, mean = fe_i, sd = 1)]
df[, x3 := rnorm(.N, mean = fe_i, sd = 1)]
df[, x4 := rnorm(.N, mean = fe_i, sd = 1)]
df[, y := x * 1 + fe_i + rnorm(.N, 0, sqrt(var_eps))]
# Estimate FE model
model = feols(y ~ x | id, df)
X = fixest:::sparse_model_matrix(
y ~ x | id, df, type = c("rhs", "fixef"),
combine = TRUE, collin.rm = FALSE
)
X_dense = as.matrix(X)
#> Warning in asMethod(object): sparse->dense coercion: allocating vector of size
#> 2.3 GiB
bench::mark(
Matrix::crossprod(X, df$y),
Matrix::crossprod(X_dense, df$y),
check = FALSE
)
#> # A tibble: 2 × 6
#> expression min median `itr/sec` mem_alloc `gc/sec`
#> <bch:expr> <bch:tm> <bch:t> <dbl> <bch:byt> <dbl>
#> 1 Matrix::crossprod(X, df$y) 285µs 322µs 2620. 493KB 2.20
#> 2 Matrix::crossprod(X_dense, df$y) 488ms 499ms 2.00 89.8KB 0 |
Just to chime in here - there are indeed large performance advantages for the cluster jackknife CRV3 estimator as suggested by MNW (2023) if X is sparse. And if I read the MNW paper correctly, I think that many people will be interested in using the CRV3 estimator in the future =) Unfortunately, it does not play too well with (arbitrary) fixed effects, so one has to bite the bullet and create the design matrix of all dummies in most cases. Here is a quick implementation of a part of the algorithm, which works both for sparse and dense matrices: library(Matrix)
N <- 100000
k <- 100
y <- rnorm(N)
X <- factor(sample(1:k, N, TRUE))
df <- data.frame(X = X)
mf <- model.frame(~X, X)
X <- sparse.model.matrix(mf)
beta <- rnorm(ncol(X))
y <- X %*% beta + rnorm(N)
cluster <- sample(letters, N, TRUE)
length(unique(cluster)
# 26 clusters
y_dense <- as.matrix(y)
X_dense <- as.matrix(X)
cluster_jackknife <- function(X = X, y = y, cluster = cluster){
XXinv <- solve(crossprod(X))
Xy <- t(X) %*% y
cluster_groups <- unique(cluster)
XgXg_list <- lapply(cluster_groups, function(g) solve(crossprod(X[cluster == g,])))
Xgyg_list <- lapply(cluster_groups, function(g) t(X[cluster == g,]) %*% y[cluster == g])
# compute leave-one-out regression coefs
beta_jack <- lapply(seq_along(cluster_groups), function(g) MASS::ginv(as.matrix(XXinv - XgXg_list[[g]])) %*% (Xy -Xgyg_list[[g]]))
beta_jack
}
microbenchmark::microbenchmark(
cluster_jackknife(X_dense, y_dense, cluster),
cluster_jackknife(X, y, cluster),
times = 1
)
# cluster_jackknife(X_dense, y_dense, cluster)
# cluster_jackknife(X, y, cluster)
# min lq mean median uq max neval
# 1794.6250 1794.6250 1794.6250 1794.6250 1794.6250 1794.6250 1
# 385.1208 385.1208 385.1208 385.1208 385.1208 385.1208 1 So in this case, the sparse version is around 4-5x faster. The larger k and the dimension of clusters, the larger the speed gains will be. Of course you could argue that I could just create the model matrix myself (at the cost of some performance overhead) - but another great advantage is that this PR creates the full model matrix of dummies - which will allow me to support all sort of |
Figured out a faster way to implement this using block formula for projection matrix: https://en.wikipedia.org/wiki/Projection_matrix#Blockwise_formula
Before (using approx to speed it up): library(data.table)
library(fixest)
library(Matrix)
library(tictoc)
#>
#> Attaching package: 'tictoc'
#> The following object is masked from 'package:data.table':
#>
#> shift
# devtools::load_all()
# Test -------------------------------------------------------------------------
N = 100000
var_fes = 10
var_eps = 4
# Ensure P_ii exists
N_i = pmax(2, rpois(N, 3))
# Create dataset
df = data.table(id = rep(1:N, N_i))
df[, j := 1:.N, by = id]
df[, fe_i := rnorm(1, 0, sd = sqrt(var_fes)), by = id]
df[, x := rnorm(.N, mean = fe_i, sd = 1)]
df[, x2 := rnorm(.N, mean = fe_i, sd = 1)]
df[, x3 := rnorm(.N, mean = fe_i, sd = 1)]
df[, x4 := rnorm(.N, mean = fe_i, sd = 1)]
df[, g := sample(1:25, .N, replace = TRUE)]
df[, y := x * 1 + fe_i + rnorm(.N, 0, sqrt(var_eps))]
df = as.data.frame(df)
# Estimate FE model
model = feols(y ~ x + x2 + x3 + x4 | id + g, df)
tic()
P_ii_orig = hatvalues(model, exact = FALSE, p = 500)
toc()
#> 12.39 sec elapsed Created on 2023-06-17 with reprex v2.0.2 After (solving exactly): library(data.table)
library(fixest)
library(Matrix)
library(tictoc)
#>
#> Attaching package: 'tictoc'
#> The following object is masked from 'package:data.table':
#>
#> shift
# devtools::load_all()
# Test -------------------------------------------------------------------------
N = 100000
var_fes = 10
var_eps = 4
# Ensure P_ii exists
N_i = pmax(2, rpois(N, 3))
# Create dataset
df = data.table(id = rep(1:N, N_i))
df[, j := 1:.N, by = id]
df[, fe_i := rnorm(1, 0, sd = sqrt(var_fes)), by = id]
df[, x := rnorm(.N, mean = fe_i, sd = 1)]
df[, x2 := rnorm(.N, mean = fe_i, sd = 1)]
df[, x3 := rnorm(.N, mean = fe_i, sd = 1)]
df[, x4 := rnorm(.N, mean = fe_i, sd = 1)]
df[, g := sample(1:25, .N, replace = TRUE)]
df[, y := x * 1 + fe_i + rnorm(.N, 0, sqrt(var_eps))]
df = as.data.frame(df)
# Estimate FE model
model = feols(y ~ x + x2 + x3 + x4 | id + g, df)
tic()
P_ii_orig = hatvalues(model, exact = TRUE)
toc()
#> 1.1 sec elapsed Created on 2023-06-17 with reprex v2.0.2 |
Very cool PR! Nice work! |
@lrberge I put a couple things in this one PR. Want me rebase onto
|
Just found a bug (column names are not correct for library(fixest)
est = feols(
mpg ~ factor(am) + disp | vs, mtcars
)
fixest::sparse_model_matrix(est, collin.rm = FALSE)
#> 32 x 3 sparse Matrix of class "dgCMatrix"
#> 0 1 disp
#> [1,] . 1 160.0
#> [2,] . 1 160.0
#> [3,] . 1 108.0
#> [4,] 1 . 258.0
#> [5,] 1 . 360.0
#> [6,] 1 . 225.0
#> [7,] 1 . 360.0
#> [8,] 1 . 146.7
#> [9,] 1 . 140.8
#> [10,] 1 . 167.6
#> [11,] 1 . 167.6
#> [12,] 1 . 275.8
#> [13,] 1 . 275.8
#> [14,] 1 . 275.8
#> [15,] 1 . 472.0
#> [16,] 1 . 460.0
#> [17,] 1 . 440.0
#> [18,] . 1 78.7
#> [19,] . 1 75.7
#> [20,] . 1 71.1
#> [21,] 1 . 120.1
#> [22,] 1 . 318.0
#> [23,] 1 . 304.0
#> [24,] 1 . 350.0
#> [25,] 1 . 400.0
#> [26,] . 1 79.0
#> [27,] . 1 120.3
#> [28,] . 1 95.1
#> [29,] . 1 351.0
#> [30,] . 1 145.0
#> [31,] . 1 301.0
#> [32,] . 1 121.0
fixest::sparse_model_matrix(est, collin.rm = TRUE)
#> 32 x 1 sparse Matrix of class "dgCMatrix"
#> disp
#> [1,] 160.0
#> [2,] 160.0
#> [3,] 108.0
#> [4,] 258.0
#> [5,] 360.0
#> [6,] 225.0
#> [7,] 360.0
#> [8,] 146.7
#> [9,] 140.8
#> [10,] 167.6
#> [11,] 167.6
#> [12,] 275.8
#> [13,] 275.8
#> [14,] 275.8
#> [15,] 472.0
#> [16,] 460.0
#> [17,] 440.0
#> [18,] 78.7
#> [19,] 75.7
#> [20,] 71.1
#> [21,] 120.1
#> [22,] 318.0
#> [23,] 304.0
#> [24,] 350.0
#> [25,] 400.0
#> [26,] 79.0
#> [27,] 120.3
#> [28,] 95.1
#> [29,] 351.0
#> [30,] 145.0
#> [31,] 301.0
#> [32,] 121.0
model.matrix(est)
#> factor(am)1 disp
#> Mazda RX4 1 160.0
#> Mazda RX4 Wag 1 160.0
#> Datsun 710 1 108.0
#> Hornet 4 Drive 0 258.0
#> Hornet Sportabout 0 360.0
#> Valiant 0 225.0
#> Duster 360 0 360.0
#> Merc 240D 0 146.7
#> Merc 230 0 140.8
#> Merc 280 0 167.6
#> Merc 280C 0 167.6
#> Merc 450SE 0 275.8
#> Merc 450SL 0 275.8
#> Merc 450SLC 0 275.8
#> Cadillac Fleetwood 0 472.0
#> Lincoln Continental 0 460.0
#> Chrysler Imperial 0 440.0
#> Fiat 128 1 78.7
#> Honda Civic 1 75.7
#> Toyota Corolla 1 71.1
#> Toyota Corona 0 120.1
#> Dodge Challenger 0 318.0
#> AMC Javelin 0 304.0
#> Camaro Z28 0 350.0
#> Pontiac Firebird 0 400.0
#> Fiat X1-9 1 79.0
#> Porsche 914-2 1 120.3
#> Lotus Europa 1 95.1
#> Ford Pantera L 1 351.0
#> Ferrari Dino 1 145.0
#> Maserati Bora 1 301.0
#> Volvo 142E 1 121.0 Created on 2024-05-09 with reprex v2.1.0 |
I've added sparse_model_matrix implementation.
Some development notes:
sparse_model_matrix
is passed a formula instead of afixest
object. I usedfixef_terms
to get thefe_vars
andslope_var_list
names. Thenprepare_df
gives me the variables (with^
interactions). Then, making a sparse matrix with them approximately follows from what you wroteWhen no data argument is passed, the original data is used. When
na.rm == TRUE
, all the rows with NA in the original regression are dropped. If data is passed (even if it's the same data), only the NAs in the particular part of the model.matrix are dropped (e.g. iftype = "lhs"
is used, then only NAs on the LHS are dropped). Alternatively, I couldupdate
the regression object, but I decided against it since that would also cause confusions. For example, "why are rows in y being dropped, I don't have any NAs there"You can pass a formula directly to sparse_model_matrix without even estimating for
type %in% c("lhs", "rhs", "fixest")
, but if you wantna.rm = TRUE
orcollin.rm = TRUE
, then the model needs to be estimated. This is useful as a "sparse matrix factory function"I didn't implement
subset
. I can, but the code seems complicated.I copied all the tests from
model.matrix
and added a few more pertaining to fixed effects.I didn't dive into getting
poly
to work whendata
is passed. I could think about trying to have a rough warning that comes when"poly"
appears in the formula anddata
is passed?There is the option
combine
which when equal toFALSE
, returns a named list with sparse matrices for each element intype
.Examples
Created on 2023-05-17 with reprex v2.0.2