use grant's more general way to check for an intercept column · statdivlab/raoBust@67409a9

gthopkins · 2024-07-31T23:44:49Z

@svteichman the if statement identifies if a vector of 1s is in the column space, however we do not know that the vector is in the first column (so X <- X[, -1, drop = FALSE] will not always work). I think we should have an iterated solution: step 1 tries to remove any columns that are an intercept and step 2 stops the code with an error, if that does not resolve the issue:

Also, important note: I mistakenly set digits to 1e-20 in the previous code. It should be a natural number less than 16 (roughly the number of digits stored in double-precision numerical calculations). 12 seems appropriate to me.

# first remove any columns that represent an intercept
if (all(round(X %*% solve(t(X) %*% X) %*% t(X) %*% rep(1, nrow(X)), 12) == 1)) {
  X <- X[, rowMeans(apply(X, 1, function(v)(v == colMeans(X)))) != 1, drop = FALSE]
  # can instead use X <- X[, -1, drop = FALSE] if we only want to try to identify classical intercept (vector of 1s in the first column)
}

# confirm that there is no longer an intercept in the column space
if (all(round(X %*% solve(t(X) %*% X) %*% t(X) %*% rep(1, nrow(X)), 12) == 1)) {
  stop("An intercept is in the column space of X. This may happen if you have a
       column of 1s or a column for every level in a categorical variable, for
       example. Your design matrix cannot contain a vector of 1s in the column
       space when you use raoBust.")
}

For testing, I constructed the following two examples that elucidate why this two-stage system may be useful:

n <- 20

X1 <- matrix(data = c(rbinom(n = n, size = 12, prob = 0.7),
                      rep(0.3, n),
                      rep(0, n),
                      rnorm(n = n, mean = -5, sd = 2)),
             ncol = 4, byrow = FALSE)

X2 <- matrix(data = c(rep(1, n),
                      rbinom(n = n, size = 12, prob = 0.7),
                      c(rep(1, n / 2), rep(0, n / 2)),
                      c(rep(0, n / 2), rep(1, n / 2)),
                      rnorm(n = n, mean = -5, sd = 2)),
             ncol = 5, byrow = FALSE)

This is all probably excessive, but I figured why not make the code more general!

svteichman · 2024-07-31T23:48:02Z

good point, I need to make sure I remove the correct column!

I think that we don't necessarily need the two stage check but I do like the generality to check for different ways that an intercept can be encoded. I'll fix this in another PR.

-Original file line number
+Diff line change
@@ Expand Up / @@ -30,8 +30,8 @@ covariates in formula must be provided.") @@
         X <- model.matrix(formula, data)
       }
-      # if X has intercept column, remove it
-      if (sum(X[, 1] == 1) == nrow(X)) {
+      # if X has intercept column (or a column of repeating values), remove it
+      if (all(round((X %*% solve(t(X) %*% X) %*% t(X)) %*% rep(1, nrow(X)), 1e-20) == 1)) {
         X <- X[, -1, drop = FALSE]
       }
@@ Expand Down @@

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Commit

There are no files selected for viewing

2 comments on commit `67409a9`

gthopkins commented on `67409a9` Jul 31, 2024

svteichman commented on `67409a9` Jul 31, 2024

Commit

There are no files selected for viewing

2 comments on commit 67409a9

gthopkins commented on 67409a9 Jul 31, 2024

Choose a reason for hiding this comment

svteichman commented on 67409a9 Jul 31, 2024

Choose a reason for hiding this comment

2 comments on commit `67409a9`

gthopkins commented on `67409a9` Jul 31, 2024

svteichman commented on `67409a9` Jul 31, 2024