Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

arbitrary test statistics in calculate() #542

Open
wants to merge 2 commits into
base: main
Choose a base branch
from
Open

Conversation

simonpcouch
Copy link
Collaborator

Seeing that Allen hadn't been able to use infer for his posit::conf(2024) keynote notes due to our not supporting arbitrary test statistics inspired me to revisit the idea on my flight back home. For better or worse, I'm too cheap to buy WiFi on flights, so didn't have a chance to read through #175 and its friend #196 and think twice about whether to do it.

Some of the challenges discussed there are still relevant—namely, whether a user can likely generate data that reflect null hypotheses for an arbitrary stat. Some other questions raised there, though (e.g. multiple explanatory variables), have been addressed as the package has matured.

This implementation works by allowing stat to take either one of the existing string values or a function that returns a scalar value. Grouping is handled for the user, as it is in the case of the string-valued options.

So, some discussion still required: has the package matured enough since that feature was first proposed to bring this in scope? Should we only allow arbitrary test statistics if there's an explanatory variable (or if the user has specified a point null)?

@echasnovski
Copy link
Collaborator

I confess I didn't pay much attention to the updates here. Mostly because {infer} seems to have stabilized and I trust your judgment :)

Yet I remember my thoughts on allowing any function to calculate (confirmed by rereading #175). It basically boils down to "doesn't play nicely with hypothesize() (both in teaching and practice) while there are other convenient methods to achieve this (from other tidyverse tools)".

@simonpcouch
Copy link
Collaborator Author

Yup, I think your argument about point null hypotheses from that linked issue still applies and I don't have a good answer there. In the case where point = "independence", I do think we're able to generate data according to the null appropriately—my remaining questions there are 1) is that actually true and 2) if that's the case, is it still a value-add to infer to allow arbitrary statistics just in that case.

@echasnovski @andrewpbray @mine-cetinkaya-rundel @ismayc and others: feedback very much welcome! I totally accept if we decide this isn't a good move, just wanted to revisit the discussion now that we know what an implementation could look like. :)

@echasnovski
Copy link
Collaborator

To me personally allowing to use a function input only for some hypothesis type would be more confusing than universally suggesting "use other established tidyverse approach".

That said, I don't think it is a strong opinion, but rather an application of a "do one thing and do it well" (which, of course, can have exceptions).

@ismayc
Copy link
Collaborator

ismayc commented Aug 28, 2024

Thanks for bringing this back up! I haven't been keeping an eye on things much either. How did Allen implement his example in Python/R? Could we think about how that could be implemented here for guidance?

I'd love for even some functionality to allow for general functions! We have plenty of warnings throughout the package already when users do things we wouldn't expect.

Great to hear from you both too!

@simonpcouch
Copy link
Collaborator Author

Allen is a Python user but gave a go at infer to put together an R companion to his talk notes. He told us he ultimately decided not to use infer because of the lack of support for arbitrary test statistics. I'll paste his implementation here!

Allen's source
# install.packages("dplyr")
# install.packages("data.table")
# install.packages("ggplot2")

library(dplyr)
library(data.table)
library(ggplot2)

# Download the file
url <- "https://github.com/AllenDowney/ThinkStats/raw/v3/data/brfss.csv"
destfile <- "brfss.csv"
if (!file.exists(destfile)) {
  # Download the file only if it does not exist
  download.file(url, destfile)
  message("File downloaded.")
} else {
  message("File already exists.")
}

# read the csv file as a data.table
brfss <- fread("brfss.csv")

# Describe the dataset
print(summary(brfss))

#' Resample a Data Table with Specified Weights
#'
#' Draws a bootstrap sample from a data table, using a specified
#' column of probability weights.
#'
#' @param dt A data table that includes the data to be resampled.
#' @param weight_col The name of the column in `dt` that contains the probability weights. 
#'        This should be a string.
#' @return A resampled data table with the same number of rows as the input.
#' @details The function samples rows from the input data table with replacement, 
#' using the specified `weight_col` as the probability weights.
#' @examples
#' # Example usage:
#' resampled_data <- resample(dt, "_LLCPWT")
#' @export
resample <- function(dt, weight_col) {
  n <- nrow(dt)
  sampled_indices <- sample(1:n, size = n, replace = TRUE, prob = dt[[weight_col]])
  return(dt[sampled_indices])
}

# generate one sample for testing
boot_sample <- resample(brfss, "_LLCPWT")

## Sampling distribution of difference in means

#' Compute Difference in Means by Group
#'
#' This function calculates the difference in mean values of a specified column
#' between groups in a data table, where the difference is computed as the negative
#' value of the difference between the means of the two groups.
#'
#' @param sample A data table containing the data. It should include a column for group identifiers
#'               (e.g., `_SEX`) and a column for the values to compute the mean of (e.g., `HTM4`).
#' @return A numeric value representing the negative difference in means between groups
#'         defined in the `_SEX` column.
#' @details The function groups the data by the `_SEX` column, computes the mean of the `HTM4` column
#'          for each group, and then calculates the difference between these means. The result is
#'          returned as a negative value.
#' @examples
#' # Example usage:
#' dt <- data.table(_SEX = c(1, 1, 2, 2), HTM4 = c(170, 180, 160, 165))
#' diff_means(dt)
#' @export
diff_means <- function(sample) {
  stats <- sample[, .(stat = mean(HTM4, na.rm = TRUE)), by = `_SEX`][order(`_SEX`)]
  return(-diff(stats$stat))
}

print(diff_means(boot_sample))


#' Generate Bootstrap Sampling Distribution
#'
#' This function generates a bootstrap sampling distribution by repeatedly drawing
#' bootstrap samples from a data table and computing a specified test statistic.
#'
#' @param dt A data table to be resampled.
#' @param test_stat_function A function that takes a data table as input and returns a test statistic.
#' @param iters The number of bootstrap iterations to perform. Default is 101.
#' @return A numeric vector containing the computed test statistics from each bootstrap sample.
#' @details The function uses the `resample` function to generate bootstrap samples from
#' the data table. For each bootstrap sample, it computes the test statistic using
#' `test_stat_function` and stores the result in a vector. The final output is a vector
#' of test statistics from each iteration.
#' @examples
#' # Example usage:
#' test_stat_function <- function(sample) { mean(sample$HTM4) }
#' results <- sampling_dist(dt, test_stat_function, iters = 100)
#' @export
sampling_dist <- function(dt, test_stat_function, iters = 101) {
  results <- numeric(iters)  # Initialize a numeric vector to store results
  
  for (i in 1:iters) {
    boot_sample <- resample(dt, "_LLCPWT")  # Generate a bootstrap sample
    results[i] <- test_stat_function(boot_sample)  # Compute and store the test statistic
  }
  
  return(results)
}


t1 <- sampling_dist(brfss, diff_means)

#' Plot Sampling Distribution with Kernel Density Estimation
#'
#' This function creates a kernel density plot of the sampling distribution and prints
#' the mean and 95% confidence interval (CI) of the distribution.
#'
#' @param t A numeric vector containing the values of the sampling distribution.
#' @return The function does not return a value. It prints the mean and 95% confidence interval
#'         of the distribution and displays a kernel density plot using ggplot2.
#' @details The function first constructs a data frame from the numeric vector `t`. It then uses
#'          ggplot2 to create and display a kernel density plot of the sampling distribution. 
#'          Additionally, it prints the mean and the 95% confidence interval of the values in `t`.
#' @examples
#' # Example usage:
#' t <- rnorm(1000)  # Generate a random sample for demonstration
#' plot_sampling_distribution(t)
#' @export
plot_sampling_distribution <- function(t) {
  df_plot <- data.frame(stat = t)
  
  plot <- ggplot(df_plot, aes(x = stat)) +
    geom_density(fill = "orange", alpha = 0.5) +
    labs(title = "Kernel Density Plot", x = "Value", y = "Density")
  
  print(mean(t))
  ci <- quantile(t, probs = c(0.025, 0.975))
  print(ci)
  print(plot)
}

plot_sampling_distribution(t1)

## Sampling distribution of difference in standard deviation

#' Compute Difference in Standard Deviations by Group
#'
#' This function calculates the difference in standard deviations of a specified column
#' between groups in a data table. The difference is computed as the negative value of
#' the difference between the standard deviations of the two groups.
#'
#' @param sample A data table containing the data. It should include a column for group identifiers
#'               (e.g., `_SEX`) and a column for the values to compute the standard deviation of
#'               (e.g., `HTM4`).
#' @return A numeric value representing the negative difference in standard deviations between groups
#'         defined in the `_SEX` column.
#' @details The function groups the data by the `_SEX` column, computes the standard deviation of the
#'          `HTM4` column for each group, and then calculates the difference between these standard deviations.
#'          The result is returned as a negative value.
#' @examples
#' # Example usage:
#' dt <- data.table(_SEX = c(1, 1, 2, 2), HTM4 = c(170, 180, 160, 165))
#' diff_sds(dt)
#' @export
diff_sds <- function(sample) {
  stats <- sample[, .(stat = sd(HTM4, na.rm = TRUE)), by = `_SEX`][order(`_SEX`)]
  return(-diff(stats$stat))
}

print(diff_sds(boot_sample))

t2 <- sampling_dist(brfss, diff_sds)

plot_sampling_distribution(t2)

## Sampling distribution of difference in coefficient of variation

#' Compute Difference in Coefficients of Variation by Group
#'
#' This function calculates the difference in coefficients of variation (CV) of a specified column
#' between groups in a data table. The coefficient of variation is computed as the ratio of the
#' standard deviation to the mean. The difference is returned as the negative value of the difference
#' between the coefficients of variation of the two groups.
#'
#' @param sample A data table containing the data. It should include a column for group identifiers
#'               (e.g., `_SEX`) and a column for the values to compute the coefficient of variation of
#'               (e.g., `HTM4`).
#' @return A numeric value representing the negative difference in coefficients of variation between groups
#'         defined in the `_SEX` column.
#' @details The function groups the data by the `_SEX` column, computes both the mean and standard deviation
#'          of the `HTM4` column for each group, calculates the coefficient of variation for each group, and
#'          then computes the difference between these coefficients of variation. The result is returned as a
#'          negative value.
#' @examples
#' # Example usage:
#' dt <- data.table(_SEX = c(1, 1, 2, 2), HTM4 = c(170, 180, 160, 165))
#' diff_cvs(dt)
#' @export
diff_cvs <- function(sample) {
  stats <- sample[, .(mean = mean(HTM4, na.rm = TRUE), 
                      sd = sd(HTM4, na.rm = TRUE)), 
                  by = `_SEX`][order(`_SEX`)]
  return(-diff(stats$sd / stats$mean))
}


print(diff_cvs(boot_sample))

t3 <- sampling_dist(brfss, diff_cvs)

plot_sampling_distribution(t3)

His keynote example would be possible with this PR—just a null = "independence" and a custom stat.

@ismayc
Copy link
Collaborator

ismayc commented Sep 9, 2024

This looks good to me. I'm in favor of merging this in.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants