diff --git a/NEWS.md b/NEWS.md index 2e2d33e..bd68e69 100644 --- a/NEWS.md +++ b/NEWS.md @@ -16,4 +16,12 @@ # ipd 0.1.4 -* +* Added a help topic for the package itself (`man/ipd-package.Rd`) via `R/ipd-package.R` and `roxygen2` + +* Updated the documentation for `ipd()`: + + * Provided a more explicit description of the `model` argument, which is meant to specify the downstream inferential model or parameter to be estimated. + + * Clarified that not all columns in data are used in prediction unless explicitly referenced in the `formula` argument or in the `label` argument if the data are passed as one stacked data frame. + +* Updated the documentation for `simdat()` to include a more thorough explanation of how to simulate data with this function. diff --git a/R/ipd-package.R b/R/ipd-package.R new file mode 100644 index 0000000..df2547e --- /dev/null +++ b/R/ipd-package.R @@ -0,0 +1,120 @@ +#' @keywords internal +"_PACKAGE" + +## usethis namespace: start +#' ipd: Inference on Predicted Data +#' +#' The `ipd` package provides tools for statistical modeling and inference when +#' a significant portion of the outcome data is predicted by AI/ML algorithms. +#' It implements several state-of-the-art methods for inference on predicted +#' data (IPD), offering a user-friendly interface to facilitate their use in +#' real-world applications. +#' +#' This package is particularly useful in scenarios where predicted values +#' (e.g., from machine learning models) are used as proxies for unobserved +#' outcomes, which can introduce biases in estimation and inference. The `ipd` +#' package integrates methods designed to address these challenges. +#' +#' @section Features: +#' - Multiple IPD methods: `PostPI`, `PPI`, `PPI++`, and `PSPA` currently. +#' - Flexible wrapper functions for ease of use. +#' - Custom methods for model inspection and evaluation. +#' - Seamless integration with common data structures in R. +#' - Comprehensive documentation and examples. +#' +#' @section Key Functions: +#' - \code{\link{ipd}}: Main wrapper function which implements various methods for inference on predicted data for a specified model/outcome type (e.g., mean estimation, linear regression). +#' - \code{\link{simdat}}: Simulates data for demonstrating the use of the various IPD methods. +#' - \code{\link{print.ipd}}: Prints a brief summary of the IPD method/model combination. +#' - \code{\link{summary.ipd}}: Summarizes the results of fitted IPD models. +#' - \code{\link{tidy.ipd}}: Tidies the IPD method/model fit into a data frame. +#' - \code{\link{glance.ipd}}: Glances at the IPD method/model fit, returning a one-row summary. +#' - \code{\link{augment.ipd}}: Augments the data used for an IPD method/model fit with additional information about each observation. +#' +#' @section Documentation: +#' The package includes detailed documentation for each function, including +#' usage examples. A vignette is also provided to guide users through common +#' workflows and applications of the package. +#' +#' @section References: +#' For details on the statistical methods implemented in this package, please +#' refer to the associated manuscripts at the following references: +#' - \strong{PostPI}: Wang, S., McCormick, T. H., & Leek, J. T. (2020). Methods for correcting inference based on outcomes predicted by machine learning. Proceedings of the National Academy of Sciences, 117(48), 30266-30275. +#' - \strong{PPI}: Angelopoulos, A. N., Bates, S., Fannjiang, C., Jordan, M. I., & Zrnic, T. (2023). Prediction-powered inference. Science, 382(6671), 669-674. +#' - \strong{PPI++}: Angelopoulos, A. N., Duchi, J. C., & Zrnic, T. (2023). PPI++: Efficient prediction-powered inference. arXiv preprint arXiv:2311.01453. +#' - \strong{PSPA}: Miao, J., Miao, X., Wu, Y., Zhao, J., & Lu, Q. (2023). Assumption-lean and data-adaptive post-prediction inference. arXiv preprint arXiv:2311.14220. +#' +#' @name ipd-package +#' +#' @keywords package +#' +#' @examples +#' #-- Generate Example Data +#' +#' set.seed(12345) +#' +#' dat <- simdat(n = c(300, 300, 300), effect = 1, sigma_Y = 1) +#' +#' head(dat) +#' +#' formula <- Y - f ~ X1 +#' +#' #-- PostPI Analytic Correction (Wang et al., 2020) +#' +#' fit_postpi1 <- ipd(formula, method = "postpi_analytic", model = "ols", +#' +#' data = dat, label = "set") +#' +#' #-- PostPI Bootstrap Correction (Wang et al., 2020) +#' +#' nboot <- 200 +#' +#' fit_postpi2 <- ipd(formula, method = "postpi_boot", model = "ols", +#' +#' data = dat, label = "set", nboot = nboot) +#' +#' #-- PPI (Angelopoulos et al., 2023) +#' +#' fit_ppi <- ipd(formula, method = "ppi", model = "ols", +#' +#' data = dat, label = "set") +#' +#' #-- PPI++ (Angelopoulos et al., 2023) +#' +#' fit_plusplus <- ipd(formula, method = "ppi_plusplus", model = "ols", +#' +#' data = dat, label = "set") +#' +#' #-- PSPA (Miao et al., 2023) +#' +#' fit_pspa <- ipd(formula, method = "pspa", model = "ols", +#' +#' data = dat, label = "set") +#' +#' #-- Print the Model +#' +#' print(fit_postpi1) +#' +#' #-- Summarize the Model +#' +#' summ_fit_postpi1 <- summary(fit_postpi1) +#' +#' #-- Print the Model Summary +#' +#' print(summ_fit_postpi1) +#' +#' #-- Tidy the Model Output +#' +#' tidy(fit_postpi1) +#' +#' #-- Get a One-Row Summary of the Model +#' +#' glance(fit_postpi1) +#' +#' #-- Augment the Original Data with Fitted Values and Residuals +#' +#' augmented_df <- augment(fit_postpi1) +#' +#' head(augmented_df) +## usethis namespace: end +NULL diff --git a/R/ipd.R b/R/ipd.R index 6748ac5..baa7c91 100644 --- a/R/ipd.R +++ b/R/ipd.R @@ -1,8 +1,8 @@ #=============================================================================== -# ipd WRAPPER FUNCTION +# WRAPPER FUNCTION #=============================================================================== -#--- MAIN ipd WRAPPER FUNCTION ------------------------------------------------- +#--- MAIN WRAPPER FUNCTION ----------------------------------------------------- #' Inference on Predicted Data (ipd) #' @@ -15,31 +15,38 @@ #' labeled data, \code{f} is the name of the column corresponding to the #' predicted outcome in both labeled and unlabeled data, and \code{X} #' corresponds to the features of interest (i.e., \code{X = X1 + ... + Xp}). +#' See \strong{1. Formula} in the \strong{Details} below for more information. #' -#' @param method The method to be used for fitting the model. Must be one of +#' @param method The IPD method to be used for fitting the model. Must be one of #' \code{"postpi_analytic"}, \code{"postpi_boot"}, \code{"ppi"}, -#' \code{"pspa"}, or \code{"ppi_plusplus"}. +#' \code{"ppi_plusplus"}, or \code{"pspa"}. +#' See \strong{3. Method} in the \strong{Details} below for more information. #' -#' @param model The type of model to be fitted. Must be one of \code{"mean"}, -#' \code{"quantile"}, \code{"ols"}, or \code{"logistic"}. +#' @param model The type of downstream inferential model to be fitted, or the +#' parameter being estimated. Must be one of \code{"mean"}, +#' \code{"quantile"}, \code{"ols"}, \code{"logistic"}, or \code{"poisson"}. +#' See \strong{4. Model} in the \strong{Details} below for more information. #' #' @param data A \code{data.frame} containing the variables in the model, #' either a stacked data frame with a specific column identifying the labeled #' versus unlabeled observations (\code{label}), or only the labeled data #' set. Must contain columns for the observed outcomes (\code{Y}), the #' predicted outcomes (\code{f}), and the features (\code{X}) needed to specify -#' the \code{formula}. +#' the \code{formula}. See \strong{2. Data} in the \strong{Details} below for +#' more information. #' #' @param label A \code{string}, \code{int}, or \code{logical} specifying the #' column in the data that distinguishes between the labeled and unlabeled #' observations. See the \code{Details} section for more information. If NULL, -#' \code{unlabeled_data} must be specified. +#' \code{unlabeled_data} must be specified. See \strong{2. Data} in the +#' \strong{Details} below for more information. #' #' @param unlabeled_data (optional) A \code{data.frame} of unlabeled data. If #' NULL, \code{label} must be specified. Specifying both the \code{label} and #' \code{unlabeled_data} arguments will result in an error message. If #' specified, must contain columns for the predicted outcomes (\code{f}), and -#' the features (\code{X}) needed to specify the \code{formula}. +#' the features (\code{X}) needed to specify the \code{formula}. See +#' \strong{2. Data} in the \strong{Details} below for more information. #' #' @param seed (optional) An \code{integer} seed for random number generation. #' @@ -53,15 +60,18 @@ #' one of \code{"two-sided"}, \code{"less"}, or \code{"greater"}. #' #' @param n_t (integer, optional) Size of the dataset used to train the -#' prediction function (necessary for the \code{"postpi"} methods if \code{n_t} < -#' \code{nrow(X_l)}. Defaults to \code{Inf}. +#' prediction function (necessary for the \code{"postpi_analytic"} and +#' \code{"postpi_boot"} methods if \code{n_t} < \code{nrow(X_l)}. +#' Defaults to \code{Inf}. #' #' @param na_action (string, optional) How missing covariate data should be #' handled. Currently \code{"na.fail"} and \code{"na.omit"} are accommodated. #' Defaults to \code{"na.fail"}. #' #' @param ... Additional arguments to be passed to the fitting function. See -#' the \code{Details} section for more information. +#' the \code{Details} section for more information. See +#' \strong{5. Auxilliary Arguments} and \strong{6. Other Arguments} in the +#' \strong{Details} below for more information. #' #' @returns a summary of model output. #' @@ -87,8 +97,10 @@ #' #' For option (1), provide one data argument (\code{data}) which contains a #' stacked \code{data.frame} with both the unlabeled and labeled data and a -#' \code{label} argument that specify the column that identifies the labeled -#' versus the unlabeled observations in the stacked \code{data.frame} +#' \code{label} argument that specifies the column identifying the labeled +#' versus the unlabeled observations in the stacked \code{data.frame} (e.g., +#' \code{label = "set"} if the column "set" in the stacked data denotes which +#' set an observation belongs to). #' #' NOTE: Labeled data identifiers can be: #' @@ -110,15 +122,20 @@ #' #' For option (2), provide separate data arguments for the labeled data set #' (\code{data}) and the unlabeled data set (\code{unlabeled_data}). If the -#' second argument is provided, the function ignores the label identifier and -#' assumes the data provided is stacked. +#' second argument is provided, the function ignores the \code{label} identifier +#' and assumes the data provided are not stacked. +#' +#' NOTE: Not all columns in \code{data} or \code{unlabeled_data} may be used +#' unless explicitly referenced in the \code{formula} argument or in the +#' \code{label} argument (if the data are passed as one stacked data frame). #' #' \strong{3. Method:} #' #' Use the \code{method} argument to specify the fitting method: #' #' \describe{ -#' \item{"postpi"}{Wang et al. (2020) Post-Prediction Inference (PostPI)} +#' \item{"postpi_analytic"}{Wang et al. (2020) Post-Prediction Inference (PostPI) Analytic Correction} +#' \item{"postpi_boot"}{Wang et al. (2020) Post-Prediction Inference (PostPI) Bootstrap Correction} #' \item{"ppi"}{Angelopoulos et al. (2023) Prediction-Powered Inference #' (PPI)} #' \item{"ppi_plusplus"}{Angelopoulos et al. (2023) PPI++} @@ -128,14 +145,15 @@ #' #' \strong{4. Model:} #' -#' Use the \code{model} argument to specify the type of model: +#' Use the \code{model} argument to specify the type of downstream inferential +#' model or parameter to be estimated: #' #' \describe{ -#' \item{"mean"}{Mean value of the outcome} -#' \item{"quantile"}{\code{q}th quantile of the outcome} -#' \item{"ols"}{Linear regression} -#' \item{"logistic"}{Logistic regression} -#' \item{"poisson"}{Poisson regression} +#' \item{"mean"}{Mean value of a continuous outcome} +#' \item{"quantile"}{\code{q}th quantile of a continuous outcome} +#' \item{"ols"}{Linear regression coefficients for a continuous outcome} +#' \item{"logistic"}{Logistic regression coefficients for a binary outcome} +#' \item{"poisson"}{Poisson regression coefficients for a count outcome} #' } #' #' The \code{ipd} wrapper function will concatenate the \code{method} and diff --git a/R/simdat.R b/R/simdat.R index b8497e0..cb81fab 100644 --- a/R/simdat.R +++ b/R/simdat.R @@ -15,8 +15,8 @@ #' @param sigma_Y Residual variance for the generated outcome. Defaults is 1. #' #' @param model The type of model to be generated. Must be one of -#' \code{"mean"}, \code{"quantile"}, \code{"ols"}, or \code{"logistic"}. -#' Default is \code{"ols"}. +#' \code{"mean"}, \code{"quantile"}, \code{"ols"}, \code{"logistic"}, or +#' \code{"poisson"}. Default is \code{"ols"}. #' #' @param shift Scalar shift of the predictions for continuous outcomes #' (i.e., "mean", "quantile", and "ols"). Defaults to 0. @@ -32,6 +32,156 @@ #' labeled, or unlabeled), and four independent, normally distributed #' predictors (X1, X2, X3, and X4), where applicable. #' +#' @details +#' +#' The `simdat` function generates three datasets consisting of independent +#' realizations of \eqn{Y} (for \code{model} = \code{"mean"} or +#' \code{"quantile"}), or \eqn{\{Y, \boldsymbol{X}\}} (for \code{model} = +#' \code{"ols"}, \code{"logistic"}, or \code{"poisson"}): a \emph{training} +#' dataset of size \eqn{n_t}, a \emph{labeled} dataset of size \eqn{n_l}, and +#' an \emph{unlabeled} dataset of size \eqn{n_u}. These sizes are specified by +#' the argument \code{n}. +#' +#' NOTE: In the \emph{unlabeled} data subset, outcome data are still generated +#' to facilitate a benchmark for comparison with an "oracle" model that uses +#' the true \eqn{Y^{\mathcal{U}}} values for estimation and inference. +#' +#' \strong{Generating Data} +#' +#' For \code{"mean"} and \code{"quantile"}, we simulate a continuous outcome, +#' \eqn{Y \in \mathbb{R}}, with mean given by the \code{effect} argument and +#' error variance given by the \code{sigma_y} argument. +#' +#' For \code{"ols"}, \code{"logistic"}, or \code{"poisson"} models, predictor +#' data, \eqn{\boldsymbol{X} \in \mathbb{R}^4} are simulated such that the +#' \eqn{i}th observation follows a standard multivariate normal distribution +#' with a zero mean vector and identity covariance matrix: +#' +#' \deqn{ +#' \boldsymbol{X_i} = (X_{i1}, X_{i2}, X_{i3}, X_{i4}) \sim \mathcal{N}_4(\boldsymbol{0}, \boldsymbol{I}). +#' } +#' +#' For \code{"ols"}, a continuous outcome \eqn{Y \in \mathbb{R}} is simulated +#' to depend on \eqn{X_1} through a linear term with the effect size specified +#' by the \code{effect} argument, while the other predictors, +#' \eqn{\boldsymbol{X} \setminus X_1}, have nonlinear effects: +#' +#' \deqn{ +#' Y_i = effect \times Z_{i1} + \frac{1}{2} Z_{i2}^2 + \frac{1}{3} Z_{i3}^3 + \frac{1}{4} Z_{i4}^2 + \varepsilon_y, +#' } +#' +#' and \eqn{\varepsilon_y \sim \mathcal{N}(0, sigma_y)}, where the +#' \code{sigma_y} argument specifies the error variance. +#' +#' For \code{"logistic"}, we simulate: +#' +#' \deqn{ +#' \Pr(Y_i = 1 \mid \boldsymbol{X}) = logit^{-1}(effect \times Z_{i1} + \frac{1}{2} Z_{i2}^2 + \frac{1}{3} Z_{i3}^3 + \frac{1}{4} Z_{i4}^2 + \varepsilon_y) +#' } +#' +#' and generate: +#' +#' \deqn{ +#' Y_i \sim Bern[1, \Pr(Y_i = 1 \mid \boldsymbol{X})] +#' } +#' +#' where \eqn{\varepsilon_y \sim \mathcal{N}(0, sigma\_y)}. +#' +#' For \code{"poisson"}, we simulate: +#' +#' \deqn{ +#' \lambda_Y = exp(effect \times Z_{i1} + \frac{1}{2} Z_{i2}^2 + \frac{1}{3} Z_{i3}^3 + \frac{1}{4} Z_{i4}^2 + \varepsilon_y) +#' } +#' +#' and generate: +#' +#' \deqn{ +#' Y_i \sim Poisson(\lambda_Y) +#' } +#' +#' \strong{Generating Predictions} +#' +#' To generate predicted outcomes for \code{"mean"} and \code{"quantile"}, we +#' simulate a continuous variable with mean given by the empirical mean of the +#' training data and error variance given by the \code{sigma_y} argument. +#' +#' For \code{"ols"}, we fit a generalized additive model (GAM) on the +#' simulated \emph{training} dataset and calculate predictions for the +#' \emph{labeled} and \emph{unlabeled} datasets as deterministic functions of +#' \eqn{\boldsymbol{X}}. Specifically, we fit the following GAM: +#' +#' \deqn{ +#' Y^{\mathcal{T}} = s_0 + s_1(X_1^{\mathcal{T}}) + s_2(X_2^{\mathcal{T}}) + +#' s_3(X_3^{\mathcal{T}}) + s_4(X_4^{\mathcal{T}}) + \varepsilon_p, +#' } +#' +#' where \eqn{\mathcal{T}} denotes the \emph{training} dataset, \eqn{s_0} is an +#' intercept term, and \eqn{s_1(\cdot)}, \eqn{s_2(\cdot)}, \eqn{s_3(\cdot)}, +#' and \eqn{s_4(\cdot)} are smoothing spline functions for \eqn{X_1}, \eqn{X_2}, +#' \eqn{X_3}, and \eqn{X_4}, respectively, with three target equivalent degrees +#' of freedom. Residual error is modeled as \eqn{\varepsilon_p}. +#' +#' Predictions for \emph{labeled} and \emph{unlabeled} datasets are calculated +#' as: +#' +#' \deqn{ +#' f(\boldsymbol{X}^{\mathcal{L}\cup\mathcal{U}}) = \hat{s}_0 + \hat{s}_1(X_1^{\mathcal{L}\cup\mathcal{U}}) + +#' \hat{s}_2(X_2^{\mathcal{L}\cup\mathcal{U}}) + \hat{s}_3(X_3^{\mathcal{L}\cup\mathcal{U}}) + +#' \hat{s}_4(X_4^{\mathcal{L}\cup\mathcal{U}}), +#' } +#' +#' where \eqn{\hat{s}_0, \hat{s}_1, \hat{s}_2, \hat{s}_3}, and \eqn{\hat{s}_4} +#' are estimates of \eqn{s_0, s_1, s_2, s_3}, and \eqn{s_4}, respectively. +#' +#' NOTE: For continuous outcomes, we provide optional arguments \code{shift} and +#' \code{scale} to further apply a location shift and scaling factor, +#' respectively, to the predicted outcomes. These default to \code{shift = 0} +#' and \code{scale = 1}, i.e., no location shift or scaling. +#' +#' For \code{"logistic"}, we train k-nearest neighbors (k-NN) classifiers on +#' the simulated \emph{training} dataset for values of \eqn{k} ranging from 1 +#' to 10. The optimal \eqn{k} is chosen via cross-validation, minimizing the +#' misclassification error on the validation folds. Predictions for the +#' \emph{labeled} and \emph{unlabeled} datasets are obtained by applying the +#' k-NN classifier with the optimal \eqn{k} to \eqn{\boldsymbol{X}}. +#' +#' Specifically, for each observation in the \emph{labeled} and \emph{unlabeled} +#' datasets: +#' +#' \deqn{ +#' \hat{Y} = \operatorname{argmax}_c \sum_{i \in \mathcal{N}_k} I(Y_i = c), +#' } +#' +#' where \eqn{\mathcal{N}_k} represents the set of \eqn{k} nearest neighbors in +#' the training dataset, \eqn{c} indexes the possible classes (0 or 1), and +#' \eqn{I(\cdot)} is an indicator function. +#' +#' For \code{"poisson"}, we fit a generalized linear model (GLM) with a log link +#' function to the simulated \emph{training} dataset. The model is of the form: +#' +#' \deqn{ +#' \log(\mu^{\mathcal{T}}) = \gamma_0 + \gamma_1 X_1^{\mathcal{T}} + \gamma_2 X_2^{\mathcal{T}} + +#' \gamma_3 X_3^{\mathcal{T}} + \gamma_4 X_4^{\mathcal{T}}, +#' } +#' +#' where \eqn{\mu^{\mathcal{T}}} is the expected count for the response variable +#' in the \emph{training} dataset, \eqn{\gamma_0} is the intercept, and +#' \eqn{\gamma_1}, \eqn{\gamma_2}, \eqn{\gamma_3}, and \eqn{\gamma_4} are the +#' regression coefficients for the predictors \eqn{X_1}, \eqn{X_2}, \eqn{X_3}, +#' and \eqn{X_4}, respectively. +#' +#' Predictions for the \emph{labeled} and \emph{unlabeled} datasets are +#' calculated as: +#' +#' \deqn{ +#' \hat{\mu}^{\mathcal{L} \cup \mathcal{U}} = \exp(\hat{\gamma}_0 + \hat{\gamma}_1 X_1^{\mathcal{L} \cup \mathcal{U}} + +#' \hat{\gamma}_2 X_2^{\mathcal{L} \cup \mathcal{U}} + \hat{\gamma}_3 X_3^{\mathcal{L} \cup \mathcal{U}} + +#' \hat{\gamma}_4 X_4^{\mathcal{L} \cup \mathcal{U}}), +#' } +#' +#' where \eqn{\hat{\gamma}_0}, \eqn{\hat{\gamma}_1}, \eqn{\hat{\gamma}_2}, \eqn{\hat{\gamma}_3}, +#' and \eqn{\hat{\gamma}_4} are the estimated coefficients. +#' #' @examples #' #' #-- Mean diff --git a/man/ipd-package.Rd b/man/ipd-package.Rd new file mode 100644 index 0000000..68e116a --- /dev/null +++ b/man/ipd-package.Rd @@ -0,0 +1,159 @@ +% Generated by roxygen2: do not edit by hand +% Please edit documentation in R/ipd-package.R +\docType{package} +\name{ipd-package} +\alias{ipd-package} +\title{ipd: Inference on Predicted Data} +\description{ +Performs valid statistical inference on predicted data (IPD) using recent methods, where for a subset of the data, the outcomes have been predicted by an algorithm. Provides a wrapper function with specified defaults for the type of model and method to be used for estimation and inference. Further provides methods for tidying and summarizing results. Salerno et al., (2024) \doi{10.48550/arXiv.2410.09665}. + +The \code{ipd} package provides tools for statistical modeling and inference when +a significant portion of the outcome data is predicted by AI/ML algorithms. +It implements several state-of-the-art methods for inference on predicted +data (IPD), offering a user-friendly interface to facilitate their use in +real-world applications. +} +\details{ +This package is particularly useful in scenarios where predicted values +(e.g., from machine learning models) are used as proxies for unobserved +outcomes, which can introduce biases in estimation and inference. The \code{ipd} +package integrates methods designed to address these challenges. +} +\section{Features}{ + +\itemize{ +\item Multiple IPD methods: \code{PostPI}, \code{PPI}, \verb{PPI++}, and \code{PSPA} currently. +\item Flexible wrapper functions for ease of use. +\item Custom methods for model inspection and evaluation. +\item Seamless integration with common data structures in R. +\item Comprehensive documentation and examples. +} +} + +\section{Key Functions}{ + +\itemize{ +\item \code{\link{ipd}}: Main wrapper function which implements various methods for inference on predicted data for a specified model/outcome type (e.g., mean estimation, linear regression). +\item \code{\link{simdat}}: Simulates data for demonstrating the use of the various IPD methods. +\item \code{\link{print.ipd}}: Prints a brief summary of the IPD method/model combination. +\item \code{\link{summary.ipd}}: Summarizes the results of fitted IPD models. +\item \code{\link{tidy.ipd}}: Tidies the IPD method/model fit into a data frame. +\item \code{\link{glance.ipd}}: Glances at the IPD method/model fit, returning a one-row summary. +\item \code{\link{augment.ipd}}: Augments the data used for an IPD method/model fit with additional information about each observation. +} +} + +\section{Documentation}{ + +The package includes detailed documentation for each function, including +usage examples. A vignette is also provided to guide users through common +workflows and applications of the package. +} + +\section{References}{ + +For details on the statistical methods implemented in this package, please +refer to the associated manuscripts at the following references: +\itemize{ +\item \strong{PostPI}: Wang, S., McCormick, T. H., & Leek, J. T. (2020). Methods for correcting inference based on outcomes predicted by machine learning. Proceedings of the National Academy of Sciences, 117(48), 30266-30275. +\item \strong{PPI}: Angelopoulos, A. N., Bates, S., Fannjiang, C., Jordan, M. I., & Zrnic, T. (2023). Prediction-powered inference. Science, 382(6671), 669-674. +\item \strong{PPI++}: Angelopoulos, A. N., Duchi, J. C., & Zrnic, T. (2023). PPI++: Efficient prediction-powered inference. arXiv preprint arXiv:2311.01453. +\item \strong{PSPA}: Miao, J., Miao, X., Wu, Y., Zhao, J., & Lu, Q. (2023). Assumption-lean and data-adaptive post-prediction inference. arXiv preprint arXiv:2311.14220. +} +} + +\examples{ +#-- Generate Example Data + +set.seed(12345) + +dat <- simdat(n = c(300, 300, 300), effect = 1, sigma_Y = 1) + +head(dat) + +formula <- Y - f ~ X1 + +#-- PostPI Analytic Correction (Wang et al., 2020) + +fit_postpi1 <- ipd(formula, method = "postpi_analytic", model = "ols", + + data = dat, label = "set") + +#-- PostPI Bootstrap Correction (Wang et al., 2020) + +nboot <- 200 + +fit_postpi2 <- ipd(formula, method = "postpi_boot", model = "ols", + + data = dat, label = "set", nboot = nboot) + +#-- PPI (Angelopoulos et al., 2023) + +fit_ppi <- ipd(formula, method = "ppi", model = "ols", + + data = dat, label = "set") + +#-- PPI++ (Angelopoulos et al., 2023) + +fit_plusplus <- ipd(formula, method = "ppi_plusplus", model = "ols", + + data = dat, label = "set") + +#-- PSPA (Miao et al., 2023) + +fit_pspa <- ipd(formula, method = "pspa", model = "ols", + + data = dat, label = "set") + +#-- Print the Model + +print(fit_postpi1) + +#-- Summarize the Model + +summ_fit_postpi1 <- summary(fit_postpi1) + +#-- Print the Model Summary + +print(summ_fit_postpi1) + +#-- Tidy the Model Output + +tidy(fit_postpi1) + +#-- Get a One-Row Summary of the Model + +glance(fit_postpi1) + +#-- Augment the Original Data with Fitted Values and Residuals + +augmented_df <- augment(fit_postpi1) + +head(augmented_df) +} +\seealso{ +Useful links: +\itemize{ + \item \url{https://github.com/ipd-tools/ipd} + \item \url{https://ipd-tools.github.io/ipd/} + \item Report bugs at \url{https://github.com/ipd-tools/ipd/issues} +} + +} +\author{ +\strong{Maintainer}: Stephen Salerno \email{ssalerno@fredhutch.org} (\href{https://orcid.org/0000-0003-2763-0494}{ORCID}) [copyright holder] + +Authors: +\itemize{ + \item Jiacheng Miao \email{jmiao24@wisc.edu} + \item Awan Afiaz \email{aafiaz@uw.edu} + \item Kentaro Hoffman \email{khoffm3@uw.edu} + \item Anna Neufeld \email{acn2@williams.edu} + \item Qiongshi Lu \email{qlu@biostat.wisc.edu} + \item Tyler H McCormick \email{tylermc@uw.edu} + \item Jeffrey T Leek \email{jtleek@fredhutch.org} +} + +} +\keyword{internal} +\keyword{package} diff --git a/man/ipd.Rd b/man/ipd.Rd index 7bb6a51..900be74 100644 --- a/man/ipd.Rd +++ b/man/ipd.Rd @@ -26,32 +26,39 @@ the model to be fitted. Must be of the form \code{Y - f ~ X}, where \code{Y} is the name of the column corresponding to the observed outcome in the labeled data, \code{f} is the name of the column corresponding to the predicted outcome in both labeled and unlabeled data, and \code{X} -corresponds to the features of interest (i.e., \code{X = X1 + ... + Xp}).} +corresponds to the features of interest (i.e., \code{X = X1 + ... + Xp}). +See \strong{1. Formula} in the \strong{Details} below for more information.} -\item{method}{The method to be used for fitting the model. Must be one of +\item{method}{The IPD method to be used for fitting the model. Must be one of \code{"postpi_analytic"}, \code{"postpi_boot"}, \code{"ppi"}, -\code{"pspa"}, or \code{"ppi_plusplus"}.} +\code{"ppi_plusplus"}, or \code{"pspa"}. +See \strong{3. Method} in the \strong{Details} below for more information.} -\item{model}{The type of model to be fitted. Must be one of \code{"mean"}, -\code{"quantile"}, \code{"ols"}, or \code{"logistic"}.} +\item{model}{The type of downstream inferential model to be fitted, or the +parameter being estimated. Must be one of \code{"mean"}, +\code{"quantile"}, \code{"ols"}, \code{"logistic"}, or \code{"poisson"}. +See \strong{4. Model} in the \strong{Details} below for more information.} \item{data}{A \code{data.frame} containing the variables in the model, either a stacked data frame with a specific column identifying the labeled versus unlabeled observations (\code{label}), or only the labeled data set. Must contain columns for the observed outcomes (\code{Y}), the predicted outcomes (\code{f}), and the features (\code{X}) needed to specify -the \code{formula}.} +the \code{formula}. See \strong{2. Data} in the \strong{Details} below for +more information.} \item{label}{A \code{string}, \code{int}, or \code{logical} specifying the column in the data that distinguishes between the labeled and unlabeled observations. See the \code{Details} section for more information. If NULL, -\code{unlabeled_data} must be specified.} +\code{unlabeled_data} must be specified. See \strong{2. Data} in the +\strong{Details} below for more information.} \item{unlabeled_data}{(optional) A \code{data.frame} of unlabeled data. If NULL, \code{label} must be specified. Specifying both the \code{label} and \code{unlabeled_data} arguments will result in an error message. If specified, must contain columns for the predicted outcomes (\code{f}), and -the features (\code{X}) needed to specify the \code{formula}.} +the features (\code{X}) needed to specify the \code{formula}. See +\strong{2. Data} in the \strong{Details} below for more information.} \item{seed}{(optional) An \code{integer} seed for random number generation.} @@ -65,15 +72,18 @@ model? Default is \code{TRUE}.} one of \code{"two-sided"}, \code{"less"}, or \code{"greater"}.} \item{n_t}{(integer, optional) Size of the dataset used to train the -prediction function (necessary for the \code{"postpi"} methods if \code{n_t} < -\code{nrow(X_l)}. Defaults to \code{Inf}.} +prediction function (necessary for the \code{"postpi_analytic"} and +\code{"postpi_boot"} methods if \code{n_t} < \code{nrow(X_l)}. +Defaults to \code{Inf}.} \item{na_action}{(string, optional) How missing covariate data should be handled. Currently \code{"na.fail"} and \code{"na.omit"} are accommodated. Defaults to \code{"na.fail"}.} \item{...}{Additional arguments to be passed to the fitting function. See -the \code{Details} section for more information.} +the \code{Details} section for more information. See +\strong{5. Auxilliary Arguments} and \strong{6. Other Arguments} in the +\strong{Details} below for more information.} } \value{ a summary of model output. @@ -120,8 +130,10 @@ for the unlabeled data (\code{unlabeled_data}). For option (1), provide one data argument (\code{data}) which contains a stacked \code{data.frame} with both the unlabeled and labeled data and a -\code{label} argument that specify the column that identifies the labeled -versus the unlabeled observations in the stacked \code{data.frame} +\code{label} argument that specifies the column identifying the labeled +versus the unlabeled observations in the stacked \code{data.frame} (e.g., +\code{label = "set"} if the column "set" in the stacked data denotes which +set an observation belongs to). NOTE: Labeled data identifiers can be: @@ -143,15 +155,20 @@ Unlabeled data identifiers can be: For option (2), provide separate data arguments for the labeled data set (\code{data}) and the unlabeled data set (\code{unlabeled_data}). If the -second argument is provided, the function ignores the label identifier and -assumes the data provided is stacked. +second argument is provided, the function ignores the \code{label} identifier +and assumes the data provided are not stacked. + +NOTE: Not all columns in \code{data} or \code{unlabeled_data} may be used +unless explicitly referenced in the \code{formula} argument or in the +\code{label} argument (if the data are passed as one stacked data frame). \strong{3. Method:} Use the \code{method} argument to specify the fitting method: \describe{ -\item{"postpi"}{Wang et al. (2020) Post-Prediction Inference (PostPI)} +\item{"postpi_analytic"}{Wang et al. (2020) Post-Prediction Inference (PostPI) Analytic Correction} +\item{"postpi_boot"}{Wang et al. (2020) Post-Prediction Inference (PostPI) Bootstrap Correction} \item{"ppi"}{Angelopoulos et al. (2023) Prediction-Powered Inference (PPI)} \item{"ppi_plusplus"}{Angelopoulos et al. (2023) PPI++} @@ -161,14 +178,15 @@ Post-Prediction Inference (PSPA)} \strong{4. Model:} -Use the \code{model} argument to specify the type of model: +Use the \code{model} argument to specify the type of downstream inferential +model or parameter to be estimated: \describe{ -\item{"mean"}{Mean value of the outcome} -\item{"quantile"}{\code{q}th quantile of the outcome} -\item{"ols"}{Linear regression} -\item{"logistic"}{Logistic regression} -\item{"poisson"}{Poisson regression} +\item{"mean"}{Mean value of a continuous outcome} +\item{"quantile"}{\code{q}th quantile of a continuous outcome} +\item{"ols"}{Linear regression coefficients for a continuous outcome} +\item{"logistic"}{Logistic regression coefficients for a binary outcome} +\item{"poisson"}{Poisson regression coefficients for a count outcome} } The \code{ipd} wrapper function will concatenate the \code{method} and diff --git a/man/simdat.Rd b/man/simdat.Rd index 11b92d4..f909757 100644 --- a/man/simdat.Rd +++ b/man/simdat.Rd @@ -23,8 +23,8 @@ inference. Defaults is 1.} \item{sigma_Y}{Residual variance for the generated outcome. Defaults is 1.} \item{model}{The type of model to be generated. Must be one of -\code{"mean"}, \code{"quantile"}, \code{"ols"}, or \code{"logistic"}. -Default is \code{"ols"}.} +\code{"mean"}, \code{"quantile"}, \code{"ols"}, \code{"logistic"}, or +\code{"poisson"}. Default is \code{"ols"}.} \item{shift}{Scalar shift of the predictions for continuous outcomes (i.e., "mean", "quantile", and "ols"). Defaults to 0.} @@ -42,6 +42,155 @@ predictors (X1, X2, X3, and X4), where applicable. \description{ Data generation function for various underlying models } +\details{ +The \code{simdat} function generates three datasets consisting of independent +realizations of \eqn{Y} (for \code{model} = \code{"mean"} or +\code{"quantile"}), or \eqn{\{Y, \boldsymbol{X}\}} (for \code{model} = +\code{"ols"}, \code{"logistic"}, or \code{"poisson"}): a \emph{training} +dataset of size \eqn{n_t}, a \emph{labeled} dataset of size \eqn{n_l}, and +an \emph{unlabeled} dataset of size \eqn{n_u}. These sizes are specified by +the argument \code{n}. + +NOTE: In the \emph{unlabeled} data subset, outcome data are still generated +to facilitate a benchmark for comparison with an "oracle" model that uses +the true \eqn{Y^{\mathcal{U}}} values for estimation and inference. + +\strong{Generating Data} + +For \code{"mean"} and \code{"quantile"}, we simulate a continuous outcome, +\eqn{Y \in \mathbb{R}}, with mean given by the \code{effect} argument and +error variance given by the \code{sigma_y} argument. + +For \code{"ols"}, \code{"logistic"}, or \code{"poisson"} models, predictor +data, \eqn{\boldsymbol{X} \in \mathbb{R}^4} are simulated such that the +\eqn{i}th observation follows a standard multivariate normal distribution +with a zero mean vector and identity covariance matrix: + +\deqn{ + \boldsymbol{X_i} = (X_{i1}, X_{i2}, X_{i3}, X_{i4}) \sim \mathcal{N}_4(\boldsymbol{0}, \boldsymbol{I}). +} + +For \code{"ols"}, a continuous outcome \eqn{Y \in \mathbb{R}} is simulated +to depend on \eqn{X_1} through a linear term with the effect size specified +by the \code{effect} argument, while the other predictors, +\eqn{\boldsymbol{X} \setminus X_1}, have nonlinear effects: + +\deqn{ + Y_i = effect \times Z_{i1} + \frac{1}{2} Z_{i2}^2 + \frac{1}{3} Z_{i3}^3 + \frac{1}{4} Z_{i4}^2 + \varepsilon_y, +} + +and \eqn{\varepsilon_y \sim \mathcal{N}(0, sigma_y)}, where the +\code{sigma_y} argument specifies the error variance. + +For \code{"logistic"}, we simulate: + +\deqn{ + \Pr(Y_i = 1 \mid \boldsymbol{X}) = logit^{-1}(effect \times Z_{i1} + \frac{1}{2} Z_{i2}^2 + \frac{1}{3} Z_{i3}^3 + \frac{1}{4} Z_{i4}^2 + \varepsilon_y) +} + +and generate: + +\deqn{ + Y_i \sim Bern[1, \Pr(Y_i = 1 \mid \boldsymbol{X})] +} + +where \eqn{\varepsilon_y \sim \mathcal{N}(0, sigma\_y)}. + +For \code{"poisson"}, we simulate: + +\deqn{ + \lambda_Y = exp(effect \times Z_{i1} + \frac{1}{2} Z_{i2}^2 + \frac{1}{3} Z_{i3}^3 + \frac{1}{4} Z_{i4}^2 + \varepsilon_y) +} + +and generate: + +\deqn{ + Y_i \sim Poisson(\lambda_Y) +} + +\strong{Generating Predictions} + +To generate predicted outcomes for \code{"mean"} and \code{"quantile"}, we +simulate a continuous variable with mean given by the empirical mean of the +training data and error variance given by the \code{sigma_y} argument. + +For \code{"ols"}, we fit a generalized additive model (GAM) on the +simulated \emph{training} dataset and calculate predictions for the +\emph{labeled} and \emph{unlabeled} datasets as deterministic functions of +\eqn{\boldsymbol{X}}. Specifically, we fit the following GAM: + +\deqn{ + Y^{\mathcal{T}} = s_0 + s_1(X_1^{\mathcal{T}}) + s_2(X_2^{\mathcal{T}}) + + s_3(X_3^{\mathcal{T}}) + s_4(X_4^{\mathcal{T}}) + \varepsilon_p, +} + +where \eqn{\mathcal{T}} denotes the \emph{training} dataset, \eqn{s_0} is an +intercept term, and \eqn{s_1(\cdot)}, \eqn{s_2(\cdot)}, \eqn{s_3(\cdot)}, +and \eqn{s_4(\cdot)} are smoothing spline functions for \eqn{X_1}, \eqn{X_2}, +\eqn{X_3}, and \eqn{X_4}, respectively, with three target equivalent degrees +of freedom. Residual error is modeled as \eqn{\varepsilon_p}. + +Predictions for \emph{labeled} and \emph{unlabeled} datasets are calculated +as: + +\deqn{ + f(\boldsymbol{X}^{\mathcal{L}\cup\mathcal{U}}) = \hat{s}_0 + \hat{s}_1(X_1^{\mathcal{L}\cup\mathcal{U}}) + +\hat{s}_2(X_2^{\mathcal{L}\cup\mathcal{U}}) + \hat{s}_3(X_3^{\mathcal{L}\cup\mathcal{U}}) + +\hat{s}_4(X_4^{\mathcal{L}\cup\mathcal{U}}), +} + +where \eqn{\hat{s}_0, \hat{s}_1, \hat{s}_2, \hat{s}_3}, and \eqn{\hat{s}_4} +are estimates of \eqn{s_0, s_1, s_2, s_3}, and \eqn{s_4}, respectively. + +NOTE: For continuous outcomes, we provide optional arguments \code{shift} and +\code{scale} to further apply a location shift and scaling factor, +respectively, to the predicted outcomes. These default to \code{shift = 0} +and \code{scale = 1}, i.e., no location shift or scaling. + +For \code{"logistic"}, we train k-nearest neighbors (k-NN) classifiers on +the simulated \emph{training} dataset for values of \eqn{k} ranging from 1 +to 10. The optimal \eqn{k} is chosen via cross-validation, minimizing the +misclassification error on the validation folds. Predictions for the +\emph{labeled} and \emph{unlabeled} datasets are obtained by applying the +k-NN classifier with the optimal \eqn{k} to \eqn{\boldsymbol{X}}. + +Specifically, for each observation in the \emph{labeled} and \emph{unlabeled} +datasets: + +\deqn{ + \hat{Y} = \operatorname{argmax}_c \sum_{i \in \mathcal{N}_k} I(Y_i = c), +} + +where \eqn{\mathcal{N}_k} represents the set of \eqn{k} nearest neighbors in +the training dataset, \eqn{c} indexes the possible classes (0 or 1), and +\eqn{I(\cdot)} is an indicator function. + +For \code{"poisson"}, we fit a generalized linear model (GLM) with a log link +function to the simulated \emph{training} dataset. The model is of the form: + +\deqn{ + \log(\mu^{\mathcal{T}}) = \gamma_0 + \gamma_1 X_1^{\mathcal{T}} + \gamma_2 X_2^{\mathcal{T}} + + \gamma_3 X_3^{\mathcal{T}} + \gamma_4 X_4^{\mathcal{T}}, +} + +where \eqn{\mu^{\mathcal{T}}} is the expected count for the response variable +in the \emph{training} dataset, \eqn{\gamma_0} is the intercept, and +\eqn{\gamma_1}, \eqn{\gamma_2}, \eqn{\gamma_3}, and \eqn{\gamma_4} are the +regression coefficients for the predictors \eqn{X_1}, \eqn{X_2}, \eqn{X_3}, +and \eqn{X_4}, respectively. + +Predictions for the \emph{labeled} and \emph{unlabeled} datasets are +calculated as: + +\deqn{ + \hat{\mu}^{\mathcal{L} \cup \mathcal{U}} = \exp(\hat{\gamma}_0 + \hat{\gamma}_1 X_1^{\mathcal{L} \cup \mathcal{U}} + + \hat{\gamma}_2 X_2^{\mathcal{L} \cup \mathcal{U}} + \hat{\gamma}_3 X_3^{\mathcal{L} \cup \mathcal{U}} + + \hat{\gamma}_4 X_4^{\mathcal{L} \cup \mathcal{U}}), +} + +where \eqn{\hat{\gamma}_0}, \eqn{\hat{\gamma}_1}, \eqn{\hat{\gamma}_2}, \eqn{\hat{\gamma}_3}, +and \eqn{\hat{\gamma}_4} are the estimated coefficients. +} \examples{ #-- Mean