Skip to content

Commit

Permalink
Updating documentation based on helpful feedback from review
Browse files Browse the repository at this point in the history
  • Loading branch information
salernos committed Dec 20, 2024
1 parent 827893c commit d4300a2
Show file tree
Hide file tree
Showing 7 changed files with 672 additions and 50 deletions.
10 changes: 9 additions & 1 deletion NEWS.md
Original file line number Diff line number Diff line change
Expand Up @@ -16,4 +16,12 @@

# ipd 0.1.4

*
* Added a help topic for the package itself (`man/ipd-package.Rd`) via `R/ipd-package.R` and `roxygen2`

* Updated the documentation for `ipd()`:

* Provided a more explicit description of the `model` argument, which is meant to specify the downstream inferential model or parameter to be estimated.

* Clarified that not all columns in data are used in prediction unless explicitly referenced in the `formula` argument or in the `label` argument if the data are passed as one stacked data frame.

* Updated the documentation for `simdat()` to include a more thorough explanation of how to simulate data with this function.
120 changes: 120 additions & 0 deletions R/ipd-package.R
Original file line number Diff line number Diff line change
@@ -0,0 +1,120 @@
#' @keywords internal
"_PACKAGE"

## usethis namespace: start
#' ipd: Inference on Predicted Data
#'
#' The `ipd` package provides tools for statistical modeling and inference when
#' a significant portion of the outcome data is predicted by AI/ML algorithms.
#' It implements several state-of-the-art methods for inference on predicted
#' data (IPD), offering a user-friendly interface to facilitate their use in
#' real-world applications.
#'
#' This package is particularly useful in scenarios where predicted values
#' (e.g., from machine learning models) are used as proxies for unobserved
#' outcomes, which can introduce biases in estimation and inference. The `ipd`
#' package integrates methods designed to address these challenges.
#'
#' @section Features:
#' - Multiple IPD methods: `PostPI`, `PPI`, `PPI++`, and `PSPA` currently.
#' - Flexible wrapper functions for ease of use.
#' - Custom methods for model inspection and evaluation.
#' - Seamless integration with common data structures in R.
#' - Comprehensive documentation and examples.
#'
#' @section Key Functions:
#' - \code{\link{ipd}}: Main wrapper function which implements various methods for inference on predicted data for a specified model/outcome type (e.g., mean estimation, linear regression).
#' - \code{\link{simdat}}: Simulates data for demonstrating the use of the various IPD methods.
#' - \code{\link{print.ipd}}: Prints a brief summary of the IPD method/model combination.
#' - \code{\link{summary.ipd}}: Summarizes the results of fitted IPD models.
#' - \code{\link{tidy.ipd}}: Tidies the IPD method/model fit into a data frame.
#' - \code{\link{glance.ipd}}: Glances at the IPD method/model fit, returning a one-row summary.
#' - \code{\link{augment.ipd}}: Augments the data used for an IPD method/model fit with additional information about each observation.
#'
#' @section Documentation:
#' The package includes detailed documentation for each function, including
#' usage examples. A vignette is also provided to guide users through common
#' workflows and applications of the package.
#'
#' @section References:
#' For details on the statistical methods implemented in this package, please
#' refer to the associated manuscripts at the following references:
#' - \strong{PostPI}: Wang, S., McCormick, T. H., & Leek, J. T. (2020). Methods for correcting inference based on outcomes predicted by machine learning. Proceedings of the National Academy of Sciences, 117(48), 30266-30275.
#' - \strong{PPI}: Angelopoulos, A. N., Bates, S., Fannjiang, C., Jordan, M. I., & Zrnic, T. (2023). Prediction-powered inference. Science, 382(6671), 669-674.
#' - \strong{PPI++}: Angelopoulos, A. N., Duchi, J. C., & Zrnic, T. (2023). PPI++: Efficient prediction-powered inference. arXiv preprint arXiv:2311.01453.
#' - \strong{PSPA}: Miao, J., Miao, X., Wu, Y., Zhao, J., & Lu, Q. (2023). Assumption-lean and data-adaptive post-prediction inference. arXiv preprint arXiv:2311.14220.
#'
#' @name ipd-package
#'
#' @keywords package
#'
#' @examples
#' #-- Generate Example Data
#'
#' set.seed(12345)
#'
#' dat <- simdat(n = c(300, 300, 300), effect = 1, sigma_Y = 1)
#'
#' head(dat)
#'
#' formula <- Y - f ~ X1
#'
#' #-- PostPI Analytic Correction (Wang et al., 2020)
#'
#' fit_postpi1 <- ipd(formula, method = "postpi_analytic", model = "ols",
#'
#' data = dat, label = "set")
#'
#' #-- PostPI Bootstrap Correction (Wang et al., 2020)
#'
#' nboot <- 200
#'
#' fit_postpi2 <- ipd(formula, method = "postpi_boot", model = "ols",
#'
#' data = dat, label = "set", nboot = nboot)
#'
#' #-- PPI (Angelopoulos et al., 2023)
#'
#' fit_ppi <- ipd(formula, method = "ppi", model = "ols",
#'
#' data = dat, label = "set")
#'
#' #-- PPI++ (Angelopoulos et al., 2023)
#'
#' fit_plusplus <- ipd(formula, method = "ppi_plusplus", model = "ols",
#'
#' data = dat, label = "set")
#'
#' #-- PSPA (Miao et al., 2023)
#'
#' fit_pspa <- ipd(formula, method = "pspa", model = "ols",
#'
#' data = dat, label = "set")
#'
#' #-- Print the Model
#'
#' print(fit_postpi1)
#'
#' #-- Summarize the Model
#'
#' summ_fit_postpi1 <- summary(fit_postpi1)
#'
#' #-- Print the Model Summary
#'
#' print(summ_fit_postpi1)
#'
#' #-- Tidy the Model Output
#'
#' tidy(fit_postpi1)
#'
#' #-- Get a One-Row Summary of the Model
#'
#' glance(fit_postpi1)
#'
#' #-- Augment the Original Data with Fitted Values and Residuals
#'
#' augmented_df <- augment(fit_postpi1)
#'
#' head(augmented_df)
## usethis namespace: end
NULL
64 changes: 41 additions & 23 deletions R/ipd.R
Original file line number Diff line number Diff line change
@@ -1,8 +1,8 @@
#===============================================================================
# ipd WRAPPER FUNCTION
# WRAPPER FUNCTION
#===============================================================================

#--- MAIN ipd WRAPPER FUNCTION -------------------------------------------------
#--- MAIN WRAPPER FUNCTION -----------------------------------------------------

#' Inference on Predicted Data (ipd)
#'
Expand All @@ -15,31 +15,38 @@
#' labeled data, \code{f} is the name of the column corresponding to the
#' predicted outcome in both labeled and unlabeled data, and \code{X}
#' corresponds to the features of interest (i.e., \code{X = X1 + ... + Xp}).
#' See \strong{1. Formula} in the \strong{Details} below for more information.
#'
#' @param method The method to be used for fitting the model. Must be one of
#' @param method The IPD method to be used for fitting the model. Must be one of
#' \code{"postpi_analytic"}, \code{"postpi_boot"}, \code{"ppi"},
#' \code{"pspa"}, or \code{"ppi_plusplus"}.
#' \code{"ppi_plusplus"}, or \code{"pspa"}.
#' See \strong{3. Method} in the \strong{Details} below for more information.
#'
#' @param model The type of model to be fitted. Must be one of \code{"mean"},
#' \code{"quantile"}, \code{"ols"}, or \code{"logistic"}.
#' @param model The type of downstream inferential model to be fitted, or the
#' parameter being estimated. Must be one of \code{"mean"},
#' \code{"quantile"}, \code{"ols"}, \code{"logistic"}, or \code{"poisson"}.
#' See \strong{4. Model} in the \strong{Details} below for more information.
#'
#' @param data A \code{data.frame} containing the variables in the model,
#' either a stacked data frame with a specific column identifying the labeled
#' versus unlabeled observations (\code{label}), or only the labeled data
#' set. Must contain columns for the observed outcomes (\code{Y}), the
#' predicted outcomes (\code{f}), and the features (\code{X}) needed to specify
#' the \code{formula}.
#' the \code{formula}. See \strong{2. Data} in the \strong{Details} below for
#' more information.
#'
#' @param label A \code{string}, \code{int}, or \code{logical} specifying the
#' column in the data that distinguishes between the labeled and unlabeled
#' observations. See the \code{Details} section for more information. If NULL,
#' \code{unlabeled_data} must be specified.
#' \code{unlabeled_data} must be specified. See \strong{2. Data} in the
#' \strong{Details} below for more information.
#'
#' @param unlabeled_data (optional) A \code{data.frame} of unlabeled data. If
#' NULL, \code{label} must be specified. Specifying both the \code{label} and
#' \code{unlabeled_data} arguments will result in an error message. If
#' specified, must contain columns for the predicted outcomes (\code{f}), and
#' the features (\code{X}) needed to specify the \code{formula}.
#' the features (\code{X}) needed to specify the \code{formula}. See
#' \strong{2. Data} in the \strong{Details} below for more information.
#'
#' @param seed (optional) An \code{integer} seed for random number generation.
#'
Expand All @@ -53,15 +60,18 @@
#' one of \code{"two-sided"}, \code{"less"}, or \code{"greater"}.
#'
#' @param n_t (integer, optional) Size of the dataset used to train the
#' prediction function (necessary for the \code{"postpi"} methods if \code{n_t} <
#' \code{nrow(X_l)}. Defaults to \code{Inf}.
#' prediction function (necessary for the \code{"postpi_analytic"} and
#' \code{"postpi_boot"} methods if \code{n_t} < \code{nrow(X_l)}.
#' Defaults to \code{Inf}.
#'
#' @param na_action (string, optional) How missing covariate data should be
#' handled. Currently \code{"na.fail"} and \code{"na.omit"} are accommodated.
#' Defaults to \code{"na.fail"}.
#'
#' @param ... Additional arguments to be passed to the fitting function. See
#' the \code{Details} section for more information.
#' the \code{Details} section for more information. See
#' \strong{5. Auxilliary Arguments} and \strong{6. Other Arguments} in the
#' \strong{Details} below for more information.
#'
#' @returns a summary of model output.
#'
Expand All @@ -87,8 +97,10 @@
#'
#' For option (1), provide one data argument (\code{data}) which contains a
#' stacked \code{data.frame} with both the unlabeled and labeled data and a
#' \code{label} argument that specify the column that identifies the labeled
#' versus the unlabeled observations in the stacked \code{data.frame}
#' \code{label} argument that specifies the column identifying the labeled
#' versus the unlabeled observations in the stacked \code{data.frame} (e.g.,
#' \code{label = "set"} if the column "set" in the stacked data denotes which
#' set an observation belongs to).
#'
#' NOTE: Labeled data identifiers can be:
#'
Expand All @@ -110,15 +122,20 @@
#'
#' For option (2), provide separate data arguments for the labeled data set
#' (\code{data}) and the unlabeled data set (\code{unlabeled_data}). If the
#' second argument is provided, the function ignores the label identifier and
#' assumes the data provided is stacked.
#' second argument is provided, the function ignores the \code{label} identifier
#' and assumes the data provided are not stacked.
#'
#' NOTE: Not all columns in \code{data} or \code{unlabeled_data} may be used
#' unless explicitly referenced in the \code{formula} argument or in the
#' \code{label} argument (if the data are passed as one stacked data frame).
#'
#' \strong{3. Method:}
#'
#' Use the \code{method} argument to specify the fitting method:
#'
#' \describe{
#' \item{"postpi"}{Wang et al. (2020) Post-Prediction Inference (PostPI)}
#' \item{"postpi_analytic"}{Wang et al. (2020) Post-Prediction Inference (PostPI) Analytic Correction}
#' \item{"postpi_boot"}{Wang et al. (2020) Post-Prediction Inference (PostPI) Bootstrap Correction}
#' \item{"ppi"}{Angelopoulos et al. (2023) Prediction-Powered Inference
#' (PPI)}
#' \item{"ppi_plusplus"}{Angelopoulos et al. (2023) PPI++}
Expand All @@ -128,14 +145,15 @@
#'
#' \strong{4. Model:}
#'
#' Use the \code{model} argument to specify the type of model:
#' Use the \code{model} argument to specify the type of downstream inferential
#' model or parameter to be estimated:
#'
#' \describe{
#' \item{"mean"}{Mean value of the outcome}
#' \item{"quantile"}{\code{q}th quantile of the outcome}
#' \item{"ols"}{Linear regression}
#' \item{"logistic"}{Logistic regression}
#' \item{"poisson"}{Poisson regression}
#' \item{"mean"}{Mean value of a continuous outcome}
#' \item{"quantile"}{\code{q}th quantile of a continuous outcome}
#' \item{"ols"}{Linear regression coefficients for a continuous outcome}
#' \item{"logistic"}{Logistic regression coefficients for a binary outcome}
#' \item{"poisson"}{Poisson regression coefficients for a count outcome}
#' }
#'
#' The \code{ipd} wrapper function will concatenate the \code{method} and
Expand Down
Loading

0 comments on commit d4300a2

Please sign in to comment.