Updating documentation based on helpful feedback from review

ipd-tools · Dec 20, 2024 · d4300a2 · d4300a2
1 parent 827893c
commit d4300a2
Show file tree

Hide file tree

Showing 7 changed files with 672 additions and 50 deletions.
diff --git a/NEWS.md b/NEWS.md
@@ -16,4 +16,12 @@
 
 # ipd 0.1.4
 
-* 
+* Added a help topic for the package itself (`man/ipd-package.Rd`) via `R/ipd-package.R` and `roxygen2`
+
+* Updated the documentation for `ipd()`:
+
+  * Provided a more explicit description of the `model` argument, which is meant to specify the downstream inferential model or parameter to be estimated.
+
+  * Clarified that not all columns in data are used in prediction unless explicitly referenced in the `formula` argument or in the `label` argument if the data are passed as one stacked data frame. 
+
+* Updated the documentation for `simdat()` to include a more thorough explanation of how to simulate data with this function. 
diff --git a/R/ipd-package.R b/R/ipd-package.R
@@ -0,0 +1,120 @@
+#' @keywords internal
+"_PACKAGE"
+
+## usethis namespace: start
+#' ipd: Inference on Predicted Data
+#'
+#' The `ipd` package provides tools for statistical modeling and inference when
+#' a significant portion of the outcome data is predicted by AI/ML algorithms.
+#' It implements several state-of-the-art methods for inference on predicted
+#' data (IPD), offering a user-friendly interface to facilitate their use in
+#' real-world applications.
+#'
+#' This package is particularly useful in scenarios where predicted values
+#' (e.g., from machine learning models) are used as proxies for unobserved
+#' outcomes, which can introduce biases in estimation and inference. The `ipd`
+#' package integrates methods designed to address these challenges.
+#'
+#' @section Features:
+#' - Multiple IPD methods: `PostPI`, `PPI`, `PPI++`, and `PSPA` currently.
+#' - Flexible wrapper functions for ease of use.
+#' - Custom methods for model inspection and evaluation.
+#' - Seamless integration with common data structures in R.
+#' - Comprehensive documentation and examples.
+#'
+#' @section Key Functions:
+#' - \code{\link{ipd}}: Main wrapper function which implements various methods for inference on predicted data for a specified model/outcome type (e.g., mean estimation, linear regression).
+#' - \code{\link{simdat}}: Simulates data for demonstrating the use of the various IPD methods.
+#' - \code{\link{print.ipd}}: Prints a brief summary of the IPD method/model combination.
+#' - \code{\link{summary.ipd}}: Summarizes the results of fitted IPD models.
+#' - \code{\link{tidy.ipd}}: Tidies the IPD method/model fit into a data frame.
+#' - \code{\link{glance.ipd}}: Glances at the IPD method/model fit, returning a one-row summary.
+#' - \code{\link{augment.ipd}}: Augments the data used for an IPD method/model fit with additional information about each observation.
+#'
+#' @section Documentation:
+#' The package includes detailed documentation for each function, including
+#' usage examples. A vignette is also provided to guide users through common
+#' workflows and applications of the package.
+#'
+#' @section References:
+#' For details on the statistical methods implemented in this package, please
+#' refer to the associated manuscripts at the following references:
+#' - \strong{PostPI}: Wang, S., McCormick, T. H., & Leek, J. T. (2020). Methods for correcting inference based on outcomes predicted by machine learning. Proceedings of the National Academy of Sciences, 117(48), 30266-30275.
+#' - \strong{PPI}: Angelopoulos, A. N., Bates, S., Fannjiang, C., Jordan, M. I., & Zrnic, T. (2023). Prediction-powered inference. Science, 382(6671), 669-674.
+#' - \strong{PPI++}: Angelopoulos, A. N., Duchi, J. C., & Zrnic, T. (2023). PPI++: Efficient prediction-powered inference. arXiv preprint arXiv:2311.01453.
+#' - \strong{PSPA}: Miao, J., Miao, X., Wu, Y., Zhao, J., & Lu, Q. (2023). Assumption-lean and data-adaptive post-prediction inference. arXiv preprint arXiv:2311.14220.
+#'
+#' @name ipd-package
+#'
+#' @keywords package
+#'
+#' @examples
+#' #-- Generate Example Data
+#'
+#' set.seed(12345)
+#'
+#' dat <- simdat(n = c(300, 300, 300), effect = 1, sigma_Y = 1)
+#'
+#' head(dat)
+#'
+#' formula <- Y - f ~ X1
+#'
+#' #-- PostPI Analytic Correction (Wang et al., 2020)
+#'
+#' fit_postpi1 <- ipd(formula, method = "postpi_analytic", model = "ols",
+#'
+#'     data = dat, label = "set")
+#'
+#' #-- PostPI Bootstrap Correction (Wang et al., 2020)
+#'
+#' nboot <- 200
+#'
+#' fit_postpi2 <- ipd(formula, method = "postpi_boot", model = "ols",
+#'
+#'     data = dat, label = "set", nboot = nboot)
+#'
+#' #-- PPI (Angelopoulos et al., 2023)
+#'
+#' fit_ppi <- ipd(formula, method = "ppi", model = "ols",
+#'
+#'     data = dat, label = "set")
+#'
+#' #-- PPI++ (Angelopoulos et al., 2023)
+#'
+#' fit_plusplus <- ipd(formula, method = "ppi_plusplus", model = "ols",
+#'
+#'     data = dat, label = "set")
+#'
+#' #-- PSPA (Miao et al., 2023)
+#'
+#' fit_pspa <- ipd(formula, method = "pspa", model = "ols",
+#'
+#'     data = dat, label = "set")
+#'
+#' #-- Print the Model
+#'
+#' print(fit_postpi1)
+#'
+#' #-- Summarize the Model
+#'
+#' summ_fit_postpi1 <- summary(fit_postpi1)
+#'
+#' #-- Print the Model Summary
+#'
+#' print(summ_fit_postpi1)
+#'
+#' #-- Tidy the Model Output
+#'
+#' tidy(fit_postpi1)
+#'
+#' #-- Get a One-Row Summary of the Model
+#'
+#' glance(fit_postpi1)
+#'
+#' #-- Augment the Original Data with Fitted Values and Residuals
+#'
+#' augmented_df <- augment(fit_postpi1)
+#'
+#' head(augmented_df)
+## usethis namespace: end
+NULL
diff --git a/R/ipd.R b/R/ipd.R
@@ -1,8 +1,8 @@
 #===============================================================================
-# ipd WRAPPER FUNCTION
+# WRAPPER FUNCTION
 #===============================================================================
 
-#--- MAIN ipd WRAPPER FUNCTION -------------------------------------------------
+#--- MAIN WRAPPER FUNCTION -----------------------------------------------------
 
 #' Inference on Predicted Data (ipd)
 #'
@@ -15,31 +15,38 @@
 #' labeled data, \code{f} is the name of the column corresponding to the
 #' predicted outcome in both labeled and unlabeled data, and \code{X}
 #' corresponds to the features of interest (i.e., \code{X = X1 + ... + Xp}).
+#' See \strong{1. Formula} in the \strong{Details} below for more information.
 #'
-#' @param method The method to be used for fitting the model. Must be one of
+#' @param method The IPD method to be used for fitting the model. Must be one of
 #' \code{"postpi_analytic"}, \code{"postpi_boot"}, \code{"ppi"},
-#' \code{"pspa"}, or \code{"ppi_plusplus"}.
+#' \code{"ppi_plusplus"}, or \code{"pspa"}.
+#' See \strong{3. Method} in the \strong{Details} below for more information.
 #'
-#' @param model The type of model to be fitted. Must be one of \code{"mean"},
-#' \code{"quantile"}, \code{"ols"}, or \code{"logistic"}.
+#' @param model The type of downstream inferential model to be fitted, or the
+#' parameter being estimated. Must be one of \code{"mean"},
+#' \code{"quantile"}, \code{"ols"}, \code{"logistic"}, or \code{"poisson"}.
+#' See \strong{4. Model} in the \strong{Details} below for more information.
 #'
 #' @param data A \code{data.frame} containing the variables in the model,
 #' either a stacked data frame with a specific column identifying the labeled
 #' versus unlabeled observations (\code{label}), or only the labeled data
 #' set. Must contain columns for the observed outcomes (\code{Y}), the
 #' predicted outcomes (\code{f}), and the features (\code{X}) needed to specify
-#' the \code{formula}.
+#' the \code{formula}. See \strong{2. Data} in the \strong{Details} below for
+#' more information.
 #'
 #' @param label A \code{string}, \code{int}, or \code{logical} specifying the
 #' column in the data that distinguishes between the labeled and unlabeled
 #' observations. See the \code{Details} section for more information. If NULL,
-#' \code{unlabeled_data} must be specified.
+#' \code{unlabeled_data} must be specified. See \strong{2. Data} in the
+#' \strong{Details} below for more information.
 #'
 #' @param unlabeled_data (optional) A \code{data.frame} of unlabeled data. If
 #' NULL, \code{label} must be specified. Specifying both the \code{label} and
 #' \code{unlabeled_data} arguments will result in an error message. If
 #' specified, must contain columns for the predicted outcomes (\code{f}), and
-#' the features (\code{X}) needed to specify the \code{formula}.
+#' the features (\code{X}) needed to specify the \code{formula}. See
+#' \strong{2. Data} in the \strong{Details} below for more information.
 #'
 #' @param seed (optional) An \code{integer} seed for random number generation.
 #'
@@ -53,15 +60,18 @@
 #' one of \code{"two-sided"}, \code{"less"}, or \code{"greater"}.
 #'
 #' @param n_t (integer, optional) Size of the dataset used to train the
-#' prediction function (necessary for the \code{"postpi"} methods if \code{n_t} <
-#' \code{nrow(X_l)}. Defaults to \code{Inf}.
+#' prediction function (necessary for the \code{"postpi_analytic"} and
+#' \code{"postpi_boot"} methods if \code{n_t} < \code{nrow(X_l)}.
+#' Defaults to \code{Inf}.
 #'
 #' @param na_action (string, optional) How missing covariate data should be
 #' handled. Currently \code{"na.fail"} and \code{"na.omit"} are accommodated.
 #' Defaults to \code{"na.fail"}.
 #'
 #' @param ... Additional arguments to be passed to the fitting function. See
-#' the \code{Details} section for more information.
+#' the \code{Details} section for more information. See
+#' \strong{5. Auxilliary Arguments} and \strong{6. Other Arguments} in the
+#' \strong{Details} below for more information.
 #'
 #' @returns a summary of model output.
 #'
@@ -87,8 +97,10 @@
 #'
 #' For option (1), provide one data argument (\code{data}) which contains a
 #' stacked \code{data.frame} with both the unlabeled and labeled data and a
-#' \code{label} argument that specify the column that identifies the labeled
-#' versus the unlabeled observations in the stacked \code{data.frame}
+#' \code{label} argument that specifies the column identifying the labeled
+#' versus the unlabeled observations in the stacked \code{data.frame} (e.g.,
+#' \code{label = "set"} if the column "set" in the stacked data denotes which
+#' set an observation belongs to).
 #'
 #' NOTE: Labeled data identifiers can be:
 #'
@@ -110,15 +122,20 @@
 #'
 #' For option (2), provide separate data arguments for the labeled data set
 #' (\code{data}) and the unlabeled data set (\code{unlabeled_data}). If the
-#' second argument is provided, the function ignores the label identifier and
-#' assumes the data provided is stacked.
+#' second argument is provided, the function ignores the \code{label} identifier
+#' and assumes the data provided are not stacked.
+#'
+#' NOTE: Not all columns in \code{data} or \code{unlabeled_data} may be used
+#' unless explicitly referenced in the \code{formula} argument or in the
+#' \code{label} argument (if the data are passed as one stacked data frame).
 #'
 #' \strong{3. Method:}
 #'
 #' Use the \code{method} argument to specify the fitting method:
 #'
 #' \describe{
-#'    \item{"postpi"}{Wang et al. (2020) Post-Prediction Inference (PostPI)}
+#'    \item{"postpi_analytic"}{Wang et al. (2020) Post-Prediction Inference (PostPI) Analytic Correction}
+#'    \item{"postpi_boot"}{Wang et al. (2020) Post-Prediction Inference (PostPI) Bootstrap Correction}
 #'    \item{"ppi"}{Angelopoulos et al. (2023) Prediction-Powered Inference
 #'    (PPI)}
 #'    \item{"ppi_plusplus"}{Angelopoulos et al. (2023) PPI++}
@@ -128,14 +145,15 @@
 #'
 #' \strong{4. Model:}
 #'
-#' Use the \code{model} argument to specify the type of model:
+#' Use the \code{model} argument to specify the type of downstream inferential
+#' model or parameter to be estimated:
 #'
 #' \describe{
-#'    \item{"mean"}{Mean value of the outcome}
-#'    \item{"quantile"}{\code{q}th quantile of the outcome}
-#'    \item{"ols"}{Linear regression}
-#'    \item{"logistic"}{Logistic regression}
-#'    \item{"poisson"}{Poisson regression}
+#'    \item{"mean"}{Mean value of a continuous outcome}
+#'    \item{"quantile"}{\code{q}th quantile of a continuous outcome}
+#'    \item{"ols"}{Linear regression coefficients for a continuous outcome}
+#'    \item{"logistic"}{Logistic regression coefficients for a binary outcome}
+#'    \item{"poisson"}{Poisson regression coefficients for a count outcome}
 #' }
 #'
 #' The \code{ipd} wrapper function will concatenate the \code{method} and