stubble

stubble helps you generate simple synthetic datasets matching the format of a supplied data frame-like object (including base R data frames, tibbles, data.tables, and lists of vectors).

stubble replicates the column names and types of the original data, but the synthetic data are randomly generated. By default, these values are completely random and contain no information about the original data, but it is also possible to tell stubble to draw synthetic values from the empirical distributions of the original data.

The original intended use of stubble was to generate simple test data for analysis projects and R package development. It could also be used in teaching, methods research, study design, and any other context where realistic-but-fake data may be useful.

Installation

You can install the development version from GitHub with:

# install.packages("remotes")
remotes::install_github("bjcairns/stubble")

Example

Here is a very simple example of using stubble on the included penguins_ext dataset (derived from palmerpenguins).

library(stubble)

# Example with several data types 
# (character, factor, double, integer, logical and Date)
p <- penguins_ext[, c(
       "id", 
       "species", 
       "bill_length_mm", 
       "body_mass_g", 
       "clutch_completion", 
       "date_egg"
     )]

head(p)
#>     id species bill_length_mm body_mass_g clutch_completion   date_egg
#> 1 N1A1  Adelie           39.1        3750              TRUE 2007-11-11
#> 2 N1A2  Adelie           39.5        3800              TRUE 2007-11-11
#> 3 N2A1  Adelie           40.3        3250              TRUE 2007-11-16
#> 4 N2A2  Adelie             NA          NA              TRUE 2007-11-16
#> 5 N3A1  Adelie           36.7        3450              TRUE 2007-11-16
#> 6 N3A2  Adelie           39.3        3650              TRUE 2007-11-16

The default in stubble is to maintain strict data protection. Unless you tell it otherwise, it generates nonsense values which bear no relation to the original values, other than having the same vector type.

# stubblise to obtain a dataset with the same structure, but random data
p_stbl <- stubblise(p)
head(p_stbl)
#>         id species bill_length_mm body_mass_g clutch_completion   date_egg
#> 1 *<n b#Ac       b   68.226909981          42              TRUE 1972-06-25
#> 2   5zp3>j       a   93.392560451          56             FALSE 2000-09-26
#> 3      a:#       b    5.276353188          56              TRUE 1984-05-14
#> 4  |!9SQp5       b    3.560817493          62             FALSE 1998-05-22
#> 5 H}&}kY^W       c   12.401691695          56             FALSE 2002-07-18
#> 6        N       d   77.027949371          44              TRUE 1991-07-23

More advanced use is also possible, such as generating values from the empirical distributions of each variable.

# Use method = "empirical" to obtain data with marginal distributions similar 
# to the original. The emp_p_exc and emp_n_exc control parameters allow that 
# all values of the id variable have only a small number of observations.
p_stbl_emp <- stubblise(p, method = "empirical", emp_p_exc = 0, emp_n_exc = 0)
head(p_stbl_emp)
#>      id species bill_length_mm body_mass_g clutch_completion   date_egg
#> 1 N88A2  Gentoo    41.47851321        3393              TRUE 2008-02-11
#> 2  N1A2  Adelie    48.18967018        5247              TRUE 2008-07-07
#> 3 N28A1  Adelie    47.11367577        3592              TRUE 2007-12-12
#> 4 N51A1  Gentoo    42.78340799        3363              TRUE 2008-02-17
#> 5 N58A2  Adelie    45.53277529        4202              TRUE 2008-03-02
#> 6 N27A1  Adelie    49.82995351        3780              TRUE 2009-08-12

See the “Using stubblise()” (vignette("using-stubblise")) and “stub() and ble()” (vignette("stub-and-ble")) vignettes for further examples of usage. (If vignettes are not installed you can repeat the command from the Installation section above with the addition of the argument build_vignettes = TRUE.)

Known issues

See Issues on the stubble GitHub repository.

Name		Name	Last commit message	Last commit date
Latest commit History 174 Commits
.github/workflows		.github/workflows
R		R
data-raw		data-raw
data		data
inst		inst
man		man
tests		tests
vignettes		vignettes
.Rbuildignore		.Rbuildignore
.gitignore		.gitignore
DESCRIPTION		DESCRIPTION
LICENSE		LICENSE
LICENSE.md		LICENSE.md
NAMESPACE		NAMESPACE
README.Rmd		README.Rmd
README.md		README.md
stubble.Rproj		stubble.Rproj

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Licenses found

Repository files navigation

stubble

Installation

Example

Known issues

About

Licenses found

Releases 1

Contributors 2

Languages

License

Licenses found

bjcairns/stubble

Folders and files

Latest commit

History

Repository files navigation

stubble

Installation

Example

Known issues

About

Topics

Resources

License

Licenses found

Stars

Watchers

Forks

Releases 1

Contributors 2

Languages