From be38de7002c6171c12d693af8585732da0daf7f6 Mon Sep 17 00:00:00 2001 From: ayushpatnaikgit Date: Wed, 30 Nov 2022 15:46:16 +0530 Subject: [PATCH 01/18] Rearrange sections and other changes. --- docs/make.jl | 5 +- docs/src/R_comparison.md | 131 +++++++++++++++++++++++---------------- docs/src/api.md | 32 ++++++++++ docs/src/examples.md | 62 ------------------ docs/src/index.md | 81 ++++++++++++++---------- docs/src/performance.md | 74 ---------------------- src/SurveyDesign.jl | 25 +++----- src/by.jl | 7 ++- src/mean.jl | 5 +- 9 files changed, 175 insertions(+), 247 deletions(-) create mode 100644 docs/src/api.md delete mode 100644 docs/src/examples.md delete mode 100644 docs/src/performance.md diff --git a/docs/make.jl b/docs/make.jl index 961ed1e7..df10b794 100644 --- a/docs/make.jl +++ b/docs/make.jl @@ -15,9 +15,8 @@ makedocs(; ), pages=[ "Home" => "index.md", - "Examples" => "examples.md", - "Comparison with R" => "R_comparison.md", - "Performance" => "performance.md", + "Moving from R" => "R_comparison.md", + "API reference" => "api.md" ], checkdocs=:exports, ) diff --git a/docs/src/R_comparison.md b/docs/src/R_comparison.md index ba15935c..1021f42c 100644 --- a/docs/src/R_comparison.md +++ b/docs/src/R_comparison.md @@ -1,98 +1,121 @@ -# Comparison with R +# Moving from R to Julia +This sections presents examples to help move from R to Julia. Examples show R and Julia code for common operations in survey analysis.
+For the same operation, first the R and then the Julia code is presented. -In the following examples, we'll compare Julia's performance to R's on the same set of operations. +## Simple random sample -## Installing and loading the package -**R** +The `apisrs` data, which is provided in both `survey` and `Survey`, is used as an example. It's a simple random sample of the Academic Performance Index of Californian schools. -```r -install.package("survey") +### 1. Creating a design object +The following example shows how to construct a design object for a simple random sample. + +```R library(survey) +data(api) +dsrs = svydesign(id = ~1, data = apisrs, weights = ~pw, fpc = ~fpc) ``` -**Julia** ```julia -using Pkg -Pkg.add(url = "https://github.com/xKDR/Survey.jl.git") using Survey +srs = load_data("apisrs") +dsrs = SimpleRandomSample(srs; popsize = :fpc) ``` -The following command in the Pkg REPL may also be used to install the package. +### 2. Mean +In the following example the mean of the variable `api00` is calculated. + +```R +svymean(~api00, dsrs) ``` -add "https://github.com/xKDR/Survey.jl.git" +```julia +mean(:api00, dsrs) ``` -## API data +### 3. Total +In the following example the sum of the variable `api00` is calculated. -The Academic Performance Index is computed for all California schools based on standardised testing of students. The [data sets](https://cran.r-project.org/web/packages/survey/survey.pdf) contain information for all schools with at least 100 students and for various probability samples of the data. apiclus1 is a cluster sample of school districts, apistrat is a sample stratified by stype. +```R +svytotal(~api00, dsrs) +``` +```julia +total(:api00, dsrs) +``` -In the following examples, we'll use the apiclus1 data from the api dataset. +### 4. Quantile +In the following example the median of the variable `api00` is calculated. +```R +svyquantile(~api00, dsrs, 0.5) +``` +```julia +quantile(:api00, dsrs, 0.5) +``` -The api dataset can be loaded using the following command: +### 5. Domain estimation +In the following example the mean of the variable `api00` is calculated grouped by the variable `cname`. -**R** -```r -data(api) +```R +svyby(~api00, ~cname, dsrs, svymean) ``` -**Julia** ```julia -apiclus1 = load_data("apiclus1") +by(:api00, :cname, dsrs, mean) ``` -## svydesign -[The ```svydesign``` object combines a data frame and all the survey design information needed to analyse it.](https://www.rdocumentation.org/packages/survey/versions/4.1-1/topics/svydesign) +## Stratified sample -A ```design``` object can be constructed with the following command: +The `apistrat` data, which is provided in both `survey` and `Survey`, is used as an example. It's a stratified sample of the Academic Performance Index of Californian schools. -**R** -```r -dclus1 <-svydesign(id = ~1, weights = ~pw, data = apiclus1, fpc = ~fpc) +### 1. Creating a design object +The following example shows how to construct a design object for a stratified sample. + +```R +library(survey) +data(api) +dstrat = svydesign(id = ~1, data = apistrat, strata = ~stype, weights = ~pw, fpc = ~fpc) ``` -**Julia** ```julia -dclus1 = design(id = :1, weights = :pw, data = apiclus1, fpc = :fpc) +using Survey +strat = load_data("apistrat") +dstrat = StratifiedSample(strat, :stype; popsize = :fpc) ``` -## by -The `by` function can be used to generate stratified estimates. - -### Mean -Weighted mean of a variable by strata can be computed using the following command: +### 2. Mean +In the following example the mean of the variable `api00` is calculated. -**R** -```r -svyby(~api00, by = ~cname, design = dclus1, svymean) +```R +svymean(~api00, dstrat) ``` - -**Julia** ```julia -by(:api00, :cname, dclus1, mean) +mean(:api00, dstrat) ``` -### Sum -Weighted sum of a variable by strata can be computed using the following command: +### 3. Total +In the following example the sum of the variable `api00` is calculated. -**R** -```r -svyby(~api00, by = ~cname, design = dclus1, svytotal) +```R +svytotal(~api00, dstrat) +``` +```julia +total(:api00, dstrat) ``` -**Julia** +### 4. Quantile +In the following example the median of the variable `api00` is calculated. +```R +svyquantile(~api00, dstrat, 0.5) +``` ```julia -by(:api00, :cname, dclus1, total) +quantile(:api00, dstrat, 0.5) ``` -### Quantile -Weighted quantile of a variable by strata can be computed using the following command: +### 5. Domain estimation +In the following example the mean of the variable `api00` is calculated grouped by the variable `cname`. -**R** -```r -svyby(~api00, by = ~cname, design = dclus1, svyquantile, quantile = 0.63) +```R +svyby(~api00, ~cname, dstrat, svymean) ``` -**Julia** ```julia -by(:api00, :cname, dclus1, quantile, 0.63) -``` +by(:api00, :cname, dstrat, mean) +``` \ No newline at end of file diff --git a/docs/src/api.md b/docs/src/api.md new file mode 100644 index 00000000..53028358 --- /dev/null +++ b/docs/src/api.md @@ -0,0 +1,32 @@ +# API + +## Index + +```@index +Module = [Survey] +Order = [:type, :function] +Private = false +``` + +```@docs +AbstractSurveyDesign +SimpleRandomSample +StratifiedSample +load_data +mean(x::Symbol, design::SimpleRandomSample) +total(x::Symbol, design::SimpleRandomSample) +quantile +by +colnames(design::AbstractSurveyDesign) +dim(design::AbstractSurveyDesign) +dimnames(design::AbstractSurveyDesign) +plot(design::AbstractSurveyDesign, x::Symbol, y::Symbol; kwargs...) +boxplot(design::AbstractSurveyDesign, x::Symbol, y::Symbol; kwargs...) +hist(design::AbstractSurveyDesign, var::Symbol, + bins::Union{Integer, AbstractVector} = freedman_diaconis(design, var); + normalization = :density, + kwargs... + ) +freedman_diaconis +sturges +``` diff --git a/docs/src/examples.md b/docs/src/examples.md deleted file mode 100644 index f488b2dd..00000000 --- a/docs/src/examples.md +++ /dev/null @@ -1,62 +0,0 @@ -# Examples - -The following examples use the -[Academic Performance Index](https://r-survey.r-forge.r-project.org/survey/html/api.html) -(API) dataset for Californian schools. The data sets contain information for all schools -with at least 100 students and for various probability samples of the data. - -The API program has been discontinued at the end of 2018. Information is archived at -[https://www.cde.ca.gov/re/pr/api.asp](https://www.cde.ca.gov/re/pr/api.asp) - -## Simple Random Sample - -Firstly, a survey design needs a dataset from which to gather information. A dataset -can be loaded as a `DataFrame` using the `load_data` function: - -```julia -julia> apisrs = load_data("apisrs"); -``` - -Next, we can build a design. The most basic survey design is a simple random sample design. -A [`SimpleRandomSample`](@ref) can be instantianted by calling the constructor: - -```julia -julia> srs = SimpleRandomSample(apisrs; weights = :pw) -SimpleRandomSample: -data: 200x42 DataFrame -weights: 31.0, 31.0, 31.0, ..., 31.0 -probs: 0.0323, 0.0323, 0.0323, ..., 0.0323 -fpc: 6194, 6194, 6194, ..., 6194 -popsize: 6194 -sampsize: 200 -sampfraction: 0.0323 -ignorefpc: false -``` - -With a `SimpleRandomSample` (as well as with any subtype of [`AbstractSurveyDesign`](@ref)) -it is possible to calculate estimates of the mean or population total for a given variable, -along with the corresponding standard errors. - -```julia -julia> mean(:api00, srs) -1×2 DataFrame - Row │ mean sem - │ Float64 Float64 -─────┼────────────────── - 1 │ 656.585 9.24972 - -julia> total(:api00, srs) -1×2 DataFrame - Row │ total se_total - │ Float64 Float64 -─────┼───────────────────── - 1 │ 4.06689e6 57292.8 -``` - -The design can be tweaked by specifying the population or sample size or whether -or not to account for finite population correction (fpc). By default the weights -are equal to one, the sample size is equal to the number of rows in `data` and the -fpc is not ignored. The population size is calculated from the weights. - -When `ignorefpc` is set to `false` the `fpc` is calculated from the sample and population -sizes. When it is set to `true` it is set to 1. diff --git a/docs/src/index.md b/docs/src/index.md index e8ef07ee..a099c73c 100644 --- a/docs/src/index.md +++ b/docs/src/index.md @@ -4,42 +4,59 @@ CurrentModule = Survey # Survey -This package is the Julia implementation of the [Survey package in R](https://cran.r-project.org/web/packages/survey/index.html) developed by [Professor Thomas Lumley](https://www.stat.auckland.ac.nz/people/tlum005). +This package is used to study complex survey data. It aims to be a fast alternative to the [Survey package in R](https://cran.r-project.org/web/packages/survey/index.html) developed by [Professor Thomas Lumley](https://www.stat.auckland.ac.nz/people/tlum005). -## The need for moving the code to Julia. +This package currently supports simple random sample and stratified sample. In future releases, it will support multistage sampling as well. -At [xKDR](https://xkdr.org/) we processed millions of records from household surveys using the survey package in R. This process took hours of computing time. By implementing the code in Julia, we are able to do the processing in seconds. In this package we have implemented the functions `mean`, `quantile` and `sum`. We have kept the syntax between the two packages similar so that we can easily move our existing code to the new language. +## Basic demo -## Index +The following demo uses the +[Academic Performance Index](https://r-survey.r-forge.r-project.org/survey/html/api.html) +(API) dataset for Californian schools. The data sets contain information for all schools +with at least 100 students and for various probability samples of the data. -```@index -Module = [Survey] -Order = [:type, :function] -Private = false +The API program has been discontinued at the end of 2018. Information is archived at +[https://www.cde.ca.gov/re/pr/api.asp](https://www.cde.ca.gov/re/pr/api.asp) + +Firstly, a survey design needs a dataset from which to gather information. + + +The sample datasets provided with the package can be loaded as `DataFrames` using the `load_data` function: + +```julia +julia> apisrs = load_data("apisrs"); ``` +`apisrs` is a simple random sample of the Academic Performance Index of Californian schools. + +Next, we can build a design. The design corresponding to a simple random sample is [`SimpleRandomSample`](@ref), which can be instantiated by calling the constructor: + +```julia +julia> srs = SimpleRandomSample(apisrs; weights = :pw) +SimpleRandomSample: +data: 200x42 DataFrame +weights: 31.0, 31.0, 31.0, ..., 31.0 +probs: 0.0323, 0.0323, 0.0323, ..., 0.0323 +fpc: 6194, 6194, 6194, ..., 6194 +popsize: 6194 +sampsize: 200 +sampfraction: 0.0323 +ignorefpc: false +``` + +With a `SimpleRandomSample` (as well as with any subtype of [`AbstractSurveyDesign`](@ref)) it is possible to calculate estimates of the mean, population total, etc., for a given variable, along with the corresponding standard errors. + +```julia +julia> mean(:api00, srs) +1×2 DataFrame + Row │ mean sem + │ Float64 Float64 +─────┼────────────────── + 1 │ 656.585 9.24972 -## API -```@docs -AbstractSurveyDesign -SimpleRandomSample -StratifiedSample -ClusterSample -design -load_data -mean(x::Symbol, design::SimpleRandomSample) -total(x::Symbol, design::SimpleRandomSample) -quantile -by -colnames(design::AbstractSurveyDesign) -dim(design::AbstractSurveyDesign) -dimnames(design::AbstractSurveyDesign) -plot(design::AbstractSurveyDesign, x::Symbol, y::Symbol; kwargs...) -boxplot(design::AbstractSurveyDesign, x::Symbol, y::Symbol; kwargs...) -hist(design::AbstractSurveyDesign, var::Symbol, - bins::Union{Integer, AbstractVector} = freedman_diaconis(design, var); - normalization = :density, - kwargs... - ) -freedman_diaconis -sturges +julia> total(:api00, srs) +1×2 DataFrame + Row │ total se_total + │ Float64 Float64 +─────┼───────────────────── + 1 │ 4.06689e6 57292.8 ``` diff --git a/docs/src/performance.md b/docs/src/performance.md deleted file mode 100644 index 478b8f73..00000000 --- a/docs/src/performance.md +++ /dev/null @@ -1,74 +0,0 @@ -# Performance - -## Grouping by a single column -**R** - -```R -> library(survey) -> library(microbenchmark) -> data(api) -> dclus1 <- svydesign(id = ~dnum, weights = ~pw, data = apiclus1, fpc = ~fpc) -> microbenchmark(svyby(~api00, by = ~cname, design = dclus1, svymean, keep.var = FALSE), units = "us") -``` - -```R -Unit: microseconds - expr - svyby(~api00, by = ~cname, design = dclus1, svymean, keep.var = FALSE) - min lq mean median uq max neval - 9427.043 10587.81 11269.22 10938.55 11219.24 17620.25 100 -``` - -**Julia** -```julia -using Survey, BenchmarkTools -apiclus1 = load_data("apiclus1") -dclus1 = svydesign(id=:dnum, weights=:pw, data = apiclus1, fpc=:fpc) -@benchmark svyby(:api00, :cname, dclus1, svymean) -``` - -```julia -BenchmarkTools.Trial: 10000 samples with 1 evaluation. - Range (min … max): 43.567 μs … 5.905 ms ┊ GC (min … max): 0.00% … 90.27% - Time (median): 53.680 μs ┊ GC (median): 0.00% - Time (mean ± σ): 58.090 μs ± 125.671 μs ┊ GC (mean ± σ): 4.36% ± 2.00% -``` - -**The median time is about 198 times lower in Julia as compared to R.** - -## Grouping by two columns. - -**R** - -```R -> library(survey) -> library(microbenchmark) -> data(api) -> dclus1 <- svydesign(id = ~dnum, weights = ~pw, data = apiclus1, fpc = ~fpc) -> microbenchmark(svyby(~api00, by = ~cname+meals, design = dclus1, svymean, keep.var = FALSE), units = "us") -``` - -```R -Unit: microseconds - expr - svyby(~api00, by = ~cname + meals, design = dclus1, svymean, keep.var = FALSE) - min lq mean median uq max neval - 120823.6 131472.8 141797.3 134375.8 140818.3 263964.3 100 -``` - -**Julia** -```julia -using Survey, BenchmarkTools -apiclus1 = load_data("apiclus1") -dclus1 = svydesign(id=:dnum, weights=:pw, data = apiclus1, fpc=:fpc) -@benchmark svyby(:api00, [:cname, :meals], dclus1, svymean) -``` - -```julia -BenchmarkTools.Trial: 10000 samples with 1 evaluation. - Range (min … max): 64.591 μs … 6.559 ms ┊ GC (min … max): 0.00% … 77.46% - Time (median): 78.204 μs ┊ GC (median): 0.00% - Time (mean ± σ): 89.447 μs ± 235.344 μs ┊ GC (mean ± σ): 8.48% ± 3.19% -``` - - **The median time is about 1718 times lower in Julia as compared to R.** diff --git a/src/SurveyDesign.jl b/src/SurveyDesign.jl index aa28973b..622a69f2 100644 --- a/src/SurveyDesign.jl +++ b/src/SurveyDesign.jl @@ -1,8 +1,7 @@ """ AbstractSurveyDesign -Supertype for every survey design type: [`SimpleRandomSample`](@ref), [`StratifiedSample`](@ref) -and [`ClusterSample`](@ref). +Supertype for every survey design type: [`SimpleRandomSample`](@ref) and [`StratifiedSample`](@ref). !!! note @@ -35,17 +34,13 @@ If `popsize` not given, `weights` or `probs` must be given, so that in combinati with `sampsize`, `popsize` can be calculated. ```jldoctest -julia> apisrs_original = load_data("apisrs"); +julia> apisrs = load_data("apisrs"); -julia> apisrs_original[!, :derived_probs] = 1 ./ apisrs_original.pw; - -julia> apisrs_original[!, :derived_sampsize] = fill(200.0, size(apisrs_original, 1)); - -julia> srs = SimpleRandomSample(apisrs_original; popsize=:fpc); +julia> srs = SimpleRandomSample(apisrs; popsize=:fpc); julia> srs SimpleRandomSample: -data: 200x44 DataFrame +data: 200x42 DataFrame weights: 31.0, 31.0, 31.0, ..., 31.0 probs: 0.0323, 0.0323, 0.0323, ..., 0.0323 fpc: 6194, 6194, 6194, ..., 6194 @@ -225,17 +220,13 @@ If `popsize` not given, `weights` or `probs` must be given, so that in combinati with `sampsize`, `popsize` can be calculated. ```jldoctest -julia> apistrat_original = load_data("apistrat"); - -julia> apistrat_original[!, :derived_probs] = 1 ./ apistrat_original.pw; - -julia> apistrat_original[!, :derived_sampsize] = apistrat_original.fpc ./ apistrat_original.pw; +julia> apistrat = load_data("apistrat"); -julia> strat_pop = StratifiedSample(apistrat_original, :stype; popsize=:fpc); +julia> dstrat = StratifiedSample(apistrat, :stype; popsize=:fpc); -julia> strat_pop +julia> dstrat StratifiedSample: -data: 200x47 DataFrame +data: 200x45 DataFrame strata: stype weights: 44.2, 44.2, 44.2, ..., 15.1 probs: 0.0226, 0.0226, 0.0226, ..., 0.0662 diff --git a/src/by.jl b/src/by.jl index 381c3fef..cc4585ef 100644 --- a/src/by.jl +++ b/src/by.jl @@ -1,8 +1,9 @@ """ by(formula, by, design, function, params) -Generate subsets of a survey design. +Estimate the population parameters of for subpopulations of interest for a simple random sample. For example, you make have a simple random sample of heights of people, but you want the average height of male and female separately. +In the following example, the mean `api00` is estimated for each county. ```jldoctest julia> apisrs = load_data("apisrs"); @@ -41,7 +42,9 @@ end """ by(formula, by, design, function) -Generate subsets of a StratifiedSample. +Estimate the population parameters of for subpopulations of interest for a stratified sample. For example, you make have a simple of heights of people stratified by region, but you want the average height of male and female separately. + +In the following example, the average `api00` is estimated for each county. ```jldoctest julia> apistrat = load_data("apistrat"); diff --git a/src/mean.jl b/src/mean.jl index 7716d8c5..edd471d6 100644 --- a/src/mean.jl +++ b/src/mean.jl @@ -20,8 +20,7 @@ end """ mean(x, design) - -Compute the mean and SEM of the survey variable `x`. +Estimate the population mean of a variable of a simple random sample, and the corresponding standard error. ```jldoctest julia> apisrs = load_data("apisrs"); @@ -121,7 +120,7 @@ function mean(x::AbstractVector, popsize::AbstractVector, sampsize::AbstractVect end """ - Survey mean for StratifiedSample objects. +Estimate the population mean of a variable of a stratified sample, and the corresponding standard error. Ref: Cochran (1977) """ function mean(x::Symbol, design::StratifiedSample) From 2cfd6b777061f0ba51e0e8ea478fdc2e7c93cde4 Mon Sep 17 00:00:00 2001 From: ayushpatnaikgit Date: Thu, 1 Dec 2022 21:11:04 +0530 Subject: [PATCH 02/18] Adding information in API reference outside @docs macro --- docs/src/api.md | 21 ++++++++++++++------- src/SurveyDesign.jl | 8 +++++--- src/by.jl | 4 ++-- src/load_data.jl | 5 +---- 4 files changed, 22 insertions(+), 16 deletions(-) diff --git a/docs/src/api.md b/docs/src/api.md index 53028358..acbb5da0 100644 --- a/docs/src/api.md +++ b/docs/src/api.md @@ -7,19 +7,25 @@ Module = [Survey] Order = [:type, :function] Private = false ``` - +Survey data can be loaded from a `DataFrame` into a survey design object. The package currently supports simple random sample and stratified sample designs. ```@docs AbstractSurveyDesign SimpleRandomSample StratifiedSample +``` + +```@docs load_data -mean(x::Symbol, design::SimpleRandomSample) +Survey.mean(x::Symbol, design::SimpleRandomSample) total(x::Symbol, design::SimpleRandomSample) quantile +``` + +It is often required to estimate population parameters for sub-populations of interest. For example, you make have of heights of people, but you want the average height of male and female separately. +```@docs by -colnames(design::AbstractSurveyDesign) -dim(design::AbstractSurveyDesign) -dimnames(design::AbstractSurveyDesign) +``` +```@docs plot(design::AbstractSurveyDesign, x::Symbol, y::Symbol; kwargs...) boxplot(design::AbstractSurveyDesign, x::Symbol, y::Symbol; kwargs...) hist(design::AbstractSurveyDesign, var::Symbol, @@ -27,6 +33,7 @@ hist(design::AbstractSurveyDesign, var::Symbol, normalization = :density, kwargs... ) -freedman_diaconis -sturges +dim(design::AbstractSurveyDesign) +dimnames(design::AbstractSurveyDesign) +colnames(design::AbstractSurveyDesign) ``` diff --git a/src/SurveyDesign.jl b/src/SurveyDesign.jl index 622a69f2..6075c22d 100644 --- a/src/SurveyDesign.jl +++ b/src/SurveyDesign.jl @@ -1,7 +1,7 @@ """ AbstractSurveyDesign -Supertype for every survey design type: [`SimpleRandomSample`](@ref) and [`StratifiedSample`](@ref). +Supertype for every survey design type. !!! note @@ -13,7 +13,9 @@ abstract type AbstractSurveyDesign end """ SimpleRandomSample <: AbstractSurveyDesign -Survey design sampled by simple random sampling. + +A simple random sample dataset can be loaded from a data frame into a `SimpleRandomSample` object for downstream analyses. + # Required arguments: data - This is the survey dataset loaded as a DataFrame in memory. Note: Keeping with Julia conventions, original data object @@ -195,7 +197,7 @@ end """ StratifiedSample <: AbstractSurveyDesign -Survey design sampled by stratification. +A stratified sample dataset can be loaded from a data frame into a `StatifiedSample` object for downstream analyses. `strata` must be specified as a Symbol name of a column in `data`. diff --git a/src/by.jl b/src/by.jl index cc4585ef..4e2ac360 100644 --- a/src/by.jl +++ b/src/by.jl @@ -1,7 +1,7 @@ """ by(formula, by, design, function, params) -Estimate the population parameters of for subpopulations of interest for a simple random sample. For example, you make have a simple random sample of heights of people, but you want the average height of male and female separately. +Estimate the population parameters of for subpopulations of interest for a simple random sample. In the following example, the mean `api00` is estimated for each county. ```jldoctest @@ -42,7 +42,7 @@ end """ by(formula, by, design, function) -Estimate the population parameters of for subpopulations of interest for a stratified sample. For example, you make have a simple of heights of people stratified by region, but you want the average height of male and female separately. +Estimate the population parameters of for subpopulations of interest for a stratified sample. In the following example, the average `api00` is estimated for each county. diff --git a/src/load_data.jl b/src/load_data.jl index bfae08d4..41f727ee 100644 --- a/src/load_data.jl +++ b/src/load_data.jl @@ -4,10 +4,7 @@ asset_path(args...) = joinpath(PKG_DIR, "assets", args...) """ load_data(name) -Load a dataset as a `DataFrame`. - -All available datasets can be found in the [`assets/`](https://github.com/xKDR/Survey.jl/tree/main/assets) -directory. +Load a sample dataset provided in the [`assets/`](https://github.com/xKDR/Survey.jl/tree/main/assets) directory a `DataFrame`. ```jldoctest julia> apisrs = load_data("apisrs") From a87da92b9b7133d31a9923b8e1c7e3b4faffa19c Mon Sep 17 00:00:00 2001 From: Ayush Patnaik Date: Fri, 2 Dec 2022 15:26:45 +0530 Subject: [PATCH 03/18] Update docs/src/R_comparison.md --- docs/src/R_comparison.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/src/R_comparison.md b/docs/src/R_comparison.md index 1021f42c..2a953d2a 100644 --- a/docs/src/R_comparison.md +++ b/docs/src/R_comparison.md @@ -1,5 +1,5 @@ # Moving from R to Julia -This sections presents examples to help move from R to Julia. Examples show R and Julia code for common operations in survey analysis.
+This section presents examples to help move from R to Julia. Examples show R and Julia code for common operations in survey analysis.
For the same operation, first the R and then the Julia code is presented. ## Simple random sample From 3d832509c1213d96931750f0d6c639ea4b403356 Mon Sep 17 00:00:00 2001 From: Ayush Patnaik Date: Fri, 2 Dec 2022 18:28:53 +0530 Subject: [PATCH 04/18] Update docs/src/R_comparison.md --- docs/src/R_comparison.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/src/R_comparison.md b/docs/src/R_comparison.md index 2a953d2a..3e5e4535 100644 --- a/docs/src/R_comparison.md +++ b/docs/src/R_comparison.md @@ -4,7 +4,7 @@ For the same operation, first the R and then the Julia code is presented. ## Simple random sample -The `apisrs` data, which is provided in both `survey` and `Survey`, is used as an example. It's a simple random sample of the Academic Performance Index of Californian schools. +The `apisrs` data, which is provided in both `survey` and `Survey.jl`, is used as an example. It's a simple random sample of the Academic Performance Index of Californian schools. ### 1. Creating a design object The following example shows how to construct a design object for a simple random sample. From 137428b67267b1a903937468da1e4d76f3e8e8e8 Mon Sep 17 00:00:00 2001 From: Ayush Patnaik Date: Fri, 2 Dec 2022 18:48:44 +0530 Subject: [PATCH 05/18] Update docs/src/R_comparison.md Co-authored-by: Iulia Dumitru <84318573+iuliadmtru@users.noreply.github.com> --- docs/src/R_comparison.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/src/R_comparison.md b/docs/src/R_comparison.md index 3e5e4535..2b5c9a7c 100644 --- a/docs/src/R_comparison.md +++ b/docs/src/R_comparison.md @@ -7,7 +7,7 @@ For the same operation, first the R and then the Julia code is presented. The `apisrs` data, which is provided in both `survey` and `Survey.jl`, is used as an example. It's a simple random sample of the Academic Performance Index of Californian schools. ### 1. Creating a design object -The following example shows how to construct a design object for a simple random sample. +Instantiating a simple random sample survey design. ```R library(survey) From 07349bf787337bdd0856ce0a74a9838482fbc3a0 Mon Sep 17 00:00:00 2001 From: Ayush Patnaik Date: Fri, 2 Dec 2022 18:48:59 +0530 Subject: [PATCH 06/18] Update docs/src/R_comparison.md Co-authored-by: Iulia Dumitru <84318573+iuliadmtru@users.noreply.github.com> --- docs/src/R_comparison.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/src/R_comparison.md b/docs/src/R_comparison.md index 2b5c9a7c..ce119757 100644 --- a/docs/src/R_comparison.md +++ b/docs/src/R_comparison.md @@ -6,7 +6,7 @@ For the same operation, first the R and then the Julia code is presented. The `apisrs` data, which is provided in both `survey` and `Survey.jl`, is used as an example. It's a simple random sample of the Academic Performance Index of Californian schools. -### 1. Creating a design object +### 1. Creating a survey design Instantiating a simple random sample survey design. ```R From 2a35d10cd35bb3ebf5e9bb34ffe2651c7da9cda5 Mon Sep 17 00:00:00 2001 From: Ayush Patnaik Date: Fri, 2 Dec 2022 18:51:32 +0530 Subject: [PATCH 07/18] Update src/load_data.jl Co-authored-by: Iulia Dumitru <84318573+iuliadmtru@users.noreply.github.com> --- src/load_data.jl | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/src/load_data.jl b/src/load_data.jl index 41f727ee..66344051 100644 --- a/src/load_data.jl +++ b/src/load_data.jl @@ -4,7 +4,9 @@ asset_path(args...) = joinpath(PKG_DIR, "assets", args...) """ load_data(name) -Load a sample dataset provided in the [`assets/`](https://github.com/xKDR/Survey.jl/tree/main/assets) directory a `DataFrame`. +Load a sample dataset as a `DataFrame`. + +All available datasets can be found [here](https://github.com/xKDR/Survey.jl/tree/main/assets). ```jldoctest julia> apisrs = load_data("apisrs") From dc18ab0fc766ba2c67b97ef8a24bdc7c6aa8cd23 Mon Sep 17 00:00:00 2001 From: Iulia Dumitru Date: Fri, 2 Dec 2022 16:26:07 +0200 Subject: [PATCH 08/18] Change writing of arguments in `SimpleRandomSample` docstring --- src/SurveyDesign.jl | 31 +++++++++++++------------------ 1 file changed, 13 insertions(+), 18 deletions(-) diff --git a/src/SurveyDesign.jl b/src/SurveyDesign.jl index 6075c22d..3e3ac58a 100644 --- a/src/SurveyDesign.jl +++ b/src/SurveyDesign.jl @@ -16,24 +16,19 @@ abstract type AbstractSurveyDesign end A simple random sample dataset can be loaded from a data frame into a `SimpleRandomSample` object for downstream analyses. -# Required arguments: -data - This is the survey dataset loaded as a DataFrame in memory. - Note: Keeping with Julia conventions, original data object - is modified, not copied. Be careful -# Optional arguments: -sampsize - Sample size of the survey, given as Symbol name of - column in `data`, an `Unsigned` integer, or a Vector -popsize - The (expected) population size of survey, given as Symbol - name of column in `data`, an `Unsigned` integer, or a Vector -weights - Sampling weights, passed as Symbol or Vector -probs - Sampling probabilities, passed as Symbol or Vector -ignorefpc- Ignore finite population correction and assume all weights equal to 1.0 - -Precedence order of using `popsize`, `weights` and `probs` is `popsize` > `weights` > `probs` -Eg. if `popsize` given then assumed ground truth over `weights` or `probs` - -If `popsize` not given, `weights` or `probs` must be given, so that in combination -with `sampsize`, `popsize` can be calculated. +# Arguments: +`data::AbstractDataFrame`: the survey dataset (!this gets modified by the constructor). +`sampsize::Union{Nothing,Symbol,<:Unsigned,Vector{<:Real}}=UInt(nrow(data))`: the survey sample size. +`popsize::Union{Nothing,Symbol,<:Unsigned,Vector{<:Real}}=nothing`: the (expected) survey population size. +`weights::Union{Nothing,Symbol,Vector{<:Real}}=nothing`: the sampling weights. +`probs::Union{Nothing,Symbol,Vector{<:Real}}=nothing: the sampling probabilities. +`ignorefpc=false`: choose to ignore finite population correction and assume all weights equal to 1.0 + +The precedence order of using `popsize`, `weights` and `probs` is `popsize` > `weights` > `probs`. +E.g. If `popsize` is given then it is assumed to be the ground truth over `weights` or `probs`. + +If `popsize` is not given `weights` or `probs` must be given. `popsize` is then calculated +using the weights and the sample size. ```jldoctest julia> apisrs = load_data("apisrs"); From 77cf28a155f4698e03865992fa4de9b65bb536f4 Mon Sep 17 00:00:00 2001 From: Iulia Dumitru Date: Fri, 2 Dec 2022 16:35:17 +0200 Subject: [PATCH 09/18] Change writing of arguments in `StratifiedSample` docstring --- src/SurveyDesign.jl | 29 ++++++++++------------------- 1 file changed, 10 insertions(+), 19 deletions(-) diff --git a/src/SurveyDesign.jl b/src/SurveyDesign.jl index 3e3ac58a..281378f5 100644 --- a/src/SurveyDesign.jl +++ b/src/SurveyDesign.jl @@ -196,25 +196,16 @@ A stratified sample dataset can be loaded from a data frame into a `StatifiedSam `strata` must be specified as a Symbol name of a column in `data`. -# Required arguments: -data - This is the survey dataset loaded as a DataFrame in memory. - Note: Keeping with Julia conventions, original data object - is modified, not copied. Be careful -strata - Column that is the stratification variable. -# Optional arguments: -sampsize - Sample size of the survey, given as Symbol name of - column in `data`, an `Unsigned` integer, or a Vector -popsize - The (expected) population size of survey, given as Symbol - name of column in `data`, an `Unsigned` integer, or a Vector -weights - Sampling weights, passed as Symbol or Vector -probs - Sampling probabilities, passed as Symbol or Vector -ignorefpc- Ignore finite population correction and assume all weights equal to 1.0 - -Precedence order of using `popsize`, `weights` and `probs` is `popsize` > `weights` > `probs` -Eg. if `popsize` given then assumed ground truth over `weights` or `probs` - -If `popsize` not given, `weights` or `probs` must be given, so that in combination -with `sampsize`, `popsize` can be calculated. +# Arguments: +`data::AbstractDataFrame`: the survey dataset (!this gets modified by the constructor). +`strata::Symbol`: the stratification variable - must be given as a column in `data`. +`sampsize::Union{Nothing,Symbol,<:Unsigned,Vector{<:Real}}=UInt(nrow(data))`: the survey sample size. +`popsize::Union{Nothing,Symbol,<:Unsigned,Vector{<:Real}}=nothing`: the (expected) survey population size. +`weights::Union{Nothing,Symbol,Vector{<:Real}}=nothing`: the sampling weights. +`probs::Union{Nothing,Symbol,Vector{<:Real}}=nothing: the sampling probabilities. +`ignorefpc=false`: choose to ignore finite population correction and assume all weights equal to 1.0 + +The `popsize`, `weights` and `probs` parameters follow the same rules as for [`SimpleRandomSample`](@ref). ```jldoctest julia> apistrat = load_data("apistrat"); From cdb1bd9ae2cac301f83f5a531def986e8f06a875 Mon Sep 17 00:00:00 2001 From: Ayush Patnaik Date: Fri, 2 Dec 2022 20:44:18 +0530 Subject: [PATCH 10/18] Update docs/src/api.md Co-authored-by: Iulia Dumitru <84318573+iuliadmtru@users.noreply.github.com> --- docs/src/api.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/src/api.md b/docs/src/api.md index acbb5da0..98b67a82 100644 --- a/docs/src/api.md +++ b/docs/src/api.md @@ -21,7 +21,7 @@ total(x::Symbol, design::SimpleRandomSample) quantile ``` -It is often required to estimate population parameters for sub-populations of interest. For example, you make have of heights of people, but you want the average height of male and female separately. +It is often required to estimate population parameters for sub-populations of interest. For example, you may have a sample of heights, but you want the average heights of males and females separately. ```@docs by ``` From a1b2e24afe98e79cc4859f9d6e0edf578eaa9194 Mon Sep 17 00:00:00 2001 From: Ayush Patnaik Date: Fri, 2 Dec 2022 20:45:09 +0530 Subject: [PATCH 11/18] Update src/SurveyDesign.jl Co-authored-by: Iulia Dumitru <84318573+iuliadmtru@users.noreply.github.com> --- src/SurveyDesign.jl | 1 - 1 file changed, 1 deletion(-) diff --git a/src/SurveyDesign.jl b/src/SurveyDesign.jl index 281378f5..bf4156b4 100644 --- a/src/SurveyDesign.jl +++ b/src/SurveyDesign.jl @@ -34,7 +34,6 @@ using the weights and the sample size. julia> apisrs = load_data("apisrs"); julia> srs = SimpleRandomSample(apisrs; popsize=:fpc); - julia> srs SimpleRandomSample: data: 200x42 DataFrame From 028fb72749cc8ed77aeeec82519c88bd4e66e559 Mon Sep 17 00:00:00 2001 From: Ayush Patnaik Date: Fri, 2 Dec 2022 20:45:25 +0530 Subject: [PATCH 12/18] Update src/SurveyDesign.jl Co-authored-by: Iulia Dumitru <84318573+iuliadmtru@users.noreply.github.com> --- src/SurveyDesign.jl | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/src/SurveyDesign.jl b/src/SurveyDesign.jl index bf4156b4..db99632e 100644 --- a/src/SurveyDesign.jl +++ b/src/SurveyDesign.jl @@ -33,7 +33,7 @@ using the weights and the sample size. ```jldoctest julia> apisrs = load_data("apisrs"); -julia> srs = SimpleRandomSample(apisrs; popsize=:fpc); +julia> srs = SimpleRandomSample(apisrs; popsize=:fpc) julia> srs SimpleRandomSample: data: 200x42 DataFrame From f0edfe946992fd5904420aaac26610abdee51438 Mon Sep 17 00:00:00 2001 From: Ayush Patnaik Date: Fri, 2 Dec 2022 20:45:36 +0530 Subject: [PATCH 13/18] Update src/SurveyDesign.jl Co-authored-by: Iulia Dumitru <84318573+iuliadmtru@users.noreply.github.com> --- src/SurveyDesign.jl | 1 - 1 file changed, 1 deletion(-) diff --git a/src/SurveyDesign.jl b/src/SurveyDesign.jl index db99632e..d20f7674 100644 --- a/src/SurveyDesign.jl +++ b/src/SurveyDesign.jl @@ -210,7 +210,6 @@ The `popsize`, `weights` and `probs` parameters follow the same rules as for [`S julia> apistrat = load_data("apistrat"); julia> dstrat = StratifiedSample(apistrat, :stype; popsize=:fpc); - julia> dstrat StratifiedSample: data: 200x45 DataFrame From 7469cca0ad44d4141227bce716828ccd797ce240 Mon Sep 17 00:00:00 2001 From: Ayush Patnaik Date: Fri, 2 Dec 2022 20:45:50 +0530 Subject: [PATCH 14/18] Update src/SurveyDesign.jl Co-authored-by: Iulia Dumitru <84318573+iuliadmtru@users.noreply.github.com> --- src/SurveyDesign.jl | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/src/SurveyDesign.jl b/src/SurveyDesign.jl index d20f7674..2beaa312 100644 --- a/src/SurveyDesign.jl +++ b/src/SurveyDesign.jl @@ -209,7 +209,7 @@ The `popsize`, `weights` and `probs` parameters follow the same rules as for [`S ```jldoctest julia> apistrat = load_data("apistrat"); -julia> dstrat = StratifiedSample(apistrat, :stype; popsize=:fpc); +julia> dstrat = StratifiedSample(apistrat, :stype; popsize=:fpc) julia> dstrat StratifiedSample: data: 200x45 DataFrame From 7bcabd4941f10b68ea985075e349941dc4bd2d0e Mon Sep 17 00:00:00 2001 From: Ayush Patnaik Date: Fri, 2 Dec 2022 21:23:39 +0530 Subject: [PATCH 15/18] Update src/SurveyDesign.jl Co-authored-by: Iulia Dumitru <84318573+iuliadmtru@users.noreply.github.com> --- src/SurveyDesign.jl | 1 - 1 file changed, 1 deletion(-) diff --git a/src/SurveyDesign.jl b/src/SurveyDesign.jl index 2beaa312..1f2e18a0 100644 --- a/src/SurveyDesign.jl +++ b/src/SurveyDesign.jl @@ -34,7 +34,6 @@ using the weights and the sample size. julia> apisrs = load_data("apisrs"); julia> srs = SimpleRandomSample(apisrs; popsize=:fpc) -julia> srs SimpleRandomSample: data: 200x42 DataFrame weights: 31.0, 31.0, 31.0, ..., 31.0 From 6faf51a7e0f8e74ffea5eeb5fd862f933070abdf Mon Sep 17 00:00:00 2001 From: Ayush Patnaik Date: Fri, 2 Dec 2022 21:42:37 +0530 Subject: [PATCH 16/18] Update src/SurveyDesign.jl Co-authored-by: Iulia Dumitru <84318573+iuliadmtru@users.noreply.github.com> --- src/SurveyDesign.jl | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/src/SurveyDesign.jl b/src/SurveyDesign.jl index 1f2e18a0..9b55694a 100644 --- a/src/SurveyDesign.jl +++ b/src/SurveyDesign.jl @@ -190,7 +190,7 @@ end """ StratifiedSample <: AbstractSurveyDesign -A stratified sample dataset can be loaded from a data frame into a `StatifiedSample` object for downstream analyses. +Survey design sampled by stratification. `strata` must be specified as a Symbol name of a column in `data`. From e65b2de363ac80f942c1040cca0b813a712e5d38 Mon Sep 17 00:00:00 2001 From: Ayush Patnaik Date: Fri, 2 Dec 2022 21:43:04 +0530 Subject: [PATCH 17/18] Update src/SurveyDesign.jl Co-authored-by: Iulia Dumitru <84318573+iuliadmtru@users.noreply.github.com> --- src/SurveyDesign.jl | 1 - 1 file changed, 1 deletion(-) diff --git a/src/SurveyDesign.jl b/src/SurveyDesign.jl index 9b55694a..4f6a5e57 100644 --- a/src/SurveyDesign.jl +++ b/src/SurveyDesign.jl @@ -209,7 +209,6 @@ The `popsize`, `weights` and `probs` parameters follow the same rules as for [`S julia> apistrat = load_data("apistrat"); julia> dstrat = StratifiedSample(apistrat, :stype; popsize=:fpc) -julia> dstrat StratifiedSample: data: 200x45 DataFrame strata: stype From 87142f513e4265921014d029af3e32d244580c9f Mon Sep 17 00:00:00 2001 From: ayushpatnaikgit Date: Fri, 2 Dec 2022 21:46:37 +0530 Subject: [PATCH 18/18] Make SRS docstring more concise. --- docs/src/api.md | 2 +- src/SurveyDesign.jl | 2 +- 2 files changed, 2 insertions(+), 2 deletions(-) diff --git a/docs/src/api.md b/docs/src/api.md index 98b67a82..159ac6c5 100644 --- a/docs/src/api.md +++ b/docs/src/api.md @@ -7,7 +7,7 @@ Module = [Survey] Order = [:type, :function] Private = false ``` -Survey data can be loaded from a `DataFrame` into a survey design object. The package currently supports simple random sample and stratified sample designs. +Survey data can be loaded from a `DataFrame` into a survey design. The package currently supports simple random sample and stratified sample designs. ```@docs AbstractSurveyDesign SimpleRandomSample diff --git a/src/SurveyDesign.jl b/src/SurveyDesign.jl index 4f6a5e57..54e99be9 100644 --- a/src/SurveyDesign.jl +++ b/src/SurveyDesign.jl @@ -14,7 +14,7 @@ abstract type AbstractSurveyDesign end SimpleRandomSample <: AbstractSurveyDesign -A simple random sample dataset can be loaded from a data frame into a `SimpleRandomSample` object for downstream analyses. +Survey design sampled by simple random sampling. # Arguments: `data::AbstractDataFrame`: the survey dataset (!this gets modified by the constructor).