Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Documentation upgrade for new design #49

Closed
wants to merge 12 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
87 changes: 84 additions & 3 deletions docs/src/examples.md
Original file line number Diff line number Diff line change
@@ -1,7 +1,88 @@
# Examples

The following examples use the Academic Performance Index (API) dataset for Californian schools.
The following examples use the [Academic Performance Index](https://r-survey.r-forge.r-project.org/survey/html/api.html) (API) dataset for Californian schools.

```@docs
svyby(formula::Symbol, by, design::svydesign, func::Function, params = [])
## Simple Random Sample

The most basic survey design is a simple random sample design. A
[`SimpleRandomSample`](@ref) can be instantianted by calling the constructor:

```julia
julia> apisrs = load_data("apisrs");

julia> srs = SimpleRandomSample(apisrs)
Simple Random Sample:
data: 200x42 DataFrame
probs: 1.0, 1.0, 1.0 ... 1.0
fpc: 1
popsize: 200
sampsize: 200
```

With a `SimpleRandomSample` (as well as with any subtype of [`AbstractSurveyDesign`](@ref))
it is possible to calculate estimates of the mean or population total for a given variable,
along with the corresponding standard errors.

```julia
julia> svymean(:api00, srs)
1×2 DataFrame
Row │ mean sem
│ Float64 Float64
─────┼──────────────────
1 │ 656.585 9.40277

julia> svytotal(:api00, srs)
1×2 DataFrame
Row │ total se_total
│ Float64 Float64
─────┼────────────────────
1 │ 131317.0 1880.55
```

The complexity of the design can be increased by specifying frequency or probability
weights, the population or sample size and whether or not to account for finite
population correction (fpc). By default the weights are equal to one, the sample size is
equal to the number of rows in `data` the fpc is ignored. The population size is calculated
from the weights.

```julia
julia> wsrs = SimpleRandomSample(apisrs; weights = :pw)
Simple Random Sample:
data: 200x42 DataFrame
weights: 31.0, 31.0, 31.0 ... 31.0
probs: 0.0323, 0.0323, 0.0323 ... 0.0323
fpc: 1
popsize: 6194
sampsize: 200

julia> fpcwsrs = SimpleRandomSample(apisrs; weights = :pw, ignorefpc = false)
Simple Random Sample:
data: 200x42 DataFrame
weights: 31.0, 31.0, 31.0 ... 31.0
probs: 0.0323, 0.0323, 0.0323 ... 0.0323
fpc: 0.968
popsize: 6194
sampsize: 200
```

When `ignorefpc` is set to `false` the `fpc` is calculated from the sample and population
sizes.

The statistics for mean and population total are different when the design takes weights
and fpc into account:

```julia
julia> svymean(:api00, fpcwsrs)
1×2 DataFrame
Row │ mean sem
│ Float64 Float64
─────┼──────────────────
1 │ 656.585 9.24972

julia> svytotal(:api00, fpcwsrs)
1×2 DataFrame
Row │ total se_total
│ Float64 Float64
─────┼─────────────────────
1 │ 4.06689e6 57292.8
```
37 changes: 28 additions & 9 deletions docs/src/index.md
Original file line number Diff line number Diff line change
@@ -1,17 +1,36 @@
```@meta
CurrentModule = Survey
```

# Survey
# Survey.jl

This package is the Julia implementation of the [Survey package in R](https://cran.r-project.org/web/packages/survey/index.html) developed by [Professor Thomas Lumley](https://www.stat.auckland.ac.nz/people/tlum005).

## The need for moving the code to Julia.
## Introduction

At [xKDR](https://xkdr.org/) we processed millions of records from household surveys using the survey package in R. This process took hours of computing time. By implementing the code in Julia, we are able to do the processing in seconds. In this package we have implemented the functions `svymean`, `svyquantile` and `svysum`. We have kept the syntax between the two packages similar so that we can easily move our existing code to the new language.

Documentation for [Survey](https://github.com/Survey.jl).
## Index

```@index
Module = [Survey]
Private = false
```

```@autodocs
Modules = [Survey]
## API
```@docs
load_data
AbstractSurveyDesign
SimpleRandomSample
StratifiedSample
ClusterSample
dim(design::AbstractSurveyDesign)
colnames(design::AbstractSurveyDesign)
dimnames(design::AbstractSurveyDesign)
svymean(x::Symbol, design::SimpleRandomSample)
svytotal(x::Symbol, design::SimpleRandomSample)
svyby
svyplot(design::AbstractSurveyDesign, x::Symbol, y::Symbol; kwargs...)
svyhist(design::AbstractSurveyDesign, var::Symbol,
bins::Union{Integer, AbstractVector} = freedman_diaconis(design, var);
normalization = :density,
kwargs...
)
svyboxplot(design::AbstractSurveyDesign, x::Symbol, y::Symbol; kwargs...)
```
2 changes: 1 addition & 1 deletion src/Survey.jl
Original file line number Diff line number Diff line change
Expand Up @@ -23,7 +23,7 @@ include("svyboxplot.jl")
include("svyby.jl")

export load_data
export AbstractSurveyDesign, SimpleRandomSample, StratifiedSample
export AbstractSurveyDesign, SimpleRandomSample, StratifiedSample, ClusterSample
export svydesign
export svyglm
export svyby
Expand Down
28 changes: 21 additions & 7 deletions src/SurveyDesign.jl
Original file line number Diff line number Diff line change
Expand Up @@ -13,11 +13,14 @@ function print_short(x)
end

"""
Supertype for every survey design type: `SimpleRandomSample`, `ClusterSample`
and `StratifiedSample`.
AbstractSurveyDesign

The data to a survey constructor is modified. To avoid this pass a copy of the data
instead of the original.
Supertype for survey designs. `SimpleRandomSample`, `ClusterSample`
and `StratifiedSample` are subtypes of this.

!!! note
When passing data to a survey design, the user should make a copy of the
data. The constructors modify the data passed as argument.
"""
abstract type AbstractSurveyDesign end

Expand Down Expand Up @@ -105,7 +108,7 @@ struct StratifiedSample <: AbstractSurveyDesign
end

# `show` method for printing information about a `StratifiedSample` after construction
function Base.show(io::IO, design::StratifiedSample)
function Base.show(io::IO, ::MIME"text/plain", design::StratifiedSample)
printstyled("Stratified Sample:\n"; bold = true)
printstyled("data: "; bold = true)
print(size(design.data, 1), "x", size(design.data, 2), " DataFrame")
Expand All @@ -130,11 +133,22 @@ Survey design sampled by clustering.
"""
struct ClusterSample <: AbstractSurveyDesign
data::DataFrame
function ClusterSample(data::DataFrame, id::Symbol; weights = ones(nrow(data)), probs = 1 ./ weights)
# add frequency weights, probability weights and sample size columns
data[!, :weights] = weights
data[!, :probs] = probs
# TODO: change `sampsize` and `popsize`
data[!, :popsize] = repeat([nrow(data)], nrow(data))
data[!, :sampsize] = repeat([nrow(data)], nrow(data))
data[!, :id] = data[!, id]

new(data)
end
end

# `show` method for printing information about a `ClusterSample` after construction
function Base.show(io::IO, design::ClusterSample)
printstyled("Cluster Sample:\n"; bold = true)
function Base.show(io::IO, ::MIME"text/plain", design::ClusterSample)
printstyled("Simple Random Sample:\n"; bold = true)
printstyled("data: "; bold = true)
print(size(design.data, 1), "x", size(design.data, 2), " DataFrame")
printstyled("\nweights: "; bold = true)
Expand Down
3 changes: 1 addition & 2 deletions src/svyboxplot.jl
Original file line number Diff line number Diff line change
@@ -1,7 +1,6 @@
"""
```
svyboxplot(design, x, y; kwargs...)
```

Box plot of survey design variable `y` grouped by column `x`.

Weights can be specified by a Symbol using the keyword argument `weights`.
Expand Down
2 changes: 1 addition & 1 deletion src/svydesign.jl
Original file line number Diff line number Diff line change
Expand Up @@ -18,7 +18,7 @@ Survey Design:
variables: 183x45 DataFrame
id: dnum
strata: 1, 1, 1 ... 1
probs: 0.029544719150814778, 0.029544719150814778, 0.029544719150814778 ... 0.029544719150814778
probs: 0.0295, 0.0295, 0.0295 ... 0.0295
fpc:
popsize: 757, 757, 757 ... 757
sampsize: 183, 183, 183 ... 183
Expand Down
4 changes: 2 additions & 2 deletions src/svymean.jl
Original file line number Diff line number Diff line change
Expand Up @@ -22,15 +22,15 @@ function var_of_mean(x::AbstractVector, design::SimpleRandomSample)
return design.fpc / design.sampsize * var(x)
end

function sem(x, design::SimpleRandomSample)
function sem(x::Symbol, design::SimpleRandomSample)
return sqrt(var_of_mean(x, design))
end

function sem(x::AbstractVector, design::SimpleRandomSample)
return sqrt(var_of_mean(x, design))
end

function svymean(x, design::SimpleRandomSample)
function svymean(x::Symbol, design::SimpleRandomSample)
return DataFrame(mean = mean(design.data[!, x]), sem = sem(x, design::SimpleRandomSample))
end

Expand Down
3 changes: 1 addition & 2 deletions src/svyplot.jl
Original file line number Diff line number Diff line change
@@ -1,7 +1,6 @@
"""
```
svyplot(design, x, y; kwargs...)
```

Scatter plot of survey design variables `x` and `y`.

The plot takes into account the frequency weights specified by the user
Expand Down