Skip to content

Commit

Permalink
Merge pull request #181 from xKDR/singledesign
Browse files Browse the repository at this point in the history
WIP: Merge `singledesign` into `main`
  • Loading branch information
ayushpatnaikgit authored Jan 29, 2023
2 parents 75710ca + 36e2bd5 commit 4d7bbf4
Show file tree
Hide file tree
Showing 41 changed files with 1,269 additions and 1,621 deletions.
2 changes: 1 addition & 1 deletion .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -8,4 +8,4 @@
/dev/*
.gitignore
.DS_Store
*.json
*.json
1 change: 1 addition & 0 deletions Project.toml
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,7 @@ CairoMakie = "13f3f980-e62b-5c42-98c6-ff1f3baf88f0"
CategoricalArrays = "324d7699-5711-5eae-9e2f-1d82baa6b597"
DataFrames = "a93c6f00-e57d-5684-b7b6-d8193f3e46c0"
LinearAlgebra = "37e2e46d-f89d-539d-b4ee-838fcccc9c8e"
Missings = "e1d29d7a-bbdc-5cf2-9ac0-f12de2c33e28"
Random = "9a3f8284-a2c9-5f02-9a11-845980a1fd5c"
Statistics = "10745b16-79ce-11e8-11f9-7d13ad32a3b2"
StatsBase = "2913bbd2-ae8a-5f71-8c99-4fb6c76f3a91"
Expand Down
218 changes: 139 additions & 79 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,111 +6,171 @@
[![codecov](https://codecov.io/gh/xKDR/Survey.jl/branch/main/graph/badge.svg?token=4PFSF47BT2)](https://codecov.io/gh/xKDR/Survey.jl)
[![Milestones](https://img.shields.io/badge/-milestones-brightgreen)](https://github.com/xKDR/Survey.jl/milestones)

This package is used to study complex survey data. It aims to be a fast alternative
to the [Survey package in R](https://cran.r-project.org/web/packages/survey/index.html)
developed by [Professor Thomas Lumley](https://www.stat.auckland.ac.nz/people/tlum005).

This package is used to study complex survey data. It aims to be a fast alternative to the [Survey package in R](https://cran.r-project.org/web/packages/survey/index.html) developed by [Professor Thomas Lumley](https://www.stat.auckland.ac.nz/people/tlum005).
All types of survey design are supported by this package.

This package currently supports simple random sample and stratified sample. In future releases, it will support multistage sampling as well.
> **_NOTE:_** For multistage sampling a single stage approximation is used. For
more information see the [TODO](https://xkdr.github.io/Survey.jl/dev/) section of
the documentation.

## Documentation
See [Documentation](https://xkdr.github.io/Survey.jl/dev/) to learn how to use the package

## How to install
## Installation
```julia
] add "https://github.com/xKDR/Survey.jl.git"
```

## Basic usage

### Simple Random Sample
The `SurveyDesign` constructor can take data corresponding to any type of design.
Depending on the keyword arguments passed, the data is processed in order to obtain
correct results for the given design.

In the following example, we will load a simple random sample of the Academic Performance Index dataset for Californian schools and do basic analysis.
```julia
using Survey
The following examples show how to create and manipulate different survey designs
using the [Academic Performance Index dataset for Californian schools](https://r-survey.r-forge.r-project.org/survey/html/api.html).

### Constructing a survey design

A survey design can be created by calling the constructor with some keywords,
depending on the survey type. Let's create a simple random sample, a stratified
sample, a single-stage and a two-stage cluster sample.

srs = load_data("apisrs")
```julia
julia> apisrs = load_data("apisrs");

julia> srs = SurveyDesign(apisrs; weights=:pw)
SurveyDesign:
data: 200×47 DataFrame
strata: none
cluster: none
popsize: [6190.0, 6190.0, 6190.0 6190.0]
sampsize: [200, 200, 200 200]
weights: [31.0, 31.0, 31.0 31.0]
probs: [0.0323, 0.0323, 0.0323 0.0323]

julia> apistrat = load_data("apistrat");

julia> strat = SurveyDesign(apistrat; strata=:stype, weights=:pw)
SurveyDesign:
data: 200×46 DataFrame
strata: stype
[E, E, E H]
cluster: none
popsize: [6190.0, 6190.0, 6190.0 6190.0]
sampsize: [200, 200, 200 200]
weights: [44.2, 44.2, 44.2 15.1]
probs: [0.0226, 0.0226, 0.0226 0.0662]

julia> apiclus1 = load_data("apiclus1");

julia> clus_one_stage = SurveyDesign(apiclus1; clusters=:dnum, weights=:pw)
SurveyDesign:
data: 183×46 DataFrame
strata: none
cluster: dnum
[637, 637, 637 448]
popsize: [6190.0, 6190.0, 6190.0 6190.0]
sampsize: [15, 15, 15 15]
weights: [33.8, 33.8, 33.8 33.8]
probs: [0.0295, 0.0295, 0.0295 0.0295]

julia> apiclus2 = load_data("apiclus2");

julia> clus_two_stage = SurveyDesign(apiclus2; clusters=[:dnum, :snum], weights=:pw)
SurveyDesign:
data: 126×47 DataFrame
strata: none
cluster: dnum
[15, 63, 83 795]
popsize: [5130.0, 5130.0, 5130.0 5130.0]
sampsize: [40, 40, 40 40]
weights: [18.9, 18.9, 18.9 18.9]
probs: [0.0528, 0.0528, 0.0528 0.0528]
```

dsrs = SimpleRandomSample(srs; weights = :pw)
Using these designs we can compute estimates of statistics such as mean and
population total. The designs must first be resampled using
[bootstrapping](https://en.wikipedia.org/wiki/Bootstrapping_(statistics)) in order
to compute the standard errors.

mean(:api00, dsrs)
```julia
julia> bootsrs = bootweights(srs; replicates=1000)
ReplicateDesign:
data: 200×1047 DataFrame
strata: none
cluster: none
popsize: [6190.0, 6190.0, 6190.0 6190.0]
sampsize: [200, 200, 200 200]
weights: [31.0, 31.0, 31.0 31.0]
probs: [0.0323, 0.0323, 0.0323 0.0323]
replicates: 1000

julia> mean(:api00, bootsrs)
1×2 DataFrame
Row │ mean SE
│ Float64 Float64
Row │ mean SE
│ Float64 Float64
─────┼──────────────────
1656.585 9.24972
1656.585 9.5409

total(:enroll, dsrs)
julia> total(:enroll, bootsrs)
1×2 DataFrame
Row │ total SE
│ Float64 Float64
─────┼─────────────────────
13.62107e6 1.6952e5

mean(:api00, :cname, dsrs)
38×3 DataFrame
Row │ cname mean SE
│ String15 Float64 Float64
─────┼────────────────────────────────────
1 │ Kern 573.6 42.8026
2 │ Los Angeles 658.156 21.0728
3 │ Orange 749.333 27.0613
36 │ Napa 727.0 46.722
37 │ Lake 804.0 NaN
38 │ Merced 595.0 NaN

quantile(:enroll,dsrs,[0.1,0.2,0.5,0.75,0.95])
5×2 DataFrame
Row │ probability quantile
│ Float64 Float64
─────┼───────────────────────
10.1 245.5
20.2 317.6
30.5 453.0
40.75 668.5
50.95 1473.1
Row │ total SE
│ Float64 Float64
─────┼──────────────────────
13.62107e6 1.72846e5
```

### Stratified Sample

In the following example, we will load a stratified sample of the Academic Performance Index dataset for Californian schools and do basic analysis.
Now we know the mean academic performance index from the year 2000 and the total
number of students enrolled in the sampled Californian schools. We can also
calculate the statistic of multiple variables in one go...

```julia
using Survey
julia> mean([:api99, :api00], bootsrs)
2×3 DataFrame
Row │ names mean SE
│ String Float64 Float64
─────┼──────────────────────────
1 │ api99 624.685 9.84669
2 │ api00 656.585 9.5409
```

... or we can calculate domain estimates:

strat = load_data("apistrat")
```julia
julia> total(:enroll, :cname, bootsrs)
38×3 DataFrame
Row │ cname total SE
│ String15 Float64 Any
─────┼────────────────────────────────────────────
1 │ Kern 1.95823e5 74731.2
2 │ Los Angeles 867129.0 1.36622e5
3 │ Orange 1.68786e5 63858.0
4 │ San Luis Obispo 6720.49 6790.49
35 │ Calaveras 12976.4 13241.6
36 │ Napa 39239.0 30181.9
37 │ Lake 6410.79 6986.29
38 │ Merced 15392.1 15202.2
30 rows omitted
```

dstrat = StratifiedSample(strat, :stype; weights = :pw, popsize = :fpc)
This gives us the total number of enrolled students in each county.

mean(:api00, dstrat)
1×2 DataFrame
Row │ mean SE
│ Float64 Float64
─────┼──────────────────
1662.287 9.40894
All functionalities are supported by each design type. For a more complete guide,
see the [Tutorial](https://xkdr.github.io/Survey.jl/dev/#Basic-demo) section in
the documentation.

total(:api00, dstrat)
1×2 DataFrame
Row │ total SE
│ Float64 Float64
─────┼────────────────────
14.10221e6 58279.0

mean(:api00, :cname, dstrat)
40×3 DataFrame
Row │ cname mean SE
│ String15 Float64 Float64
─────┼───────────────────────────────────────
1 │ Los Angeles 633.511 21.3912
2 │ Ventura 707.172 31.6856
3 │ Kern 678.235 53.1337
39 │ Mendocino 632.018 1.04942
40 │ Butte 627.0 0.0
```
## Goals

## Strategic goals
We want to implement all the features provided by the [Survey package in R](https://cran.r-project.org/web/packages/survey/index.html)
We want to implement all the features provided by the
[Survey package in R](https://cran.r-project.org/web/packages/survey/index.html)
in a Julia-native way. The main goal is to have a complete package that provides
a large range of functionality and takes efficiency into consideration, such that
large surveys can be analysed fast.

The [milestones](https://github.com/xKDR/Survey.jl/milestones) sections of the repository contains a list of features that contributors can implement in the short-term.
The [milestones](https://github.com/xKDR/Survey.jl/milestones) section of the repository
contains a list of features that contributors can implement in the short-term.

## Support

Expand Down
16 changes: 12 additions & 4 deletions docs/make.jl
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,7 @@ DocMeta.setdocmeta!(Survey, :DocTestSetup, :(using Survey); recursive=true)
makedocs(;
modules=[Survey],
authors="xKDR Forum",
# doctest = :fix,
repo="https://github.com/xKDR/Survey.jl/blob/{commit}{path}#{line}",
sitename="$Survey.jl",
format=Documenter.HTML(;
Expand All @@ -16,14 +17,21 @@ makedocs(;
),
pages=[
"Home" => "index.md",
"Moving from R" => "R_comparison.md",
"API reference" => "api.md"
"Getting Started" => "getting_started.md",
"Manual" => [
"DataFrames in Survey" => "man/dataframes.md",
"ReplicateDesign" => "man/replicate.md",
"Plotting" => "man/plotting.md",
"Comparison with other survey analysis tools" => "man/comparisons.md",
"Future plans" => "man/future.md",
],
"API reference" => "api.md",
],
checkdocs=:exports,
)

deploydocs(;
repo="github.com/xKDR/Survey.jl",
target = "build",
devbranch="main"
target="build",
devbranch="main",
)
Loading

0 comments on commit 4d7bbf4

Please sign in to comment.