xKDR · iuliadmtru · Jan 11, 2023 · Jan 11, 2023 · Jan 11, 2023 · Jan 11, 2023
diff --git a/README.md b/README.md
@@ -6,111 +6,262 @@
 [![codecov](https://codecov.io/gh/xKDR/Survey.jl/branch/main/graph/badge.svg?token=4PFSF47BT2)](https://codecov.io/gh/xKDR/Survey.jl)
 [![Milestones](https://img.shields.io/badge/-milestones-brightgreen)](https://github.com/xKDR/Survey.jl/milestones)
 
+This package is used to study complex survey data. It aims to be a fast alternative
+to the [Survey package in R](https://cran.r-project.org/web/packages/survey/index.html)
+developed by [Professor Thomas Lumley](https://www.stat.auckland.ac.nz/people/tlum005).
 
-This package is used to study complex survey data. It aims to be a fast alternative to the [Survey package in R](https://cran.r-project.org/web/packages/survey/index.html) developed by [Professor Thomas Lumley](https://www.stat.auckland.ac.nz/people/tlum005).
+This package currently supports simple random sample, stratified sample, one- and
+two-stage cluster sample, the latter using single stage approximation. For more
+details see the [TODO](https://xkdr.github.io/Survey.jl/dev/) section of the
+documentation.
 
-This package currently supports simple random sample and stratified sample. In future releases, it will support multistage sampling as well. 
-
-## Documentation
-See [Documentation](https://xkdr.github.io/Survey.jl/dev/) to learn how to use the package 
-
-## How to install
+## Installation
 ```julia
 ]  add "https://github.com/xKDR/Survey.jl.git"
 ```
+
 ## Basic usage
 
-### Simple Random Sample
+The `SurveyDesign` constructor can take data corresponding to any type of design.
+Depending on the keyword arguments passed, the data is processed in order to obtain
+correct results for the given design.
+
+The following examples show how to create and manipulate different survey designs
+using the [Academic Performance Index dataset for Californian schools](https://r-survey.r-forge.r-project.org/survey/html/api.html).
+
+### Simple random sample
+
+A simple random sample can be created without specifying any special keywords. Here
+we will create a weighted simple random sample design.
 
-In the following example, we will load a simple random sample of the Academic Performance Index dataset for Californian schools and do basic analysis. 
 ```julia
-using Survey
+julia> apisrs = load_data("apisrs");
 
-srs = load_data("apisrs")
+julia> srs = SurveyDesign(apisrs; weights=:pw)
+SurveyDesign:
+data: 200x47 DataFrame
+cluster: false_cluster
+design.data[!,design.cluster]: 1, 2, 3, ..., 200
+popsize: popsize
+design.data[!,design.popsize]: 6190.0, 6190.0, 6190.0, ..., 6190.0
+sampsize: sampsize
+design.data[!,design.sampsize]: 200, 200, 200, ..., 200
+design.data[!,:probs]: 0.0323, 0.0323, 0.0323, ..., 0.0323
+design.data[!,:allprobs]: 0.0323, 0.0323, 0.0323, ..., 0.0323
+```
 
-dsrs = SimpleRandomSample(srs; weights = :pw)
+Using the `srs` design we can compute estimates of statistics such as mean and
+population total. The design must first be resampled using
+[bootstrapping](https://en.wikipedia.org/wiki/Bootstrapping_(statistics)) in order
+to compute the standard errors.
 
-mean(:api00, dsrs)
+```julia
+julia> bootsrs = bootweights(srs; replicates=1000)
+ReplicateDesign:
+data: 200x1047 DataFrame
+cluster: false_cluster
+design.data[!,design.cluster]: 1, 2, 3, ..., 200
+popsize: popsize
+design.data[!,design.popsize]: 6190.0, 6190.0, 6190.0, ..., 6190.0
+sampsize: sampsize
+design.data[!,design.sampsize]: 200, 200, 200, ..., 200
+design.data[!,:probs]: 0.0323, 0.0323, 0.0323, ..., 0.0323
+design.data[!,:allprobs]: 0.0323, 0.0323, 0.0323, ..., 0.0323
+replicates: 1000
+
+julia> mean(:api00, bootsrs)
 1×2 DataFrame
- Row │ mean     SE      
-     │ Float64  Float64 
+ Row │ mean     SE
+     │ Float64  Float64
 ─────┼──────────────────
-   1 │ 656.585  9.24972
+   1 │ 656.585   9.5409
 
-total(:enroll, dsrs)
+julia> total(:enroll, bootsrs)
 1×2 DataFrame
- Row │ total      SE       
-     │ Float64    Float64  
-─────┼─────────────────────
-   1 │ 3.62107e6  1.6952e5  
+ Row │ total      SE
+     │ Float64    Float64
+─────┼──────────────────────
+   1 │ 3.62107e6  1.72846e5
+```
 
-mean(:api00, :cname, dsrs)
+Now we know the mean academic performance index from the year 2000 and the total
+number of students enrolled in the sampled Californian schools. We can also
+calculate the statistic of two variables in one go...
+
+```julia
+julia> mean([:api99, :api00], bootsrs)
+2×3 DataFrame
+ Row │ names   mean     SE
+     │ String  Float64  Float64
+─────┼──────────────────────────
+   1 │ api99   624.685  9.84669
+   2 │ api00   656.585  9.5409
+```
+
+... or we can calculate domain estimates:
+
+```julia
+julia> total(:enroll, :cname, bootsrs)
 38×3 DataFrame
- Row │ cname            mean     SE       
-     │ String15         Float64  Float64  
-─────┼────────────────────────────────────
-   1 │ Kern             573.6     42.8026
-   2 │ Los Angeles      658.156   21.0728
-   3 │ Orange           749.333   27.0613
-  ⋮  │        ⋮            ⋮        ⋮
-  36 │ Napa             727.0     46.722
-  37 │ Lake             804.0    NaN
-  38 │ Merced           595.0    NaN
-
-quantile(:enroll,dsrs,[0.1,0.2,0.5,0.75,0.95])
-5×2 DataFrame
- Row │ probability  quantile 
-     │ Float64      Float64  
-─────┼───────────────────────
-   1 │        0.1      245.5
-   2 │        0.2      317.6
-   3 │        0.5      453.0
-   4 │        0.75     668.5
-   5 │        0.95    1473.1
+ Row │ cname            total           SE
+     │ String15         Float64         Any
+─────┼────────────────────────────────────────────
+   1 │ Kern                  1.95823e5  74731.2
+   2 │ Los Angeles      867129.0        1.36622e5
+   3 │ Orange                1.68786e5  63858.0
+   4 │ San Luis Obispo    6720.49       6790.49
+  ⋮  │        ⋮               ⋮             ⋮
+  35 │ Calaveras         12976.4        13241.6
+  36 │ Napa              39239.0        30181.9
+  37 │ Lake               6410.79       6986.29
+  38 │ Merced            15392.1        15202.2
+                                   30 rows omitted
 ```
 
-### Stratified Sample
+This gives us the total number of enrolled students in each county.
+
+### Stratified sample
 
-In the following example, we will load a stratified sample of the Academic Performance Index dataset for Californian schools and do basic analysis. 
+All functionalities described above are also supported for stratified sample
+designs. To create a stratified sample, the `strata` keyword must be passed to
+`SurveyDesign`.
 
 ```julia
-using Survey
+julia> apistrat = load_data("apistrat");
 
-strat = load_data("apistrat")
+julia> strat = SurveyDesign(apistrat; strata=:stype, weights=:pw)
+SurveyDesign:
+data: 200x46 DataFrame
+cluster: false_cluster
+design.data[!,design.cluster]: 1, 2, 3, ..., 200
+popsize: popsize
+design.data[!,design.popsize]: 6190.0, 6190.0, 6190.0, ..., 6190.0
+sampsize: sampsize
+design.data[!,design.sampsize]: 200, 200, 200, ..., 200
+design.data[!,:probs]: 0.0226, 0.0226, 0.0226, ..., 0.0662
+design.data[!,:allprobs]: 0.0226, 0.0226, 0.0226, ..., 0.0662
 
-dstrat = StratifiedSample(strat, :stype; weights = :pw, popsize = :fpc)
 
-mean(:api00, dstrat)
-1×2 DataFrame
- Row │ mean     SE      
-     │ Float64  Float64 
-─────┼──────────────────
-   1 │ 662.287  9.40894
+julia> bootstrat = bootweights(strat; replicates=1000)
+ReplicateDesign:
+data: 200x1046 DataFrame
+cluster: false_cluster
+design.data[!,design.cluster]: 1, 2, 3, ..., 200
+popsize: popsize
+design.data[!,design.popsize]: 6190.0, 6190.0, 6190.0, ..., 6190.0
+sampsize: sampsize
+design.data[!,design.sampsize]: 200, 200, 200, ..., 200
+design.data[!,:probs]: 0.0226, 0.0226, 0.0226, ..., 0.0662
+design.data[!,:allprobs]: 0.0226, 0.0226, 0.0226, ..., 0.0662
+replicates: 1000
 
-total(:api00, dstrat)
-1×2 DataFrame
- Row │ total      SE      
-     │ Float64    Float64 
-─────┼────────────────────
-   1 │ 4.10221e6  58279.0
 
-mean(:api00, :cname, dstrat)
+julia> mean([:api99, :api00], bootstrat)
+2×3 DataFrame
+ Row │ names   mean     SE
+     │ String  Float64  Float64
+─────┼───────────────────────────
+   1 │ api99   629.395  10.08
+   2 │ api00   662.287   9.56931
+
+julia> mean(:api00, :cname, bootstrat)
 40×3 DataFrame
- Row │ cname           mean     SE           
-     │ String15        Float64  Float64      
+ Row │ cname           mean     SE
+     │ String15        Float64  Any
+─────┼──────────────────────────────────
+   1 │ Los Angeles     633.511  21.6242
+   2 │ Ventura         707.172  34.2091
+   3 │ Kern            678.235  57.651
+   4 │ San Diego       704.121  33.0882
+  ⋮  │       ⋮            ⋮        ⋮
+  37 │ Napa            660.0    0.0
+  38 │ Mariposa        706.0    0.0
+  39 │ Mendocino       632.018  1.70573
+  40 │ Butte           627.0    0.0
+                         32 rows omitted
+```
+
+### Cluster sample
+
+For now, the package supports one- and two-stage cluster sampling. These are
+created by passing the `clusters` keyword argument to `SurveyDesign`.
+
+```julia
+julia> apiclus1 = load_data("apiclus1");
+
+julia> clus_one_stage = SurveyDesign(apiclus1; clusters=:dnum, weights=:pw)
+SurveyDesign:
+data: 183x46 DataFrame
+cluster: dnum
+design.data[!,design.cluster]: 637, 637, 637, ..., 448
+popsize: popsize
+design.data[!,design.popsize]: 6190.0, 6190.0, 6190.0, ..., 6190.0
+sampsize: sampsize
+design.data[!,design.sampsize]: 15, 15, 15, ..., 15
+design.data[!,:probs]: 0.0295, 0.0295, 0.0295, ..., 0.0295
+design.data[!,:allprobs]: 0.0295, 0.0295, 0.0295, ..., 0.0295
+
+
+julia> apiclus2 = load_data("apiclus2");
+
+julia> clus_two_stage = SurveyDesign(apiclus2; clusters=[:dnum, :snum], weights=:pw)
+SurveyDesign:
+data: 126x47 DataFrame
+cluster: dnum
+design.data[!,design.cluster]: 15, 63, 83, ..., 795
+popsize: popsize
+design.data[!,design.popsize]: 5130.0, 5130.0, 5130.0, ..., 5130.0
+sampsize: sampsize
+design.data[!,design.sampsize]: 40, 40, 40, ..., 40
+design.data[!,:probs]: 0.0528, 0.0528, 0.0528, ..., 0.0528
+design.data[!,:allprobs]: 0.0528, 0.0528, 0.0528, ..., 0.0528
+```
+
+Again, all above functionalities are supported for cluster sample designs as well.
+
+```julia
+julia> bootclus_one_stage = bootweights(clus_one_stage; replicates=1000);
+
+julia> total([:enroll, Symbol("api.stu")], bootclus_one_stage)
+2×3 DataFrame
+ Row │ names    total      SE
+     │ String   Float64    Float64
+─────┼───────────────────────────────
+   1 │ enroll   3.40494e6  9.4505e5
+   2 │ api.stu  2.89321e6  8.10919e5
+
+julia> bootclus_two_stage = bootweights(clus_two_stage; replicates=1000);
+
+julia> mean(:api00, :cname, bootclus_two_stage)
+26×3 DataFrame
+ Row │ cname            mean     SE
+     │ String15         Float64  Any
 ─────┼───────────────────────────────────────
-   1 │ Los Angeles     633.511  21.3912
-   2 │ Ventura         707.172  31.6856
-   3 │ Kern            678.235  53.1337
-  ⋮  │       ⋮            ⋮          ⋮
-  39 │ Mendocino       632.018   1.04942
-  40 │ Butte           627.0     0.0
+   1 │ Placer           821.0    0.0
+   2 │ Tuolumne         773.0    0.0
+   3 │ San Mateo        743.091  92.7257
+   4 │ San Luis Obispo  811.0    0.0
+  ⋮  │        ⋮            ⋮          ⋮
+  23 │ Monterey         720.5    6.50969e-15
+  24 │ Tulare           607.5    106.359
+  25 │ Stanislaus       730.4    3.32051e-14
+  26 │ Contra Costa     864.0    0.0
+                              18 rows omitted
 ```
 
-## Strategic goals
-We want to implement all the features provided by the [Survey package in R](https://cran.r-project.org/web/packages/survey/index.html)
+For a more complete guide, see the [Tutorial](https://xkdr.github.io/Survey.jl/dev/#Basic-demo)
+section in the documentation.
+
+## Future goals
+
+We want to implement all the features provided by the
+[Survey package in R](https://cran.r-project.org/web/packages/survey/index.html)
+in a Julia-native way. The main goal is to have a complete package that provides
+a large range of functionality and takes efficiency into consideration, such that
+large surveys can be analysed fast.
 
-The [milestones](https://github.com/xKDR/Survey.jl/milestones) sections of the repository contains a list of features that contributors can implement in the short-term.
+The [milestones](https://github.com/xKDR/Survey.jl/milestones) section of the repository
+contains a list of features that contributors can implement in the short-term.
 
 ## Support