Merge pull request #133 from xKDR/design_update

Design update
xKDR · Dec 9, 2022 · de80168 · de80168
2 parents 7b6bdb3 + 02d749b
commit de80168
Show file tree

Hide file tree

Showing 45 changed files with 1,615 additions and 3,327 deletions.
diff --git a/.gitignore b/.gitignore
@@ -5,3 +5,7 @@
 /docs/Manifest.toml
 /docs/build/
 /test/Manifest.toml
+/dev/*
+.gitignore
+.DS_Store
+*.json
diff --git a/Project.toml b/Project.toml
@@ -7,8 +7,8 @@ version = "0.11.1"
 AlgebraOfGraphics = "cbdf2221-f076-402e-a563-3d30da359d67"
 CSV = "336ed68f-0bac-5ca0-87d4-7b16caf5d00b"
 CairoMakie = "13f3f980-e62b-5c42-98c6-ff1f3baf88f0"
+CategoricalArrays = "324d7699-5711-5eae-9e2f-1d82baa6b597"
 DataFrames = "a93c6f00-e57d-5684-b7b6-d8193f3e46c0"
-GLM = "38e38edf-8417-5370-95a0-9cbb8c7f171a"
 LinearAlgebra = "37e2e46d-f89d-539d-b4ee-838fcccc9c8e"
 Statistics = "10745b16-79ce-11e8-11f9-7d13ad32a3b2"
 StatsBase = "2913bbd2-ae8a-5f71-8c99-4fb6c76f3a91"
@@ -18,6 +18,5 @@ AlgebraOfGraphics = "0.6"
 CSV = "0.10"
 CairoMakie = "0.8, 0.9, 0.10"
 DataFrames = "1"
-GLM = "1"
 StatsBase = "0.33"
 julia = "1"
diff --git a/README.md b/README.md
@@ -7,121 +7,104 @@
 [![Milestones](https://img.shields.io/badge/-milestones-brightgreen)](https://github.com/xKDR/Survey.jl/milestones)
 
 
-This package is used to study complex survey data. It is the Julia implementation of the [Survey package in R](https://cran.r-project.org/web/packages/survey/index.html) developed by [Professor Thomas Lumley](https://www.stat.auckland.ac.nz/people/tlum005).
+This package is used to study complex survey data. It aims to be a fast alternative to the [Survey package in R](https://cran.r-project.org/web/packages/survey/index.html) developed by [Professor Thomas Lumley](https://www.stat.auckland.ac.nz/people/tlum005).
 
-As the size of survey datasets have become larger, processing the records can take hours or days in R. We endeavour to solve this problem by implementing the Survey package in Julia.
+This package currently supports simple random sample and stratified sample. In future releases, it will support multistage sampling as well. 
 
 ## How to install
-
-    add "https://github.com/xKDR/Survey.jl.git"
-
-## Basic usage
-
-In the following example, we will load the Academic Performance Index dataset for Californian schools and produce the weighted mean for each county.
 ```julia
-using Survey
-
-apiclus1 = load_data("apiclus1")
-## This function loads a commonly used dataset, Academic Performance Index (API), as an example.
-## Any DataFrame object can be used with this package.
-
-dclus1 = svydesign(id = :1, weights = :pw, data = apiclus1)
-
-svyby(:api00, :cname, dclus1, svymean)
-11×3 DataFrame
- Row │ cname        mean     SE
-     │ String15     Float64  Float64
-─────┼────────────────────────────────
-   1 │ Alameda      669.0    16.2135
-   2 │ Fresno       472.0     9.85278
-   3 │ Kern         452.5    29.5049
-   4 │ Los Angeles  647.267  23.5116
-   5 │ Mendocino    623.25   24.216
-   6 │ Merced       519.25   10.4925
-   7 │ Orange       710.562  28.9123
-   8 │ Plumas       709.556  13.2174
-   9 │ San Diego    659.436  12.2082
-  10 │ San Joaquin  551.189  11.578
-  11 │ Santa Clara  732.077  12.2291
-```
-
-This example is from the Survey package in R. The [examples section of the documentation](https://xkdr.github.io/Survey.jl/dev/examples/) shows the R and the Julia code side by side for this and a few other examples.
-
-## Performance
-We will measure the performance of the R and Julia for the example shown above.
-
-**R**
-
-```R
-library(survey)
-library(microbenchmark)
-data(api)
-dclus1 <- svydesign(id = ~1, weights = ~pw, data = apiclus1)
-microbenchmark(svyby(~api00, by = ~cname, design = dclus1, svymean), units = "us")
+]  add "https://github.com/xKDR/Survey.jl.git"
 ```
+## Basic usage
 
-```R
-                                                 expr      min       lq
- svyby(~api00, by = ~cname, design = dclus1, svymean) 10180.47 12102.61
-     mean   median       uq      max neval
- 12734.43 12421.93 12788.55 17242.35   100
-```
+### Simple Random Sample
 
-**Julia**
+In the following example, we will load a simple random sample of the Academic Performance Index dataset for Californian schools and do basic analysis. 
 ```julia
-using Survey, BenchmarkTools
-apiclus1 = load_data("apiclus1")
-dclus1 = svydesign(id=:1, weights=:pw, data = apiclus1)
-@benchmark svyby(:api00, :cname, dclus1, svymean)
-```
+using Survey
 
-```julia
-BenchmarkTools.Trial: 10000 samples with 1 evaluation.
- Range (min … max):  54.464 μs …   6.070 ms  ┊ GC (min … max): 0.00% … 94.01%
- Time  (median):     72.468 μs               ┊ GC (median):    0.00%
- Time  (mean ± σ):   81.833 μs ± 190.657 μs  ┊ GC (mean ± σ):  7.62% ±  3.23%
- ```
-
-The Julia code is about 171 times faster than the R code.
-
-We increase the complexity by grouping the data by two variables and then performing the same operations.
-**R**
-
-```R
-library(survey)
-library(microbenchmark)
-data(api)
-dclus1 <- svydesign(id = ~1, weights = ~pw, data = apiclus1)
-microbenchmark(svyby(~api00, by = ~cname+meals, design = dclus1, svymean, keep.var = FALSE), units = "us")
+srs = load_data("apisrs")
+
+dsrs = SimpleRandomSample(srs; weights = :pw)
+
+mean(:api00, dsrs)
+1×2 DataFrame
+ Row │ mean     SE      
+     │ Float64  Float64 
+─────┼──────────────────
+   1 │ 656.585  9.24972
+
+total(:enroll, dsrs)
+1×2 DataFrame
+ Row │ total      SE       
+     │ Float64    Float64  
+─────┼─────────────────────
+   1 │ 3.62107e6  1.6952e5  
+
+mean(:api00, :cname, dsrs)
+38×3 DataFrame
+ Row │ cname            mean     SE       
+     │ String15         Float64  Float64  
+─────┼────────────────────────────────────
+   1 │ Kern             573.6     42.8026
+   2 │ Los Angeles      658.156   21.0728
+   3 │ Orange           749.333   27.0613
+  ⋮  │        ⋮            ⋮        ⋮
+  36 │ Napa             727.0     46.722
+  37 │ Lake             804.0    NaN
+  38 │ Merced           595.0    NaN
+
+quantile(:enroll,dsrs,[0.1,0.2,0.5,0.75,0.95])
+5×2 DataFrame
+ Row │ probability  quantile 
+     │ Float64      Float64  
+─────┼───────────────────────
+   1 │        0.1      245.5
+   2 │        0.2      317.6
+   3 │        0.5      453.0
+   4 │        0.75     668.5
+   5 │        0.95    1473.1
 ```
 
-```R
-Unit: microseconds
-                                                         expr      min     lq
- svyby(~api00, by = ~cname + meals, design = dclus1, svymean) 132468.1 149914
-     mean   median       uq      max neval
- 166121.9 160571.3 172301.6 304979.2   100
-```
+### Stratified Sample
 
-**Julia**
-```julia
-using Survey, BenchmarkTools
-apiclus1 = load_data("apiclus1")
-dclus1 = svydesign(id=:1, weights=:pw, data = apiclus1)
-@benchmark svyby(:api00, [:cname, :meals], dclus1, svymean)
-```
+In the following example, we will load a stratified sample of the Academic Performance Index dataset for Californian schools and do basic analysis. 
 
 ```julia
-BenchmarkTools.Trial: 10000 samples with 1 evaluation.
- Range (min … max):  219.387 μs …   8.284 ms  ┊ GC (min … max):  0.00% … 90.94%
- Time  (median):     265.214 μs               ┊ GC (median):     0.00%
- Time  (mean ± σ):   325.100 μs ± 513.020 μs  ┊ GC (mean ± σ):  14.23% ±  8.58%
- ```
+using Survey
 
-The Julia code is about 605 times faster than the R code.
+strat = load_data("apistrat")
+
+dstrat = StratifiedSample(strat, :stype; weights = :pw, popsize = :fpc)
+
+mean(:api00, dstrat)
+1×2 DataFrame
+ Row │ mean     SE      
+     │ Float64  Float64 
+─────┼──────────────────
+   1 │ 662.287  9.40894
+
+total(:api00, dstrat)
+1×2 DataFrame
+ Row │ total      SE      
+     │ Float64    Float64 
+─────┼────────────────────
+   1 │ 4.10221e6  58279.0
+
+mean(:api00, :cname, dstrat)
+40×3 DataFrame
+ Row │ cname           mean     SE           
+     │ String15        Float64  Float64      
+─────┼───────────────────────────────────────
+   1 │ Los Angeles     633.511  21.3912
+   2 │ Ventura         707.172  31.6856
+   3 │ Kern            678.235  53.1337
+  ⋮  │       ⋮            ⋮          ⋮
+  39 │ Mendocino       632.018   1.04942
+  40 │ Butte           627.0     0.0
+```
 
 ## Strategic goals
-
 We want to implement all the features provided by the [Survey package in R](https://cran.r-project.org/web/packages/survey/index.html)
 
 The [milestones](https://github.com/xKDR/Survey.jl/milestones) sections of the repository contains a list of features that contributors can implement in the short-term.

diff --git a/docs/make.jl b/docs/make.jl
@@ -15,10 +15,10 @@ makedocs(;
     ),
     pages=[
         "Home" => "index.md",
-        "Examples" => "examples.md",
-        "Comparison with R" => "R_comparison.md",
-        "Performance" => "performance.md",
+        "Moving from R" => "R_comparison.md",
+        "API reference" => "api.md"
     ],
+    checkdocs=:exports,
 )
 
 deploydocs(;