Skip to content

Commit

Permalink
Merge pull request #96 from ayushpatnaikgit/design_update
Browse files Browse the repository at this point in the history
Update readme according to the new design
  • Loading branch information
smishr authored Nov 28, 2022
2 parents b0b10c1 + efd9b38 commit 004eb07
Showing 1 changed file with 71 additions and 98 deletions.
169 changes: 71 additions & 98 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,121 +7,94 @@
[![Milestones](https://img.shields.io/badge/-milestones-brightgreen)](https://github.com/xKDR/Survey.jl/milestones)


This package is used to study complex survey data. It is inspired by the [Survey package in R](https://cran.r-project.org/web/packages/survey/index.html) developed by [Professor Thomas Lumley](https://www.stat.auckland.ac.nz/people/tlum005), and initial development attempts to replicate the basic functionality of that package with the speed and performance enhancement of Julia.
This package is used to study complex survey data. It aims to be a fast alternative to the [Survey package in R](https://cran.r-project.org/web/packages/survey/index.html) developed by [Professor Thomas Lumley](https://www.stat.auckland.ac.nz/people/tlum005).

R `survey` package can take hours/days for analysis large survey datasets (> few GB in memory), such as [CMIE CPHS](https://consumerpyramidsdx.cmie.com). One of the key goals of `Survey.jl` is to utilise the power of Julia and speedup processing times.
This package currently supports simple random sample and stratified sample. In future releases, it will support multistage sampling as well.

## How to install

add "https://github.com/xKDR/Survey.jl.git"

## Basic usage

In the following example, we will load the Academic Performance Index dataset for Californian schools and produce the weighted mean for each county.
```julia
using Survey

srs_design = SimpleRandomSample(apisrs, weights = :pw)
## This function loads a commonly used dataset, Academic Performance Index (API), as an example.
## Any DataFrame object can be used with this package.

# dclus1 = svydesign(id = :1, weights = :pw, data = apiclus1)

# svyby(:api00, :cname, dclus1, svymean)
# 11×3 DataFrame
# Row │ cname mean SE
# │ String15 Float64 Float64
# ─────┼────────────────────────────────
# 1 │ Alameda 669.0 16.2135
# 2 │ Fresno 472.0 9.85278
# 3 │ Kern 452.5 29.5049
# 4 │ Los Angeles 647.267 23.5116
# 5 │ Mendocino 623.25 24.216
# 6 │ Merced 519.25 10.4925
# 7 │ Orange 710.562 28.9123
# 8 │ Plumas 709.556 13.2174
# 9 │ San Diego 659.436 12.2082
# 10 │ San Joaquin 551.189 11.578
# 11 │ Santa Clara 732.077 12.2291
```

This example is from the Survey package in R. The [examples section of the documentation](https://xkdr.github.io/Survey.jl/dev/examples/) shows the R and the Julia code side by side for this and a few other examples.

## Performance
We will measure the performance of the R and Julia for the example shown above.

**R**

```R
library(survey)
library(microbenchmark)
data(api)
dclus1 <- svydesign(id = ~1, weights = ~pw, data = apiclus1)
microbenchmark(svyby(~api00, by = ~cname, design = dclus1, svymean), units = "us")
] add "https://github.com/xKDR/Survey.jl.git"
```
## Basic usage

```R
expr min lq
svyby(~api00, by = ~cname, design = dclus1, svymean) 10180.47 12102.61
mean median uq max neval
12734.43 12421.93 12788.55 17242.35 100
```
### Simple Random Sample

**Julia**
In the following example, we will load a simple random sample of the Academic Performance Index dataset for Californian schools and do basic analysis.
```julia
using Survey, BenchmarkTools
apiclus1 = load_data("apiclus1")
dclus1 = svydesign(id=:1, weights=:pw, data = apiclus1)
@benchmark svyby(:api00, :cname, dclus1, svymean)
```
using Survey

```julia
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
Range (min max): 54.464 μs 6.070 ms ┊ GC (min max): 0.00% 94.01%
Time (median): 72.468 μs ┊ GC (median): 0.00%
Time (mean ± σ): 81.833 μs ± 190.657 μs ┊ GC (mean ± σ): 7.62% ± 3.23%
```

The Julia code is about 171 times faster than the R code.

We increase the complexity by grouping the data by two variables and then performing the same operations.
**R**

```R
library(survey)
library(microbenchmark)
data(api)
dclus1 <- svydesign(id = ~1, weights = ~pw, data = apiclus1)
microbenchmark(svyby(~api00, by = ~cname+meals, design = dclus1, svymean, keep.var = FALSE), units = "us")
srs = load_data("apisrs")

dsrs = SimpleRandomSample(srs; weights = :pw)

svymean(:api00, dsrs)
1×2 DataFrame
Row │ mean sem
│ Float64 Float64
─────┼──────────────────
1656.585 9.24972

svytotal(:enroll, dsrs)
1×2 DataFrame
Row │ total se_total
│ Float64 Float64
─────┼─────────────────────
13.62107e6 1.6952e5

svyby(:api00, :cname, dsrs, svymean)
38×3 DataFrame
Row │ cname mean sem
│ String15 Float64 Float64
─────┼────────────────────────────────────
1 │ Kern 573.6 42.8026
2 │ Los Angeles 658.156 21.0728
3 │ Orange 749.333 27.0613
36 │ Napa 727.0 46.722
37 │ Lake 804.0 NaN
38 │ Merced 595.0 NaN
```

```R
Unit: microseconds
expr min lq
svyby(~api00, by = ~cname + meals, design = dclus1, svymean) 132468.1 149914
mean median uq max neval
166121.9 160571.3 172301.6 304979.2 100
```
### Stratified Sample

**Julia**
```julia
using Survey, BenchmarkTools
apiclus1 = load_data("apiclus1")
dclus1 = svydesign(id=:1, weights=:pw, data = apiclus1)
@benchmark svyby(:api00, [:cname, :meals], dclus1, svymean)
```
In the following example, we will load a stratified sample of the Academic Performance Index dataset for Californian schools and do basic analysis.

```julia
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
Range (min max): 219.387 μs 8.284 ms ┊ GC (min max): 0.00% 90.94%
Time (median): 265.214 μs ┊ GC (median): 0.00%
Time (mean ± σ): 325.100 μs ± 513.020 μs ┊ GC (mean ± σ): 14.23% ± 8.58%
```
using Survey

The Julia code is about 605 times faster than the R code.
strat = load_data("apistrat")

dstrat = StratifiedSample(strat, :stype; weights = :pw, popsize = :fpc)

svymean(:api00, dstrat)
1×2 DataFrame
Row │ Ȳ̂ SE
│ Float64 Float64
─────┼──────────────────
1662.287 9.40894

svytotal(:api00, dstrat)
1×2 DataFrame
Row │ grand_total SE
│ Float64 Float64
─────┼──────────────────────
14.10221e6 58279.0

svyby(:api00, :cname, dstrat, svymean)
40×3 DataFrame
Row │ cname domain_mean domain_mean_se
│ String15 Float64 Float64
─────┼─────────────────────────────────────────────
1 │ Los Angeles 633.511 21.3912
2 │ Ventura 707.172 31.6856
3 │ Kern 678.235 53.1337
38 │ Mariposa 706.0 0.0
39 │ Mendocino 632.018 1.04942
40 │ Butte 627.0 0.0
```

## Strategic goals

We want to implement all the features provided by the [Survey package in R](https://cran.r-project.org/web/packages/survey/index.html)

The [milestones](https://github.com/xKDR/Survey.jl/milestones) sections of the repository contains a list of features that contributors can implement in the short-term.
Expand Down

0 comments on commit 004eb07

Please sign in to comment.