Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rearrange sections and other changes. #106

Merged
merged 18 commits into from
Dec 3, 2022
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 2 additions & 3 deletions docs/make.jl
Original file line number Diff line number Diff line change
Expand Up @@ -15,9 +15,8 @@ makedocs(;
),
pages=[
"Home" => "index.md",
"Examples" => "examples.md",
"Comparison with R" => "R_comparison.md",
"Performance" => "performance.md",
"Moving from R" => "R_comparison.md",
"API reference" => "api.md"
],
checkdocs=:exports,
)
Expand Down
131 changes: 77 additions & 54 deletions docs/src/R_comparison.md
Original file line number Diff line number Diff line change
@@ -1,98 +1,121 @@
# Comparison with R
# Moving from R to Julia
This section presents examples to help move from R to Julia. Examples show R and Julia code for common operations in survey analysis. <br>
For the same operation, first the R and then the Julia code is presented.

In the following examples, we'll compare Julia's performance to R's on the same set of operations.
## Simple random sample

## Installing and loading the package
**R**
The `apisrs` data, which is provided in both `survey` and `Survey.jl`, is used as an example. It's a simple random sample of the Academic Performance Index of Californian schools.

```r
install.package("survey")
### 1. Creating a survey design
Instantiating a simple random sample survey design.

```R
library(survey)
data(api)
dsrs = svydesign(id = ~1, data = apisrs, weights = ~pw, fpc = ~fpc)
```

**Julia**
```julia
using Pkg
Pkg.add(url = "https://github.com/xKDR/Survey.jl.git")
using Survey
srs = load_data("apisrs")
dsrs = SimpleRandomSample(srs; popsize = :fpc)
```

The following command in the Pkg REPL may also be used to install the package.
### 2. Mean
In the following example the mean of the variable `api00` is calculated.

```R
svymean(~api00, dsrs)
```
add "https://github.com/xKDR/Survey.jl.git"
```julia
mean(:api00, dsrs)
```

## API data
### 3. Total
In the following example the sum of the variable `api00` is calculated.

The Academic Performance Index is computed for all California schools based on standardised testing of students. The [data sets](https://cran.r-project.org/web/packages/survey/survey.pdf) contain information for all schools with at least 100 students and for various probability samples of the data. apiclus1 is a cluster sample of school districts, apistrat is a sample stratified by stype.
```R
svytotal(~api00, dsrs)
```
```julia
total(:api00, dsrs)
```

In the following examples, we'll use the apiclus1 data from the api dataset.
### 4. Quantile
In the following example the median of the variable `api00` is calculated.
```R
svyquantile(~api00, dsrs, 0.5)
```
```julia
quantile(:api00, dsrs, 0.5)
```

The api dataset can be loaded using the following command:
### 5. Domain estimation
In the following example the mean of the variable `api00` is calculated grouped by the variable `cname`.

**R**
```r
data(api)
```R
svyby(~api00, ~cname, dsrs, svymean)
```

**Julia**
```julia
apiclus1 = load_data("apiclus1")
by(:api00, :cname, dsrs, mean)
```

## svydesign
[The ```svydesign``` object combines a data frame and all the survey design information needed to analyse it.](https://www.rdocumentation.org/packages/survey/versions/4.1-1/topics/svydesign)
## Stratified sample

A ```design``` object can be constructed with the following command:
The `apistrat` data, which is provided in both `survey` and `Survey`, is used as an example. It's a stratified sample of the Academic Performance Index of Californian schools.
ayushpatnaikgit marked this conversation as resolved.
Show resolved Hide resolved

**R**
```r
dclus1 <-svydesign(id = ~1, weights = ~pw, data = apiclus1, fpc = ~fpc)
### 1. Creating a design object
The following example shows how to construct a design object for a stratified sample.

```R
library(survey)
data(api)
dstrat = svydesign(id = ~1, data = apistrat, strata = ~stype, weights = ~pw, fpc = ~fpc)
```

**Julia**
```julia
dclus1 = design(id = :1, weights = :pw, data = apiclus1, fpc = :fpc)
using Survey
strat = load_data("apistrat")
dstrat = StratifiedSample(strat, :stype; popsize = :fpc)
```

## by
The `by` function can be used to generate stratified estimates.

### Mean
Weighted mean of a variable by strata can be computed using the following command:
### 2. Mean
In the following example the mean of the variable `api00` is calculated.

**R**
```r
svyby(~api00, by = ~cname, design = dclus1, svymean)
```R
svymean(~api00, dstrat)
```

**Julia**
```julia
by(:api00, :cname, dclus1, mean)
mean(:api00, dstrat)
```

### Sum
Weighted sum of a variable by strata can be computed using the following command:
### 3. Total
In the following example the sum of the variable `api00` is calculated.

**R**
```r
svyby(~api00, by = ~cname, design = dclus1, svytotal)
```R
svytotal(~api00, dstrat)
```
```julia
total(:api00, dstrat)
```

**Julia**
### 4. Quantile
In the following example the median of the variable `api00` is calculated.
```R
svyquantile(~api00, dstrat, 0.5)
```
```julia
by(:api00, :cname, dclus1, total)
quantile(:api00, dstrat, 0.5)
```

### Quantile
Weighted quantile of a variable by strata can be computed using the following command:
### 5. Domain estimation
In the following example the mean of the variable `api00` is calculated grouped by the variable `cname`.

**R**
```r
svyby(~api00, by = ~cname, design = dclus1, svyquantile, quantile = 0.63)
```R
svyby(~api00, ~cname, dstrat, svymean)
```

**Julia**
```julia
by(:api00, :cname, dclus1, quantile, 0.63)
```
by(:api00, :cname, dstrat, mean)
```
39 changes: 39 additions & 0 deletions docs/src/api.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,39 @@
# API

## Index

```@index
Module = [Survey]
Order = [:type, :function]
Private = false
```
Survey data can be loaded from a `DataFrame` into a survey design. The package currently supports simple random sample and stratified sample designs.
```@docs
AbstractSurveyDesign
SimpleRandomSample
StratifiedSample
```

```@docs
load_data
Survey.mean(x::Symbol, design::SimpleRandomSample)
total(x::Symbol, design::SimpleRandomSample)
quantile
```

It is often required to estimate population parameters for sub-populations of interest. For example, you may have a sample of heights, but you want the average heights of males and females separately.
```@docs
by
```
```@docs
plot(design::AbstractSurveyDesign, x::Symbol, y::Symbol; kwargs...)
boxplot(design::AbstractSurveyDesign, x::Symbol, y::Symbol; kwargs...)
hist(design::AbstractSurveyDesign, var::Symbol,
bins::Union{Integer, AbstractVector} = freedman_diaconis(design, var);
normalization = :density,
kwargs...
)
dim(design::AbstractSurveyDesign)
dimnames(design::AbstractSurveyDesign)
colnames(design::AbstractSurveyDesign)
```
62 changes: 0 additions & 62 deletions docs/src/examples.md

This file was deleted.

81 changes: 49 additions & 32 deletions docs/src/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,42 +4,59 @@ CurrentModule = Survey

# Survey

This package is the Julia implementation of the [Survey package in R](https://cran.r-project.org/web/packages/survey/index.html) developed by [Professor Thomas Lumley](https://www.stat.auckland.ac.nz/people/tlum005).
This package is used to study complex survey data. It aims to be a fast alternative to the [Survey package in R](https://cran.r-project.org/web/packages/survey/index.html) developed by [Professor Thomas Lumley](https://www.stat.auckland.ac.nz/people/tlum005).

## The need for moving the code to Julia.
This package currently supports simple random sample and stratified sample. In future releases, it will support multistage sampling as well.

At [xKDR](https://xkdr.org/) we processed millions of records from household surveys using the survey package in R. This process took hours of computing time. By implementing the code in Julia, we are able to do the processing in seconds. In this package we have implemented the functions `mean`, `quantile` and `sum`. We have kept the syntax between the two packages similar so that we can easily move our existing code to the new language.
## Basic demo

## Index
The following demo uses the
[Academic Performance Index](https://r-survey.r-forge.r-project.org/survey/html/api.html)
(API) dataset for Californian schools. The data sets contain information for all schools
with at least 100 students and for various probability samples of the data.

```@index
Module = [Survey]
Order = [:type, :function]
Private = false
The API program has been discontinued at the end of 2018. Information is archived at
[https://www.cde.ca.gov/re/pr/api.asp](https://www.cde.ca.gov/re/pr/api.asp)

Firstly, a survey design needs a dataset from which to gather information.


The sample datasets provided with the package can be loaded as `DataFrames` using the `load_data` function:

```julia
julia> apisrs = load_data("apisrs");
```
`apisrs` is a simple random sample of the Academic Performance Index of Californian schools.

Next, we can build a design. The design corresponding to a simple random sample is [`SimpleRandomSample`](@ref), which can be instantiated by calling the constructor:

```julia
julia> srs = SimpleRandomSample(apisrs; weights = :pw)
SimpleRandomSample:
data: 200x42 DataFrame
weights: 31.0, 31.0, 31.0, ..., 31.0
probs: 0.0323, 0.0323, 0.0323, ..., 0.0323
fpc: 6194, 6194, 6194, ..., 6194
popsize: 6194
sampsize: 200
sampfraction: 0.0323
ignorefpc: false
```

With a `SimpleRandomSample` (as well as with any subtype of [`AbstractSurveyDesign`](@ref)) it is possible to calculate estimates of the mean, population total, etc., for a given variable, along with the corresponding standard errors.

```julia
julia> mean(:api00, srs)
1×2 DataFrame
Row │ mean sem
│ Float64 Float64
─────┼──────────────────
1 │ 656.585 9.24972

## API
```@docs
AbstractSurveyDesign
SimpleRandomSample
StratifiedSample
ClusterSample
design
load_data
mean(x::Symbol, design::SimpleRandomSample)
total(x::Symbol, design::SimpleRandomSample)
quantile
by
colnames(design::AbstractSurveyDesign)
dim(design::AbstractSurveyDesign)
dimnames(design::AbstractSurveyDesign)
plot(design::AbstractSurveyDesign, x::Symbol, y::Symbol; kwargs...)
boxplot(design::AbstractSurveyDesign, x::Symbol, y::Symbol; kwargs...)
hist(design::AbstractSurveyDesign, var::Symbol,
bins::Union{Integer, AbstractVector} = freedman_diaconis(design, var);
normalization = :density,
kwargs...
)
freedman_diaconis
sturges
julia> total(:api00, srs)
1×2 DataFrame
Row │ total se_total
│ Float64 Float64
─────┼─────────────────────
1 │ 4.06689e6 57292.8
```
Loading