Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New data structures to support multiple replicate methods. #297

Merged
merged 21 commits into from
May 11, 2023
Merged
Show file tree
Hide file tree
Changes from 10 commits
Commits
Show all changes
21 commits
Select commit Hold shift + click to select a range
6346972
Adding new structs for difference replicate methods, and exporting them.
codetalker7 May 2, 2023
ca964b1
Incorporating inference method to the `ReplicateDesign` struct, and
codetalker7 May 2, 2023
be94678
Incorporating inference method in `bootweights`.
codetalker7 May 2, 2023
084c900
Incorporating inference method in `jackknifeweights`.
codetalker7 May 2, 2023
662d356
Forcing `ReplicateType` to be a subtype of `InferenceMethod`.
codetalker7 May 2, 2023
95803b2
Correcting docstrings in `src/SurveyDesign.jl`.
codetalker7 May 2, 2023
dfc4f4a
Fixing documentation for `bootweights` and `jackknife.jl`. Also forcing
codetalker7 May 2, 2023
b1b3eb5
Adding documentation for the inference method types.
codetalker7 May 2, 2023
e9acc5d
Minor fixes in tests to incorporate the inference method types.
codetalker7 May 2, 2023
69b7aff
Added the inference method types to the API reference.
codetalker7 May 2, 2023
67be0b1
Renaming `jackknife_variance` to `variance`.
codetalker7 May 3, 2023
f11c6c6
Minor correction; qualifying the `variance` method to `Survey`'s
codetalker7 May 3, 2023
93d23db
Removing redundant `mean` function from `mean.jl`, and moving it to the
codetalker7 May 5, 2023
aa4cd67
Minor change; changing column name from `mean` to `estimator` in
codetalker7 May 5, 2023
1d9bf3e
Adding the `mean` function back, which now uses the `variance` function.
codetalker7 May 5, 2023
cd8554a
Minor change in `mean` function; renaming `estimator` column to `mean`.
codetalker7 May 5, 2023
8910308
Correcting test for `jackknife`.
codetalker7 May 5, 2023
feef09b
Minor change in docstring.
codetalker7 May 5, 2023
79b4c39
Minor fix in docstring.
codetalker7 May 5, 2023
64019c6
Rewriting `quantile` using `variance`.
codetalker7 May 8, 2023
d99f389
Rewriting `total` using `variance`.
codetalker7 May 8, 2023
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions docs/src/api.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,6 +12,8 @@ Private = false
AbstractSurveyDesign
SurveyDesign
ReplicateDesign
BootstrapReplicates
JackknifeReplicates
load_data
bootweights
jackknifeweights
Expand Down
1 change: 1 addition & 0 deletions src/Survey.jl
Original file line number Diff line number Diff line change
Expand Up @@ -29,6 +29,7 @@ include("jackknife.jl")

export load_data
export AbstractSurveyDesign, SurveyDesign, ReplicateDesign
export BootstrapReplicates, JackknifeReplicates
export dim, colnames, dimnames
export mean, total, quantile
export plot
Expand Down
93 changes: 60 additions & 33 deletions src/SurveyDesign.jl
Original file line number Diff line number Diff line change
Expand Up @@ -123,46 +123,71 @@ struct SurveyDesign <: AbstractSurveyDesign
end
end

"""
InferenceMethod

Abstract type for inference methods.
"""
abstract type InferenceMethod end

"""
BootstrapReplicates <: InferenceMethod

Type for the bootstrap replicates method. For more details, see [`bootweights`](@ref).
"""
struct BootstrapReplicates <: InferenceMethod
replicates::UInt
end

"""
JackknifeReplicates <: InferenceMethod

Type for the Jackknife replicates method. For more details, see [`jackknifeweights`](@ref).
"""
struct JackknifeReplicates <: InferenceMethod
replicates::UInt
end

"""
ReplicateDesign <: AbstractSurveyDesign

Survey design obtained by replicating an original design using [`bootweights`](@ref). If
replicate weights are available, then they can be used to directly create a `ReplicateDesign`.
Survey design obtained by replicating an original design using an inference method like [`bootweights`](@ref) or [`jackknifeweights`](@ref). If
replicate weights are available, then they can be used to directly create a `ReplicateDesign` object.

# Constructors

```julia
ReplicateDesign(
ReplicateDesign{ReplicateType}(
data::AbstractDataFrame,
replicate_weights::Vector{Symbol};
clusters::Union{Nothing,Symbol,Vector{Symbol}} = nothing,
strata::Union{Nothing,Symbol} = nothing,
popsize::Union{Nothing,Symbol} = nothing,
weights::Union{Nothing,Symbol} = nothing
)
) where {ReplicateType <: InferenceType}

ReplicateDesign(
ReplicateDesign{ReplicateType}(
data::AbstractDataFrame,
replicate_weights::UnitIndex{Int};
clusters::Union{Nothing,Symbol,Vector{Symbol}} = nothing,
strata::Union{Nothing,Symbol} = nothing,
popsize::Union{Nothing,Symbol} = nothing,
weights::Union{Nothing,Symbol} = nothing
)
) where {ReplicateType <: InferenceType}

ReplicateDesign(
ReplicateDesign{ReplicateType}(
data::AbstractDataFrame,
replicate_weights::Regex;
clusters::Union{Nothing,Symbol,Vector{Symbol}} = nothing,
strata::Union{Nothing,Symbol} = nothing,
popsize::Union{Nothing,Symbol} = nothing,
weights::Union{Nothing,Symbol} = nothing
)
) where {ReplicateType <: InferenceType}
```

# Arguments

The constructor has the same arguments as [`SurveyDesign`](@ref). The only additional argument is `replicate_weights`, which can
`ReplicateType` must be one of the supported inference types; currently the package supports [`BootstrapReplicates`](@ref) and [`JackknifeReplicates`](@ref). The constructor has the same arguments as [`SurveyDesign`](@ref). The only additional argument is `replicate_weights`, which can
be of one of the following types.

- `Vector{Symbol}`: In this case, each `Symbol` in the vector should represent a column of `data` containing the replicate weights.
Expand All @@ -173,15 +198,15 @@ All the columns containing the replicate weights will be renamed to the form `re

# Examples

Here is an example where the [`bootweights`](@ref) function is used to create a `ReplicateDesign`.
Here is an example where the [`bootweights`](@ref) function is used to create a `ReplicateDesign{BootstrapReplicates}`.

```jldoctest replicate-design; setup = :(using Survey, CSV, DataFrames)
julia> apistrat = load_data("apistrat");

julia> dstrat = SurveyDesign(apistrat; strata=:stype, weights=:pw);

julia> bootstrat = bootweights(dstrat; replicates=1000) # creating a ReplicateDesign using bootweights
ReplicateDesign:
ReplicateDesign{BootstrapReplicates}:
data: 200×1044 DataFrame
strata: stype
[E, E, E … H]
Expand Down Expand Up @@ -210,8 +235,8 @@ julia> CSV.write("apistrat_withreplicates.csv", bootstrat.data);
We can now pass the replicate weights directly to the `ReplicateDesign` constructor, either as a `Vector{Symbol}`, a `UnitRange` or a `Regex`.

```jldoctest replicate-design
julia> bootstrat_direct = ReplicateDesign(CSV.read("apistrat_withreplicates.csv", DataFrame), [Symbol("r_"*string(replicate)) for replicate in 1:1000]; strata=:stype, weights=:pw)
ReplicateDesign:
julia> bootstrat_direct = ReplicateDesign{BootstrapReplicates}(CSV.read("apistrat_withreplicates.csv", DataFrame), [Symbol("r_"*string(replicate)) for replicate in 1:1000]; strata=:stype, weights=:pw)
ReplicateDesign{BootstrapReplicates}:
data: 200×1044 DataFrame
strata: stype
[E, E, E … H]
Expand All @@ -223,8 +248,8 @@ allprobs: [0.0226, 0.0226, 0.0226 … 0.0662]
type: bootstrap
replicates: 1000

julia> bootstrat_unitrange = ReplicateDesign(CSV.read("apistrat_withreplicates.csv", DataFrame), UnitRange(45:1044);strata=:stype, weights=:pw)
ReplicateDesign:
julia> bootstrat_unitrange = ReplicateDesign{BootstrapReplicates}(CSV.read("apistrat_withreplicates.csv", DataFrame), UnitRange(45:1044);strata=:stype, weights=:pw)
ReplicateDesign{BootstrapReplicates}:
data: 200×1044 DataFrame
strata: stype
[E, E, E … H]
Expand All @@ -236,8 +261,8 @@ allprobs: [0.0226, 0.0226, 0.0226 … 0.0662]
type: bootstrap
replicates: 1000

julia> bootstrat_regex = ReplicateDesign(CSV.read("apistrat_withreplicates.csv", DataFrame), r"r_\\d";strata=:stype, weights=:pw)
ReplicateDesign:
julia> bootstrat_regex = ReplicateDesign{BootstrapReplicates}(CSV.read("apistrat_withreplicates.csv", DataFrame), r"r_\\d";strata=:stype, weights=:pw)
ReplicateDesign{BootstrapReplicates}:
data: 200×1044 DataFrame
strata: stype
[E, E, E … H]
Expand All @@ -252,7 +277,7 @@ replicates: 1000
```

"""
struct ReplicateDesign <: AbstractSurveyDesign
struct ReplicateDesign{ReplicateType} <: AbstractSurveyDesign
data::AbstractDataFrame
cluster::Symbol
popsize::Symbol
Expand All @@ -264,9 +289,10 @@ struct ReplicateDesign <: AbstractSurveyDesign
type::String
replicates::UInt
replicate_weights::Vector{Symbol}
inference_method::ReplicateType

# default constructor
function ReplicateDesign(
function ReplicateDesign{ReplicateType}(
data::DataFrame,
cluster::Symbol,
popsize::Symbol,
Expand All @@ -277,21 +303,21 @@ struct ReplicateDesign <: AbstractSurveyDesign
pps::Bool,
type::String,
replicates::UInt,
replicate_weights::Vector{Symbol}
)
new(data, cluster, popsize, sampsize, strata, weights, allprobs,
pps, type, replicates, replicate_weights)
replicate_weights::Vector{Symbol},
) where {ReplicateType <: InferenceMethod}
new{ReplicateType}(data, cluster, popsize, sampsize, strata, weights, allprobs,
pps, type, replicates, replicate_weights, ReplicateType(replicates))
end

# constructor with given replicate_weights
function ReplicateDesign(
function ReplicateDesign{ReplicateType}(
data::AbstractDataFrame,
replicate_weights::Vector{Symbol};
clusters::Union{Nothing,Symbol,Vector{Symbol}} = nothing,
strata::Union{Nothing,Symbol} = nothing,
popsize::Union{Nothing,Symbol} = nothing,
weights::Union{Nothing,Symbol} = nothing
)
) where {ReplicateType <: InferenceMethod}
# rename the replicate weights if needed
rename!(data, [replicate_weights[index] => "replicate_"*string(index) for index in 1:length(replicate_weights)])

Expand All @@ -303,7 +329,7 @@ struct ReplicateDesign <: AbstractSurveyDesign
popsize=popsize,
weights=weights
)
new(
new{ReplicateType}(
base_design.data,
base_design.cluster,
base_design.popsize,
Expand All @@ -314,20 +340,21 @@ struct ReplicateDesign <: AbstractSurveyDesign
base_design.pps,
"bootstrap",
length(replicate_weights),
replicate_weights
replicate_weights,
ReplicateType(length(replicate_weights))
)
end

# replicate weights given as a range of columns
ReplicateDesign(
ReplicateDesign{ReplicateType}(
data::AbstractDataFrame,
replicate_weights::UnitRange{Int};
clusters::Union{Nothing,Symbol,Vector{Symbol}} = nothing,
strata::Union{Nothing,Symbol} = nothing,
popsize::Union{Nothing,Symbol} = nothing,
weights::Union{Nothing,Symbol} = nothing
) =
ReplicateDesign(
) where {ReplicateType <: InferenceMethod} =
ReplicateDesign{ReplicateType}(
data,
Symbol.(names(data)[replicate_weights]);
clusters=clusters,
Expand All @@ -337,15 +364,15 @@ struct ReplicateDesign <: AbstractSurveyDesign
)

# replicate weights given as regular expression
ReplicateDesign(
ReplicateDesign{ReplicateType}(
data::AbstractDataFrame,
replicate_weights::Regex;
clusters::Union{Nothing,Symbol,Vector{Symbol}} = nothing,
strata::Union{Nothing,Symbol} = nothing,
popsize::Union{Nothing,Symbol} = nothing,
weights::Union{Nothing,Symbol} = nothing
) =
ReplicateDesign(
) where {ReplicateType <: InferenceMethod} =
ReplicateDesign{ReplicateType}(
data,
Symbol.(names(data)[findall(name -> occursin(replicate_weights, name), names(data))]);
clusters=clusters,
Expand Down
9 changes: 5 additions & 4 deletions src/bootstrap.jl
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
"""
Use bootweights to create replicate weights using Rao-Wu bootstrap. The function accepts a `SurveyDesign` and returns a `ReplicateDesign` which has additional columns for replicate weights.
Use bootweights to create replicate weights using Rao-Wu bootstrap. The function accepts a `SurveyDesign` and returns a `ReplicateDesign{BootstrapReplicates}` which has additional columns for replicate weights.

```jldoctest
julia> using Random
Expand All @@ -9,7 +9,7 @@ julia> apiclus1 = load_data("apiclus1");
julia> dclus1 = SurveyDesign(apiclus1; clusters = :dnum, popsize=:fpc);

julia> bootweights(dclus1; replicates=1000, rng=MersenneTwister(111)) # choose a seed for deterministic results
ReplicateDesign:
ReplicateDesign{BootstrapReplicates}:
data: 183×1044 DataFrame
strata: none
cluster: dnum
Expand All @@ -20,6 +20,7 @@ weights: [50.4667, 50.4667, 50.4667 … 50.4667]
allprobs: [0.0198, 0.0198, 0.0198 … 0.0198]
type: bootstrap
replicates: 1000

```
"""
function bootweights(design::SurveyDesign; replicates = 4000, rng = MersenneTwister(1234))
Expand All @@ -37,7 +38,7 @@ function bootweights(design::SurveyDesign; replicates = 4000, rng = MersenneTwis
substrata_dfs[h] = cluster_sorted
end
df = reduce(vcat, substrata_dfs)
return ReplicateDesign(
return ReplicateDesign{BootstrapReplicates}(
df,
design.cluster,
design.popsize,
Expand All @@ -48,7 +49,7 @@ function bootweights(design::SurveyDesign; replicates = 4000, rng = MersenneTwis
design.pps,
"bootstrap",
UInt(replicates),
[Symbol("replicate_"*string(replicate)) for replicate in 1:replicates]
[Symbol("replicate_"*string(replicate)) for replicate in 1:replicates],
)
end

Expand Down
10 changes: 5 additions & 5 deletions src/jackknife.jl
Original file line number Diff line number Diff line change
Expand Up @@ -22,7 +22,7 @@ julia> apistrat = load_data("apistrat");
julia> dstrat = SurveyDesign(apistrat; strata=:stype, weights=:pw);

julia> rstrat = jackknifeweights(dstrat)
ReplicateDesign:
ReplicateDesign{JackknifeReplicates}:
data: 200×244 DataFrame
strata: stype
[E, E, E … M]
Expand Down Expand Up @@ -67,7 +67,7 @@ function jackknifeweights(design::SurveyDesign)
end
end

return ReplicateDesign(
return ReplicateDesign{JackknifeReplicates}(
df,
design.cluster,
design.popsize,
Expand All @@ -83,7 +83,7 @@ function jackknifeweights(design::SurveyDesign)
end

"""
jackknife_variance(x::Symbol, func::Function, design::ReplicateDesign)
jackknife_variance(x::Symbol, func::Function, design::ReplicateDesign{JackknifeReplicates})

Compute variance of column `x` for the given `func` using the Jackknife method. The formula to compute this variance is the following.

Expand All @@ -102,7 +102,7 @@ julia> apistrat = load_data("apistrat");
julia> dstrat = SurveyDesign(apistrat; strata=:stype, weights=:pw);

julia> rstrat = jackknifeweights(dstrat)
ReplicateDesign:
ReplicateDesign{JackknifeReplicates}:
data: 200×244 DataFrame
strata: stype
[E, E, E … M]
Expand All @@ -127,7 +127,7 @@ julia> jackknife_variance(:api00, weightedmean, rstrat)
# Reference
pg 380-382, Section 9.3.2 Jackknife - Sharon Lohr, Sampling Design and Analysis (2010)
"""
function jackknife_variance(x::Symbol, func::Function, design::ReplicateDesign)
function jackknife_variance(x::Symbol, func::Function, design::ReplicateDesign{JackknifeReplicates})
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should just be called variance. There should be 2 dispatches.

  1. variance(x::Symbol, func::Function, design::ReplicateDesign{JackknifeReplicates})
  2. variance(x::Symbol, func::Function, design::ReplicateDesign{BootstrapReplicates})

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, but we don't have a variance method for BootstrapReplicates method yet, do we?

Copy link
Member Author

@codetalker7 codetalker7 May 3, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, should I rewrite the mean function using multiple dispatch now, or should that be done as a separate PR?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should rewrite the functions now.
And we need a variance function for BootstrapReplicates. It's pretty simple, it's just the simple variable of the estimates.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should rewrite the functions now. And we need a variance function for BootstrapReplicates. It's pretty simple, it's just the simple variable of the estimates.

Okay. Also, just to confirm: haven't we already implemented the variance function for BootstrapReplicates in the mean function in line 48 of this file?: https://github.com/xKDR/Survey.jl/blob/main/src/mean.jl

So we should probably get rid of this mean function right?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we should get rid of it, and

function variance(x::Symbol, func::Function, design::ReplicateDesign{BootstrapReplicates})
...
variance = sum((θ̂t .- θ̂) .^ 2) / design.replicates
end

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

df = design.data
# sort!(df, [design.strata, design.cluster])
stratified_gdf = groupby(df, design.strata)
Expand Down
20 changes: 10 additions & 10 deletions test/runtests.jl
Original file line number Diff line number Diff line change
Expand Up @@ -13,28 +13,28 @@ apisrs = load_data("apisrs") # Load API dataset
srs = SurveyDesign(apisrs, weights = :pw)
unitrange = UnitRange((length(names(apisrs)) + 1):(TOTAL_REPLICATES + length(names(apisrs))))
bsrs = srs |> bootweights # Create replicate design
bsrs_direct = ReplicateDesign(bsrs.data, REPLICATES_VECTOR, weights = :pw) # using ReplicateDesign constructor
bsrs_unitrange = ReplicateDesign(bsrs.data, unitrange, weights = :pw) # using ReplicateDesign constructor
bsrs_regex = ReplicateDesign(bsrs.data, REPLICATES_REGEX, weights = :pw) # using ReplicateDesign constructor
bsrs_direct = ReplicateDesign{BootstrapReplicates}(bsrs.data, REPLICATES_VECTOR, weights = :pw) # using ReplicateDesign constructor
bsrs_unitrange = ReplicateDesign{BootstrapReplicates}(bsrs.data, unitrange, weights = :pw) # using ReplicateDesign constructor
bsrs_regex = ReplicateDesign{BootstrapReplicates}(bsrs.data, REPLICATES_REGEX, weights = :pw) # using ReplicateDesign constructor

# Stratified sample
apistrat = load_data("apistrat") # Load API dataset
dstrat = SurveyDesign(apistrat, strata = :stype, weights = :pw) # Create SurveyDesign
unitrange = UnitRange((length(names(apistrat)) + 1):(TOTAL_REPLICATES + length(names(apistrat))))
bstrat = dstrat |> bootweights # Create replicate design
bstrat_direct = ReplicateDesign(bstrat.data, REPLICATES_VECTOR, strata=:stype, weights=:pw) # using ReplicateDesign constructor
bstrat_unitrange = ReplicateDesign(bstrat.data, unitrange, strata=:stype, weights=:pw) # using ReplicateDesign constructor
bstrat_regex = ReplicateDesign(bstrat.data, REPLICATES_REGEX, strata=:stype, weights=:pw) # using ReplicateDesign constructor
bstrat_direct = ReplicateDesign{BootstrapReplicates}(bstrat.data, REPLICATES_VECTOR, strata=:stype, weights=:pw) # using ReplicateDesign constructor
bstrat_unitrange = ReplicateDesign{BootstrapReplicates}(bstrat.data, unitrange, strata=:stype, weights=:pw) # using ReplicateDesign constructor
bstrat_regex = ReplicateDesign{BootstrapReplicates}(bstrat.data, REPLICATES_REGEX, strata=:stype, weights=:pw) # using ReplicateDesign constructor

# One-stage cluster sample
apiclus1 = load_data("apiclus1") # Load API dataset
apiclus1[!, :pw] = fill(757 / 15, (size(apiclus1, 1),)) # Correct api mistake for pw column
dclus1 = SurveyDesign(apiclus1; clusters = :dnum, weights = :pw) # Create SurveyDesign
unitrange = UnitRange((length(names(apiclus1)) + 1):(TOTAL_REPLICATES + length(names(apiclus1))))
dclus1_boot = dclus1 |> bootweights # Create replicate design
dclus1_boot_direct = ReplicateDesign(dclus1_boot.data, REPLICATES_VECTOR, clusters=:dnum, weights=:pw) # using ReplicateDesign constructor
dclus1_boot_unitrange = ReplicateDesign(dclus1_boot.data, unitrange, clusters=:dnum, weights=:pw) # using ReplicateDesign constructor
dclus1_boot_regex = ReplicateDesign(dclus1_boot.data, REPLICATES_REGEX, clusters=:dnum, weights=:pw) # using ReplicateDesign constructor
dclus1_boot_direct = ReplicateDesign{BootstrapReplicates}(dclus1_boot.data, REPLICATES_VECTOR, clusters=:dnum, weights=:pw) # using ReplicateDesign constructor
dclus1_boot_unitrange = ReplicateDesign{BootstrapReplicates}(dclus1_boot.data, unitrange, clusters=:dnum, weights=:pw) # using ReplicateDesign constructor
dclus1_boot_regex = ReplicateDesign{BootstrapReplicates}(dclus1_boot.data, REPLICATES_REGEX, clusters=:dnum, weights=:pw) # using ReplicateDesign constructor

# Two-stage cluster sample
apiclus2 = load_data("apiclus2") # Load API dataset
Expand Down Expand Up @@ -63,4 +63,4 @@ include("hist.jl")
include("boxplot.jl")
include("ratio.jl")
include("show.jl")
include("jackknife.jl")
include("jackknife.jl")
Loading