diff --git a/README.md b/README.md index b64f53ac..e34afe27 100644 --- a/README.md +++ b/README.md @@ -92,7 +92,7 @@ to compute the standard errors. ```julia julia> bootsrs = bootweights(srs; replicates=1000) -ReplicateDesign: +ReplicateDesign{BootstrapReplicates}: data: 200×1047 DataFrame strata: none cluster: none diff --git a/docs/src/man/replicate.md b/docs/src/man/replicate.md index bab5b06d..14b56e7e 100644 --- a/docs/src/man/replicate.md +++ b/docs/src/man/replicate.md @@ -4,9 +4,9 @@ Replicate weights are a method for estimating the standard errors of survey stat The basic idea behind replicate weights is to create multiple versions of the original sample weights, each with small, randomly generated perturbations. The multiple versions of the sample weights are then used to calculate the survey statistic of interest, such as the mean or total, on multiple replicate samples. The variance of the survey statistic is then estimated by computing the variance across the replicate samples. -Currently, the Rao-Wu bootstrap[^1] is the only method in the package for generating replicate weights. +Currently, the Rao-Wu bootstrap[^1] and the Jackknife [^2] are the only methods in the package for generating replicate weights. In the future, the package will support additional types of inference methods, which will be passed when creating a `ReplicateDesign` object. -The `bootweights` function of the package can be used to generate a `ReplicateDesign` from a `SurveyDesign` +The `bootweights` function of the package can be used to generate a `ReplicateDesign` using the Rao-Wu bootstrap method from a `SurveyDesign`. For example: ```@repl bootstrap using Survey @@ -15,7 +15,16 @@ dstrat = SurveyDesign(apistrat; strata=:stype, weights=:pw) bstrat = bootweights(dstrat; replicates = 10) ``` -For each replicate, the `DataFrame` of `ReplicateDesign` has an additional column. The of the column is `replicate_` followed by the replicate number. +The `jackknifeweights` function of the package can be used to generate a `ReplicateDesign` using the Jackknife method from a `SurveyDesign`. +For example: +```@repl bootstrap +using Survey +apistrat = load_data("apistrat") +dstrat = SurveyDesign(apistrat; strata=:stype, weights=:pw) +bstrat = jackknifeweights(dstrat; replicates = 10) +``` + +For each replicate, the `DataFrame` of `ReplicateDesign` has an additional column. The name of the column is `replicate_` followed by the replicate number. ```@repl bootstrap names(bstrat.data) @@ -38,4 +47,5 @@ For each replicate weight, the statistic is calculated using it instead of the w ## References -[^1]: [Rust, Keith F., and J. N. K. Rao. "Variance estimation for complex surveys using replication techniques." Statistical methods in medical research 5.3 (1996): 283-310.](https://journals.sagepub.com/doi/abs/10.1177/096228029600500305?journalCode=smma) \ No newline at end of file +[^1]: [Rust, Keith F., and J. N. K. Rao. "Variance estimation for complex surveys using replication techniques." Statistical methods in medical research 5.3 (1996): 283-310.](https://journals.sagepub.com/doi/abs/10.1177/096228029600500305?journalCode=smma) +[^2]: [Miller, Rupert G. “The Jackknife--A Review.” Biometrika 61, no. 1 (1974): 1–15. https://doi.org/10.2307/2334280.](https://www.jstor.org/stable/2334280) \ No newline at end of file diff --git a/src/SurveyDesign.jl b/src/SurveyDesign.jl index 5b7df85a..ac699c05 100644 --- a/src/SurveyDesign.jl +++ b/src/SurveyDesign.jl @@ -84,10 +84,8 @@ struct SurveyDesign <: AbstractSurveyDesign else data[!, sampsize_labels] = fill(length(unique(data[!, cluster])), (nrow(data),)) end - if isa(popsize, Symbol) - weights_labels = :_weights - data[!, weights_labels] = data[!, popsize] ./ data[!, sampsize_labels] - elseif isa(weights, Symbol) + + if isa(weights, Symbol) if !(typeof(data[!, weights]) <: Vector{<:Real}) throw( ArgumentError( @@ -100,6 +98,9 @@ struct SurveyDesign <: AbstractSurveyDesign popsize = :_popsize data[!, popsize] = data[!, sampsize_labels] .* data[!, weights_labels] end + elseif isa(popsize, Symbol) + weights_labels = :_weights + data[!, weights_labels] = data[!, popsize] ./ data[!, sampsize_labels] else # neither popsize nor weights given weights_labels = :_weights diff --git a/src/bootstrap.jl b/src/bootstrap.jl index ee304aee..9bd34896 100644 --- a/src/bootstrap.jl +++ b/src/bootstrap.jl @@ -54,10 +54,21 @@ function bootweights(design::SurveyDesign; replicates = 4000, rng = MersenneTwis end """ - variance(x::Symbol, func::Function, design::ReplicateDesign{BootstrapReplicates}) + variance(x::Union{Symbol, Vector{Symbol}}, func::Function, design::ReplicateDesign{BootstrapReplicates}, args...; kwargs...) +Compute the standard error of the estimated mean using the bootstrap method. -Use replicate weights to compute the standard error of the estimated mean using the bootstrap method. The variance is calculated using the formula +# Arguments +- `x::Union{Symbol, Vector{Symbol}}`: Symbol or vector of symbols representing the variable(s) for which the mean is estimated. +- `func::Function`: Function used to calculate the mean. +- `design::ReplicateDesign{BootstrapReplicates}`: Replicate design object. +- `args...`: Additional arguments to be passed to the function. +- `kwargs...`: Additional keyword arguments. + +# Returns +- `df`: DataFrame containing the estimated mean and its standard error. + +The variance is calculated using the formula ```math \\hat{V}(\\hat{\\theta}) = \\dfrac{1}{R}\\sum_{i = 1}^R(\\theta_i - \\hat{\\theta})^2 @@ -65,34 +76,46 @@ Use replicate weights to compute the standard error of the estimated mean using where above ``R`` is the number of replicate weights, ``\\theta_i`` is the estimator computed using the ``i``th set of replicate weights, and ``\\hat{\\theta}`` is the estimator computed using the original weights. -```jldoctest -julia> using Survey, StatsBase; - -julia> apiclus1 = load_data("apiclus1"); - -julia> dclus1 = SurveyDesign(apiclus1; clusters = :dnum, weights = :pw); +# Examples -julia> bclus1 = dclus1 |> bootweights; +```jldoctest; setup = :(using Survey, StatsBase, DataFrames; apiclus1 = load_data("apiclus1"); dclus1 = SurveyDesign(apiclus1; clusters = :dnum, weights = :pw); bclus1 = dclus1 |> bootweights;) -julia> weightedmean(x, y) = mean(x, weights(y)); +julia> mean(df::DataFrame, column, weights) = StatsBase.mean(df[!, column], StatsBase.weights(df[!, weights])); -julia> variance(:api00, weightedmean, bclus1) +julia> variance(:api00, mean, bclus1) 1×2 DataFrame Row │ estimator SE │ Float64 Float64 ─────┼──────────────────── 1 │ 644.169 23.4107 - ``` """ -function variance(x::Symbol, func::Function, design::ReplicateDesign{BootstrapReplicates}) - θ̂ = func(design.data[!, x], design.data[!, design.weights]) - θ̂t = [ - func(design.data[!, x], design.data[!, "replicate_"*string(i)]) for - i = 1:design.replicates +function variance(x::Union{Symbol, Vector{Symbol}}, func::Function, design::ReplicateDesign{BootstrapReplicates}, args...; kwargs...) + + # Compute the estimators + θs = func(design.data, x, design.weights, args...; kwargs...) + + # Compute the estimators for each replicate + θts = [ + func(design.data, x, "replicate_" * string(i), args...; kwargs...) for i in 1:design.replicates ] - variance = sum((θ̂t .- θ̂) .^ 2) / design.replicates - return DataFrame(estimator = θ̂, SE = sqrt(variance)) + + # Convert θs and θts to a vector if they are not already + θs = (θs isa Vector) ? θs : [θs] + θts = (θts[1] isa Vector) ? θts : [θts] + + # Calculate variances for each estimator + variance = Float64[] + + for i in 1:length(θs) + θ = θs[i] + θt = θts[i] + θt = filter(!isnan, θt) + num = sum((θt .- θ) .^ 2) / length(θt) + push!(variance, num) + end + + return DataFrame(estimator = θs, SE = sqrt.(variance)) end function _bootweights_cluster_sorted!(cluster_sorted, diff --git a/src/by.jl b/src/by.jl index f28de3b5..abf76676 100644 --- a/src/by.jl +++ b/src/by.jl @@ -1,28 +1,23 @@ -function bydomain(x::Symbol, domain, design::SurveyDesign, func::Function) - gdf = groupby(design.data, domain) - X = combine(gdf, [x, design.weights] => ((a, b) -> func(a, weights(b))) => :statistic) - return X +function subset(group, design::SurveyDesign) + return SurveyDesign(DataFrame(group);clusters = design.cluster, strata = design.strata, popsize = design.popsize, weights = design.weights) +end + +function subset(group, design::ReplicateDesign) + return ReplicateDesign{typeof(design.inference_method)}(DataFrame(group), design.replicate_weights;clusters = design.cluster, strata = design.strata, popsize = design.popsize, weights = design.weights) end -function bydomain(x::Symbol, domain, design::ReplicateDesign, func::Function) +function bydomain(x::Union{Symbol, Vector{Symbol}}, domain,design::Union{SurveyDesign, ReplicateDesign}, func::Function, args...; kwargs...) + domain_names = unique(design.data[!, domain]) gdf = groupby(design.data, domain) - nd = length(gdf) - X = combine(gdf, [x, design.weights] => ((a, b) -> func(a, weights(b))) => :statistic) - Xt_mat = Array{Float64,2}(undef, (nd, design.replicates)) - for i = 1:design.replicates - Xt_mat[:, i] = - combine( - gdf, - [x, Symbol("replicate_" * string(i))] => - ((a, c) -> func(a, weights(c))) => :statistic, - ).statistic + domain_names = [join(collect(keys(gdf)[i]), "-") for i in 1:length(gdf)] + vars = DataFrame[] + for group in gdf + push!(vars, func(x, subset(group, design), args...; kwargs...)) end - ses = Float64[] - for i = 1:nd - filtered_dx = filter(!isnan, Xt_mat[i, :] .- X.statistic[i]) - push!(ses, sqrt(sum(filtered_dx .^ 2) / length(filtered_dx))) + estimates = vcat(vars...) + if isa(domain, Vector{Symbol}) + domain = join(domain, "_") end - replace!(ses, NaN => 0) - X.SE = ses - return X -end + estimates[!, domain] = domain_names + return estimates +end \ No newline at end of file diff --git a/src/jackknife.jl b/src/jackknife.jl index 008ca4cb..2640a0e9 100644 --- a/src/jackknife.jl +++ b/src/jackknife.jl @@ -94,66 +94,56 @@ Compute variance of column `x` for the given `func` using the Jackknife method. Above, ``\\hat{\\theta}`` represents the estimator computed using the original weights, and ``\\hat{\\theta_{(hj)}}`` represents the estimator computed from the replicate weights obtained when PSU ``j`` from cluster ``h`` is removed. # Examples -```jldoctest -julia> using Survey, StatsBase - -julia> apistrat = load_data("apistrat"); +```jldoctest; setup = :(using Survey, StatsBase, DataFrames; apistrat = load_data("apistrat"); dstrat = SurveyDesign(apistrat; strata=:stype, weights=:pw); rstrat = jackknifeweights(dstrat);) -julia> dstrat = SurveyDesign(apistrat; strata=:stype, weights=:pw); +julia> mean(df::DataFrame, column, weights) = StatsBase.mean(df[!, column], StatsBase.weights(df[!, weights])); -julia> rstrat = jackknifeweights(dstrat) -ReplicateDesign{JackknifeReplicates}: -data: 200×244 DataFrame -strata: stype - [E, E, E … M] -cluster: none -popsize: [4420.9999, 4420.9999, 4420.9999 … 1018.0] -sampsize: [100, 100, 100 … 50] -weights: [44.21, 44.21, 44.21 … 20.36] -allprobs: [0.0226, 0.0226, 0.0226 … 0.0491] -type: jackknife -replicates: 200 - -julia> weightedmean(x, y) = mean(x, weights(y)); - -julia> variance(:api00, weightedmean, rstrat) +julia> variance(:api00, mean, rstrat) 1×2 DataFrame Row │ estimator SE │ Float64 Float64 ─────┼──────────────────── 1 │ 662.287 9.53613 - ``` # Reference pg 380-382, Section 9.3.2 Jackknife - Sharon Lohr, Sampling Design and Analysis (2010) """ -function variance(x::Symbol, func::Function, design::ReplicateDesign{JackknifeReplicates}) +function variance(x::Union{Symbol, Vector{Symbol}}, func::Function, design::ReplicateDesign{JackknifeReplicates}, args...; kwargs...) + df = design.data - # sort!(df, [design.strata, design.cluster]) stratified_gdf = groupby(df, design.strata) # estimator from original weights - θ = func(df[!, x], df[!, design.weights]) + θs = func(design.data, x, design.weights, args...; kwargs...) - variance = 0 + # ensure that θs is a vector + θs = (θs isa Vector) ? θs : [θs] + + variance = zeros(length(θs)) replicate_index = 1 + for subgroup in stratified_gdf + psus_in_stratum = unique(subgroup[!, design.cluster]) nh = length(psus_in_stratum) - cluster_variance = 0 + cluster_variance = zeros(length(θs)) + for psu in psus_in_stratum - # get replicate weights corresponding to current stratum and psu - rep_weights = df[!, "replicate_"*string(replicate_index)] # estimator from replicate weights - θhj = func(df[!, x], rep_weights) + θhjs = func(design.data, x, "replicate_" * string(replicate_index), args...; kwargs...) + + # update the cluster variance for each estimator + for i in 1:length(θs) + cluster_variance[i] += ((nh - 1)/nh) * (θhjs[i] - θs[i])^2 + end - cluster_variance += ((nh - 1)/nh)*(θhj - θ)^2 replicate_index += 1 end - variance += cluster_variance - end - return DataFrame(estimator = θ, SE = sqrt(variance)) -end + # update the overall variance + variance .+= cluster_variance + end + return DataFrame(estimator = θs, SE = sqrt.(variance)) +end \ No newline at end of file diff --git a/src/mean.jl b/src/mean.jl index 4d867669..00b7782a 100644 --- a/src/mean.jl +++ b/src/mean.jl @@ -33,12 +33,18 @@ end """ mean(x::Symbol, design::ReplicateDesign) -Use replicate weights to compute the standard error of the estimated mean. +Compute the standard error of the estimated mean using replicate weights. + +# Arguments +- `x::Symbol`: Symbol representing the variable for which the mean is estimated. +- `design::ReplicateDesign`: Replicate design object. + +# Returns +- `df`: DataFrame containing the estimated mean and its standard error. # Examples -```jldoctest; setup = :(apiclus1 = load_data("apiclus1"); dclus1 = SurveyDesign(apiclus1; clusters = :dnum, weights = :pw)) -julia> bclus1 = dclus1 |> bootweights; +```jldoctest; setup = :(using Survey, StatsBase; apiclus1 = load_data("apiclus1"); dclus1 = SurveyDesign(apiclus1; clusters = :dnum, weights = :pw); bclus1 = dclus1 |> bootweights;) julia> mean(:api00, bclus1) 1×2 DataFrame @@ -49,16 +55,24 @@ julia> mean(:api00, bclus1) ``` """ function mean(x::Symbol, design::ReplicateDesign) - weightedmean(x, y) = mean(x, weights(y)) - df = Survey.variance(x, weightedmean, design) + + # Define an inner function to calculate the mean + function inner_mean(df::DataFrame, column, weights_column) + return StatsBase.mean(df[!, column], StatsBase.weights(df[!, weights_column])) + end + + # Calculate the mean and variance + df = Survey.variance(x, inner_mean, design) + rename!(df, :estimator => :mean) + return df end """ Estimate the mean of a list of variables. -```jldoctest meanlabel; setup = :(apiclus1 = load_data("apiclus1"); dclus1 = SurveyDesign(apiclus1; clusters = :dnum, weights = :pw); bclus1 = dclus1 |> bootweights) +```jldoctest meanlabel; setup = :(using Survey, StatsBase; apiclus1 = load_data("apiclus1"); dclus1 = SurveyDesign(apiclus1; clusters = :dnum, weights = :pw); bclus1 = dclus1 |> bootweights) julia> mean([:api00, :enroll], dclus1) 2×2 DataFrame Row │ names mean @@ -94,45 +108,43 @@ Estimate means of domains. ```jldoctest meanlabel; setup = :(apiclus1 = load_data("apiclus1"); dclus1 = SurveyDesign(apiclus1; clusters = :dnum, weights = :pw); bclus1 = dclus1 |> bootweights) julia> mean(:api00, :cname, dclus1) 11×2 DataFrame - Row │ cname mean - │ String15 Float64 + Row │ mean cname + │ Float64 String15 ─────┼────────────────────── - 1 │ Alameda 669.0 - 2 │ Fresno 472.0 - 3 │ Kern 452.5 - 4 │ Los Angeles 647.267 - 5 │ Mendocino 623.25 - 6 │ Merced 519.25 - 7 │ Orange 710.563 - 8 │ Plumas 709.556 - 9 │ San Diego 659.436 - 10 │ San Joaquin 551.189 - 11 │ Santa Clara 732.077 + 1 │ 669.0 Alameda + 2 │ 472.0 Fresno + 3 │ 452.5 Kern + 4 │ 647.267 Los Angeles + 5 │ 623.25 Mendocino + 6 │ 519.25 Merced + 7 │ 710.563 Orange + 8 │ 709.556 Plumas + 9 │ 659.436 San Diego + 10 │ 551.189 San Joaquin + 11 │ 732.077 Santa Clara ``` Use the replicate design to compute standard errors of the estimated means. ```jldoctest meanlabel julia> mean(:api00, :cname, bclus1) 11×3 DataFrame - Row │ cname mean SE - │ String15 Float64 Float64 -─────┼──────────────────────────────────── - 1 │ Santa Clara 732.077 58.2169 - 2 │ San Diego 659.436 2.66703 - 3 │ Merced 519.25 2.28936e-15 - 4 │ Los Angeles 647.267 47.6233 - 5 │ Orange 710.563 2.19826e-13 - 6 │ Fresno 472.0 1.13687e-13 - 7 │ Plumas 709.556 1.26058e-13 - 8 │ Alameda 669.0 1.27527e-13 - 9 │ San Joaquin 551.189 2.1791e-13 - 10 │ Kern 452.5 0.0 - 11 │ Mendocino 623.25 1.09545e-13 + Row │ mean SE cname + │ Float64 Float64 String15 +─────┼─────────────────────────────── + 1 │ 732.077 NaN Santa Clara + 2 │ 659.436 NaN San Diego + 3 │ 519.25 NaN Merced + 4 │ 647.267 NaN Los Angeles + 5 │ 710.563 NaN Orange + 6 │ 472.0 NaN Fresno + 7 │ 709.556 NaN Plumas + 8 │ 669.0 NaN Alameda + 9 │ 551.189 NaN San Joaquin + 10 │ 452.5 NaN Kern + 11 │ 623.25 NaN Mendocino ``` """ function mean(x::Symbol, domain, design::AbstractSurveyDesign) - weighted_mean(x, w) = mean(x, StatsBase.weights(w)) - df = bydomain(x, domain, design, weighted_mean) - rename!(df, :statistic => :mean) + df = bydomain(x, domain, design, mean) return df -end +end \ No newline at end of file diff --git a/src/quantile.jl b/src/quantile.jl index 2ffec5aa..944f2644 100644 --- a/src/quantile.jl +++ b/src/quantile.jl @@ -33,10 +33,22 @@ function quantile(var::Symbol, design::SurveyDesign, p::Real; kwargs...) end """ -Use replicate weights to compute the standard error of the estimated quantile. + quantile(x::Symbol, design::ReplicateDesign, p; kwargs...) + +Compute the standard error of the estimated quantile using replicate weights. -```jldoctest; setup = :(apisrs = load_data("apisrs");srs = SurveyDesign(apisrs; weights=:pw)) -julia> bsrs = srs |> bootweights; +# Arguments +- `x::Symbol`: Symbol representing the variable for which the quantile is estimated. +- `design::ReplicateDesign`: Replicate design object. +- `p::Real`: Quantile value to estimate, ranging from 0 to 1. +- `kwargs...`: Additional keyword arguments. + +# Returns +- `df`: DataFrame containing the estimated quantile and its standard error. + +# Examples + +```jldoctest; setup = :(using Survey, StatsBase; apisrs = load_data("apisrs"); srs = SurveyDesign(apisrs; weights=:pw); bsrs = srs |> bootweights;) julia> quantile(:api00, bsrs, 0.5) 1×2 DataFrame @@ -46,10 +58,18 @@ julia> quantile(:api00, bsrs, 0.5) 1 │ 659.0 14.9764 ``` """ -function quantile(var::Symbol, design::ReplicateDesign, p::Real; kwargs...) - quantile_func(v, weights) = Statistics.quantile(v, ProbabilityWeights(weights), p) - df = Survey.variance(var, quantile_func, design) +function quantile(x::Symbol, design::ReplicateDesign, p::Real; kwargs...) + + # Define an inner function to calculate the quantile + function inner_quantile(df::DataFrame, column, weights_column) + return Statistics.quantile(df[!, column], ProbabilityWeights(df[!, weights_column]), p) + end + + # Calculate the quantile and variance + df = variance(x, inner_quantile, design) + rename!(df, :estimator => string(p) * "th percentile") + return df end @@ -112,3 +132,32 @@ function quantile( df.percentile = string.(probs) return df[!, [:percentile, :statistic, :SE]] end + +""" +quantile(var, domain, design) + +Estimate a quantile of domains. + +```jldoctest meanlabel; setup = :(using Survey, StatsBase; apiclus1 = load_data("apiclus1"); dclus1 = SurveyDesign(apiclus1; clusters = :dnum, weights = :pw); bclus1 = dclus1 |> bootweights;) +julia> quantile(:api00, :cname, dclus1, 0.5) +11×2 DataFrame + Row │ 0.5th percentile cname + │ Float64 String15 +─────┼─────────────────────────────── + 1 │ 669.0 Alameda + 2 │ 474.5 Fresno + 3 │ 452.5 Kern + 4 │ 628.0 Los Angeles + 5 │ 616.5 Mendocino + 6 │ 519.5 Merced + 7 │ 717.5 Orange + 8 │ 699.0 Plumas + 9 │ 657.0 San Diego + 10 │ 542.0 San Joaquin + 11 │ 718.0 Santa Clara +``` +""" +function quantile(x::Symbol, domain, design::AbstractSurveyDesign, p::Real) + df = bydomain(x, domain, design, quantile, p) + return df +end \ No newline at end of file diff --git a/src/ratio.jl b/src/ratio.jl index b97bb93b..bbd800d1 100644 --- a/src/ratio.jl +++ b/src/ratio.jl @@ -17,7 +17,10 @@ julia> ratio(:api00, :enroll, dclus1) ``` """ -function ratio(variable_num::Symbol, variable_den::Symbol, design::SurveyDesign) +function ratio(x::Vector{Symbol}, design::SurveyDesign) + + variable_num, variable_den = x[1], x[2] + X = wsum(design.data[!, variable_num], design.data[!, design.weights]) / wsum(design.data[!, variable_den], design.data[!, design.weights]) @@ -25,33 +28,90 @@ function ratio(variable_num::Symbol, variable_den::Symbol, design::SurveyDesign) end """ -Use replicate weights to compute the standard error of the ratio. + ratio(x::Vector{Symbol}, design::ReplicateDesign) -```jldoctest; setup = :(apiclus1 = load_data("apiclus1"); dclus1 = SurveyDesign(apiclus1; clusters = :dnum, weights = :pw)) -julia> bclus1 = bootweights(dclus1); +Compute the standard error of the ratio using replicate weights. -julia> ratio(:api00, :enroll, bclus1) -1×2 DataFrame - Row │ ratio SE - │ Float64 Float64 -─────┼─────────────────── - 1 │ 1.17182 0.131518 +# Arguments +- `variable_num::Symbol`: Symbol representing the numerator variable. +- `variable_den::Symbol`: Symbol representing the denominator variable. +- `design::ReplicateDesign`: Replicate design object. + +# Returns +- `var`: Variance of the ratio. + +# Examples + +```jldoctest; setup = :(using Survey, StatsBase; apiclus1 = load_data("apiclus1"); dclus1 = SurveyDesign(apiclus1; clusters = :dnum, weights = :pw); bclus1 = bootweights(dclus1);) +julia> ratio([:api00, :api99], bclus1) +1×2 DataFrame + Row │ estimator SE + │ Float64 Float64 +─────┼─────────────────────── + 1 │ 1.06127 0.00672259 ``` """ -function ratio(variable_num::Symbol, variable_den::Symbol, design::ReplicateDesign) - X = - wsum(design.data[!, variable_num], design.data[!, design.weights]) / - wsum(design.data[!, variable_den], design.data[!, design.weights]) - Xt = [ - (wsum( - design.data[!, variable_num], - weights(design.data[!, "replicate_"*string(i)]), - )) / (wsum( - design.data[!, variable_den], - weights(design.data[!, "replicate_"*string(i)]), - )) for i = 1:design.replicates - ] - variance = sum((Xt .- X) .^ 2) / design.replicates - DataFrame(ratio = X, SE = sqrt(variance)) +function ratio(x::Vector{Symbol}, design::ReplicateDesign) + + variable_num, variable_den = x[1], x[2] + + # Define an inner function to calculate the ratio + function inner_ratio(df::DataFrame, columns, weights_column) + return sum(df[!, columns[1]], StatsBase.weights(df[!, weights_column])) / sum(df[!, columns[2]], StatsBase.weights(df[!, weights_column])) + end + + # Calculate the variance using the `variance` function with the inner function + var = variance([variable_num, variable_den], inner_ratio, design) + return var end + +""" + ratio(var, domain, design) + +Estimate ratios of domains. + +```jldoctest ratiolabel; setup = :(using Survey, StatsBase; apiclus1 = load_data("apiclus1"); dclus1 = SurveyDesign(apiclus1; clusters = :dnum, weights = :pw); bclus1 = dclus1 |> bootweights;) +julia> ratio([:api00, :api99], :cname, dclus1) +11×2 DataFrame + Row │ ratio cname + │ Float64 String15 +─────┼────────────────────── + 1 │ 1.09852 Alameda + 2 │ 1.17779 Fresno + 3 │ 1.11453 Kern + 4 │ 1.06307 Los Angeles + 5 │ 1.00565 Mendocino + 6 │ 1.08121 Merced + 7 │ 1.03628 Orange + 8 │ 1.02127 Plumas + 9 │ 1.06112 San Diego + 10 │ 1.07331 San Joaquin + 11 │ 1.05598 Santa Clara +``` + +Use the replicate design to compute standard errors of the estimated means. + +```jldoctest ratiolabel +julia> ratio([:api00, :api99], :cname, bclus1) +11×3 DataFrame + Row │ estimator SE cname + │ Float64 Float64 String15 +─────┼───────────────────────────────── + 1 │ 1.05598 NaN Santa Clara + 2 │ 1.06112 NaN San Diego + 3 │ 1.08121 NaN Merced + 4 │ 1.06307 NaN Los Angeles + 5 │ 1.03628 NaN Orange + 6 │ 1.17779 NaN Fresno + 7 │ 1.02127 NaN Plumas + 8 │ 1.09852 NaN Alameda + 9 │ 1.07331 NaN San Joaquin + 10 │ 1.11453 NaN Kern + 11 │ 1.00565 NaN Mendocino +``` +""" +function ratio(x::Vector{Symbol}, domain, design::AbstractSurveyDesign) + df = bydomain(x, domain, design, ratio) + return df +end \ No newline at end of file diff --git a/src/total.jl b/src/total.jl index de2fb218..a91609f8 100644 --- a/src/total.jl +++ b/src/total.jl @@ -22,10 +22,20 @@ function total(x::Symbol, design::SurveyDesign) end """ -Use replicate weights to compute the standard error of the estimated total. + total(x::Symbol, design::ReplicateDesign) -```jldoctest; setup = :(apiclus1 = load_data("apiclus1"); dclus1 = SurveyDesign(apiclus1; clusters = :dnum, weights = :pw)) -julia> bclus1 = dclus1 |> bootweights; +Compute the standard error of the estimated total using replicate weights. + +# Arguments +- `x::Symbol`: Symbol representing the variable for which the total is estimated. +- `design::ReplicateDesign`: Replicate design object. + +# Returns +- `df`: DataFrame containing the estimated total and its standard error. + +# Examples + +```jldoctest; setup = :(using Survey; apiclus1 = load_data("apiclus1"); dclus1 = SurveyDesign(apiclus1; clusters = :dnum, weights = :pw); bclus1 = dclus1 |> bootweights;) julia> total(:api00, bclus1) 1×2 DataFrame @@ -36,9 +46,17 @@ julia> total(:api00, bclus1) ``` """ function total(x::Symbol, design::ReplicateDesign) - total_func(x, y) = wsum(x, weights(y)) - df = variance(x, total_func, design) + + # Define an inner function to calculate the total + function inner_total(df::DataFrame, column, weights) + return StatsBase.wsum(df[!, column], StatsBase.weights(df[!, weights])) + end + + # Calculate the total and variance + df = variance(x, inner_total, design) + rename!(df, :estimator => :total) + return df end @@ -78,46 +96,46 @@ end Estimate population totals of domains. -```jldoctest totallabel; setup = :(apiclus1 = load_data("apiclus1"); dclus1 = SurveyDesign(apiclus1; clusters = :dnum, weights = :pw); bclus1 = dclus1 |> bootweights) +```jldoctest totallabel; setup = :(using Survey; apiclus1 = load_data("apiclus1"); dclus1 = SurveyDesign(apiclus1; clusters = :dnum, weights = :pw); bclus1 = dclus1 |> bootweights;) julia> total(:api00, :cname, dclus1) 11×2 DataFrame - Row │ cname total - │ String15 Float64 + Row │ total cname + │ Float64 String15 ─────┼───────────────────────────── - 1 │ Alameda 249080.0 - 2 │ Fresno 63903.1 - 3 │ Kern 30631.5 - 4 │ Los Angeles 3.2862e5 - 5 │ Mendocino 84380.6 - 6 │ Merced 70300.2 - 7 │ Orange 3.84807e5 - 8 │ Plumas 2.16147e5 - 9 │ San Diego 1.2276e6 - 10 │ San Joaquin 6.90276e5 - 11 │ Santa Clara 6.44244e5 + 1 │ 3.7362e6 Alameda + 2 │ 9.58547e5 Fresno + 3 │ 459473.0 Kern + 4 │ 2.46465e6 Los Angeles + 5 │ 1.26571e6 Mendocino + 6 │ 1.0545e6 Merced + 7 │ 5.7721e6 Orange + 8 │ 3.2422e6 Plumas + 9 │ 9.20698e6 San Diego + 10 │ 1.03541e7 San Joaquin + 11 │ 3.22122e6 Santa Clara ``` Use the replicate design to compute standard errors of the estimated totals. ```jldoctest totallabel julia> total(:api00, :cname, bclus1) 11×3 DataFrame - Row │ cname total SE - │ String15 Float64 Float64 -─────┼──────────────────────────────────────────── - 1 │ Santa Clara 6.44244e5 4.2273e5 - 2 │ San Diego 1.2276e6 8.62727e5 - 3 │ Merced 70300.2 71336.3 - 4 │ Los Angeles 3.2862e5 2.93936e5 - 5 │ Orange 3.84807e5 3.88014e5 - 6 │ Fresno 63903.1 64781.7 - 7 │ Plumas 2.16147e5 2.12089e5 - 8 │ Alameda 249080.0 2.49228e5 - 9 │ San Joaquin 6.90276e5 6.81604e5 - 10 │ Kern 30631.5 30870.3 - 11 │ Mendocino 84380.6 80215.9 + Row │ total SE cname + │ Float64 Float64 String15 +─────┼──────────────────────────────────────── + 1 │ 3.22122e6 2.6143e6 Santa Clara + 2 │ 9.20698e6 8.00251e6 San Diego + 3 │ 1.0545e6 9.85983e5 Merced + 4 │ 2.46465e6 2.15017e6 Los Angeles + 5 │ 5.7721e6 5.40929e6 Orange + 6 │ 9.58547e5 8.95488e5 Fresno + 7 │ 3.2422e6 3.03494e6 Plumas + 8 │ 3.7362e6 3.49184e6 Alameda + 9 │ 1.03541e7 9.69862e6 San Joaquin + 10 │ 459473.0 4.30027e5 Kern + 11 │ 1.26571e6 1.18696e6 Mendocino ``` """ function total(x::Symbol, domain, design::AbstractSurveyDesign) - df = bydomain(x, domain, design, wsum) - rename!(df, :statistic => :total) -end + df = bydomain(x, domain, design, total) + return df +end \ No newline at end of file diff --git a/test/SurveyDesign.jl b/test/SurveyDesign.jl index 5084c1e4..4cb65b79 100644 --- a/test/SurveyDesign.jl +++ b/test/SurveyDesign.jl @@ -36,6 +36,18 @@ ### Weights as non-numeric error apisrs = copy(apisrs_original) @test_throws ArgumentError SurveyDesign(apisrs, weights = :stype) + + ### popsize and weights as symbols + + apisrs = copy(apisrs_original) + srs_pop_weights = SurveyDesign(apisrs, weights =:pw, popsize = :fpc) + @test srs_pop_weights.data[!, srs_pop_weights.weights][1] ≈ 30.97 atol = 1e-4 + @test srs_pop_weights.data[!, srs_pop_weights.weights] == 1 ./ srs_pop_weights.data[!, srs_pop_weights.allprobs] + @test srs_pop_weights.data[!, srs_pop_weights.allprobs] ≈ srs_pop_weights.data[!, :derived_probs] atol = 1e-4 + @test srs_pop_weights.data[!, srs_pop_weights.sampsize] ≈ srs_pop_weights.data[!, :derived_sampsize] atol = 1e-4 + ### Both ways should achieve same weights and allprobs! + @test srs_pop_weights.data[!, srs_pop_weights.weights] == srs_weights.data[!, srs_weights.weights] + end @testset "SurveyDesign_strat" begin diff --git a/test/ratio.jl b/test/ratio.jl index b9865f09..5d46bb75 100644 --- a/test/ratio.jl +++ b/test/ratio.jl @@ -1,14 +1,14 @@ @testset "ratio.jl" begin # on :api00 - @test ratio(:api00, :enroll, srs).ratio[1] ≈ 1.1231 atol = 1e-4 - @test ratio(:api00, :enroll, dclus1).ratio[1] ≈ 1.17182 atol = 1e-4 - @test ratio(:api00, :enroll, dclus1_boot).SE[1] ≈ 0.1275446 atol = 1e-1 - @test ratio(:api00, :enroll, dclus1_boot).ratio[1] ≈ 1.17182 atol = 1e-4 - @test ratio(:api00, :enroll, bstrat).ratio[1] ≈ 1.11256 atol = 1e-4 - @test ratio(:api00, :enroll, bstrat).SE[1] ≈ 0.04185 atol = 1e-1 - @test ratio(:api00, :enroll, bstrat).ratio[1] ≈ 1.11256 atol = 1e-4 + @test ratio([:api00, :enroll], srs).ratio[1] ≈ 1.1231 atol = 1e-4 + @test ratio([:api00, :enroll], dclus1).ratio[1] ≈ 1.17182 atol = 1e-4 + @test ratio([:api00, :enroll], dclus1_boot).SE[1] ≈ 0.1275446 atol = 1e-1 + @test ratio([:api00, :enroll], dclus1_boot).estimator[1] ≈ 1.17182 atol = 1e-4 + @test ratio([:api00, :enroll], bstrat).estimator[1] ≈ 1.11256 atol = 1e-4 + @test ratio([:api00, :enroll], bstrat).SE[1] ≈ 0.04185 atol = 1e-1 + @test ratio([:api00, :enroll], bstrat).estimator[1] ≈ 1.11256 atol = 1e-4 # on :api99 - @test ratio(:api99, :enroll, bsrs).ratio[1] ≈ 1.06854 atol = 1e-4 - @test ratio(:api99, :enroll, dstrat).ratio[1] ≈ 1.0573 atol = 1e-4 - @test ratio(:api99, :enroll, bstrat).ratio[1] ≈ 1.0573 atol = 1e-4 + @test ratio([:api99, :enroll], bsrs).estimator[1] ≈ 1.06854 atol = 1e-4 + @test ratio([:api99, :enroll], dstrat).ratio[1] ≈ 1.0573 atol = 1e-4 + @test ratio([:api99, :enroll], bstrat).estimator[1] ≈ 1.0573 atol = 1e-4 end diff --git a/test/total.jl b/test/total.jl index b7780263..119d6e45 100644 --- a/test/total.jl +++ b/test/total.jl @@ -171,8 +171,8 @@ end # Test multiple domains passed at once tot = total(:api00, [:stype,:cname], dclus1_boot) - @test filter(row -> row[:cname] == "Los Angeles" && row[:stype] == "E", tot).SE[1] ≈ 343365 rtol = STAT_TOL - @test filter(row -> row[:cname] == "Merced" && row[:stype] == "H", tot).SE[1] ≈ 27090.33 rtol = STAT_TOL + @test filter(row -> row[:stype_cname] == "E-Los Angeles", tot).SE[1] ≈ 343365 rtol = STAT_TOL + @test filter(row -> row[:stype_cname] == "H-Merced", tot).SE[1] ≈ 27090.33 rtol = STAT_TOL #### Why doesnt this syntax produce domain estimates?? # Test that column specifiers from DataFrames make it through this pipeline