diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md index ca099c87..986e1713 100644 --- a/CONTRIBUTING.md +++ b/CONTRIBUTING.md @@ -2,6 +2,17 @@ # Contributing to Survey.jl + * [Overview](#overview) + * [Reporting Issues](#reporting-issues) + * [Recommended workflow setup](#recommended-workflow-setup) + * [Modifying an existing docstring in `src/`](#modifying-an-existing-docstring-in--src--) + * [Adding a new docstring to `src/`](#adding-a-new-docstring-to--src--) + * [Doctests](#doctests) + * [Integration with exisiting API](#integration-with-exisiting-api) + * [Contributing](#contributing) + * [Style Guidelines](#style-guidelines) + * [Git Recommendations For Pull Requests](#git-recommendations-for-pull-requests) + ## Overview Thank you for thinking about making contributions to Survey.jl! We aim to keep consistency in contribution guidelines to DataFrames.jl, which is the main upstream dependency for the project. @@ -16,6 +27,46 @@ Reading through the ColPrac guide for collaborative practices is highly recommen (`Pkg.add(name="Survey", rev="main")`) is a good gut check and can streamline the process, along with including the first two lines of output from `versioninfo()` +## Setting up development workflow + +Below tutorial uses Windows Subsystem for Linux (WSL) and VSCode. Linux/MacOS/BSD can ignore WSL specific steps. + +1. Install Ubuntu on WSL from the [Ubuntu website](https://ubuntu.com/wsl) or the Microsoft Store +2. Create a fork of the [Survey.jl repository](https://github.com/xKDR/Survey.jl). You will only be ever working on this fork, and submitting Pull Requests to the main repo. +3. Copy the SSH link from your fork by clicking the green `<> Code` icon and then `SSH`. + - You must already have SSH setup for this to work. If you don't, look up a tutorial on how to clone a github repository using SSH. +4. Open a WSL terminal, and run : + - `curl -fsSL https://install.julialang.org | sh` + - `git clone git@github.com:your_username/Survey.jl.git` -- replace "*your_username**" + - `julia` +3. You are now in the Julia REPL, run : + - `import Pkg; Pkg.add("Revise")` + - `import Pkg; Pkg.add("Survey")` + - `import Pkg; Pkg.add("Test")` + - `] dev .` +4. Open VSCode and install the following extensions : + - WSL + - Julia +5. Go back to your WSL terminal, navigate to the folder of your repo, and run `code .` to open VSCode in that folder +6. Create a `dev` folder (only if you want, it is gitignored by default), and a `test.jl` file in the file. Paste this block of code and save : + +```julia +using Revise, Survey, Test + +@testset "ratio.jl" begin + apiclus1 = load_data("apiclus1") + dclus1 = SurveyDesign(apiclus1; clusters=:dnum, strata=:stype, weights=:pw) + @test ratio(:api00, :enroll, dclus1).ratio[1] ≈ 1.17182 atol = 1e-4 +end +``` + +9. In the WSL terminal (not Julia REPL), run `julia dev/test.jl` +✅ If you get no errors, your setup is now complete ! + +You can keep working in the `dev` folder, which is .gitignored. +Once you have working code and tests, you can move them to the appropriate folders, commit, push, and submit a Pull Request. +Make sure to read the rest of this document so you can learn the best practices and guidelines for this project. + ## Modifying an existing docstring in `src/` All docstrings are written inline above the methods or types they are associated with and can @@ -94,7 +145,7 @@ This way you are modifying as little as possible of previously written code, and * If you want to propose a new functionality it is strongly recommended to open an issue first and reach a decision on the final design. Then a pull request serves an implementation of the agreed way how things should work. * If you are a new contributor and would like to get a guidance on what area - you could focus your first PR please do not hesitate to ask and JuliaData members + you could focus your first PR please do not hesitate to ask community members will help you with picking a topic matching your experience. * Feel free to open, or comment on, an issue and solicit feedback early on, especially if you're unsure about aligning with design goals and direction, @@ -104,22 +155,15 @@ This way you are modifying as little as possible of previously written code, and * Aim for atomic commits, if possible, e.g. `change 'foo' behavior like so` & `'bar' handles such and such corner case`, rather than `update 'foo' and 'bar'` & `fix typo` & `fix 'bar' better`. -* Pull requests are tested against release and development branches of Julia, - so using `Pkg.test("DataFrames")` as you develop can be helpful. +* Pull requests are tested against release branches of Julia, + so using `Pkg.test("Survey")` as you develop can be helpful. * The style guidelines outlined below are not the personal style of most contributors, but for consistency throughout the project, we've adopted them. -* It is recommended to disable GitHub Actions on your fork; check Settings > Actions. * If a PR adds a new exported name then make sure to add a docstring for it and add a reference to it in the documentation. * A PR with breaking changes should have `[BREAKING]` as a first part of its name. -* If a PR changes or adds functionality please update NEWS.md file accordingly as - a part of the PR (along with the link to the PR); please do not add entries - to NEWS.md for changes that are bug fixes or are not user visible, such as - adding tests, updating documentation or improving code layout. -* If you make a PR please try to avoid pushing many small commits to GitHub in - a sequence as each such commit triggers a separate CI job, which takes over - an hour. This has a consequence of making other PRs in packages from the JuliaData - ecosystem wait for such CI jobs to finish as hey share a common pool of CI resources. +* A PR which is still draft or work in progress should have `WIP:` as a first part of its name. +* If you make a PR please try to avoid pushing many small commits to GitHub in a sequence as each such commit triggers a separate CI job, which takes compuational time, and not a good use of the small pool of CI resources. ## Style Guidelines diff --git a/Project.toml b/Project.toml index 792d0adc..7ecf4b9a 100644 --- a/Project.toml +++ b/Project.toml @@ -1,7 +1,7 @@ name = "Survey" uuid = "c1a98b4d-6cd2-47ec-b9e9-69b59c35373c" authors = ["Ayush Patnaik "] -version = "0.1.0" +version = "0.2.0" [deps] AlgebraOfGraphics = "cbdf2221-f076-402e-a563-3d30da359d67" diff --git a/README.md b/README.md index 70b5f82b..b64f53ac 100644 --- a/README.md +++ b/README.md @@ -99,7 +99,8 @@ cluster: none popsize: [6190.0, 6190.0, 6190.0 … 6190.0] sampsize: [200, 200, 200 … 200] weights: [31.0, 31.0, 31.0 … 31.0] -probs: [0.0323, 0.0323, 0.0323 … 0.0323] +allprobs: [0.0323, 0.0323, 0.0323 … 0.0323] +type: bootstrap replicates: 1000 julia> mean(:api00, bootsrs) diff --git a/docs/Project.toml b/docs/Project.toml index 76d580a7..f11e8201 100644 --- a/docs/Project.toml +++ b/docs/Project.toml @@ -1,5 +1,7 @@ [deps] +CSV = "336ed68f-0bac-5ca0-87d4-7b16caf5d00b" +DataFrames = "a93c6f00-e57d-5684-b7b6-d8193f3e46c0" Documenter = "e30172f5-a6a5-5a46-863b-614d45cd2de4" -Survey = "c1a98b4d-6cd2-47ec-b9e9-69b59c35373c" Random = "9a3f8284-a2c9-5f02-9a11-845980a1fd5c" StatsBase = "2913bbd2-ae8a-5f71-8c99-4fb6c76f3a91" +Survey = "c1a98b4d-6cd2-47ec-b9e9-69b59c35373c" diff --git a/docs/src/api.md b/docs/src/api.md index 341dcbba..e3dd4f08 100644 --- a/docs/src/api.md +++ b/docs/src/api.md @@ -14,6 +14,8 @@ SurveyDesign ReplicateDesign load_data bootweights +jackknifeweights +jackknife_variance mean total quantile diff --git a/src/Survey.jl b/src/Survey.jl index 62f263a2..6960cfb5 100644 --- a/src/Survey.jl +++ b/src/Survey.jl @@ -25,6 +25,7 @@ include("boxplot.jl") include("show.jl") include("ratio.jl") include("by.jl") +include("jackknife.jl") export load_data export AbstractSurveyDesign, SurveyDesign, ReplicateDesign @@ -35,5 +36,6 @@ export hist, sturges, freedman_diaconis export boxplot export bootweights export ratio +export jackknifeweights, jackknife_variance end diff --git a/src/SurveyDesign.jl b/src/SurveyDesign.jl index 295ce50c..f4c28ab4 100644 --- a/src/SurveyDesign.jl +++ b/src/SurveyDesign.jl @@ -126,14 +126,117 @@ end """ ReplicateDesign <: AbstractSurveyDesign -Survey design obtained by replicating an original design using [`bootweights`](@ref). +Survey design obtained by replicating an original design using [`bootweights`](@ref). If +replicate weights are available, then they can be used to directly create a `ReplicateDesign`. -```jldoctest +# Constructors + +```julia +ReplicateDesign( + data::AbstractDataFrame, + replicate_weights::Vector{Symbol}; + clusters::Union{Nothing,Symbol,Vector{Symbol}} = nothing, + strata::Union{Nothing,Symbol} = nothing, + popsize::Union{Nothing,Symbol} = nothing, + weights::Union{Nothing,Symbol} = nothing +) + +ReplicateDesign( + data::AbstractDataFrame, + replicate_weights::UnitIndex{Int}; + clusters::Union{Nothing,Symbol,Vector{Symbol}} = nothing, + strata::Union{Nothing,Symbol} = nothing, + popsize::Union{Nothing,Symbol} = nothing, + weights::Union{Nothing,Symbol} = nothing +) + +ReplicateDesign( + data::AbstractDataFrame, + replicate_weights::Regex; + clusters::Union{Nothing,Symbol,Vector{Symbol}} = nothing, + strata::Union{Nothing,Symbol} = nothing, + popsize::Union{Nothing,Symbol} = nothing, + weights::Union{Nothing,Symbol} = nothing +) +``` + +# Arguments + +The constructor has the same arguments as [`SurveyDesign`](@ref). The only additional argument is `replicate_weights`, which can +be of one of the following types. + +- `Vector{Symbol}`: In this case, each `Symbol` in the vector should represent a column of `data` containing the replicate weights. +- `UnitIndex{Int}`: For instance, this could be UnitRange(5:10). This will mean that the replicate weights are contained in columns 5 through 10. +- `Regex`: In this case, all the columns of `data` which match this `Regex` will be treated as the columns containing the replicate weights. + +All the columns containing the replicate weights will be renamed to the form `replicate_i`, where `i` ranges from 1 to the number of columns containing the replicate weights. + +# Examples + +Here is an example where the [`bootweights`](@ref) function is used to create a `ReplicateDesign`. + +```jldoctest replicate-design; setup = :(using Survey, CSV, DataFrames) julia> apistrat = load_data("apistrat"); julia> dstrat = SurveyDesign(apistrat; strata=:stype, weights=:pw); -julia> bootstrat = bootweights(dstrat; replicates=1000) +julia> bootstrat = bootweights(dstrat; replicates=1000) # creating a ReplicateDesign using bootweights +ReplicateDesign: +data: 200×1044 DataFrame +strata: stype + [E, E, E … H] +cluster: none +popsize: [4420.9999, 4420.9999, 4420.9999 … 755.0] +sampsize: [100, 100, 100 … 50] +weights: [44.21, 44.21, 44.21 … 15.1] +allprobs: [0.0226, 0.0226, 0.0226 … 0.0662] +type: bootstrap +replicates: 1000 + +``` + +If the replicate weights are given to us already, then we can directly pass them to the `ReplicateDesign` constructor. For instance, in +the above example, suppose we had the `bootstrat` data as a CSV file (for this example, we also rename the columns containing the replicate weights to the form `r_i`). + +```jldoctest replicate-design +julia> using CSV; + +julia> DataFrames.rename!(bootstrat.data, ["replicate_"*string(index) => "r_"*string(index) for index in 1:1000]); + +julia> CSV.write("apistrat_withreplicates.csv", bootstrat.data); + +``` + +We can now pass the replicate weights directly to the `ReplicateDesign` constructor, either as a `Vector{Symbol}`, a `UnitRange` or a `Regex`. + +```jldoctest replicate-design +julia> bootstrat_direct = ReplicateDesign(CSV.read("apistrat_withreplicates.csv", DataFrame), [Symbol("r_"*string(replicate)) for replicate in 1:1000]; strata=:stype, weights=:pw) +ReplicateDesign: +data: 200×1044 DataFrame +strata: stype + [E, E, E … H] +cluster: none +popsize: [4420.9999, 4420.9999, 4420.9999 … 755.0] +sampsize: [100, 100, 100 … 50] +weights: [44.21, 44.21, 44.21 … 15.1] +allprobs: [0.0226, 0.0226, 0.0226 … 0.0662] +type: bootstrap +replicates: 1000 + +julia> bootstrat_unitrange = ReplicateDesign(CSV.read("apistrat_withreplicates.csv", DataFrame), UnitRange(45:1044);strata=:stype, weights=:pw) +ReplicateDesign: +data: 200×1044 DataFrame +strata: stype + [E, E, E … H] +cluster: none +popsize: [4420.9999, 4420.9999, 4420.9999 … 755.0] +sampsize: [100, 100, 100 … 50] +weights: [44.21, 44.21, 44.21 … 15.1] +allprobs: [0.0226, 0.0226, 0.0226 … 0.0662] +type: bootstrap +replicates: 1000 + +julia> bootstrat_regex = ReplicateDesign(CSV.read("apistrat_withreplicates.csv", DataFrame), r"r_\\d";strata=:stype, weights=:pw) ReplicateDesign: data: 200×1044 DataFrame strata: stype @@ -143,8 +246,11 @@ popsize: [4420.9999, 4420.9999, 4420.9999 … 755.0] sampsize: [100, 100, 100 … 50] weights: [44.21, 44.21, 44.21 … 15.1] allprobs: [0.0226, 0.0226, 0.0226 … 0.0662] +type: bootstrap replicates: 1000 + ``` + """ struct ReplicateDesign <: AbstractSurveyDesign data::AbstractDataFrame @@ -155,5 +261,96 @@ struct ReplicateDesign <: AbstractSurveyDesign weights::Symbol # Effective weights in case of singlestage approx supported allprobs::Symbol # Right now only singlestage approx supported pps::Bool + type::String replicates::UInt + replicate_weights::Vector{Symbol} + + # default constructor + function ReplicateDesign( + data::DataFrame, + cluster::Symbol, + popsize::Symbol, + sampsize::Symbol, + strata::Symbol, + weights::Symbol, + allprobs::Symbol, + pps::Bool, + type::String, + replicates::UInt, + replicate_weights::Vector{Symbol} + ) + new(data, cluster, popsize, sampsize, strata, weights, allprobs, + pps, type, replicates, replicate_weights) + end + + # constructor with given replicate_weights + function ReplicateDesign( + data::AbstractDataFrame, + replicate_weights::Vector{Symbol}; + clusters::Union{Nothing,Symbol,Vector{Symbol}} = nothing, + strata::Union{Nothing,Symbol} = nothing, + popsize::Union{Nothing,Symbol} = nothing, + weights::Union{Nothing,Symbol} = nothing + ) + # rename the replicate weights if needed + rename!(data, [replicate_weights[index] => "replicate_"*string(index) for index in 1:length(replicate_weights)]) + + # call the SurveyDesign constructor + base_design = SurveyDesign( + data; + clusters=clusters, + strata=strata, + popsize=popsize, + weights=weights + ) + new( + base_design.data, + base_design.cluster, + base_design.popsize, + base_design.sampsize, + base_design.strata, + base_design.weights, + base_design.allprobs, + base_design.pps, + "bootstrap", + length(replicate_weights), + replicate_weights + ) + end + + # replicate weights given as a range of columns + ReplicateDesign( + data::AbstractDataFrame, + replicate_weights::UnitRange{Int}; + clusters::Union{Nothing,Symbol,Vector{Symbol}} = nothing, + strata::Union{Nothing,Symbol} = nothing, + popsize::Union{Nothing,Symbol} = nothing, + weights::Union{Nothing,Symbol} = nothing + ) = + ReplicateDesign( + data, + Symbol.(names(data)[replicate_weights]); + clusters=clusters, + strata=strata, + popsize=popsize, + weights=weights + ) + + # replicate weights given as regular expression + ReplicateDesign( + data::AbstractDataFrame, + replicate_weights::Regex; + clusters::Union{Nothing,Symbol,Vector{Symbol}} = nothing, + strata::Union{Nothing,Symbol} = nothing, + popsize::Union{Nothing,Symbol} = nothing, + weights::Union{Nothing,Symbol} = nothing + ) = + ReplicateDesign( + data, + Symbol.(names(data)[findall(name -> occursin(replicate_weights, name), names(data))]); + clusters=clusters, + strata=strata, + popsize=popsize, + weights=weights + ) end diff --git a/src/bootstrap.jl b/src/bootstrap.jl index 7e01247f..ef66ad9f 100644 --- a/src/bootstrap.jl +++ b/src/bootstrap.jl @@ -18,33 +18,25 @@ popsize: [757, 757, 757 … 757] sampsize: [15, 15, 15 … 15] weights: [50.4667, 50.4667, 50.4667 … 50.4667] allprobs: [0.0198, 0.0198, 0.0198 … 0.0198] +type: bootstrap replicates: 1000 ``` """ function bootweights(design::SurveyDesign; replicates = 4000, rng = MersenneTwister(1234)) stratified = groupby(design.data, design.strata) H = length(keys(stratified)) - substrata_dfs = DataFrame[] + substrata_dfs = Vector{DataFrame}(undef, H) for h = 1:H substrata = DataFrame(stratified[h]) cluster_sorted = sort(substrata, design.cluster) - psus = unique(cluster_sorted[!, design.cluster]) - npsus = [(count(==(i), cluster_sorted[!, design.cluster])) for i in psus] - nh = length(psus) + cluster_sorted_designcluster = cluster_sorted[!, design.cluster] cluster_weights = cluster_sorted[!, design.weights] - for replicate = 1:replicates - randinds = rand(rng, 1:(nh), (nh - 1)) - cluster_sorted[!, "replicate_"*string(replicate)] = - vcat( - [ - fill((count(==(i), randinds)) * (nh / (nh - 1)), npsus[i]) for - i = 1:nh - ]..., - ) .* cluster_weights - end - push!(substrata_dfs, cluster_sorted) + # Perform the inner loop in a type-stable function to improve runtime. + _bootweights_cluster_sorted!(cluster_sorted, cluster_weights, + cluster_sorted_designcluster, replicates, rng) + substrata_dfs[h] = cluster_sorted end - df = vcat(substrata_dfs...) + df = reduce(vcat, substrata_dfs) return ReplicateDesign( df, design.cluster, @@ -54,6 +46,27 @@ function bootweights(design::SurveyDesign; replicates = 4000, rng = MersenneTwis design.weights, design.allprobs, design.pps, - replicates, + "bootstrap", + UInt(replicates), + [Symbol("replicate_"*string(replicate)) for replicate in 1:replicates] ) end + +function _bootweights_cluster_sorted!(cluster_sorted, + cluster_weights, cluster_sorted_designcluster, replicates, rng) + + psus = unique(cluster_sorted_designcluster) + npsus = [count(==(i), cluster_sorted_designcluster) for i in psus] + nh = length(psus) + for replicate = 1:replicates + randinds = rand(rng, 1:(nh), (nh - 1)) + cluster_sorted[!, "replicate_"*string(replicate)] = + reduce(vcat, + [ + fill((count(==(i), randinds)) * (nh / (nh - 1)), npsus[i]) for + i = 1:nh + ] + ) .* cluster_weights + end + cluster_sorted +end diff --git a/src/by.jl b/src/by.jl index 3f955b15..f28de3b5 100644 --- a/src/by.jl +++ b/src/by.jl @@ -1,13 +1,12 @@ -function bydomain(x::Symbol, domain::Symbol, design::SurveyDesign, func::Function) +function bydomain(x::Symbol, domain, design::SurveyDesign, func::Function) gdf = groupby(design.data, domain) - nd = length(unique(design.data[!, domain])) X = combine(gdf, [x, design.weights] => ((a, b) -> func(a, weights(b))) => :statistic) return X end -function bydomain(x::Symbol, domain::Symbol, design::ReplicateDesign, func::Function) +function bydomain(x::Symbol, domain, design::ReplicateDesign, func::Function) gdf = groupby(design.data, domain) - nd = length(unique(design.data[!, domain])) + nd = length(gdf) X = combine(gdf, [x, design.weights] => ((a, b) -> func(a, weights(b))) => :statistic) Xt_mat = Array{Float64,2}(undef, (nd, design.replicates)) for i = 1:design.replicates diff --git a/src/jackknife.jl b/src/jackknife.jl new file mode 100644 index 00000000..315de348 --- /dev/null +++ b/src/jackknife.jl @@ -0,0 +1,159 @@ +""" + jackknifeweights(design::SurveyDesign) + +Delete-1 Jackknife algorithm for replication weights from sampling weights. The replicate weights are calculated using the following formula. + +```math +w_{i(hj)} = +\\begin{cases} + w_i\\quad\\quad &\\text{if observation unit }i\\text{ is not in stratum }h\\\\ + 0\\quad\\quad &\\text{if observation unit }i\\text{ is in psu }j\\text{of stratum }h\\\\ + \\dfrac{n_h}{n_h - 1}w_i \\quad\\quad &\\text{if observation unit }i\\text{ is in stratum }h\\text{ but not in psu }j\\\\ +\\end{cases} +``` +In the above formula, ``w_i`` represent the original weights, ``w_{i(hj)}`` represent the replicate weights when the ``j``th PSU from cluster ``h`` is removed, and ``n_h`` represents the number of unique PSUs within cluster ``h``. Replicate weights are added as columns to `design.data`, and these columns have names of the form `replicate_i`, where `i` ranges from 1 to the number of replicate weight columns. + +# Examples +```jldoctest +julia> using Survey; + +julia> apistrat = load_data("apistrat"); + +julia> dstrat = SurveyDesign(apistrat; strata=:stype, weights=:pw); + +julia> rstrat = jackknifeweights(dstrat) +ReplicateDesign: +data: 200×244 DataFrame +strata: stype + [E, E, E … M] +cluster: none +popsize: [4420.9999, 4420.9999, 4420.9999 … 1018.0] +sampsize: [100, 100, 100 … 50] +weights: [44.21, 44.21, 44.21 … 20.36] +allprobs: [0.0226, 0.0226, 0.0226 … 0.0491] +type: jackknife +replicates: 200 + +``` + +# Reference +pg 380-382, Section 9.3.2 Jackknife - Sharon Lohr, Sampling Design and Analysis (2010) +""" +function jackknifeweights(design::SurveyDesign) + sort!(design.data, [design.strata, design.cluster]) + df = design.data + + stratified_gdf = groupby(df, design.strata) + replicate_index = 0 + for (key, subgroup) in pairs(stratified_gdf) + stratum = key[design.strata] # current stratum + psus_in_stratum = unique(subgroup[!, design.cluster]) + nh = length(psus_in_stratum) + + for psu in psus_in_stratum + replicate_index += 1 + colname = "replicate_"*string(replicate_index) + + # Initialise replicate_i with original sampling weights + df[!, colname] = Vector(df[!, design.weights]) + + # getting indexes + same_psu = (df[!, design.strata] .== stratum) .&& (df[!, design.cluster] .== psu) + different_psu = (df[!, design.strata] .== stratum) .&& (df[!, design.cluster] .!== psu) + + # scaling weights appropriately + df[same_psu, colname] .*= 0 + df[different_psu, colname] .*= nh/(nh - 1) + end + end + + return ReplicateDesign( + df, + design.cluster, + design.popsize, + design.sampsize, + design.strata, + design.weights, + design.allprobs, + design.pps, + "jackknife", + UInt(replicate_index), + [Symbol("replicate_"*string(index)) for index in 1:replicate_index] + ) +end + +""" + jackknife_variance(x::Symbol, func::Function, design::ReplicateDesign) + +Compute variance of column `x` for the given `func` using the Jackknife method. The formula to compute this variance is the following. + +```math +\\hat{V}_{\\text{JK}}(\\hat{\\theta}) = \\sum_{h = 1}^H \\dfrac{n_h - 1}{n_h}\\sum_{j = 1}^{n_h}(\\hat{\\theta}_{(hj)} - \\hat{\\theta})^2 +``` + +Above, ``\\hat{\\theta}`` represents the estimator computed using the original weights, and ``\\hat{\\theta_{(hj)}}`` represents the estimator computed from the replicate weights obtained when PSU ``j`` from cluster ``h`` is removed. + +# Examples +```jldoctest +julia> using Survey, StatsBase + +julia> apistrat = load_data("apistrat"); + +julia> dstrat = SurveyDesign(apistrat; strata=:stype, weights=:pw); + +julia> rstrat = jackknifeweights(dstrat) +ReplicateDesign: +data: 200×244 DataFrame +strata: stype + [E, E, E … M] +cluster: none +popsize: [4420.9999, 4420.9999, 4420.9999 … 1018.0] +sampsize: [100, 100, 100 … 50] +weights: [44.21, 44.21, 44.21 … 20.36] +allprobs: [0.0226, 0.0226, 0.0226 … 0.0491] +type: jackknife +replicates: 200 + +julia> weightedmean(x, y) = mean(x, weights(y)); + +julia> jackknife_variance(:api00, weightedmean, rstrat) +1×2 DataFrame + Row │ estimator SE + │ Float64 Float64 +─────┼──────────────────── + 1 │ 662.287 9.53613 + +``` +# Reference +pg 380-382, Section 9.3.2 Jackknife - Sharon Lohr, Sampling Design and Analysis (2010) +""" +function jackknife_variance(x::Symbol, func::Function, design::ReplicateDesign) + df = design.data + # sort!(df, [design.strata, design.cluster]) + stratified_gdf = groupby(df, design.strata) + + # estimator from original weights + θ = func(df[!, x], df[!, design.weights]) + + variance = 0 + replicate_index = 1 + for subgroup in stratified_gdf + psus_in_stratum = unique(subgroup[!, design.cluster]) + nh = length(psus_in_stratum) + cluster_variance = 0 + for psu in psus_in_stratum + # get replicate weights corresponding to current stratum and psu + rep_weights = df[!, "replicate_"*string(replicate_index)] + + # estimator from replicate weights + θhj = func(df[!, x], rep_weights) + + cluster_variance += ((nh - 1)/nh)*(θhj - θ)^2 + replicate_index += 1 + end + variance += cluster_variance + end + + return DataFrame(estimator = θ, SE = sqrt(variance)) +end + diff --git a/src/mean.jl b/src/mean.jl index 4db9fed1..580c63f2 100644 --- a/src/mean.jl +++ b/src/mean.jl @@ -46,13 +46,19 @@ julia> mean(:api00, bclus1) ``` """ function mean(x::Symbol, design::ReplicateDesign) - X = mean(design.data[!, x], weights(design.data[!, design.weights])) - Xt = [ - mean(design.data[!, x], weights(design.data[!, "replicate_"*string(i)])) for - i = 1:design.replicates - ] - variance = sum((Xt .- X) .^ 2) / design.replicates - DataFrame(mean = X, SE = sqrt(variance)) + if design.type == "bootstrap" + θ̂ = mean(design.data[!, x], weights(design.data[!, design.weights])) + θ̂t = [ + mean(design.data[!, x], weights(design.data[!, "replicate_"*string(i)])) for + i = 1:design.replicates + ] + variance = sum((θ̂t .- θ̂) .^ 2) / design.replicates + return DataFrame(mean = θ̂, SE = sqrt(variance)) + # Jackknife integration + elseif design.type == "jackknife" + weightedmean(x, y) = mean(x, weights(y)) + return jackknife_variance(x, weightedmean, design) + end end """ @@ -130,7 +136,7 @@ julia> mean(:api00, :cname, bclus1) 11 │ Mendocino 623.25 1.09545e-13 ``` """ -function mean(x::Symbol, domain::Symbol, design::AbstractSurveyDesign) +function mean(x::Symbol, domain, design::AbstractSurveyDesign) weighted_mean(x, w) = mean(x, StatsBase.weights(w)) df = bydomain(x, domain, design, weighted_mean) rename!(df, :statistic => :mean) diff --git a/src/show.jl b/src/show.jl index 2c323704..09d22b48 100644 --- a/src/show.jl +++ b/src/show.jl @@ -37,6 +37,7 @@ Base.show(io::IO, ::MIME"text/plain", design::SurveyDesign) = surveyshow(io, des function Base.show(io::IO, ::MIME"text/plain", design::ReplicateDesign) # new_io = IOContext(io, :compact=>true, :limit=>true, :displaysize=>(50, 50)) surveyshow(io, design) + printinfo(io, "\ntype", design.type; newline = false) printinfo(io, "\nreplicates", design.replicates; newline = false) end diff --git a/src/total.jl b/src/total.jl index cc9bb06d..a451324c 100644 --- a/src/total.jl +++ b/src/total.jl @@ -120,7 +120,7 @@ julia> total(:api00, :cname, bclus1) 11 │ Mendocino 84380.6 80215.9 ``` """ -function total(x::Symbol, domain::Symbol, design::AbstractSurveyDesign) +function total(x::Symbol, domain, design::AbstractSurveyDesign) df = bydomain(x, domain, design, wsum) rename!(df, :statistic => :total) end diff --git a/test/SurveyDesign.jl b/test/SurveyDesign.jl index 0d42fc19..5084c1e4 100644 --- a/test/SurveyDesign.jl +++ b/test/SurveyDesign.jl @@ -259,3 +259,45 @@ end yrbs = copy(yrbs_original) dyrbs = SurveyDesign(yrbs; clusters = :psu, strata = :stratum, weights = :weight) end + +@testset "ReplicateDesign_direct" begin + for (sample, sample_direct) in [(bsrs, bsrs_direct), (bstrat, bstrat_direct), (dclus1_boot, dclus1_boot_direct)] + @test isequal(sample.data, sample_direct.data) + @test isequal(sample.popsize, sample_direct.popsize) + @test isequal(sample.sampsize, sample_direct.sampsize) + @test isequal(sample.strata, sample_direct.strata) + @test isequal(sample.weights, sample_direct.weights) + @test isequal(sample.allprobs, sample_direct.allprobs) + @test isequal(sample.pps, sample_direct.pps) + @test isequal(sample.replicates, sample_direct.replicates) + @test isequal(sample.replicate_weights, sample_direct.replicate_weights) + end +end + +@testset "ReplicateDesign_unitrange" begin + for (sample, sample_unitrange) in [(bsrs, bsrs_unitrange), (bstrat, bstrat_unitrange), (dclus1_boot, dclus1_boot_unitrange)] + @test isequal(sample.data, sample_unitrange.data) + @test isequal(sample.popsize, sample_unitrange.popsize) + @test isequal(sample.sampsize, sample_unitrange.sampsize) + @test isequal(sample.strata, sample_unitrange.strata) + @test isequal(sample.weights, sample_unitrange.weights) + @test isequal(sample.allprobs, sample_unitrange.allprobs) + @test isequal(sample.pps, sample_unitrange.pps) + @test isequal(sample.replicates, sample_unitrange.replicates) + @test isequal(sample.replicate_weights, sample_unitrange.replicate_weights) + end +end + +@testset "ReplicateDesign_regex" begin + for (sample, sample_regex) in [(bsrs, bsrs_regex), (bstrat, bstrat_regex), (dclus1_boot, dclus1_boot_regex)] + @test isequal(sample.data, sample_regex.data) + @test isequal(sample.popsize, sample_regex.popsize) + @test isequal(sample.sampsize, sample_regex.sampsize) + @test isequal(sample.strata, sample_regex.strata) + @test isequal(sample.weights, sample_regex.weights) + @test isequal(sample.allprobs, sample_regex.allprobs) + @test isequal(sample.pps, sample_regex.pps) + @test isequal(sample.replicates, sample_regex.replicates) + @test isequal(sample.replicate_weights, sample_regex.replicate_weights) + end +end diff --git a/test/jackknife.jl b/test/jackknife.jl new file mode 100644 index 00000000..73ab5b71 --- /dev/null +++ b/test/jackknife.jl @@ -0,0 +1,88 @@ +@testset "jackknife" begin + # Create Jackknife replicate designs + bsrs_jk = srs |> jackknifeweights # SRS + dstrat_jk = dstrat |> jackknifeweights # Stratified + dclus2_jk = dclus2 |> jackknifeweights # Two stage cluster + dnhanes_jk = dnhanes |> jackknifeweights # multistage stratified + + mean_srs_jk = mean([:api00,:api99], bsrs_jk) + @test mean_srs_jk.SE[1] ≈ 9.4028 atol = 1e-3 + @test mean_srs_jk.SE[2] ≈ 9.6575 atol = 1e-3 + + mean_strat_jk = mean([:api00,:api99],dstrat_jk) + @test mean_strat_jk.SE[1] ≈ 9.5361 atol = 1e-3 + @test mean_strat_jk.SE[2] ≈ 10.097 atol = 1e-3 + + mean_clus2_jk = mean([:api00,:api99],dclus2_jk) + @test mean_clus2_jk.SE[1] ≈ 34.9388 atol = 1e-3 # R gives 34.928 + @test mean_clus2_jk.SE[2] ≈ 34.6565 atol = 1e-3 # R gives 34.645 + + # Tests using for NHANES + mean_nhanes_jk = mean([:seq1, :seq2],dnhanes_jk) + @test mean_nhanes_jk.estimator[1] ≈ 21393.96 atol = 1e-3 + @test mean_nhanes_jk.SE[1] ≈ 143.371 atol = 1e-3 # R is slightly diff in 2nd decimal place + @test mean_nhanes_jk.estimator[2] ≈ 38508.328 atol = 1e-3 + @test mean_nhanes_jk.SE[2] ≈ 258.068 atol = 1e-3 # R is slightly diff in 2nd decimal place +end + +# # R code for correctness above +# library(survey) +# data(api) +# apiclus1$pw = rep(757/15,nrow(apiclus1)) +# No corrections needed for apiclus2, it has correct weights by default + +# ############# +# ###### 23.03.22 PR#260 + +# ## srs with fpc (doesnt match Julia) +# srs <- svydesign(id=~1,data=apisrs, weights=~pw, fpc=~fpc) +# bsrs_jk <- as.svrepdesign(srs, type="JK1", compress=FALSE) +# svymean(~api00,bsrs_jk) +# # mean SE +# # api00 656.58 9.2497 + +# ## srs without fpc (matches Julia) +# srs <- svydesign(id=~1,data=apisrs, weights=~pw)#, fpc=~fpc) +# bsrs_jk <- as.svrepdesign(srs, type="JK1", compress=FALSE) +# svymean(~api00+api99,bsrs_jk) +# # mean SE +# # api00 656.58 9.4028 +# # api99 624.68 9.6575 + +# # strat with fpc (doesnt match Julia) +# dstrat <- svydesign(data=apistrat, id=~1, weights=~pw, strata=~stype,fpc=~fpc) +# dstrat_jk <- as.svrepdesign(dstrat, type="JKn", compress=FALSE) +# svymean(~api99,dstrat_jk) +# # mean SE +# # api99 629.39 9.9639 + +# # strat without fpc (matches Julia) +# dstrat <- svydesign(data=apistrat, id=~1, weights=~pw, strata=~stype)#,fpc=~fpc) +# dstrat_jk <- as.svrepdesign(dstrat, type="JKn", compress=FALSE) +# svymean(~api00+api99,dstrat_jk) +# # mean SE +# # api00 662.29 9.5361 +# # api99 629.39 10.0970 + +#### apiclus2 +# > # clus2 without fpc (doesnt match Julia) +# > dclus2 <- svydesign(id=~dnum+snum, weights=~pw, data=apiclus2) +# > dclus2_jk <- as.svrepdesign(dclus2, type="JK1", compress=FALSE) +# > svymean(~api00+api99,dclus2) +# mean SE +# api00 670.81 30.712 +# api99 645.03 30.308 +# > svymean(~api00+api99,dclus2_jk) +# mean SE +# api00 670.81 34.928 +# api99 645.03 34.645 + +# NHANES test +# > data("nhanes") +# > nhanes$seq1 = seq(1.0, 5*8591.0, by = 5) +# > nhanes$seq2 = seq(1.0, 9*8591.0, by = 9)#, length.out=8591) +# > dnhanes <- svydesign(id=~SDMVPSU, strata=~SDMVSTRA, weights=~WTMEC2YR, nest=TRUE, data=nhanes) +# > svymean(~seq1+seq2,dnhanes) +# mean SE +# seq1 21394 143.34 +# seq2 38508 258.01 \ No newline at end of file diff --git a/test/ratio.jl b/test/ratio.jl index d0f5f975..b9865f09 100644 --- a/test/ratio.jl +++ b/test/ratio.jl @@ -1,5 +1,14 @@ @testset "ratio.jl" begin + # on :api00 + @test ratio(:api00, :enroll, srs).ratio[1] ≈ 1.1231 atol = 1e-4 @test ratio(:api00, :enroll, dclus1).ratio[1] ≈ 1.17182 atol = 1e-4 @test ratio(:api00, :enroll, dclus1_boot).SE[1] ≈ 0.1275446 atol = 1e-1 @test ratio(:api00, :enroll, dclus1_boot).ratio[1] ≈ 1.17182 atol = 1e-4 + @test ratio(:api00, :enroll, bstrat).ratio[1] ≈ 1.11256 atol = 1e-4 + @test ratio(:api00, :enroll, bstrat).SE[1] ≈ 0.04185 atol = 1e-1 + @test ratio(:api00, :enroll, bstrat).ratio[1] ≈ 1.11256 atol = 1e-4 + # on :api99 + @test ratio(:api99, :enroll, bsrs).ratio[1] ≈ 1.06854 atol = 1e-4 + @test ratio(:api99, :enroll, dstrat).ratio[1] ≈ 1.0573 atol = 1e-4 + @test ratio(:api99, :enroll, bstrat).ratio[1] ≈ 1.0573 atol = 1e-4 end diff --git a/test/runtests.jl b/test/runtests.jl index f2fc89d4..d709103c 100644 --- a/test/runtests.jl +++ b/test/runtests.jl @@ -4,21 +4,49 @@ using CategoricalArrays const STAT_TOL = 1e-5 const SE_TOL = 1e-1 +TOTAL_REPLICATES = 4000 +REPLICATES_VECTOR = [Symbol("replicate_"*string(i)) for i in 1:TOTAL_REPLICATES] +REPLICATES_REGEX = r"r*_\d" # Simple random sample apisrs = load_data("apisrs") # Load API dataset srs = SurveyDesign(apisrs, weights = :pw) +unitrange = UnitRange((length(names(apisrs)) + 1):(TOTAL_REPLICATES + length(names(apisrs)))) bsrs = srs |> bootweights # Create replicate design +bsrs_direct = ReplicateDesign(bsrs.data, REPLICATES_VECTOR, weights = :pw) # using ReplicateDesign constructor +bsrs_unitrange = ReplicateDesign(bsrs.data, unitrange, weights = :pw) # using ReplicateDesign constructor +bsrs_regex = ReplicateDesign(bsrs.data, REPLICATES_REGEX, weights = :pw) # using ReplicateDesign constructor + # Stratified sample apistrat = load_data("apistrat") # Load API dataset dstrat = SurveyDesign(apistrat, strata = :stype, weights = :pw) # Create SurveyDesign +unitrange = UnitRange((length(names(apistrat)) + 1):(TOTAL_REPLICATES + length(names(apistrat)))) bstrat = dstrat |> bootweights # Create replicate design +bstrat_direct = ReplicateDesign(bstrat.data, REPLICATES_VECTOR, strata=:stype, weights=:pw) # using ReplicateDesign constructor +bstrat_unitrange = ReplicateDesign(bstrat.data, unitrange, strata=:stype, weights=:pw) # using ReplicateDesign constructor +bstrat_regex = ReplicateDesign(bstrat.data, REPLICATES_REGEX, strata=:stype, weights=:pw) # using ReplicateDesign constructor # One-stage cluster sample apiclus1 = load_data("apiclus1") # Load API dataset apiclus1[!, :pw] = fill(757 / 15, (size(apiclus1, 1),)) # Correct api mistake for pw column dclus1 = SurveyDesign(apiclus1; clusters = :dnum, weights = :pw) # Create SurveyDesign +unitrange = UnitRange((length(names(apiclus1)) + 1):(TOTAL_REPLICATES + length(names(apiclus1)))) dclus1_boot = dclus1 |> bootweights # Create replicate design +dclus1_boot_direct = ReplicateDesign(dclus1_boot.data, REPLICATES_VECTOR, clusters=:dnum, weights=:pw) # using ReplicateDesign constructor +dclus1_boot_unitrange = ReplicateDesign(dclus1_boot.data, unitrange, clusters=:dnum, weights=:pw) # using ReplicateDesign constructor +dclus1_boot_regex = ReplicateDesign(dclus1_boot.data, REPLICATES_REGEX, clusters=:dnum, weights=:pw) # using ReplicateDesign constructor + +# Two-stage cluster sample +apiclus2 = load_data("apiclus2") # Load API dataset +dclus2 = SurveyDesign(apiclus2; clusters = :dnum, weights = :pw) # Create SurveyDesign +dclus2_boot = dclus2 |> bootweights # Create replicate design + +# NHANES +nhanes = load_data("nhanes") +nhanes.seq1 = collect(1.0:5.0:42955.0) +nhanes.seq2 = collect(1.0:9.0:77319.0) # [9k for k in 0:8590.0] +dnhanes = SurveyDesign(nhanes; clusters = :SDMVPSU, strata = :SDMVSTRA, weights = :WTMEC2YR) +dnhanes_boot = dnhanes |> bootweights @testset "Survey.jl" begin @test size(load_data("apiclus1")) == (183, 40) @@ -35,3 +63,4 @@ include("hist.jl") include("boxplot.jl") include("ratio.jl") include("show.jl") +include("jackknife.jl") \ No newline at end of file diff --git a/test/show.jl b/test/show.jl index d21cb76f..3035190f 100644 --- a/test/show.jl +++ b/test/show.jl @@ -23,6 +23,7 @@ sampsize: [200, 200, 200 … 200] weights: [30.97, 30.97, 30.97 … 30.97] allprobs: [0.0323, 0.0323, 0.0323 … 0.0323] + type: bootstrap replicates: 4000""" show(io, MIME("text/plain"), bsrs) @@ -58,6 +59,7 @@ end sampsize: [100, 100, 100 … 50] weights: [44.21, 44.21, 44.21 … 15.1] allprobs: [0.0226, 0.0226, 0.0226 … 0.0662] + type: bootstrap replicates: 4000""" show(io, MIME("text/plain"), bstrat) @@ -93,6 +95,7 @@ end sampsize: [15, 15, 15 … 15] weights: [50.4667, 50.4667, 50.4667 … 50.4667] allprobs: [0.0198, 0.0198, 0.0198 … 0.0198] + type: bootstrap replicates: 4000""" show(io, MIME("text/plain"), dclus1_boot) diff --git a/test/total.jl b/test/total.jl index b8d91795..b7780263 100644 --- a/test/total.jl +++ b/test/total.jl @@ -168,4 +168,59 @@ end # equivalent R code (results cause clutter): # > svyby(~api00, ~cname, clus1rep, svytotal) # > svyby(~api00, ~cname, clus1rep, svymean) + + # Test multiple domains passed at once + tot = total(:api00, [:stype,:cname], dclus1_boot) + @test filter(row -> row[:cname] == "Los Angeles" && row[:stype] == "E", tot).SE[1] ≈ 343365 rtol = STAT_TOL + @test filter(row -> row[:cname] == "Merced" && row[:stype] == "H", tot).SE[1] ≈ 27090.33 rtol = STAT_TOL + + #### Why doesnt this syntax produce domain estimates?? + # Test that column specifiers from DataFrames make it through this pipeline + # These tests replicate what you see above...just with a different syntax. + # tot = total(:api00, Survey.DataFrames.Cols(==(:cname)), dclus1_boot) + ######## Above Survey.DataFrames.Cols(==(:cname)) syntax doesnt give domain estimates + # @test size(tot)[1] == apiclus1.cname |> unique |> length + # @test filter(:cname => ==("Los Angeles"), tot).total[1] ≈ 489980.87 rtol = STAT_TOL + # @test filter(:cname => ==("Los Angeles"), tot).SE[1] ≈ 430469.28 rtol = SE_TOL + # @test filter(:cname => ==("San Diego"), tot).total[1] ≈ 1830375.53 rtol = STAT_TOL + # @test filter(:cname => ==("San Diego"), tot).SE[1] ≈ 1298696.64 rtol = SE_TOL end + +#### R code for vector{Symbol} domain estimation +# > data(api) +# > apiclus1$pw = rep(757/15,nrow(apiclus1)) +# > ### 9.04.23 +# > dclus1<-svydesign(id=~dnum, weights=~pw, data=apiclus1); +# > rclus1<-as.svrepdesign(dclus1, type="subbootstrap", compress=FALSE, replicates = 4000) +# > svyby(~api00, ~stype+cname, rclus1, svytotal) +# stype cname api00 se +# E.Alameda E Alameda 273428.40 275423.33 +# H.Alameda H Alameda 30683.73 30907.60 +# M.Alameda M Alameda 67272.07 67762.88 +# E.Fresno E Fresno 48599.40 47484.67 +# H.Fresno H Fresno 22356.73 21843.93 +# M.Fresno M Fresno 24324.93 23766.99 +# E.Kern E Kern 24930.53 24847.76 +# M.Kern M Kern 20741.80 20672.93 +# E.Los Angeles E Los Angeles 395154.00 341692.92 +# M.Los Angeles M Los Angeles 94826.87 95416.42 +# E.Mendocino E Mendocino 58844.13 57711.15 +# H.Mendocino H Mendocino 35124.80 34448.51 +# M.Mendocino M Mendocino 31844.47 31231.33 +# E.Merced E Merced 50517.13 51424.65 +# H.Merced H Merced 26696.87 27176.47 +# M.Merced M Merced 27605.27 28101.18 +# E.Orange E Orange 463536.33 465047.76 +# M.Orange M Orange 110219.20 110578.59 +# E.Plumas E Plumas 144284.20 146672.86 +# H.Plumas H Plumas 143729.07 146108.54 +# M.Plumas M Plumas 34266.87 34834.16 +# E.San Diego E San Diego 1670497.13 1233144.04 +# H.San Diego H San Diego 63386.13 63693.54 +# M.San Diego M San Diego 96492.27 96960.22 +# E.San Joaquin E San Joaquin 848243.73 848605.33 +# H.San Joaquin H San Joaquin 79585.93 79619.86 +# M.San Joaquin M San Joaquin 101387.53 101430.75 +# E.Santa Clara E Santa Clara 737418.93 484164.71 +# H.Santa Clara H Santa Clara 35478.07 35311.28 +# M.Santa Clara M Santa Clara 187685.53 131278.63 \ No newline at end of file