Merge pull request #256 from xKDR/v0.1.1

Version 0.1.1 into main
xKDR · Apr 10, 2023 · 0703cfc · 0703cfc
2 parents f9aa828 + 31a80ef
commit 0703cfc
Show file tree

Hide file tree

Showing 19 changed files with 700 additions and 48 deletions.
diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md
@@ -2,6 +2,17 @@
 
 # Contributing to Survey.jl
 
+  * [Overview](#overview)
+  * [Reporting Issues](#reporting-issues)
+  * [Recommended workflow setup](#recommended-workflow-setup)
+  * [Modifying an existing docstring in `src/`](#modifying-an-existing-docstring-in--src--)
+  * [Adding a new docstring to `src/`](#adding-a-new-docstring-to--src--)
+  * [Doctests](#doctests)
+  * [Integration with exisiting API](#integration-with-exisiting-api)
+  * [Contributing](#contributing)
+  * [Style Guidelines](#style-guidelines)
+  * [Git Recommendations For Pull Requests](#git-recommendations-for-pull-requests)
+
 ## Overview
 Thank you for thinking about making contributions to Survey.jl!  
 We aim to keep consistency in contribution guidelines to DataFrames.jl, which is the main upstream dependency for the project. 
@@ -16,6 +27,46 @@ Reading through the ColPrac guide for collaborative practices is highly recommen
   (`Pkg.add(name="Survey", rev="main")`) is a good gut check and can streamline the process,
   along with including the first two lines of output from `versioninfo()`
 
+## Setting up development workflow
+
+Below tutorial uses Windows Subsystem for Linux (WSL) and VSCode. Linux/MacOS/BSD can ignore WSL specific steps.
+
+1. Install Ubuntu on WSL from the [Ubuntu website](https://ubuntu.com/wsl) or the Microsoft Store
+2. Create a fork of the [Survey.jl repository](https://github.com/xKDR/Survey.jl). You will only be ever working on this fork, and submitting Pull Requests to the main repo. 
+3. Copy the SSH link from your fork by clicking the green `<> Code` icon and then `SSH`. 
+    - You must already have SSH setup for this to work. If you don't, look up a tutorial on how to clone a github repository using SSH.
+4. Open a WSL terminal, and run :
+    - `curl -fsSL https://install.julialang.org | sh`
+    - `git clone [email protected]:your_username/Survey.jl.git` -- replace "*your_username**"
+    - `julia`
+3. You are now in the Julia REPL, run :
+    - `import Pkg; Pkg.add("Revise")`
+    - `import Pkg; Pkg.add("Survey")`
+    - `import Pkg; Pkg.add("Test")`
+    - `] dev .`
+4. Open VSCode and install the following extensions :
+    - WSL 
+    - Julia
+5. Go back to your WSL terminal, navigate to the folder of your repo, and run `code .` to open VSCode in that folder
+6. Create a `dev` folder (only if you want, it is gitignored by default), and a `test.jl` file in the file. Paste this block of code and save :
+
+```julia
+using Revise, Survey, Test
+
+@testset "ratio.jl" begin
+    apiclus1 = load_data("apiclus1")
+    dclus1 = SurveyDesign(apiclus1; clusters=:dnum, strata=:stype, weights=:pw)
+    @test ratio(:api00, :enroll, dclus1).ratio[1] ≈ 1.17182 atol = 1e-4
+end
+```
+
+9. In the WSL terminal (not Julia REPL), run `julia dev/test.jl`  
+✅ If you get no errors, your setup is now complete !
+
+You can keep working in the `dev` folder, which is .gitignored.  
+Once you have working code and tests, you can move them to the appropriate folders, commit, push, and submit a Pull Request.  
+Make sure to read the rest of this document so you can learn the best practices and guidelines for this project.  
+
 ## Modifying an existing docstring in `src/`
 
 All docstrings are written inline above the methods or types they are associated with and can
@@ -94,7 +145,7 @@ This way you are modifying as little as possible of previously written code, and
 * If you want to propose a new functionality it is strongly recommended to open an issue first and reach a decision on the final design.
   Then a pull request serves an implementation of the agreed way how things should work.
 * If you are a new contributor and would like to get a guidance on what area
-  you could focus your first PR please do not hesitate to ask and JuliaData members
+  you could focus your first PR please do not hesitate to ask community members
   will help you with picking a topic matching your experience.
 * Feel free to open, or comment on, an issue and solicit feedback early on,
   especially if you're unsure about aligning with design goals and direction,
@@ -104,22 +155,15 @@ This way you are modifying as little as possible of previously written code, and
 * Aim for atomic commits, if possible, e.g. `change 'foo' behavior like so` &
   `'bar' handles such and such corner case`,
   rather than `update 'foo' and 'bar'` & `fix typo` & `fix 'bar' better`.
-* Pull requests are tested against release and development branches of Julia,
-  so using `Pkg.test("DataFrames")` as you develop can be helpful.
+* Pull requests are tested against release branches of Julia,
+  so using `Pkg.test("Survey")` as you develop can be helpful.
 * The style guidelines outlined below are not the personal style of most contributors,
   but for consistency throughout the project, we've adopted them.
-* It is recommended to disable GitHub Actions on your fork; check Settings > Actions.
 * If a PR adds a new exported name then make sure to add a docstring for it and
   add a reference to it in the documentation.
 * A PR with breaking changes should have `[BREAKING]` as a first part of its name.
-* If a PR changes or adds functionality please update NEWS.md file accordingly as
-  a part of the PR (along with the link to the PR); please do not add entries
-  to NEWS.md for changes that are bug fixes or are not user visible, such as
-  adding tests, updating documentation or improving code layout.
-* If you make a PR please try to avoid pushing many small commits to GitHub in
-  a sequence as each such commit triggers a separate CI job, which takes over
-  an hour. This has a consequence of making other PRs in packages from the JuliaData
-  ecosystem wait for such CI jobs to finish as hey share a common pool of CI resources.
+* A PR which is still draft or work in progress should have `WIP:` as a first part of its name.
+* If you make a PR please try to avoid pushing many small commits to GitHub in a sequence as each such commit triggers a separate CI job, which takes compuational time, and not a good use of the small pool of CI resources.
 
 ## Style Guidelines
 

diff --git a/Project.toml b/Project.toml
@@ -1,7 +1,7 @@
 name = "Survey"
 uuid = "c1a98b4d-6cd2-47ec-b9e9-69b59c35373c"
 authors = ["Ayush Patnaik <[email protected]>"]
-version = "0.1.0"
+version = "0.2.0"
 
 [deps]
 AlgebraOfGraphics = "cbdf2221-f076-402e-a563-3d30da359d67"

diff --git a/README.md b/README.md
@@ -99,7 +99,8 @@ cluster: none
 popsize: [6190.0, 6190.0, 6190.0  …  6190.0]
 sampsize: [200, 200, 200  …  200]
 weights: [31.0, 31.0, 31.0  …  31.0]
-probs: [0.0323, 0.0323, 0.0323  …  0.0323]
+allprobs: [0.0323, 0.0323, 0.0323  …  0.0323]
+type: bootstrap
 replicates: 1000
 
 julia> mean(:api00, bootsrs)

diff --git a/docs/Project.toml b/docs/Project.toml
@@ -1,5 +1,7 @@
 [deps]
+CSV = "336ed68f-0bac-5ca0-87d4-7b16caf5d00b"
+DataFrames = "a93c6f00-e57d-5684-b7b6-d8193f3e46c0"
 Documenter = "e30172f5-a6a5-5a46-863b-614d45cd2de4"
-Survey = "c1a98b4d-6cd2-47ec-b9e9-69b59c35373c"
 Random = "9a3f8284-a2c9-5f02-9a11-845980a1fd5c"
 StatsBase = "2913bbd2-ae8a-5f71-8c99-4fb6c76f3a91"
+Survey = "c1a98b4d-6cd2-47ec-b9e9-69b59c35373c"
diff --git a/docs/src/api.md b/docs/src/api.md
@@ -14,6 +14,8 @@ SurveyDesign
 ReplicateDesign
 load_data
 bootweights
+jackknifeweights
+jackknife_variance
 mean
 total
 quantile

diff --git a/src/Survey.jl b/src/Survey.jl
@@ -25,6 +25,7 @@ include("boxplot.jl")
 include("show.jl")
 include("ratio.jl")
 include("by.jl")
+include("jackknife.jl")
 
 export load_data
 export AbstractSurveyDesign, SurveyDesign, ReplicateDesign
@@ -35,5 +36,6 @@ export hist, sturges, freedman_diaconis
 export boxplot
 export bootweights
 export ratio
+export jackknifeweights, jackknife_variance
 
 end
diff --git a/src/SurveyDesign.jl b/src/SurveyDesign.jl
@@ -126,14 +126,117 @@ end
 """
     ReplicateDesign <: AbstractSurveyDesign
 
-Survey design obtained by replicating an original design using [`bootweights`](@ref).
+Survey design obtained by replicating an original design using [`bootweights`](@ref). If
+replicate weights are available, then they can be used to directly create a `ReplicateDesign`.
 
-```jldoctest
+# Constructors
+
+```julia
+ReplicateDesign(
+    data::AbstractDataFrame,
+    replicate_weights::Vector{Symbol};
+    clusters::Union{Nothing,Symbol,Vector{Symbol}} = nothing,
+    strata::Union{Nothing,Symbol} = nothing,
+    popsize::Union{Nothing,Symbol} = nothing,
+    weights::Union{Nothing,Symbol} = nothing
+)
+
+ReplicateDesign(
+    data::AbstractDataFrame,
+    replicate_weights::UnitIndex{Int};
+    clusters::Union{Nothing,Symbol,Vector{Symbol}} = nothing,
+    strata::Union{Nothing,Symbol} = nothing,
+    popsize::Union{Nothing,Symbol} = nothing,
+    weights::Union{Nothing,Symbol} = nothing
+)
+
+ReplicateDesign(
+    data::AbstractDataFrame,
+    replicate_weights::Regex;
+    clusters::Union{Nothing,Symbol,Vector{Symbol}} = nothing,
+    strata::Union{Nothing,Symbol} = nothing,
+    popsize::Union{Nothing,Symbol} = nothing,
+    weights::Union{Nothing,Symbol} = nothing
+)
+```
+
+# Arguments
+
+The constructor has the same arguments as [`SurveyDesign`](@ref). The only additional argument is `replicate_weights`, which can
+be of one of the following types.
+
+- `Vector{Symbol}`: In this case, each `Symbol` in the vector should represent a column of `data` containing the replicate weights.
+- `UnitIndex{Int}`: For instance, this could be UnitRange(5:10). This will mean that the replicate weights are contained in columns 5 through 10.
+- `Regex`: In this case, all the columns of `data` which match this `Regex` will be treated as the columns containing the replicate weights.
+
+All the columns containing the replicate weights will be renamed to the form `replicate_i`, where `i` ranges from 1 to the number of columns containing the replicate weights.
+
+# Examples
+
+Here is an example where the [`bootweights`](@ref) function is used to create a `ReplicateDesign`.
+
+```jldoctest replicate-design; setup = :(using Survey, CSV, DataFrames)
 julia> apistrat = load_data("apistrat");
 
 julia> dstrat = SurveyDesign(apistrat; strata=:stype, weights=:pw);
 
-julia> bootstrat = bootweights(dstrat; replicates=1000)
+julia> bootstrat = bootweights(dstrat; replicates=1000)     # creating a ReplicateDesign using bootweights
+ReplicateDesign:
+data: 200×1044 DataFrame
+strata: stype
+    [E, E, E  …  H]
+cluster: none
+popsize: [4420.9999, 4420.9999, 4420.9999  …  755.0]
+sampsize: [100, 100, 100  …  50]
+weights: [44.21, 44.21, 44.21  …  15.1]
+allprobs: [0.0226, 0.0226, 0.0226  …  0.0662]
+type: bootstrap
+replicates: 1000
+
+```
+
+If the replicate weights are given to us already, then we can directly pass them to the `ReplicateDesign` constructor. For instance, in
+the above example, suppose we had the `bootstrat` data as a CSV file (for this example, we also rename the columns containing the replicate weights to the form `r_i`).
+
+```jldoctest replicate-design
+julia> using CSV;
+
+julia> DataFrames.rename!(bootstrat.data, ["replicate_"*string(index) => "r_"*string(index) for index in 1:1000]);
+
+julia> CSV.write("apistrat_withreplicates.csv", bootstrat.data);
+
+```
+
+We can now pass the replicate weights directly to the `ReplicateDesign` constructor, either as a `Vector{Symbol}`, a `UnitRange` or a `Regex`.
+
+```jldoctest replicate-design
+julia> bootstrat_direct = ReplicateDesign(CSV.read("apistrat_withreplicates.csv", DataFrame), [Symbol("r_"*string(replicate)) for replicate in 1:1000]; strata=:stype, weights=:pw)
+ReplicateDesign:
+data: 200×1044 DataFrame
+strata: stype
+    [E, E, E  …  H]
+cluster: none
+popsize: [4420.9999, 4420.9999, 4420.9999  …  755.0]
+sampsize: [100, 100, 100  …  50]
+weights: [44.21, 44.21, 44.21  …  15.1]
+allprobs: [0.0226, 0.0226, 0.0226  …  0.0662]
+type: bootstrap
+replicates: 1000
+
+julia> bootstrat_unitrange = ReplicateDesign(CSV.read("apistrat_withreplicates.csv", DataFrame), UnitRange(45:1044);strata=:stype, weights=:pw)
+ReplicateDesign:
+data: 200×1044 DataFrame
+strata: stype
+    [E, E, E  …  H]
+cluster: none
+popsize: [4420.9999, 4420.9999, 4420.9999  …  755.0]
+sampsize: [100, 100, 100  …  50]
+weights: [44.21, 44.21, 44.21  …  15.1]
+allprobs: [0.0226, 0.0226, 0.0226  …  0.0662]
+type: bootstrap
+replicates: 1000
+
+julia> bootstrat_regex = ReplicateDesign(CSV.read("apistrat_withreplicates.csv", DataFrame), r"r_\\d";strata=:stype, weights=:pw)
 ReplicateDesign:
 data: 200×1044 DataFrame
 strata: stype
@@ -143,8 +246,11 @@ popsize: [4420.9999, 4420.9999, 4420.9999  …  755.0]
 sampsize: [100, 100, 100  …  50]
 weights: [44.21, 44.21, 44.21  …  15.1]
 allprobs: [0.0226, 0.0226, 0.0226  …  0.0662]
+type: bootstrap
 replicates: 1000
+
 ```
+
 """
 struct ReplicateDesign <: AbstractSurveyDesign
     data::AbstractDataFrame
@@ -155,5 +261,96 @@ struct ReplicateDesign <: AbstractSurveyDesign
     weights::Symbol # Effective weights in case of singlestage approx supported
     allprobs::Symbol # Right now only singlestage approx supported
     pps::Bool
+    type::String
     replicates::UInt
+    replicate_weights::Vector{Symbol}
+
+    # default constructor
+    function ReplicateDesign(
+        data::DataFrame,
+        cluster::Symbol,
+        popsize::Symbol,
+        sampsize::Symbol,
+        strata::Symbol,
+        weights::Symbol,
+        allprobs::Symbol,
+        pps::Bool,
+        type::String,
+        replicates::UInt,
+        replicate_weights::Vector{Symbol}
+    )
+        new(data, cluster, popsize, sampsize, strata, weights, allprobs,
+           pps, type, replicates, replicate_weights)
+    end
+
+    # constructor with given replicate_weights
+    function ReplicateDesign(
+        data::AbstractDataFrame,
+        replicate_weights::Vector{Symbol};
+        clusters::Union{Nothing,Symbol,Vector{Symbol}} = nothing,
+        strata::Union{Nothing,Symbol} = nothing,
+        popsize::Union{Nothing,Symbol} = nothing,
+        weights::Union{Nothing,Symbol} = nothing
+    )
+        # rename the replicate weights if needed
+        rename!(data, [replicate_weights[index] => "replicate_"*string(index) for index in 1:length(replicate_weights)])
+
+        # call the SurveyDesign constructor
+        base_design = SurveyDesign(
+                        data;
+                        clusters=clusters,
+                        strata=strata,
+                        popsize=popsize,
+                        weights=weights
+                      )
+        new(
+            base_design.data,
+            base_design.cluster,
+            base_design.popsize,
+            base_design.sampsize,
+            base_design.strata,
+            base_design.weights,
+            base_design.allprobs,
+            base_design.pps,
+            "bootstrap",
+            length(replicate_weights),
+            replicate_weights
+        )
+    end
+
+    # replicate weights given as a range of columns
+    ReplicateDesign(
+        data::AbstractDataFrame,
+        replicate_weights::UnitRange{Int};
+        clusters::Union{Nothing,Symbol,Vector{Symbol}} = nothing,
+        strata::Union{Nothing,Symbol} = nothing,
+        popsize::Union{Nothing,Symbol} = nothing,
+        weights::Union{Nothing,Symbol} = nothing
+    ) =
+        ReplicateDesign(
+            data,
+            Symbol.(names(data)[replicate_weights]);
+            clusters=clusters,
+            strata=strata,
+            popsize=popsize,
+            weights=weights
+        )
+
+    # replicate weights given as regular expression
+    ReplicateDesign(
+        data::AbstractDataFrame,
+        replicate_weights::Regex;
+        clusters::Union{Nothing,Symbol,Vector{Symbol}} = nothing,
+        strata::Union{Nothing,Symbol} = nothing,
+        popsize::Union{Nothing,Symbol} = nothing,
+        weights::Union{Nothing,Symbol} = nothing
+    ) =
+        ReplicateDesign(
+            data,
+            Symbol.(names(data)[findall(name -> occursin(replicate_weights, name), names(data))]);
+            clusters=clusters,
+            strata=strata,
+            popsize=popsize,
+            weights=weights
+        )
 end