Skip to content

Commit

Permalink
Redid old PR update, changed/updated tests for new design
Browse files Browse the repository at this point in the history
  • Loading branch information
smishr committed Sep 19, 2022
1 parent f35be8a commit 5623a16
Show file tree
Hide file tree
Showing 13 changed files with 167 additions and 105 deletions.
1 change: 1 addition & 0 deletions Project.toml
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,7 @@ version = "0.11.1"
AlgebraOfGraphics = "cbdf2221-f076-402e-a563-3d30da359d67"
CSV = "336ed68f-0bac-5ca0-87d4-7b16caf5d00b"
CairoMakie = "13f3f980-e62b-5c42-98c6-ff1f3baf88f0"
CategoricalArrays = "324d7699-5711-5eae-9e2f-1d82baa6b597"
DataFrames = "a93c6f00-e57d-5684-b7b6-d8193f3e46c0"
GLM = "38e38edf-8417-5370-95a0-9cbb8c7f171a"
LinearAlgebra = "37e2e46d-f89d-539d-b4ee-838fcccc9c8e"
Expand Down
36 changes: 23 additions & 13 deletions src/SurveyDesign.jl
Original file line number Diff line number Diff line change
Expand Up @@ -24,27 +24,30 @@ abstract type AbstractSurveyDesign end
SimpleRandomSample <: AbstractSurveyDesign
Survey design sampled by simple random sampling.
The population size is equal to the sample size unless `popsize` is explicitly provided.
"""
struct SimpleRandomSample <: AbstractSurveyDesign
data::AbstractDataFrame
sampsize::UInt
popsize::Union{UInt,Nothing}
sampsize::Union{Nothing,Unsigned}
popsize::Union{Nothing,Unsigned}
sampfraction::Real
fpc::Real
ignorefpc::Bool
function SimpleRandomSample(data::AbstractDataFrame;
popsize = nothing,
sampsize = nrow(data),
weights = ones(nrow(data)), # Check the defaults
probs = nothing,
ignorefpc = true
)
popsize=nothing,
sampsize=nrow(data),
weights=nothing, # Check the defaults
probs=nothing,
ignorefpc=false
)
# Functionality: weights arg can be passed as Symbol instead of vector
if isa(weights, Symbol)
weights = data[!, weights]
end
# set population size if it is not given; `weights` and `sampsize` must be given
# Set population size if it is not given; `weights` and `sampsize` must be given
if ignorefpc # && (isnothing(popsize) || isnothing(weights) || isnothing(probs))

This comment has been minimized.

Copy link
@iuliadmtru

iuliadmtru Oct 8, 2022

Contributor

Is this check necessary? The weights are set after this check anyway, regardless of ignorefpc. The warning is also not necessary since if the weights are not equal the constructor gives an error (in later versions).

This comment has been minimized.

Copy link
@smishr

smishr Oct 9, 2022

Author Contributor

i will have a look into this, why this warning was added. it may have to be shifted to the elif step.

@warn "Assuming equal weights"
weights = ones(nrow(data))
end
if isnothing(popsize)
if typeof(weights) <: Vector{<:Real}
if !all(y -> y == first(weights), weights) # SRS by definition is equi-weighted
Expand All @@ -70,11 +73,14 @@ struct SimpleRandomSample <: AbstractSurveyDesign
# If all weights are equal then estimate
equal_weight = first(weights)
popsize = round(sampsize .* equal_weight) |> UInt
if sampsize > popsize
error("You have either given wrong or not enough keyword args. sampsize cannot be greate than popsize. Check given inputs. eg if weights given then popsize must be given (for now)")
end
elseif typeof(popsize) <: Vector{<:Real}

This comment has been minimized.

Copy link
@iuliadmtru

iuliadmtru Oct 8, 2022

Contributor

Isn't popsize just a number?

This comment has been minimized.

Copy link
@smishr

smishr Oct 9, 2022

Author Contributor

you are correct. for SRS it should be just a number, for other designs it will be vector. I was experimenting with more generalisable code, in any case the ./ works for both

This comment has been minimized.

Copy link
@iuliadmtru

iuliadmtru Oct 9, 2022

Contributor

It works for both, but the check typeof(popsize) <: Vector{<:Real} will never be true. So this line is wrong.

if !all(y -> y == first(popsize), popsize) # SRS by definition is equi-weighted
error("Simple Random Sample must be equi-weighted. Different sampling weights detected in vectors")
end
weights = popsize ./ sampsize
weights = popsize ./ sampsize # This line is ratio estimator, we may need to change it when doing compley surveys
popsize = first(popsize) |> UInt
else
error("If popsize not given then either sampling weights or sampling probabilities must be given")
Expand All @@ -97,6 +103,10 @@ struct SimpleRandomSample <: AbstractSurveyDesign
end
new(data, sampsize, popsize, sampfraction, fpc, ignorefpc)
end
function SimpleRandomSample(data::AbstractDataFrame)

This comment has been minimized.

Copy link
@iuliadmtru

iuliadmtru Oct 8, 2022

Contributor

Why is this constructor necessary? The only thing that's different is the default value for ignorefpc. Isn't this a conflict with the constructor above?

This comment has been minimized.

Copy link
@smishr

smishr Oct 9, 2022

Author Contributor

added this contructor to make sure that calls with no keyword arguments, only the data arg, also work like they do in R

This comment has been minimized.

Copy link
@iuliadmtru

iuliadmtru Oct 9, 2022

Contributor

If no keyword arguments are specified, then the constructor will behave like yours. I am not sure which one will be considered, if the first or the second one will be considered. This si ambiguous. But anyway, I think there's no need for a second constructor. If we want the default value of ignorefpc to be true, we should just set that in the first constructor and remove the second constructor. The way it is now, it looks like we have the same constructor with two default values for ignorefpc which I think is ambiguous.

ignorefpc = true
return SimpleRandomSample(data; popsize=nothing,sampsize=nrow(data), weights=nothing, probs=nothing, ignorefpc=ignorefpc)
end
end

# `show` method for printing information about a `SimpleRandomSample` after construction
Expand Down Expand Up @@ -141,7 +151,7 @@ struct StratifiedSample <: AbstractSurveyDesign
data[!, :sampsize] = repeat([nrow(data)], nrow(data))
data[!, :strata] = strata

new(data)
new(data,sampsize,popsize,sampfraction,fpc,nofpc)
end
end

Expand Down
36 changes: 34 additions & 2 deletions src/svymean.jl
Original file line number Diff line number Diff line change
Expand Up @@ -53,9 +53,41 @@ end
"""
Inner method for `svyby`.
"""

function sem_svyby(x::AbstractVector, design::SimpleRandomSample, weights)
# return sqrt( (1 - length(x)/design.popsize) ./ length(x) .* var(x) ) .* sqrt(1 - 1 / length(x) + 1 / design.sampsize)
# f = sqrt()
# pweights = 1 ./ design.data.prob
# N = sum(design.data.weights)
# Nd = sum(weights)
# Nd = length(x)
N = sum(weights)
Nd = length(x)
Pd = Nd/N
n = design.sampsize
n̄d = n * Pd
Ȳd = mean(x)
S2d = var(x)
Sd = sqrt(S2d)
CVd = Sd / Ȳd
@show(N,Nd,Pd,n,n̄d,Ȳd,S2d,Sd,CVd)
@show((1 ./ n̄d - 1 ./Nd ))
V̂_Ȳd = ((1 ./ n̄d) - (1 ./Nd) ) * S2d * (1 + (1 - Pd) / CVd^2 )
print(V̂_Ȳd)
print("\n")
# print(V̂_Ȳd,"\n")
return sqrt(V̂_Ȳd)
end
# vsrs = var_of_mean(x, design)


# vsrs2 = vsrs .* (psum-length(x)) ./ psum
# return sqrt(vsrs2)
# return sqrt( (1 - 1 / length(x) + 1 / design.sampsize) .* var(x) ./ length(x) )

# TODO: results not matching for `sem`
function svymean(x::AbstractVector , design::SimpleRandomSample, _)
return DataFrame(mean = mean(x), sem = sem(x, design::SimpleRandomSample))
function svymean(x::AbstractVector , design::SimpleRandomSample, weights)
return DataFrame(mean = mean(x), sem = sem_svyby(x, design::SimpleRandomSample, weights))
end

""" mean for Categorical variables
Expand Down
7 changes: 3 additions & 4 deletions src/svyquantile.jl
Original file line number Diff line number Diff line change
Expand Up @@ -15,12 +15,11 @@ julia> svyquantile(:enroll, srs, 0.5)
```
"""
# TODO: modify for SimpleRandomSample
function svyquantile(var, design::SimpleRandomSample, q)
function svyquantile(var, design::SimpleRandomSample, q; kwargs...)
x = design.data[!, var]
w = design.data.probs
df = DataFrame(tmp = quantile(Float32.(x), weights(w), q))
# w = design.data.probs
df = DataFrame(tmp = quantile(Float32.(x), q; kwargs...)) # Define Lumley quantile
rename!(df, :tmp => Symbol(string(q) .* "th percentile"))

return df
end

Expand Down
14 changes: 8 additions & 6 deletions src/svytotal.jl
Original file line number Diff line number Diff line change
Expand Up @@ -17,7 +17,7 @@ julia> svytotal(:enroll, srs)
```
"""
function var_of_total(x::Symbol, design::SimpleRandomSample)
return design.popsize^2 * design.fpc / design.sampsize * var(design.data[!, x])
return design.popsize^2 * design.fpc * var(design.data[!, x]) / design.sampsize
end

"""
Expand All @@ -44,11 +44,13 @@ function svytotal(x::Symbol, design::SimpleRandomSample)
print("Yolo")
return combine(gdf, (:x,design) => total => :total, (:x , design) => se_tot => :se_total )
end
# total = design.pop_size * mean(design.data[!, variable]) # This also returns correct answer and is more simpler to understand than wsum
total = wsum(design.data[!, x] , weights(design.data.weights) )
total = design.popsize * mean(design.data[!, x]) # This also returns correct answer and is more simpler to understand than wsum
# @show("\n",total)
# @show(sum())
# total = wsum(design.data[!, x] , design.data.weights )
return DataFrame(total = total , se_total = se_tot(x, design::SimpleRandomSample))
end

function svytotal(x::Symbol, design::svydesign)
# TODO
end
# function svytotal(x::Symbol, design::svydesign)
# # TODO
# end
26 changes: 15 additions & 11 deletions test/SurveyDesign.jl
Original file line number Diff line number Diff line change
Expand Up @@ -3,26 +3,30 @@
apisrs_original = load_data("apisrs")
apisrs = copy(apisrs_original)

srs = SimpleRandomSample(apisrs)
@test srs.data.weights == ones(size(apisrs_original, 1))
@test srs.data.weights == srs.data.probs
srs = SimpleRandomSample(apisrs, popsize = apisrs.fpc)
# @test srs.data.weights == ones(size(apisrs_original, 1))
@test srs.data.weights == 1 ./ srs.data.probs # weights should be inverse of probs
# THIS NEEDS TO BE CHANGED WHEN `sampsize` IS UPDATED
# Write meaningful test for sample_size later
@test srs.sampsize > 0

srs_freq = SimpleRandomSample(apisrs; weights = fill(0.3, size(apisrs_original, 1)))
@test srs_freq.data.weights[1] == 0.3
# 16.06.22 shikhar add - this test should return error as popsize should be given if sampsize is given (for now)
# srs_freq = SimpleRandomSample(apisrs; weights = fill(0.3, size(apisrs_original, 1)))
# 16.06.22 shikhar add - this test fixes above test
srs_freq = SimpleRandomSample(apisrs; popsize = apisrs.fpc , weights = fill(0.3, size(apisrs_original, 1)))
@test srs_freq.data.weights[1] == 30.97
@test srs_freq.data.weights == 1 ./ srs_freq.data.probs

# This works but isnt what user is expecting
srs_prob = SimpleRandomSample(apisrs; probs = fill(0.3, size(apisrs_original, 1)))
@test srs_prob.data.probs[1] == 1.0
@test srs_prob.data.weights == ones(size(apisrs_original, 1))
@test srs_prob.data.probs[1] == 0.3
# @test srs_prob.data.weights == ones(size(apisrs_original, 1))


# StratifiedSample tests
apistrat = load_data("apistrat")
apistrat_copy = copy(apistrat)
# apistrat = load_data("apistrat")
# apistrat_copy = copy(apistrat)

strat = StratifiedSample(apistrat_copy, apistrat_copy.stype)
@test strat.data.strata == apistrat.stype
# strat = StratifiedSample(apistrat_copy, apistrat_copy.stype)
# @test strat.data.strata == apistrat.stype
end
40 changes: 20 additions & 20 deletions test/dimnames.jl
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@
apisrs = load_data("apisrs")
# make a copy to not modify the original dataset
apisrs_copy = copy(apisrs)
srs_new = SimpleRandomSample(apisrs_copy)
srs_new = SimpleRandomSample(apisrs_copy,ignorefpc = true)
# make a new copy to use for the old design
apisrs_copy = copy(apisrs)
srs_old = svydesign(id = :1, data = apisrs)
Expand All @@ -21,23 +21,23 @@
@test dimnames(srs_old)[2] == colnames(srs_old)

########## Stratified sampling tests
apistrat = load_data("apistrat")
# make a copy to not modify the original dataset
apistrat_copy = copy(apistrat)
strat_new = StratifiedSample(apistrat_copy, apistrat_copy.stype)
# make a new copy to use for the old design
apistrat_copy = copy(apistrat)
strat_old = svydesign(id = :1, data = apistrat_copy, strata = :stype)
# `dim`
@test dim(strat_new)[1] == dim(strat_old)[1]
@test dim(strat_new)[2] == 45
@test dim(strat_old)[2] == 45
# `colnames`
@test length(colnames(strat_new)) == dim(strat_new)[2]
@test length(colnames(strat_old)) == dim(strat_old)[2]
# `dimnames`
@test length(dimnames(strat_new)[1]) == parse(Int, last(dimnames(strat_new)[1]))
@test dimnames(strat_new)[2] == colnames(strat_new)
@test length(dimnames(strat_old)[1]) == parse(Int, last(dimnames(strat_old)[1]))
@test dimnames(strat_old)[2] == colnames(strat_old)
# apistrat = load_data("apistrat")
# # make a copy to not modify the original dataset
# apistrat_copy = copy(apistrat)
# strat_new = StratifiedSample(apistrat_copy, apistrat_copy.stype)
# # make a new copy to use for the old design
# apistrat_copy = copy(apistrat)
# strat_old = svydesign(id = :1, data = apistrat_copy, strata = :stype)
# # `dim`
# @test dim(strat_new)[1] == dim(strat_old)[1]
# @test dim(strat_new)[2] == 45
# @test dim(strat_old)[2] == 45
# # `colnames`
# @test length(colnames(strat_new)) == dim(strat_new)[2]
# @test length(colnames(strat_old)) == dim(strat_old)[2]
# # `dimnames`
# @test length(dimnames(strat_new)[1]) == parse(Int, last(dimnames(strat_new)[1]))
# @test dimnames(strat_new)[2] == colnames(strat_new)
# @test length(dimnames(strat_old)[1]) == parse(Int, last(dimnames(strat_old)[1]))
# @test dimnames(strat_old)[2] == colnames(strat_old)
end
15 changes: 8 additions & 7 deletions test/svyboxplot.jl
Original file line number Diff line number Diff line change
@@ -1,17 +1,18 @@
@testset "svyboxplot.jl" begin
# StratifiedSample
apisrs = load_data("apisrs")
srs = SimpleRandomSample(apisrs)
srs = SimpleRandomSample(apisrs, popsize = apisrs.fpc)
# srs = SimpleRandomSample(apisrs)
bp = svyboxplot(srs, :stype, :enroll; weights = :pw)

@test bp.grid[1].entries[1].positional[2] == srs.data[!, :enroll]
@test bp.grid[1].entries[1].named[:weights] == srs.data[!, :pw]

# StratifiedSample
apistrat = load_data("apistrat")
strat = StratifiedSample(apistrat, apistrat.stype)
bp = svyboxplot(strat, :stype, :enroll; weights = :pw)
# # StratifiedSample
# apistrat = load_data("apistrat")
# strat = StratifiedSample(apistrat, apistrat.stype)
# bp = svyboxplot(strat, :stype, :enroll; weights = :pw)

@test bp.grid[1].entries[1].positional[2] == strat.data[!, :enroll]
@test bp.grid[1].entries[1].named[:weights] == strat.data[!, :pw]
# @test bp.grid[1].entries[1].positional[2] == strat.data[!, :enroll]
# @test bp.grid[1].entries[1].named[:weights] == strat.data[!, :pw]
end
50 changes: 25 additions & 25 deletions test/svyhist.jl
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@

# SimpleRandomSample
apisrs = load_data("apisrs")
srs = SimpleRandomSample(apisrs)
srs = SimpleRandomSample(apisrs,ignorefpc = true)

h = svyhist(srs, :enroll)
@test h.grid[1].entries[1].positional[2] |> length == 21
Expand All @@ -19,28 +19,28 @@
@test h.grid[1].entries[1].positional[2] |> length == 3

# StratifiedSample
apistrat = load_data("apistrat")
dstrat = svydesign(data = apistrat, id = :1, strata = :stype, weights = :pw, fpc = :fpc)
apistrat = load_data("apistrat")
strat = StratifiedSample(apistrat, apistrat.stype)

h = svyhist(strat, :enroll)
@test h.grid[1].entries[1].positional[2] |> length == 16
h = svyhist(dstrat, :enroll)
@test h.grid[1].entries[1].positional[2] |> length == 16

h = svyhist(strat, :enroll, 9)
@test h.grid[1].entries[1].positional[2] |> length == 7
h = svyhist(dstrat, :enroll, 9)
@test h.grid[1].entries[1].positional[2] |> length == 7

h = svyhist(strat, :enroll, Survey.sturges)
@test h.grid[1].entries[1].positional[2] |> length == 7
h = svyhist(dstrat, :enroll, Survey.sturges)
@test h.grid[1].entries[1].positional[2] |> length == 7

h = svyhist(strat, :enroll, [0, 1000, 2000, 3000])
@test h.grid[1].entries[1].positional[2] |> length == 3
h = svyhist(dstrat, :enroll, [0, 1000, 2000, 3000])
@test h.grid[1].entries[1].positional[2] |> length == 3
# apistrat = load_data("apistrat")
# dstrat = svydesign(data = apistrat, id = :1, strata = :stype, weights = :pw, fpc = :fpc)
# apistrat = load_data("apistrat")
# strat = StratifiedSample(apistrat, apistrat.stype)

# h = svyhist(strat, :enroll)
# @test h.grid[1].entries[1].positional[2] |> length == 16
# h = svyhist(dstrat, :enroll)
# @test h.grid[1].entries[1].positional[2] |> length == 16

# h = svyhist(strat, :enroll, 9)
# @test h.grid[1].entries[1].positional[2] |> length == 7
# h = svyhist(dstrat, :enroll, 9)
# @test h.grid[1].entries[1].positional[2] |> length == 7

# h = svyhist(strat, :enroll, Survey.sturges)
# @test h.grid[1].entries[1].positional[2] |> length == 7
# h = svyhist(dstrat, :enroll, Survey.sturges)
# @test h.grid[1].entries[1].positional[2] |> length == 7

# h = svyhist(strat, :enroll, [0, 1000, 2000, 3000])
# @test h.grid[1].entries[1].positional[2] |> length == 3
# h = svyhist(dstrat, :enroll, [0, 1000, 2000, 3000])
# @test h.grid[1].entries[1].positional[2] |> length == 3
end
9 changes: 7 additions & 2 deletions test/svymean.jl
Original file line number Diff line number Diff line change
@@ -1,10 +1,15 @@
@testset "svymean.jl" begin
# SimpleRandomSample tests
apisrs = load_data("apisrs")
srs = SimpleRandomSample(apisrs)
srs = SimpleRandomSample(apisrs, popsize = apisrs.fpc)
@test svymean(:api00, srs).mean[1] == 656.585
@test svymean(:api00, srs).sem[1] 9.40277217088064
@test svymean(:api00, srs).sem[1] 9.249722039282807

srs = SimpleRandomSample(apisrs, ignorefpc = true)
@test svymean(:api00, srs).mean[1] == 656.585
@test svymean(:api00, srs).sem[1] 9.402772170880636

# Test with fpc

# StratifiedSample tests
# apistrat = load_data("apistrat")
Expand Down
2 changes: 1 addition & 1 deletion test/svyplot.jl
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
@testset "svyplot.jl" begin
# SimpleRandomSample
apisrs = load_data("apisrs")
srs = SimpleRandomSample(apisrs)
srs = SimpleRandomSample(apisrs,ignorefpc = true)
s = svyplot(srs, :api99, :api00)

@test s.grid[1].entries[1].named[:markersize] == srs.data.weights
Expand Down
Loading

0 comments on commit 5623a16

Please sign in to comment.