Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cleaning total and mean. Removing by and other files. #114

Merged
merged 8 commits into from
Dec 4, 2022
Merged
Show file tree
Hide file tree
Changes from 3 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 2 additions & 1 deletion docs/src/api.md
Original file line number Diff line number Diff line change
Expand Up @@ -23,7 +23,8 @@ quantile

It is often required to estimate population parameters for sub-populations of interest. For example, you make have of heights of people, but you want the average height of male and female separately.
```@docs
by
mean(x::Symbol, by::Symbol, design::SimpleRandomSample)
total(x::Symbol, by::Symbol, design::SimpleRandomSample)
```
```@docs
plot(design::AbstractSurveyDesign, x::Symbol, y::Symbol; kwargs...)
Expand Down
2 changes: 0 additions & 2 deletions src/Survey.jl
Original file line number Diff line number Diff line change
Expand Up @@ -20,8 +20,6 @@ include("hist.jl")
include("plot.jl")
include("dimnames.jl")
include("boxplot.jl")
include("by.jl")
include("ht.jl")
include("show.jl")

export load_data
Expand Down
82 changes: 0 additions & 82 deletions src/by.jl

This file was deleted.

40 changes: 0 additions & 40 deletions src/ht.jl

This file was deleted.

180 changes: 100 additions & 80 deletions src/mean.jl
Original file line number Diff line number Diff line change
@@ -1,23 +1,3 @@
# SimpleRandomSample

"""
var_of_mean(x, design)

Compute the variance of the mean for the variable `x`.
"""
function var_of_mean(x::Symbol, design::SimpleRandomSample)
return design.fpc * var(design.data[!, x]) / design.sampsize
end

"""
sem(x, design)

Compute the standard error of the mean for the variable `x`.
"""
function sem(x::Symbol, design::SimpleRandomSample)
return sqrt(var_of_mean(x, design))
end

"""
mean(x, design)
Estimate the population mean of a variable of a simple random sample, and the corresponding standard error.
Expand All @@ -29,99 +9,139 @@ julia> srs = SimpleRandomSample(apisrs;popsize=:fpc);

julia> mean(:enroll, srs)
1×2 DataFrame
Row │ mean sem
Row │ mean SE
│ Float64 Float64
─────┼──────────────────
1 │ 584.61 27.3684
```
"""
function mean(x::Symbol, design::SimpleRandomSample)
function se(x::Symbol, design::SimpleRandomSample)
ayushpatnaikgit marked this conversation as resolved.
Show resolved Hide resolved
variance = design.fpc * Statistics.var(design.data[!, x]) / design.sampsize
return sqrt(variance)
end
if isa(design.data[!, x], CategoricalArray)
gdf = groupby(design.data, x)
p = combine(gdf, nrow => :counts)
p.mean = p.counts ./ sum(p.counts)
# variance of proportion
p.var = design.fpc .* p.mean .* (1 .- p.mean) ./ (design.sampsize - 1)
p.sem = sqrt.(p.var)
p.se = sqrt.(p.var)
return select(p, Not([:counts, :var]))
end
return DataFrame(mean=mean(design.data[!, x]), sem=sem(x, design))
return DataFrame(mean=mean(design.data[!, x]), SE=se(x, design))
end

"""
ayushpatnaikgit marked this conversation as resolved.
Show resolved Hide resolved
mean(x, design)
Estimate the population mean of a variable of a simple random sample, and the corresponding standard error.

```jldoctest
julia> apisrs = load_data("apisrs");

julia> srs = SimpleRandomSample(apisrs;popsize=:fpc);

julia> mean([:api00, :api99], srs)
2×3 DataFrame
Row │ names mean SE
│ String Float64 Float64
─────┼──────────────────────────
1 │ api00 656.585 9.24972
2 │ api99 624.685 9.5003
```
"""
function mean(x::Vector{Symbol}, design::SimpleRandomSample)
means_list = []
for i in x
push!(means_list, mean(i, design))
end
df = reduce(vcat, means_list)
df = reduce(vcat, [mean(i, design) for i in x])
insertcols!(df, 1, :names => String.(x))
return df
end

"""
Inner method for `by`
"""
# Inner methods for `by`
function sem_by(x::AbstractVector, design::SimpleRandomSample)
# domain size
dsize = length(x)
# sample size
ssize = design.sampsize
# fpc
fpc = design.fpc
# variance of the mean
variance = (dsize / ssize)^(-2) / ssize * fpc * ((dsize - 1) / (ssize - 1)) * var(x)
# return the standard error
return sqrt(variance)
end
Calculates domain mean.
ayushpatnaikgit marked this conversation as resolved.
Show resolved Hide resolved
```jldoctest
julia> using Survey;

julia> srs = load_data("apisrs");

julia> srs = SimpleRandomSample(srs; popsize = :fpc);

julia> mean(:api00, :cname, srs) |> first
ayushpatnaikgit marked this conversation as resolved.
Show resolved Hide resolved
DataFrameRow
Row │ cname mean SE
│ String15 Float64 Float64
─────┼────────────────────────────
1 │ Kern 573.6 42.8026
```
"""
Inner method for `by` for SimpleRandomSample
"""
function mean(x::AbstractVector, design::SimpleRandomSample, weights)
return DataFrame(mean=Statistics.mean(x), sem=sem_by(x, design))
function mean(x::Symbol, by::Symbol, design::SimpleRandomSample)
function domain_mean(x::AbstractVector, design::SimpleRandomSample, weights)
function se(x::AbstractVector, design::SimpleRandomSample)
nd = length(x) # domain size
n = design.sampsize
fpc = design.fpc
variance = (nd / n)^(-2) / n * fpc * ((nd - 1) / (n - 1)) * var(x)
return sqrt(variance)
end
return DataFrame(mean=Statistics.mean(x), SE=se(x, design))
end
gdf = groupby(design.data, by)
combine(gdf, [x, :weights] => ((a, b) -> domain_mean(a, design, b)) => AsTable)
end

"""
Inner method for `by` for StratifiedSample
Calculates domain mean and its std error, based example 10.3.3 on pg394 Sarndal (1992)
ayushpatnaikgit marked this conversation as resolved.
Show resolved Hide resolved

```jldoctest
julia> using Survey;

julia> strat = load_data("apistrat");

julia> dstrat = StratifiedSample(strat, :stype; popsize = :fpc);

julia> mean(:api00, :cname, dstrat) |> first
DataFrameRow
Row │ cname mean SE
│ String15 Float64 Float64
─────┼───────────────────────────────
1 │ Los Angeles 633.511 21.3912
```
"""
function mean(x::AbstractVector, popsize::AbstractVector, sampsize::AbstractVector, sampfraction::AbstractVector, strata::AbstractVector)
df = DataFrame(x=x, popsize=popsize, sampsize=sampsize, sampfraction=sampfraction, strata=strata)
nsdh = []
substrata_domain_totals = []
Nh = []
nh = []
fh = []
ȳsdh = []
sigma_ȳsh_squares = []
grouped_frame = groupby(df, :strata)
for each_strata in keys(grouped_frame)
nsh = nrow(grouped_frame[each_strata])#, nrow=>:nsdh).nsdh
push!(nsdh, nsh)
substrata_domain_total = sum(grouped_frame[each_strata].x)
ȳdh = Statistics.mean(grouped_frame[each_strata].x)
push!(ȳsdh, ȳdh)
push!(substrata_domain_totals, substrata_domain_total)
popsizes = first(grouped_frame[each_strata].popsize)
push!(Nh, popsizes)
sampsizes = first(grouped_frame[each_strata].sampsize)
push!(nh, sampsizes)
sampfrac = first(grouped_frame[each_strata].sampfraction)
push!(fh, sampfrac)
push!(sigma_ȳsh_squares, sum((grouped_frame[each_strata].x .- ȳdh) .^ 2))
function mean(x::Symbol, by::Symbol, design::StratifiedSample)
function domain_mean(x::AbstractVector, popsize::AbstractVector, sampsize::AbstractVector, sampfraction::AbstractVector, strata::AbstractVector)
df = DataFrame(x=x, popsize=popsize, sampsize=sampsize, sampfraction=sampfraction, strata=strata)
function calculate_components(x, popsize, sampsize, sampfraction)
return DataFrame(nsdh = length(x), nsh = length(x), substrata_domain_totals = sum(x), ȳsdh = mean(x), Nh = first(popsize), nh = first(sampsize),fh = first(sampfraction), sigma_ȳsh_squares = sum((x .- mean(x)).^2))
ayushpatnaikgit marked this conversation as resolved.
Show resolved Hide resolved
end
components = combine(groupby(df, :strata), [:x, :popsize, :sampsize, :sampfraction] => calculate_components => AsTable)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
components = combine(groupby(df, :strata), [:x, :popsize, :sampsize, :sampfraction] => calculate_components => AsTable)
components = combine(groupby(df, :strata), [:x, :popsize, :sampsize, :sampfraction]
=> calculate_components => AsTable)

domain_mean = sum(components.Nh .* components.substrata_domain_totals ./ components.nh) / sum(components.Nh .* components.nsdh ./ components.nh)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
domain_mean = sum(components.Nh .* components.substrata_domain_totals ./ components.nh) / sum(components.Nh .* components.nsdh ./ components.nh)
domain_mean = sum(components.Nh .* components.substrata_domain_totals ./ components.nh)
/ sum(components.Nh .* components.nsdh ./ components.nh)

pdh = components.nsdh ./ components.nh
N̂d = sum(components.Nh .* pdh)
domain_var = sum(components.Nh .^ 2 .* (1 .- components.fh) .* (components.sigma_ȳsh_squares .+ (components.nsdh .* (1 .- pdh) .* (components.ȳsdh .- domain_mean) .^ 2)) ./ (components.nh .* (components.nh .- 1))) ./ N̂d .^ 2
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is probably not great, but I don't know how else to make this line shorter. The formatter does something weird with it, so this is my suggestion:

Suggested change
domain_var = sum(components.Nh .^ 2 .* (1 .- components.fh) .* (components.sigma_ȳsh_squares .+ (components.nsdh .* (1 .- pdh) .* (components.ȳsdh .- domain_mean) .^ 2)) ./ (components.nh .* (components.nh .- 1))) ./ N̂d .^ 2
domain_var = sum(components.Nh .^ 2 .* (1 .- components.fh) .* (components.sigma_ȳsh_squares
.+ (components.nsdh .* (1 .- pdh) .* (components.ȳsdh .- domain_mean) .^ 2))
./ (components.nh .* (components.nh .- 1))) ./ N̂d .^ 2

domain_mean_se = sqrt(domain_var)
return DataFrame(mean=domain_mean, SE=domain_mean_se)
end
domain_mean = sum(Nh .* substrata_domain_totals ./ nh) / sum(Nh .* nsdh ./ nh)
pdh = nsdh ./ nh
N̂d = sum(Nh .* pdh)
domain_var = sum(Nh .^ 2 .* (1 .- fh) .* (sigma_ȳsh_squares .+ (nsdh .* (1 .- pdh) .* (ȳsdh .- domain_mean) .^ 2)) ./ (nh .* (nh .- 1))) ./ N̂d .^ 2
domain_mean_se = sqrt(domain_var)
return DataFrame(domain_mean=domain_mean, domain_mean_se=domain_mean_se)
gdf_domain = groupby(design.data, by)
combine(gdf_domain, [x, :popsize,:sampsize,:sampfraction, design.strata] => domain_mean => AsTable)
end

"""
Estimate the population mean of a variable of a stratified sample, and the corresponding standard error.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again this mean should not be documented, since it's the same functionality as above (starting with line 1). And I think this should be grouped with the first two methods.

Ref: Cochran (1977)

```jldoctest
julia> using Survey;

julia> strat = load_data("apistrat");

julia> dstrat = StratifiedSample(strat, :stype; popsize = :fpc);

julia> mean(:api00, dstrat)
1×2 DataFrame
Row │ mean SE
│ Float64 Float64
─────┼──────────────────
1 │ 662.287 9.40894

```
"""
function mean(x::Symbol, design::StratifiedSample)
if x == design.strata
Expand Down Expand Up @@ -149,7 +169,7 @@ function mean(x::Symbol, design::StratifiedSample)
s²ₕ = combine(gdf, x => var => :s²h).s²h
V̂Ȳ̂ = sum((Wₕ .^ 2) .* (1 .- fₕ) .* s²ₕ ./ nₕ)
SE = sqrt(V̂Ȳ̂)
return DataFrame(Ȳ̂=Ȳ̂, SE=SE)
return DataFrame(mean=Ȳ̂, SE=SE)
end

function mean(::Bool; x::Symbol, design::StratifiedSample)
Expand Down
3 changes: 0 additions & 3 deletions src/poststratify.jl

This file was deleted.

14 changes: 0 additions & 14 deletions src/ratio.jl

This file was deleted.

Loading