Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CategoricalArrays ("Factor") support as multiple dispatch functions and testing benchmarking #52

Closed
smishr opened this issue Sep 5, 2022 · 9 comments
Assignees
Labels
bug Something isn't working
Milestone

Comments

@smishr
Copy link
Contributor

smishr commented Sep 5, 2022

Add multiple dispatch methods for CategoricalArray type columns in the dataset. Behaviour like R does with factor variables.

@smishr smishr changed the title Add CategoricalArrays ("Factor") support, multiple dispatch Add CategoricalArrays ("Factor") support, behaviour like R Sep 5, 2022
@iuliadmtru
Copy link
Contributor

iuliadmtru commented Oct 7, 2022

I think this was solved with PR #58, right?

@smishr
Copy link
Contributor Author

smishr commented Nov 1, 2022

Yes, some CategoricalArray support has been added for SimpleRandomSample and StratifiedSample, which is achieving slightly faster groupby times. Still need to do thorough testing and benchmarking to show that as stratification levels increase, setting the strata vector as a CategoricalArray results in better performance than as a StringX type

@smishr
Copy link
Contributor Author

smishr commented Nov 9, 2022

I think it would be great to create multiple dispatch from CategoricalArray enhacements added inside if-else ladders of svymean and svytotal.]

So instead of doing elseif isa(x, Symbol) && isa(design.data[!, x], CategoricalArray) inside svymean(x::Symbol, design::StratifiedSample), we can have those conditions as multiple dispatch and separate function for better readability

@smishr smishr changed the title Add CategoricalArrays ("Factor") support, behaviour like R CategoricalArrays ("Factor") support as multiple dispatch functions Nov 9, 2022
@smishr smishr changed the title CategoricalArrays ("Factor") support as multiple dispatch functions CategoricalArrays ("Factor") support as multiple dispatch functions and testing benchmarking Nov 9, 2022
@smishr
Copy link
Contributor Author

smishr commented Nov 9, 2022

Further, need to quantify and benchmark the improvements from grouping by over CategoricalArrays instead of Strings (which would be naive default for a categorical variable).

@iuliadmtru
Copy link
Contributor

svymean and svytotal give a different output for CategoricalArray input than for Symbol input:

julia> apisrs = load_data("apisrs");

julia> srs = SimpleRandomSample(apisrs; weights = :pw);

julia> srs.data.stype = categorical(srs.data.stype);

julia> svymean(:enroll, srs)
1×2 DataFrame
 Row │ mean     sem
     │ Float64  Float64
─────┼──────────────────
   1584.61  27.3684

julia> svymean(:stype, srs)
3×5 DataFrame
 Row │ stype  counts  proportion  var          se
     │ Cat   Int64   Float64     Float64      Float64
─────┼───────────────────────────────────────────────────
   1 │ E         142       0.71   0.00100126   0.0316428
   2 │ H          25       0.125  0.000531876  0.0230624
   3 │ M          33       0.165  0.000669982  0.025884

Also, the standard error for the CategoricalArray method doesn't exactly match R:

> library(survey)
> data(api)
> srs <- svydesign(id = ~1, weights = ~pw, data = apistrat)
> svymean(~stype, srs)
       mean     SE
stypeE 0.71376 0.0291
stypeH 0.12189 0.0177
stypeM 0.16435 0.0229

@smishr
Copy link
Contributor Author

smishr commented Nov 17, 2022

okay ill have a look

@smishr
Copy link
Contributor Author

smishr commented Nov 29, 2022

bump

@smishr smishr added this to the 0.2.0 release milestone Nov 29, 2022
@smishr smishr added the bug Something isn't working label Nov 29, 2022
@ayushpatnaikgit
Copy link
Member

> srs <- svydesign(id = ~1, data = apistrat, fpc = ~fpc)
> svymean(~stype, srs)
           mean     SE
stypeE 0.832972 0.0194
stypeH 0.071126 0.0109
stypeM 0.095902 0.0144
> srs <- svydesign(id = ~1, weights = ~pw, data = apistrat, fpc = ~fpc)
> svymean(~stype, srs)
          mean     SE
stypeE 0.71376 0.0285
stypeH 0.12189 0.0173
stypeM 0.16435 0.0224

These two are also different. From our Julia result and each other. They are also related to #93. It seems like R doesn't derive weights from fpc.

@smishr
Copy link
Contributor Author

smishr commented Jan 4, 2023

closing as codebase has changed quite a lot

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants