Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

multinomial sampling should rely on binomial #287

Open
wants to merge 5 commits into
base: master
Choose a base branch
from

Conversation

YeungOnion
Copy link
Contributor

addresses #183, but does not really fix it since binomial is naive sampling.

if we rely on binomial sampling then we can optimize once. This also updates the testing_boiler to work with Multinomial, notice how the tests are written to all use dynamically sized vectors and to rely on norms for approx absolute equality.

There is also a stoachastic test added, but I'm doubtful that I'm correct about both the test and sampling because it doesn't pass often enough. (Perhaps my choice of parameters in the test is not convergent enough to a normal distribution?) But a second pair of eyes on this would be helpful.

Copy link

codecov bot commented Sep 11, 2024

Codecov Report

Attention: Patch coverage is 57.93103% with 122 lines in your changes missing coverage. Please review.

Project coverage is 93.08%. Comparing base (993a4b5) to head (66b8d7e).
Report is 2 commits behind head on master.

Files with missing lines Patch % Lines
src/distribution/multinomial.rs 55.14% 122 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master     #287      +/-   ##
==========================================
- Coverage   93.73%   93.08%   -0.66%     
==========================================
  Files          53       53              
  Lines       12010    12196     +186     
==========================================
+ Hits        11258    11353      +95     
- Misses        752      843      +91     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@YeungOnion
Copy link
Contributor Author

perhaps I need test statistics to assess if the samples are likely from the distribution...

@FreezyLemon
Copy link
Contributor

notice how the tests are written to all use dynamically sized vectors

Hm... the static types (i.e. SMatrix, SVector) have more type safety and so I generally agree that the dynamically sized types need more testing. But I'm not sure about just testing the dynamic types; we might miss problems with specific nalgebra code paths or even incorrect trait bounds that would exclude the static types from being used in the statrs API altogether.

Looking at #165, this would also exclude basically all multivariate tests from no_std environments without an allocator.

Not 100% sure the existing boiler macro can cover everything... the nalgebra-based distributions are a fair bit different after all. But I think that e.g. new_from_nalgebra should be preferred over just new since it works for both types of Matrix/vector.

@YeungOnion
Copy link
Contributor Author

I agree we should be testing our API size generics and that tests should work on no_std if we have it. I'll revert that.

Do you have any other ideas to verify that samples are correctly generated?

@YeungOnion YeungOnion added this to the 0.18 milestone Sep 18, 2024
@FreezyLemon
Copy link
Contributor

Do you have any other ideas to verify that samples are correctly generated?

Not really, to be honest. Except for very rudimentary testing (samples are in the correct domain/range of values), you'd probably need to implement proper statistical test methods..

@YeungOnion
Copy link
Contributor Author

I'll rely on other implementations of statistical tests and rv sampling to verify my implementation for now. Once we've got tests and verification for them, then I'll consider bootstrapping features.

@YeungOnion YeungOnion force-pushed the perf/multinomial-sampling branch from 82501a1 to c40567c Compare September 24, 2024 04:11
I'm still unsure if I'm sampling incorrectly or am using bad
verification technique, should rely on a common GoF test instead of my
homebrew statistics
@YeungOnion YeungOnion force-pushed the perf/multinomial-sampling branch from c40567c to 7b4180c Compare September 24, 2024 04:30
also adds a comment demonstrating success of unwrap
@YeungOnion YeungOnion force-pushed the perf/multinomial-sampling branch from 6b20e22 to 66b8d7e Compare September 24, 2024 17:53
@YeungOnion
Copy link
Contributor Author

Tests use both static and dynamic allocators, but aren't broken out into separate tests, but they aren't interleaved into the test suite, so shouldn't be a big lift to conditionally compile them for a no_std option.

@YeungOnion YeungOnion mentioned this pull request Sep 24, 2024
@YeungOnion
Copy link
Contributor Author

Certainly not a thorough analysis, but it's safe to say the implementation is not likely correct.

>>> stats.cramervonmises_2samp(scipy_mnom_gen(), data) # scipy vs data
CramerVonMisesResult(statistic=[4.46216824 4.365037   4.57871893 4.39759045 4.64458355], pvalue=[4.92114127e-11 7.65537633e-11 3.194799
98e-11 6.57013333e-11
 2.73471246e-11])
>>> stats.cramervonmises_2samp(scipy_mnom_gen(), scipy_mnom_gen()) # scipy vs scipy
CramerVonMisesResult(statistic=[0.1209659  0.05345861 0.15851277 0.10590302 0.13809556], pvalue=[0.49155904 0.8550833  0.36467217 0.556
74849 0.42800996])

tested with n=1000, p=5*[0.2] where both scipy and statrs generated 10k samples

@YeungOnion YeungOnion modified the milestones: 0.18, Future, 0.19 Dec 3, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants