-
Notifications
You must be signed in to change notification settings - Fork 432
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add geometric and hypergeometric distributions #1062
Conversation
Thanks, this looks like good additions! We are using a rejection algorithm by Kachitvichyanukul and Schmeiser from 1988 for sampling the binomial distribution, so we can hopefully find their hypergeometric variant as well. |
I was able to find their article on hypergeometric sampling on ResearchGate. I seems the PDF is available there and complete. |
Awesome, thank you! I'll get to work on that tomorrow. |
I think it's worth to keep the full name. |
No problem, I'll do the rename first thing in the morning. My thought process was that it would pair better with By the way, do you folks think it's worth having an optimized implementation for pub struct Geometric1in2(); // WTH should this be called, though?
impl Distribution<u64> for Geometric1in2 {
fn sample<R: Rng>(&self, rng: &mut R) -> u64 {
let mut result = 0;
loop {
let x = rng.gen::<u64>().leading_ones() as u64;
result += x;
if x < 64 { break; }
}
result
}
} I didn't test or benchmark this yet, but I've used similar code before (without the loop, which makes this exact), and in the vast majority of cases, this is just a couple of extra cycles on top of the I'd be happy to add both of those as well. Seemed like too much scope creep for an unprompted PR, though. (And again, I'm not sure it's actually worth having, so...) |
For the geometric distribution with small values of "p" there is an exact and efficient algorithm published in Bringmann, K. and Friedrich, T., 2013, July. Exact and efficient generation of geometric random variates and random graphs, in International Colloquium on Automata, Languages, and Programming (pp. 267-278). However, the algorithm is not exactly trivial and its description there is not necessarily programmer-friendly. EDIT (Oct. 27): Nevertheless, here is my implementation of this algorithm in Python. |
Depends on whether the performance benefits are worth the increased API surface. Maybe you could call it |
Well, that took a bit longer to implement than I expected. There are at least two print errors in the paper, one causing an infinite loop, and one incomplete expression. The tests pass now, but I had to allow for much larger errors in the mean and especially variance; not sure if that's due to an unlucky seed, or if there's a subtle bug left in there. Could also be a problem with the algorithm itself, the paper doesn't include these measurements. Algorithm HIN is fine, the problem is only with the tests exercising algorithm H2PE. EDIT: As a quick test, I removed steps 4.1 and 4.2, which are alternate acceptance conditions, purely intended as optimizations. The error of the measured mean goes way down to acceptable levels, but the variance is still out of whack. EDIT 2 That helped narrow it down. Found the errors, one The AppVeyor build seems to have failed due to unrelated network issues. @peteroupc My implementation just truncates samples from an As for Out of curiosity, I also compared |
It is exact in theory, assuming that computers can store, process, and generate any real number of any precision (even infinite precision), but not exact in practice, especially where floating-point number formats with a fixed precision are involved. On real-life computers, exact sampling can generally be achieved only by rejection sampling and/or arbitrary-precision arithmetic. See the Bringmann paper, especially Appendices A and B, for details, but again, it isn't trivial since it involves (approximations of) arbitrary-precision logarithms. On efficiency: In the case of the geometric distribution, when p is at 1/3 or greater, the trivial algorithm of drawing Bernoulli(p) trials until a success happens "is probably difficult to beat in any programming environment" (Devroye, L., "Non-Uniform Random Variate Generation", 1986, p. 498). (You can see that already with your optimized geometric(1/2) sampler.) This trivial algorithm, though, is not necessarily efficient when p is close to 0, unlike your implementation that employs inversion (by truncating exponentials), which is efficient for any parameter p (again with the assumption above). Translating this inversion algorithm to be exact on real-life computers will likewise lead to an efficient algorithm in practice (in fact this is one of two geometric sampling algorithms in the Bringmann paper; this one is in Appendix A because the paper also showcases another with an optimal time complexity compared to this one). |
@peteroupc Ah, I see my error now. I thought that this would be exact given an exact sampling algorithm for the exponential distribution. I failed to consider possible errors in the calculation of I can sort of make sense of the algorithm given in the paper: choose What I don't get is why the bitwise sampling of Now that I think about it, I guess the docs should mention that it's technically the bounded geometric distribution with |
Indeed, sampling |
@teryror Could you please add the two new distributions to the list in the docs in |
already did that here and updated it during the rename. I listed hypergeometric under misc. distributions though. |
Alright, I implemented the suggested exact algorithm (minus the bitwise sampling), and with that, I think I'm done here, unless anyone finds issues for me to fix. |
This looks good! The only possibly important issue I found when comparing to the reference was a use of |
All done! |
Fair enough. Could you please add a comment documenting why we are
deviating from the reference here?
…On Fri, Nov 20, 2020, 17:09 Tristan Dannenberg ***@***.***> wrote:
***@***.**** commented on this pull request.
------------------------------
In rand_distr/src/hypergeometric.rs
<#1062 (comment)>:
> + let n = total_population_size;
+ let (mut sign_x, mut offset_x) = (1, 0);
+ let (n1, n2) = {
+ // switch around success and failure states if necessary to ensure n1 <= n2
+ let population_without_feature = n - population_with_feature;
+ if population_with_feature > population_without_feature {
+ sign_x = -1;
+ offset_x = sample_size as i64;
+ (population_without_feature, population_with_feature)
+ } else {
+ (population_with_feature, population_without_feature)
+ }
+ };
+ // when sampling more than half the total population, take the smaller
+ // group as sampled instead (we can then return n1-x instead):
+ let k = if sample_size <= n / 2 {
Maybe so, I'm confident this is right, though.
When n is even, it doesn't make a difference whether we switch when k ==
n/2.
When n is odd, n/2 < n - n/2, so if we switch when k == n/2, we're
actually taking the bigger group as sampled, which runs counter to the
stated intent.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#1062 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAAIFNFEASBWGDBZWEFLDE3SQ2ICFANCNFSM4S5VSE4A>
.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great, thanks!
@vks we now have 16 commits in the git log not specifically linked to this PR. I'd prefer we didn't "rebase merge" with more than one (or maybe two) commits. Merge commits are the best in this case, or squashing if there's no desire to keep individual commits in the log (they remain visible in the PR in any case). CC @newpavlov |
This is why I strongly prefer the "squash and merge" option. :) Having one commit per PR also help in updating changelogs before release. In most of the RustCrypto repositories I even had set it to the only available option for merging PRs. |
@dhardy Sure! |
I needed both these distributions recently (for procedural generation and numerical simulation, respectively), and I figured I might as well contribute them for other people's convenience.
I use the obvious sampling algorithm for the hypergeometric distribution, which is exact, but runs in
O(n)
time and requiresn
uniform variates. That was good enough for my purposes, but a better, rejection-based algorithm calledH2PEC
apparently exists, I just couldn't find any material explaining it. The original 1988 paper seems to be lost to time, save for the first two pages. If someone can point me in the right direction, I'd be willing to do the rest of the work there.I also touched rand_distr/src/normal.rs so as to run the tests with
--no-default-features
.