Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add geometric and hypergeometric distributions #1062

Merged
merged 16 commits into from
Nov 21, 2020
Merged

Add geometric and hypergeometric distributions #1062

merged 16 commits into from
Nov 21, 2020

Conversation

teryror
Copy link
Contributor

@teryror teryror commented Oct 24, 2020

I needed both these distributions recently (for procedural generation and numerical simulation, respectively), and I figured I might as well contribute them for other people's convenience.

I use the obvious sampling algorithm for the hypergeometric distribution, which is exact, but runs in O(n) time and requires n uniform variates. That was good enough for my purposes, but a better, rejection-based algorithm called H2PEC apparently exists, I just couldn't find any material explaining it. The original 1988 paper seems to be lost to time, save for the first two pages. If someone can point me in the right direction, I'd be willing to do the rest of the work there.

I also touched rand_distr/src/normal.rs so as to run the tests with --no-default-features.

@vks
Copy link
Collaborator

vks commented Oct 24, 2020

Thanks, this looks like good additions!

We are using a rejection algorithm by Kachitvichyanukul and Schmeiser from 1988 for sampling the binomial distribution, so we can hopefully find their hypergeometric variant as well.

@vks
Copy link
Collaborator

vks commented Oct 24, 2020

I was able to find their article on hypergeometric sampling on ResearchGate. I seems the PDF is available there and complete.

@teryror
Copy link
Contributor Author

teryror commented Oct 24, 2020

Awesome, thank you! I'll get to work on that tomorrow.

@newpavlov
Copy link
Member

I think it's worth to keep the full name. Geo is a bit confusing, since it is usually used as a shorthand and prefix for "geographic", plus Geometric will be a better pair to Hypergeometric.

@teryror
Copy link
Contributor Author

teryror commented Oct 24, 2020

No problem, I'll do the rename first thing in the morning. My thought process was that it would pair better with Exponential, and Geo seems to be a common shorthand in probability theory. Before I came to that conclusion, I considered putting both in the same mod geometric. Would that also be preferable to you?

By the way, do you folks think it's worth having an optimized implementation for Geometric(0.5)? Something like

pub struct Geometric1in2(); // WTH should this be called, though?

impl Distribution<u64> for Geometric1in2 {
    fn sample<R: Rng>(&self, rng: &mut R) -> u64 {
        let mut result = 0;
        loop {
            let x = rng.gen::<u64>().leading_ones() as u64;
            result += x;
            if x < 64 { break; }
        }
        result
    }
}

I didn't test or benchmark this yet, but I've used similar code before (without the loop, which makes this exact), and in the vast majority of cases, this is just a couple of extra cycles on top of the rng.gen() call, which seems pretty great compared to the wrapped Exp. There's similar technique for binomial distributions with p = 0.5 and small n (how small should be determined by benchmark), using rng.gen::<u64>().count_ones().

I'd be happy to add both of those as well. Seemed like too much scope creep for an unprompted PR, though. (And again, I'm not sure it's actually worth having, so...)

@peteroupc
Copy link

peteroupc commented Oct 26, 2020

For the geometric distribution with small values of "p" there is an exact and efficient algorithm published in Bringmann, K. and Friedrich, T., 2013, July. Exact and efficient generation of geometric random variates and random graphs, in International Colloquium on Automata, Languages, and Programming (pp. 267-278). However, the algorithm is not exactly trivial and its description there is not necessarily programmer-friendly.

EDIT (Oct. 27): Nevertheless, here is my implementation of this algorithm in Python.

@vks
Copy link
Collaborator

vks commented Oct 26, 2020

By the way, do you folks think it's worth having an optimized implementation for Geometric(0.5)?

Depends on whether the performance benefits are worth the increased API surface. Maybe you could call it StandardGeometric in analogy to StandardNormal.

@teryror
Copy link
Contributor Author

teryror commented Oct 27, 2020

Well, that took a bit longer to implement than I expected. There are at least two print errors in the paper, one causing an infinite loop, and one incomplete expression. The tests pass now, but I had to allow for much larger errors in the mean and especially variance; not sure if that's due to an unlucky seed, or if there's a subtle bug left in there. Could also be a problem with the algorithm itself, the paper doesn't include these measurements. Algorithm HIN is fine, the problem is only with the tests exercising algorithm H2PE.

EDIT: As a quick test, I removed steps 4.1 and 4.2, which are alternate acceptance conditions, purely intended as optimizations. The error of the measured mean goes way down to acceptable levels, but the variance is still out of whack.

EDIT 2 That helped narrow it down. Found the errors, one >= where a <= should have gone, and two calls to ln_of_factorial that should have just been ln. Misread that one in the paper. :|

The AppVeyor build seems to have failed due to unrelated network issues.

@peteroupc My implementation just truncates samples from an Exp, which I thought was both exact and efficient for all values of p. Is that not true?

As for Geometric vs. StandardGeometric: I just ran some benchmarks, and the optimized code is indeed around 7 times faster on my machine (Ubuntu 18.04.5 LTS, Intel i7-4770K @ 3.5GHz x 8). Seems significant enough to me, but I'll wait on the go-ahead before pushing that code.

Out of curiosity, I also compared Exp and Exp1 (because the latter had no representation in the benchmarks), and Exp1 turns out to actually be slower by about 25%. You may want to consider deprecating that, after benchmarking on more platforms, of course.

@peteroupc
Copy link

peteroupc commented Oct 27, 2020

@peteroupc My implementation just truncates samples from an Exp, which I thought was both exact and efficient for all values of p. Is that not true?

It is exact in theory, assuming that computers can store, process, and generate any real number of any precision (even infinite precision), but not exact in practice, especially where floating-point number formats with a fixed precision are involved. On real-life computers, exact sampling can generally be achieved only by rejection sampling and/or arbitrary-precision arithmetic. See the Bringmann paper, especially Appendices A and B, for details, but again, it isn't trivial since it involves (approximations of) arbitrary-precision logarithms.

On efficiency: In the case of the geometric distribution, when p is at 1/3 or greater, the trivial algorithm of drawing Bernoulli(p) trials until a success happens "is probably difficult to beat in any programming environment" (Devroye, L., "Non-Uniform Random Variate Generation", 1986, p. 498). (You can see that already with your optimized geometric(1/2) sampler.) This trivial algorithm, though, is not necessarily efficient when p is close to 0, unlike your implementation that employs inversion (by truncating exponentials), which is efficient for any parameter p (again with the assumption above). Translating this inversion algorithm to be exact on real-life computers will likewise lead to an efficient algorithm in practice (in fact this is one of two geometric sampling algorithms in the Bringmann paper; this one is in Appendix A because the paper also showcases another with an optimal time complexity compared to this one).

@teryror
Copy link
Contributor Author

teryror commented Oct 27, 2020

@peteroupc Ah, I see my error now. I thought that this would be exact given an exact sampling algorithm for the exponential distribution. I failed to consider possible errors in the calculation of lambda as a function of p.

I can sort of make sense of the algorithm given in the paper: choose k such that 1/2 <= p' = (1 - p)^(2^k) < 1, and use the trivial algorithm to sample D from Geo(p'), where D represents a sample from Geo(p) / 2^k, then use rejection sampling for the remainder M from Geo(p) % 2^k. Return 2^k * D + M.

What I don't get is why the bitwise sampling of M should be necessary (if I'm reading the paper right it's not?) - I'm guessing you're doing that because Python uses variable-width integers for everything, right? Since I'm dealing with 64 bit integers exclusively, shouldn't it be more efficient to just generate k random bits at once?

Now that I think about it, I guess the docs should mention that it's technically the bounded geometric distribution with bound = u64::MAX.

@peteroupc
Copy link

What I don't get is why the bitwise sampling of M should be necessary (if I'm reading the paper right it's not?) - I'm guessing you're doing that because Python uses variable-width integers for everything, right? Since I'm dealing with 64 bit integers exclusively, shouldn't it be more efficient to just generate k random bits at once?

Indeed, sampling M bit by bit is not necessary for correctness; this merely has the advantage of reducing the expected number of random bits used by the geometric sampler, especially when k is high. However, this bit-by-bit sampling complicates the algorithm's implementation (compare the two helper functions in my implementation).

@vks
Copy link
Collaborator

vks commented Oct 27, 2020

Bitwise sampling is out of scope for now, because the current RNG API does not support this use case very well. See #1014 and #1031 for more details and possible API extensions.

rand_distr/src/geometric.rs Outdated Show resolved Hide resolved
@vks
Copy link
Collaborator

vks commented Oct 27, 2020

@teryror Could you please add the two new distributions to the list in the docs in lib.rs? They should probably got next to Binomial.

@teryror
Copy link
Contributor Author

teryror commented Oct 27, 2020

@teryror Could you please add the two new distributions to the list in the docs in lib.rs? They should probably got next to Binomial.

already did that here and updated it during the rename. I listed hypergeometric under misc. distributions though.

@teryror
Copy link
Contributor Author

teryror commented Oct 28, 2020

Alright, I implemented the suggested exact algorithm (minus the bitwise sampling), and with that, I think I'm done here, unless anyone finds issues for me to fix.

rand_distr/src/geometric.rs Outdated Show resolved Hide resolved
@vks
Copy link
Collaborator

vks commented Oct 28, 2020

This looks good! The only possibly important issue I found when comparing to the reference was a use of g instead of 1.

@teryror
Copy link
Contributor Author

teryror commented Oct 28, 2020

All done!

@dhardy
Copy link
Member

dhardy commented Nov 2, 2020

Thanks for the PR @teryror. @vks I trust you'll merge this when ready? I don't have anything to add.

@vks
Copy link
Collaborator

vks commented Nov 20, 2020 via email

Copy link
Collaborator

@vks vks left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great, thanks!

@vks vks merged commit 82229c6 into rust-random:master Nov 21, 2020
@dhardy
Copy link
Member

dhardy commented Dec 7, 2020

@vks we now have 16 commits in the git log not specifically linked to this PR. I'd prefer we didn't "rebase merge" with more than one (or maybe two) commits. Merge commits are the best in this case, or squashing if there's no desire to keep individual commits in the log (they remain visible in the PR in any case). CC @newpavlov

@newpavlov
Copy link
Member

This is why I strongly prefer the "squash and merge" option. :) Having one commit per PR also help in updating changelogs before release. In most of the RustCrypto repositories I even had set it to the only available option for merging PRs.

@dhardy
Copy link
Member

dhardy commented Dec 7, 2020

"Merge commit" also works fine IMO, providing you're using tools like qgit or appropriate log options. E.g. (using a different repo with merge commits):
qgit

@vks
Copy link
Collaborator

vks commented Dec 7, 2020

@dhardy Sure!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants