Add geometric and hypergeometric distributions #1062

teryror · 2020-10-24T16:05:41Z

I needed both these distributions recently (for procedural generation and numerical simulation, respectively), and I figured I might as well contribute them for other people's convenience.

I use the obvious sampling algorithm for the hypergeometric distribution, which is exact, but runs in O(n) time and requires n uniform variates. That was good enough for my purposes, but a better, rejection-based algorithm called H2PEC apparently exists, I just couldn't find any material explaining it. The original 1988 paper seems to be lost to time, save for the first two pages. If someone can point me in the right direction, I'd be willing to do the rest of the work there.

I also touched rand_distr/src/normal.rs so as to run the tests with --no-default-features.

vks · 2020-10-24T18:45:42Z

Thanks, this looks like good additions!

We are using a rejection algorithm by Kachitvichyanukul and Schmeiser from 1988 for sampling the binomial distribution, so we can hopefully find their hypergeometric variant as well.

vks · 2020-10-24T18:54:58Z

I was able to find their article on hypergeometric sampling on ResearchGate. I seems the PDF is available there and complete.

teryror · 2020-10-24T19:38:15Z

Awesome, thank you! I'll get to work on that tomorrow.

newpavlov · 2020-10-24T19:43:52Z

I think it's worth to keep the full name. Geo is a bit confusing, since it is usually used as a shorthand and prefix for "geographic", plus Geometric will be a better pair to Hypergeometric.

teryror · 2020-10-24T20:29:37Z

No problem, I'll do the rename first thing in the morning. My thought process was that it would pair better with Exponential, and Geo seems to be a common shorthand in probability theory. Before I came to that conclusion, I considered putting both in the same mod geometric. Would that also be preferable to you?

By the way, do you folks think it's worth having an optimized implementation for Geometric(0.5)? Something like

pub struct Geometric1in2(); // WTH should this be called, though?

impl Distribution<u64> for Geometric1in2 {
    fn sample<R: Rng>(&self, rng: &mut R) -> u64 {
        let mut result = 0;
        loop {
            let x = rng.gen::<u64>().leading_ones() as u64;
            result += x;
            if x < 64 { break; }
        }
        result
    }
}

I didn't test or benchmark this yet, but I've used similar code before (without the loop, which makes this exact), and in the vast majority of cases, this is just a couple of extra cycles on top of the rng.gen() call, which seems pretty great compared to the wrapped Exp. There's similar technique for binomial distributions with p = 0.5 and small n (how small should be determined by benchmark), using rng.gen::<u64>().count_ones().

I'd be happy to add both of those as well. Seemed like too much scope creep for an unprompted PR, though. (And again, I'm not sure it's actually worth having, so...)

peteroupc · 2020-10-26T14:12:28Z

For the geometric distribution with small values of "p" there is an exact and efficient algorithm published in Bringmann, K. and Friedrich, T., 2013, July. Exact and efficient generation of geometric random variates and random graphs, in International Colloquium on Automata, Languages, and Programming (pp. 267-278). However, the algorithm is not exactly trivial and its description there is not necessarily programmer-friendly.

EDIT (Oct. 27): Nevertheless, here is my implementation of this algorithm in Python.

vks · 2020-10-26T14:48:22Z

By the way, do you folks think it's worth having an optimized implementation for Geometric(0.5)?

Depends on whether the performance benefits are worth the increased API surface. Maybe you could call it StandardGeometric in analogy to StandardNormal.

teryror · 2020-10-27T14:57:00Z

Well, that took a bit longer to implement than I expected. There are at least two print errors in the paper, one causing an infinite loop, and one incomplete expression. The tests pass now, but I had to allow for much larger errors in the mean and especially variance; not sure if that's due to an unlucky seed, or if there's a subtle bug left in there. Could also be a problem with the algorithm itself, the paper doesn't include these measurements. Algorithm HIN is fine, the problem is only with the tests exercising algorithm H2PE.

EDIT: As a quick test, I removed steps 4.1 and 4.2, which are alternate acceptance conditions, purely intended as optimizations. The error of the measured mean goes way down to acceptable levels, but the variance is still out of whack.

EDIT 2 That helped narrow it down. Found the errors, one >= where a <= should have gone, and two calls to ln_of_factorial that should have just been ln. Misread that one in the paper. :|

The AppVeyor build seems to have failed due to unrelated network issues.

@peteroupc My implementation just truncates samples from an Exp, which I thought was both exact and efficient for all values of p. Is that not true?

As for Geometric vs. StandardGeometric: I just ran some benchmarks, and the optimized code is indeed around 7 times faster on my machine (Ubuntu 18.04.5 LTS, Intel i7-4770K @ 3.5GHz x 8). Seems significant enough to me, but I'll wait on the go-ahead before pushing that code.

Out of curiosity, I also compared Exp and Exp1 (because the latter had no representation in the benchmarks), and Exp1 turns out to actually be slower by about 25%. You may want to consider deprecating that, after benchmarking on more platforms, of course.

peteroupc · 2020-10-27T16:49:34Z

@peteroupc My implementation just truncates samples from an Exp, which I thought was both exact and efficient for all values of p. Is that not true?

It is exact in theory, assuming that computers can store, process, and generate any real number of any precision (even infinite precision), but not exact in practice, especially where floating-point number formats with a fixed precision are involved. On real-life computers, exact sampling can generally be achieved only by rejection sampling and/or arbitrary-precision arithmetic. See the Bringmann paper, especially Appendices A and B, for details, but again, it isn't trivial since it involves (approximations of) arbitrary-precision logarithms.

On efficiency: In the case of the geometric distribution, when p is at 1/3 or greater, the trivial algorithm of drawing Bernoulli(p) trials until a success happens "is probably difficult to beat in any programming environment" (Devroye, L., "Non-Uniform Random Variate Generation", 1986, p. 498). (You can see that already with your optimized geometric(1/2) sampler.) This trivial algorithm, though, is not necessarily efficient when p is close to 0, unlike your implementation that employs inversion (by truncating exponentials), which is efficient for any parameter p (again with the assumption above). Translating this inversion algorithm to be exact on real-life computers will likewise lead to an efficient algorithm in practice (in fact this is one of two geometric sampling algorithms in the Bringmann paper; this one is in Appendix A because the paper also showcases another with an optimal time complexity compared to this one).

teryror · 2020-10-27T18:44:56Z

@peteroupc Ah, I see my error now. I thought that this would be exact given an exact sampling algorithm for the exponential distribution. I failed to consider possible errors in the calculation of lambda as a function of p.

I can sort of make sense of the algorithm given in the paper: choose k such that 1/2 <= p' = (1 - p)^(2^k) < 1, and use the trivial algorithm to sample D from Geo(p'), where D represents a sample from Geo(p) / 2^k, then use rejection sampling for the remainder M from Geo(p) % 2^k. Return 2^k * D + M.

What I don't get is why the bitwise sampling of M should be necessary (if I'm reading the paper right it's not?) - I'm guessing you're doing that because Python uses variable-width integers for everything, right? Since I'm dealing with 64 bit integers exclusively, shouldn't it be more efficient to just generate k random bits at once?

Now that I think about it, I guess the docs should mention that it's technically the bounded geometric distribution with bound = u64::MAX.

peteroupc · 2020-10-27T19:05:39Z

What I don't get is why the bitwise sampling of M should be necessary (if I'm reading the paper right it's not?) - I'm guessing you're doing that because Python uses variable-width integers for everything, right? Since I'm dealing with 64 bit integers exclusively, shouldn't it be more efficient to just generate k random bits at once?

Indeed, sampling M bit by bit is not necessary for correctness; this merely has the advantage of reducing the expected number of random bits used by the geometric sampler, especially when k is high. However, this bit-by-bit sampling complicates the algorithm's implementation (compare the two helper functions in my implementation).

vks · 2020-10-27T21:12:48Z

Bitwise sampling is out of scope for now, because the current RNG API does not support this use case very well. See #1014 and #1031 for more details and possible API extensions.

rand_distr/src/geometric.rs

vks · 2020-10-27T21:46:45Z

@teryror Could you please add the two new distributions to the list in the docs in lib.rs? They should probably got next to Binomial.

teryror · 2020-10-27T21:53:29Z

@teryror Could you please add the two new distributions to the list in the docs in lib.rs? They should probably got next to Binomial.

already did that here and updated it during the rename. I listed hypergeometric under misc. distributions though.

teryror · 2020-10-28T12:51:35Z

Alright, I implemented the suggested exact algorithm (minus the bitwise sampling), and with that, I think I'm done here, unless anyone finds issues for me to fix.

rand_distr/src/geometric.rs

rand_distr/src/hypergeometric.rs

vks · 2020-10-28T15:44:57Z

This looks good! The only possibly important issue I found when comparing to the reference was a use of g instead of 1.

teryror · 2020-10-28T16:09:06Z

All done!

dhardy · 2020-11-02T09:40:17Z

Thanks for the PR @teryror. @vks I trust you'll merge this when ready? I don't have anything to add.

rand_distr/src/hypergeometric.rs

vks · 2020-11-20T21:04:44Z

Fair enough. Could you please add a comment documenting why we are deviating from the reference here?

…

On Fri, Nov 20, 2020, 17:09 Tristan Dannenberg ***@***.***> wrote: ***@***.**** commented on this pull request. ------------------------------ In rand_distr/src/hypergeometric.rs <#1062 (comment)>: > + let n = total_population_size; + let (mut sign_x, mut offset_x) = (1, 0); + let (n1, n2) = { + // switch around success and failure states if necessary to ensure n1 <= n2 + let population_without_feature = n - population_with_feature; + if population_with_feature > population_without_feature { + sign_x = -1; + offset_x = sample_size as i64; + (population_without_feature, population_with_feature) + } else { + (population_with_feature, population_without_feature) + } + }; + // when sampling more than half the total population, take the smaller + // group as sampled instead (we can then return n1-x instead): + let k = if sample_size <= n / 2 { Maybe so, I'm confident this is right, though. When n is even, it doesn't make a difference whether we switch when k == n/2. When n is odd, n/2 < n - n/2, so if we switch when k == n/2, we're actually taking the bigger group as sampled, which runs counter to the stated intent. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#1062 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAAIFNFEASBWGDBZWEFLDE3SQ2ICFANCNFSM4S5VSE4A> .

vks

Great, thanks!

dhardy · 2020-12-07T10:52:35Z

@vks we now have 16 commits in the git log not specifically linked to this PR. I'd prefer we didn't "rebase merge" with more than one (or maybe two) commits. Merge commits are the best in this case, or squashing if there's no desire to keep individual commits in the log (they remain visible in the PR in any case). CC @newpavlov

newpavlov · 2020-12-07T11:03:02Z

This is why I strongly prefer the "squash and merge" option. :) Having one commit per PR also help in updating changelogs before release. In most of the RustCrypto repositories I even had set it to the only available option for merging PRs.

dhardy · 2020-12-07T11:07:03Z

"Merge commit" also works fine IMO, providing you're using tools like qgit or appropriate log options. E.g. (using a different repo with merge commits):

vks · 2020-12-07T13:20:53Z

@dhardy Sure!

teryror added 6 commits October 24, 2020 16:28

Add geometric distribution

50b9ef1

Add hypergeometric distribution

69f8b21

Update crate documentation

2df49f5

Replace std paths with equivalent core paths

ce86b23

Update changelog

fd95685

Fix constant paths for older rustc versions

0e6870f

teryror added 2 commits October 25, 2020 11:02

Rename 'Geo'metric distribution to 'Geometric'

ba29387

Replace naive hypergeometric sampling: add algorithm HIN

304f166

teryror added 2 commits October 27, 2020 13:50

Replace naive hypergeometric sampling: add algorithm H2PE

e4ae597

Move expensive setup computations into constructor

bac2d1d

Fix typos causing numerical inaccuracy

a153c29

vks reviewed Oct 27, 2020

View reviewed changes

rand_distr/src/geometric.rs Outdated Show resolved Hide resolved

vks mentioned this pull request Oct 27, 2020

Benchmark specialized distribution implementations #1063

Merged

teryror added 2 commits October 28, 2020 00:06

Add optimized StandardGeometric distribution

6a202be

Replace geometric sampler with exact algorithm

70124b5

Fix compile errors for older versions of rustc

1fbe551