-
Notifications
You must be signed in to change notification settings - Fork 430
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Compromising accuracy and speed [distributions] #494
Comments
Suggestion 1: we just choose defaults with reasonable precision.Since computing power is inherently limited, sufficiently small biases are unlikely to be noticeable. But how small is good enough? Modern CPUs run up to around 2^32 cycles/second, and a year has around 2^25 seconds, for around 2^57 cycles/year/core. The top supercomputers now have in excess of a million cores, say around 2^24 with around 100 PFlop/s, so 10^14 × 3*10^7 = 3e21 (approx 2^72) floating-point operations per year. What does this tell us? A bias as small as 1 in 2^128 is unlikely to be detectable, ever (at least without quantum computers). But a bias of 1 in 2^53 (i.e. Can we say anything else? Above around 1 in 2^20 bias is hard to detect without specifically looking for it, and above 2^40 doing so requires looking quite hard, hence biases this small may be good enough for most purposes, though certainly not all. (Sorry @pitdicker for paraphrasing a lot of what you said I while back; I forget where.) |
Suggestion 2: use feature flags to control precisionIf around 2^20-2^40 bits of precision is enough for most users, then what if we accept such choices as defaults, but add a There is a potential catch here — activation of higher-precision becomes application-wide, and if any lib in the application requires Rand's Further, this is only going to be a good choice if the high-precision alternatives are still reasonably fast, since users can't pick which algorithm to use at each use-site. This would probably not affect compilation time or code size much, since the algorithms would be behind feature flags thus unused code can be eliminated early. Summary: keeps the API clean and simple; Rand still gets a little more complex (multiple implementations); limited control |
Suggestion 3: multiple implementations, as necessaryThis is basically the current plan regarding For e.g. Summary: flexible but with extra API complexity and sometimes worse ergonomics. |
Suggestion 4: generic, over some configuration typeWhat if we added dummy types This would probably be quite bad for ergonomics (if it even worked as designed) since presumably users could not omit generic parameters any more ( In my opinion the poor ergonomics and extra API complexity are unacceptable; as it is now Rand is already confusing some users. |
There's a variant of Suggestion 3 which is that rather than make it a property of which API is used ( I.e. that something like Similarly We might even be able to create wrappers which wrap I don't know if there's any good way to do this which doesn't introduce runtime cost though. It'd work to stick a Also this approach would make |
Ultimately I think Suggestion 1 is the most reasonable. Though it scares me a little to think that someone might use a CSPRNG together with a biased algorithm... |
That is a very good summary, @dhardy! We seem to be in a bit of a difficult position, as Rand is used for some very different use cases:
In my opinion, we should strongly optimize for the 'common programming problems' (security does not really apply here). Is seems reasonable to expect anyone who writes multiple-month simulations to spend an hour reading docs, or know the limitations of the algorithms he is using. I like suggestion 1 most, but would also like to see 3 when reasonable. Specifically:
Letting the accuracy depend on the RNG as @sicking suggest is also an interesting idea. But the distinction should not be between normal PRNGs and CSPRNGs. Accuracy is not really a security-related problem. PRNGs that are suitable for simulations are also the ones that are suitable for many common problems. I would say the suitability depends more on where the function is used, than the RNG. |
I'm not sure how @sicking's suggestion could be implemented other than by adding a method like We could also modify suggestion 2 by adding a Other than that, I just wanted to say that talking about |
Just a quick way to say the accuracy lies between 53 and 64 bits :-). Is a
chance smaller than, say 2^50, really a chance we see as a reasonable use?
B.t.w., I have been a bit absent the past week, but try to get back to
things this evening
Op wo 6 jun. 2018 10:04 schreef Diggory Hardy <[email protected]>:
… I'm not sure how @sicking <https://github.com/sicking>'s suggestion could
be implemented other than by adding a method like fn
use_high_precision_sampling(&self) -> bool to RngCore (specialisation
might allow a different approach based on whether an additional trait,
HighPrecisionSamplingRng, was implemented, but not yet). Ultimately I'm
not so keen on the idea anyway because as @pitdicker
<https://github.com/pitdicker> says, accuracy is more a use-case problem
and less a security problem (though arguably enabling it globally within an
application is a reasonable solution, hence suggestion 2).
We could also modify suggestion 2 by adding a low_precision feature in
addition to (or instead of) the high_precision one.
Other than that, I just wanted to say that talking about Bernoulli as
having 53/64-bit accuracy is a bit mis-leading: it currently supports down
to 2^-64, but 53-bit (and even 24-bit) precise floats can represent far
smaller positive numbers (MIN_POSITIVE is approx 10^-38 and 10^-308).
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#494 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AF9xynYji0sVNqAf9i0tvl_lI9YaXAOvks5t540qgaJpZM4Uabdu>
.
|
Another thing I want to say is that I'd prefer to err on the side of caution (i.e. more precision than necessary) than on the side of performance (for default choices). That said, I think some trade-offs can still be made (e.g. the current This is really more political than it is technical: people are far more likely to criticise incorrect and prematurely-optimised approaches than slow ones; also since in many cases performance will not be a real issue anyway, I don't think it makes sense to "over-optimise". Because of this, I'm not happy to skip the zone checks for range sampling (e.g. sampling I therefore think that these range sampling optimisations (and perhaps also the 32-bit Good point @pitdicker that anyone writing serious simulations will likely look over the docs closely, so slightly less obvious/ergonomic solutions for people needing high precision is fine (i.e. solution 3), though suggestion 2 may still be easier. I guess the wrap-up question is would any of these solutions make our lives easier (i.e. more obvious answers to speed-vs-quality compromises) vs harder (more code/complexity)? No problem @pitdicker 😉 |
I'm skeptical for the following reasons:
I would like to propose a variation of suggestion 3: Suggestion 5: split up the distributions into two modules:
|
Nice suggestion @vks! Two niggles:
With that in mind, perhaps One thing though: should we really make I guess we could have three variants of the module: the default, |
Not sure how they could be any shorter. I think the alternative suggestions except suggestion 1 are more unwieldy.
Yes, it should be safe for
That's what I suggested. You would have to choose between I don't think it is over-designed (unless providing more than one implementation is already over-designed), because it is backwards compatible and nicely allows to opt into the desired implementation. |
I mean the combined names, e.g.
I see. Okay, sounds like a good plan, though it leaves two things out:
|
Yeah, I didn't mean to suggest otherwise. Mainly I was thinking that someone specifically requesting a cryptographically strong likely wouldn't want to use an algorithm with bias as a default. I.e. if you're generating a cryptographically strong security token for authentication, I was thinking you likely don't want more But I'd want to check with someone that knows crypto better than me before being sure about that.
I think the distributions are the relatively easy part to solve. Separate structs for fast vs. biased I think would make for both make for a clear and easy to implement API. If we do it by sticking the distributions in separate modules, or by using more expressive names, is not something I have an opinion about. The bigger problem is all the other APIs. I.e. things like |
I agree with @pitdicker that we should optimize for "common programming problems". Which I feel like it means that sufficiently small biases is ok. Though I'm not sure what optimizing for "common programming problems" means for performance. Arguably it means that if making things faster comes at a cost of code complexity (which often is the case), we should only optimize the code if it's likely to show up in real-world profiles, rather than worry too much about squeezing everything out of our micro-benchmarks. But I'm happy to defer to others. Especially in cases where implementation strategy doesn't affect the public API and so is something that we can easily change over time. |
They trade flexibility for convenience. Making them more flexible defeats their purpose, so I'm not sure what the alternative would be? I'm not a big fan of any of those methods TBH. This is different for the algorithms operating on sequences ( I'm not sure what "common programming problems" means, because it depends so strongly on the programs in question. Either of the alternatives @pitdicker mentioned may be "common" in certain domains. Maybe it would be more actionable to look at typical implementations used in the popular randomness libraries that have been used a lot without anyone complaining about biases. Please note that "long-running simulations" are not necessarily "long" in terms of real time but rather in CPU time. On most supercomputers there is an upper limit on the runtime, ranging from a day up to a week or two.
Why would it be more difficult than for the others? In any case, I think they should live in the same crate. It might make sense to move the distributions to their own crate. However, at the moment they are tied to some methods of |
Agreed; in part also because trying to tie down an exact level of bias which is acceptable in general is difficult. Two things about @vks suggestion still need finalising:
|
I don't really see it as difficult or vague. When we start to talk in terms of "this will take 1+ year of CPU-time before the tiniest bias becomes visible while doing nothing but calling this function", we are clearly outside the "common programming problems" territory.
That is always a good idea 😄. I really don't think suggestion 2 is a good idea. The choice of precision really depends on how something is used. Say a bit more accuracy is wanted in one place. A feature flag would effect every other function, also in every other dependency. I don't know what to think of suggestion 5. If we only consider
Seems like a good alternative to keep in mind. We already had the same discussion around PRNG algorithms. In my opinion is is a good thing for Rand to be opinionated in the functionality it includes. If we can't make good choices, or offer multiple variants if really needed, how do we expect users to choose? |
Another approach occurred to me: Suggestion 6: Use the type of the generated data to determine precisionI noticed that We could use a similar approach for choosing between high-performance vs. low-bias algorithms. I.e. This would unfortunately only work for APIs like I'm not sure that I particularly like this solution, but I wanted to add it for completeness. |
In some cases this "wrapping type" notation can be used easily, but to be honest I found let Precise(x) = rng.gen();
// but the type of `x` must be specified somewhere, so you may require:
let Precise(x): Precise<f64> = rng.gen(); |
I now agree with @pitdicker's view the most: that in most cases we can just pick a sensible default, and in a few we can add extra implementations (i.e. Suggestion 3). Proposal:
This seems the simplest solution, and also provides uniform names, without the confusion of shadowed names in sub-modules (I don't like the idea of having Is everyone happy enough with this proposal? |
Why do you think it is confusing? To me, |
Because the normal way to import the name is |
You are free to prefer |
FWIW, I think there's relatively little value in adding additional distributions. I think the real value is in if we can provide faster implementations of the "default" functionality like Which really comes down to having a faster, but "good enough", implementation of Defining "good enough" is hard, but we've been able to do it for Additionally having a "perfectly unbiased" distribution like today's |
@sicking I am open to revisiting the question of bias when sampling small integer ranges, though I don't know exactly what "confidence" parameters we should have. One of the issues though is allowing smart choices of bias/performance trade-offs. For example, allowing small bias in integer range sampling has a relatively small performance impact, vs allowing small bias in floating-point range sampling (IIUC). So does it make sense only to have |
I'm also not sure what confidence interval to use. My thinking was to use 50% as in "more likely than not that there's bias". I think think the pick of load makes at least as big of a difference. I.e. the choice between "do nothing but generate random numbers on every core on a modern MacBook Pro" vs. "generate one random number every microsecond". I was able to get about a 2x perf improvement on integer sampling by accepting a small bias for 8-bit, 16-bit integers and 32-bit integers. Smaller than the 6x difference between the current Uniform and the code in #531, but same order of magnitude.
If we do anything at all, then yeah, I think that's what I would do. Though I don't know if I would put "unbiased integer sampling" in the same distribution as "maximum precision float sampling". I feel like bias and precision are related but not the same. But I'm not sure. |
This problem has gone unanswered for a long time now. Since the above was written, many distributions have been moved to the Related problems:
Possible solution: add a new module, |
By now we have found "nice defaults" (solution 1) for most cases and spun off many implementations to the
My suggestion now is that we stick with "solution 1" (reasonable defaults) for |
A lot of this discussion is based on Fast, potentially biased integer sampling is a distinct problem: potentially producing some values too often relative to others. It is behaviourally distinct: for small numbers of samples the bias is statistically non-detectable, while when using sufficiently large numbers of samples the bias could be visible for pretty much any usage. For this I think only solution 1 or 2 is appropriate: if you have a reason to need an unbiased Having an exact The multiple impls of I propose the following actions:
|
I agree that the high-precision variants could live in their own crate. I'm also not sure whether there are even use cases for them, and a common criticism of rand is too much complexity/flexibility, so it makes sense to split them off. (We should probably decrease our API surface instead of increasing it.1) About Footnotes
|
I don't see an extra crate as an alternative to solution 2: the biased/unbiased question comes down to "is our simulation likely to have noticeable bias due to the algorithms used" (given the 2^48 threshold used, anything involving less than ~1 year of CPU time is unaffected). For the few cases where the answer is yes, all uses of In my opinion the only relevant question is between solution 1 (one set of algorithms for everything) or 2 (crate configuration), and I don't see a strong rationale either way. (Also relevant: is 5-40% overhead on uniform sampling even an issue, even in things like games? Only if done a lot, as in the ridiculous micro-benchmarks I've been using.)
The whole |
This is an issue that has been discussed in part several times already in other contexts. But I think it's now worth making an issue specifically about this to get extra feedback.
There are several cases within distribution algorithms that we've had to make decisions about speed-vs-accuracy trade-offs:
Standard
now uses 24/53 bits precision (Future-proofing 24/53bit precision for f32/f64 generation #416); we have discussion around a high-precision variant (Implement HighPrecision01 distribution #372)Uniform
(for ranges) uses a rejection zone to avoid bias; as noted here we can sometimes use faster approaches if we don't care about small amounts of biasgen_bool
has at least 3 possible implementations with varying accuracy and speed (Implement Bernoulli distribution #411)gen_ratio
is a nice addition togen_bool
with at least two different approachs (Add API for getting a bool with chance of exactly 1-in-10 or 2-in-3 #491)So far, other than with the suggested addition of a
HighPrecision
distribution (#372) (and the other float sampling methodsOpen01
andOpenClosed01
), we have not given users any choice over this speed-vs-accuracy trade-off but have simply made the choice for users.The text was updated successfully, but these errors were encountered: