Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Comparing distributions #560

Closed
peterhj opened this issue Jul 18, 2018 · 8 comments
Closed

Comparing distributions #560

peterhj opened this issue Jul 18, 2018 · 8 comments

Comments

@peterhj
Copy link

peterhj commented Jul 18, 2018

Given an arbitrary distribution w/ parameters, e.g. Uniform or Normal, we might want to compare its parameters to those of another distribution of the same type. Depending on how we come across these structs, we might not easily have access to the underlying distribution parameters. I've thought of a few considerations:

  1. Is it worth implementing some way of comparing the distributions in rand?

  2. How should comparison be implemented? Float types generally only implement PartialEq, so doesn't necessarily make sense for Distribution to generally require Eq, only PartialEq. Alternatively, one could introduce a new trait with a method fn dist_eq(&self, other: &Self) -> bool to be implemented by comparable distributions. Even more simply, one could manually expose the distribution parameters on a per-distribution basis, e.g. add Normal::mean and Normal::std_dev functions, and leave comparison to the user.

  3. Related to (2), it might be possible that distributions with the same "public" parameters may be implemented with different "private" parameters that determine how samples are drawn from an RNG. Would this complicate the notion of comparing distributions?

Thoughts?

@dhardy
Copy link
Member

dhardy commented Jul 18, 2018

Well,

  1. What's the use-case? I can see some use to exposing the CDF (as a function), but it's kind of orthogonal to sampling. We could just derive PartialEq, but I don't see a lot of need for it (though it's easy to add I think).
  2. mean and stddev are not quite so simple; e.g. the mean of a log-normal is usually not the input parameter though isn't hard to calculate; some distributions (e.g. Cauchy) don't have a mean at all. If instead you are talking about exposing the distribution parameters used to parameterise that distribution I'm not really keen — the internal parameters may be derived from input, and it's possible for internal implementations to change.
  3. Are you talking about multiple different implementations of the same distribution? Yes that's possible but I don't see why we could care (LogNormal1(μ, σ) and LogNormal2(μ, σ) might be equivalent distributions but generate different output for the same random values — but why even try to compare them?)

Not really sure what you're use-case is.

Also bear in mind that distributions could change significantly #290

@peterhj
Copy link
Author

peterhj commented Jul 18, 2018

  1. The particular use case I bumped into is recording and replaying a stream of random numbers for bitwise reproducibility/determinism, e.g. when comparing random numbers generated from two similar but different applications in possibly different languages (I'm actually comparing something written in numpy to the equivalent using rust libs).

    Because the "recording" application might not easily be able to save the actual random bits from the RNG, instead it may need to resort to recording "Normal(0.0, 0.1)" (i.e. both the name of the distribution and its params) followed by a stream of f64 data. Then when the "replaying" application will presumably call Normal::new(0.0, 0.1), it would be wise to dynamically check that the mean and std are the same as the params recorded in the stream.

    If distributions are comparable in some way, it's possible to do this with minimal changes to either the recording application or the replaying application. A Rust implementation of this idea is at https://github.com/peterhj/rngtape where ReplayTapeRng is mostly a drop-in replacement for any other Rng, except right now it's kind of limited in what kind of data it can read from the recorded random number stream.

  2. I was going for the second idea about parameters in general, not mean and stddev specifically. I do think that the input params containing the user's "intent" may be valuable to keep around, but ultimately the distribution that is sampled is based on the internal params which I agree may be better not to expose. Also the input params might not be unique anyway.

  3. answered by (2), I think -- sounds better to ignore the effect of private/internal distribution params.

Also re: #290, I noticed that distributions in stat-rs do implement PartialEq, however it doesn't seem like there was discussion there about comparing distributions, which is what we're doing here :).

@pitdicker
Copy link
Contributor

  1. The particular use case I bumped into is recording and replaying a stream of random numbers for bitwise reproducibility/determinism, e.g. when comparing random numbers generated from two similar but different applications in possibly different languages (I'm actually comparing something written in numpy to the equivalent using rust libs).

This is a nice idea in theory, but I don't see it as something that is going to work. You would need to have exactly the same RNG algorithm, with the same seed (and initialization), with the same algorithms to convert to f64 and sample with the given distribution. Rand is too different from numpy.

As you and @dhardy say, not all input values are kept around, but the get converted to internal parameters. Keeping these around is something not many users are going to need. And because of the fun that floating point is, it is also not possible to always exactly recalculate the input values from the internal parameters.

Implementing PartialEq should not be difficult, but I also have some trouble to see how/when you would use it.

@peterhj
Copy link
Author

peterhj commented Jul 19, 2018

This is a nice idea in theory, but I don't see it as something that is going to work. You would need to have exactly the same RNG algorithm, with the same seed (and initialization), with the same algorithms to convert to f64 and sample with the given distribution. Rand is too different from numpy.

I don't think I explained clearly the point of what I'm trying to do. Suppose you have 2 programs, one program is in numpy, the other program is in rust using rand. Yes, both RNGs and distributions in rand and numpy are quite different. But, both programs are supposed to be "similar" in that the random number sequence they each draw are nominally "equivalent": for example, this could be encoded as ["draw 5 f64's from Normal(mean=0.0, std=0.1)", "draw 10 i64's from Uniform(low=0, high=50)", ...]. At a high level, both programs also do analogous computations with those random numbers.

Now, you first run the numpy program using its Mersenne twister RNG and its distributions, and this run generates a "tape" or "trace" of the actual np.float64's (or other types) that were generated, along with a description of the distributions they were drawn from. Then this tape or trace is dumped to a file, which can be loaded by a special Rng in rust to "replay" the exact sequence of random numbers through calling rng.sample(dist). Comparable distributions are useful at this point in order to check, at runtime, that the dist being sampled from in rust is nominally the same as the distribution encoded in the trace/tape file and loaded by the special Rng, without having to modify the distributions in the rust program.

The key point is that we do not want to re-implement the numpy RNG/distributions in rust, we just want to replay (in rust) the recorded sequence of random numbers (from numpy) for the purpose of reproducibility, particularly to facilitate the debugging and testing of numerical code by eliminating a source of non-determinism at the level of bits.

As you and @dhardy say, not all input values are kept around, but the get converted to internal parameters. Keeping these around is something not many users are going to need. And because of the fun that floating point is, it is also not possible to always exactly recalculate the input values from the internal parameters.

Yes, this is my takeaway as well. It sounds like comparing distributions may better served by special types like AbstractNormal{pub mean: f64, pub stddev: f64}, which by default get translated behind-the-scenes to Normal, except for a specialized impl with the "replaying RNG" mentioned above (not entirely sure how this will work, though).

@dhardy
Copy link
Member

dhardy commented Jul 23, 2018

It would be easy to make an AbstractNormal wrapper type which creates a Normal internally and re-implements sample around that. You can do this in your own code of course.

Otherwise, in theory it is possible to convert whatever internal parameters are used to a mean / mode / stddev / ... (excepting that not all measures make sense on all distributions). However, we should expect some accuracy to be lost (thus exact equality not being useful).

@vks
Copy link
Collaborator

vks commented Jul 23, 2018

It seems to me like you want to compare samples from arbitrary distributions and determine how likely it is that they are from the same distribution. Is that correct? In this case you could use something like the Kolmogorov–Smirnov test. It might make sense to have that in Rand, although we currently only implement sampling (also see #290).

@peterhj
Copy link
Author

peterhj commented Jul 23, 2018

It would be easy to make an AbstractNormal wrapper type which creates a Normal internally and re-implements sample around that. You can do this in your own code of course.

Yes this is how I've ended up doing it now. Also in light of #290 and related issues, it seems the state of distributions in rand may be a bit in flux, so in my own code I'll likely mess around with some slightly alternative distribution sampling APIs, including things like AbstractNormal, and if it works out will report back.

It seems to me like you want to compare samples from arbitrary distributions and determine how likely it is that they are from the same distribution. Is that correct?

Not quite. I'm interested in (1) type-wise comparison of distributions (e.g. is it a Normal or a Uniform or something else?), (2) bit-wise comparison of distributional "parameters," when they are well defined (e.g. the mean and std_dev of an AbstractNormal above), and (3) bit-wise comparison of the stream of generated random numbers. The goal is trying to guarantee bit-wise reproducibility at run-time.

@dhardy
Copy link
Member

dhardy commented Jul 31, 2018

It seems like you have a solution to your problem, and we still have a lot of work to do with distributions (contributions welcome in #290!), so I'm going to close this now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants