Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Empirical distribution refactor #245

Open
YeungOnion opened this issue Jun 12, 2024 · 2 comments
Open

Empirical distribution refactor #245

YeungOnion opened this issue Jun 12, 2024 · 2 comments

Comments

@YeungOnion
Copy link
Contributor

YeungOnion commented Jun 12, 2024

Was considering making a change to Empirical, albeit mostly out of interest (not because I've benched and find it lacking).

The current implementation of Empirical uses BTreeMap where entries have a float key and a observation count value.
I thought it might be cool to leverage the structure of the tree with the fact that the CDF would have the same ordering as the keys (from $S$), i.e.

$$ \begin{align} \textrm{ if } F(y) - F(x) \geq \, &0 \left\{ x,y \in S \right\} \\\ \textrm{ then } (y-x) \geq \,&0 \\\ \end{align} $$

INLINE_EDIT: after thinking that map is a function (even if not one I could write down better than a table of sorts), I realize that this is a Tree Map with a monotone function for the map. This is much more clearly stated and easier to google about.

By making an entry's value to instead be a cumulative count of observations in lesser child nodes + observation count in the entry, then

  • CDF computations become accumulation over Tree search. Cost of search, could work without stack from recursion.
  • inverse CDF computations become a search where the termination condition is with the mapped value, but traversal works with key ordering. Nearly cost of search (assuming cost value comparison is same as cost of key comparison and memory for value comparison negligible)
  • PMF computations is difference of CDF, if search can emit traversal to found entry, then this could easily be optimized to better than two CDF calcuations.
  • key insertion will get more expensive, will have to modify all values that are ≥key in a BTree Node for each increasing depth branch step, rotations and node separation are trickier. If anyone has some good resources for writing proofs about B-trees, I need the support of better notation before I can get to a proof.

See also: prefix sums, Fenwick Tree Wiki

Other things that came to mind that make more visible changes

  • precompute mode at insertion as {count: u64, values: Vec<T>}
    • note that our Mode trait allows only single mode distributions, which is infeasible to guarantee for Empirical, perhaps we should switch some of the statistics traits to use associated types for their returns.
  • use ordered_float::NotNan
@YeungOnion
Copy link
Contributor Author

@dhardy would you happen to know of an implementation for prefix sums over a BTree? (I may also post this question to the rust lang user forum)

The motive beyond interest is in sampling from Empirical, the naive option to sample uniform through inverse cdf would be very expensive since inverse cdf is done via bisection and cdf requires visiting ≈1/2 the tree's nodes. (Improvement could certainly be made to memoize within a single call to inverse cdf).

Fenwick tree is great if I'd have a way to keep a positional index for data in the tree, but for the cost to update the prefix sum, $\mathcal O (\log N)$, I could also search for the key in the BTree and do the updates along larger values as nodes are traversed, $\mathcal O (k \log N)$, i.e. comparable. Would you be able to guess if this is a case where an implementation likely won't make those gains?

@dhardy
Copy link

dhardy commented Jun 14, 2024

@YeungOnion you can just iterate over a BTree. Efficient key-order iteration is one of its key properties.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants