Empirical distribution refactor #245

YeungOnion · 2024-06-12T20:50:57Z

Was considering making a change to Empirical, albeit mostly out of interest (not because I've benched and find it lacking).

The current implementation of Empirical uses BTreeMap where entries have a float key and a observation count value.
I thought it might be cool to leverage the structure of the tree with the fact that the CDF would have the same ordering as the keys (from $S$), i.e.

$$ \begin{align} \textrm{ if } F(y) - F(x) \geq \, &0 \left\{ x,y \in S \right\} \\\ \textrm{ then } (y-x) \geq \,&0 \\\ \end{align} $$

INLINE_EDIT: after thinking that map is a function (even if not one I could write down better than a table of sorts), I realize that this is a Tree Map with a monotone function for the map. This is much more clearly stated and easier to google about.

By making an entry's value to instead be a cumulative count of observations in lesser child nodes + observation count in the entry, then

CDF computations become accumulation over Tree search. Cost of search, could work without stack from recursion.
inverse CDF computations become a search where the termination condition is with the mapped value, but traversal works with key ordering. Nearly cost of search (assuming cost value comparison is same as cost of key comparison and memory for value comparison negligible)
PMF computations is difference of CDF, if search can emit traversal to found entry, then this could easily be optimized to better than two CDF calcuations.
key insertion will get more expensive, will have to modify all values that are ≥key in a BTree Node for each increasing depth branch step, rotations and node separation are trickier. If anyone has some good resources for writing proofs about B-trees, I need the support of better notation before I can get to a proof.

See also: prefix sums, Fenwick Tree Wiki

Other things that came to mind that make more visible changes

precompute mode at insertion as {count: u64, values: Vec<T>}
- note that our Mode trait allows only single mode distributions, which is infeasible to guarantee for Empirical, perhaps we should switch some of the statistics traits to use associated types for their returns.
use ordered_float::NotNan

The text was updated successfully, but these errors were encountered:

YeungOnion · 2024-06-14T05:42:37Z

@dhardy would you happen to know of an implementation for prefix sums over a BTree? (I may also post this question to the rust lang user forum)

The motive beyond interest is in sampling from Empirical, the naive option to sample uniform through inverse cdf would be very expensive since inverse cdf is done via bisection and cdf requires visiting ≈1/2 the tree's nodes. (Improvement could certainly be made to memoize within a single call to inverse cdf).

Fenwick tree is great if I'd have a way to keep a positional index for data in the tree, but for the cost to update the prefix sum, $\mathcal O (\log N)$, I could also search for the key in the BTree and do the updates along larger values as nodes are traversed, $\mathcal O (k \log N)$, i.e. comparable. Would you be able to guess if this is a case where an implementation likely won't make those gains?

dhardy · 2024-06-14T09:28:07Z

@YeungOnion you can just iterate over a BTree. Efficient key-order iteration is one of its key properties.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Empirical distribution refactor #245

Empirical distribution refactor #245

YeungOnion commented Jun 12, 2024 •

edited

Loading

YeungOnion commented Jun 14, 2024

dhardy commented Jun 14, 2024

Empirical distribution refactor #245

Empirical distribution refactor #245

Comments

YeungOnion commented Jun 12, 2024 • edited Loading

YeungOnion commented Jun 14, 2024

dhardy commented Jun 14, 2024

YeungOnion commented Jun 12, 2024 •

edited

Loading