Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consider using succinct data structures for read-only trees #107

Open
matklad opened this issue Jul 3, 2021 · 2 comments
Open

Consider using succinct data structures for read-only trees #107

matklad opened this issue Jul 3, 2021 · 2 comments

Comments

@matklad
Copy link
Member

matklad commented Jul 3, 2021

I've recently been reading on succinct data structures literature, and I am wondering if they might be applicable to rowan. The tagline of SD is to represent data using approximately as few bits as entropy allows, but with keeping all the operations fast.

In particular, for trees, we generally use 2 * n * sizeof<usize> bytes (parent/children pointers). SD allows using roughly 2n bits, while still allowing for efficient parent/child queries. This works due to the following tricks:

  • bitvectors can be represented very efficiently (you can pack 64 bits into a single word)
  • working with bitmasks is fast (popcount + precomputed tables)
  • by spending a few extra bytes, it's possible to augment bitvector with fast prefix-sum datastructures (rank & select)
  • it's possible to encode a tree as a bitvec, using couple of bits per node, where parent/child relations are expressible via rank/select

It would be interesting to see if it makes sense to use something like this for rowan.

Some reasons to do this:

  • using fewer bits for storing tree topology means that larger trees will fit into CPU caches, potentially improving performance
  • trees are heavy-weight, reducing RAM seems like a good idea
  • algorithmic complexity of operations becomes independent of the topology of the tree. Ie, the tree can be arbitrary deep or arbitrary wide

Some reasons not to do this:

  • performance will probably be worse
  • works only for read-only trees
  • high complexity of the implementation
@CAD97
Copy link
Collaborator

CAD97 commented Aug 15, 2021

Rowan is such nerd catnip for me.

https://doi.org/10.1145/2601073

Fully Functional Static and Dynamic Succinct Trees

For the dynamic case, where insertion/deletion (indels) of nodes is allowed, the existing data structures support a very limited set of operations. Our data structure builds on the range min-max tree to achieve 2n + O(n/log n) bits of space and O(log n) time for all operations supported in the static scenario, plus indels. We also propose an improved data structure using 2n + O(nlog log n/log n) bits and improving the time to the optimal O(log n/log log n) for most operations. We extend our support to forests, where whole subtrees can be attached to or detached from others, in time O(log1+ϵ n) for any ϵ > 0. Such operations had not been considered before.

The forest operations seem especially relevant to rust-analyzer/rowan use cases, since the main mutation use case is in grafting subtrees. It's definitely worth looking into... and I'm going to think about this until I experiment some with it... I sense a sorbus 0.2 coming some time in the future 😅

An interesting constraint that Rowan has over pure academic trees is the actual data storage on each node (in a side table for succinct data structures, typically, AIUI). Both interior nodes (for length offsets of children, for sublinear position search) and terminal nodes (for the actual text) want to themselves be dynamically sized, so size analysis is obviously more complicated, along with node deduplication.

@CAD97
Copy link
Collaborator

CAD97 commented Aug 15, 2021

One property of Rowan that I'm fairly sure sure succinct trees would lose is the independence of nodes. Currently, if you have GreenNode, you keep just that node and its children alive. With a succinct tree, however, I'm fairly certain that the tree is one unit w.r.t. ownership. (Though I need to dig into how exactly the forest operations work.)

Is this restriction problematic for what rust-analyzer wants? Node payloads would still be able to be cached and deduplicated separately from the trees themselves.

If we're okay giving up partial sub-ownership of a green tree, we can reduce memory usage even without a succinct tree by using indices into an arena rather than pointers (as well as unlocking parent pointers in the green tree, if we want those).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests

2 participants