Skip to content
This repository has been archived by the owner on Apr 26, 2024. It is now read-only.

Write about the chain cover a little. #13602

Merged
merged 7 commits into from
Aug 23, 2022
Merged
Show file tree
Hide file tree
Changes from 2 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions changelog.d/13602.doc
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
Improve the description of the ["chain cover index"](https://matrix-org.github.io/synapse/latest/auth_chain_difference_algorithm.html) used internally by Synapse.
40 changes: 33 additions & 7 deletions docs/auth_chain_difference_algorithm.md
Original file line number Diff line number Diff line change
Expand Up @@ -34,13 +34,38 @@ the process of indexing it).
## Chain Cover Index

Synapse computes auth chain differences by pre-computing a "chain cover" index
for the auth chain in a room, allowing efficient reachability queries like "is
event A in the auth chain of event B". This is done by assigning every event a
*chain ID* and *sequence number* (e.g. `(5,3)`), and having a map of *links*
between chains (e.g. `(5,3) -> (2,4)`) such that A is reachable by B (i.e. `A`
for the auth chain in a room, allowing us to efficiently make reachability queries
like "is event `A` in the auth chain of event `B`?". We could do this with an index
that tracks all pairs `(A, B)` such that `A` is in the auth chain of `B`. However this
would be prohibitively large, scaling poorly as the room accumulates more state
events.

Instead, we break down the graph into *chains*. A chain is a subset of a DAG
with the following property: for any pair of events `E` and `F` in the chain,
the chain contains a path `E -> F` or a path `F -> E`. If we ensure that each
persisted event belongs to exactly one chain, we can keep answer reachability
queries by tracking of how the chains are connected to one another. Doing so
DMRobertson marked this conversation as resolved.
Show resolved Hide resolved
uses less storage than tracking this on an event-by-event basis, particularly
when we have fewer and longer chains. See

> Jagadish, H. (1990). [A compression technique to materialize transitive closure](https://doi.org/10.1145/99935.99944).
> *ACM Transactions on Database Systems (TODS)*, 15*(4)*, 558-598.

for the original idea or

> Y. Chen, Y. Chen, [An efficient algorithm for answering graph
> reachability queries](https://doi.org/10.1109/ICDE.2008.4497498),
> in: 2008 IEEE 24th International Conference on Data Engineering, April 2008,
> pp. 893–902. (PDF available via [Google Scholar](https://scholar.google.com/scholar?q=Y.%20Chen,%20Y.%20Chen,%20An%20efficient%20algorithm%20for%20answering%20graph%20reachability%20queries,%20in:%202008%20IEEE%2024th%20International%20Conference%20on%20Data%20Engineering,%20April%202008,%20pp.%20893902.).)

for a more modern take.

In practical terms, the chain cover assigns every event a
*chain ID* and *sequence number* (e.g. `(5,3)`), and maintains a map of *links*
between chains (e.g. `(5,3) -> (2,4)`) such that `A` is reachable by `B` (i.e. `A`
is in the auth chain of `B`) if and only if either:
DMRobertson marked this conversation as resolved.
Show resolved Hide resolved

1. A and B have the same chain ID and `A`'s sequence number is less than `B`'s
1. `A` and `B` have the same chain ID and `A`'s sequence number is less than `B`'s
sequence number; or
2. there is a link `L` between `B`'s chain ID and `A`'s chain ID such that
`L.start_seq_no` <= `B.seq_no` and `A.seq_no` <= `L.end_seq_no`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This bit's still a little unclear (but I did manage to understand it in the end).

Up until point 2, I wasn't sure whether sequence numbers were namespaced to chains and it was unclear why links included a sequence number.

Not sure how to improve it though.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've had a go at this. WDYT?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense! Turns out I was really confused about sequence numbers. Somehow I thought that 2. implied they were global.

Expand All @@ -49,8 +74,9 @@ There are actually two potential implementations, one where we store links from
each chain to every other reachable chain (the transitive closure of the links
graph), and one where we remove redundant links (the transitive reduction of the
links graph) e.g. if we have chains `C3 -> C2 -> C1` then the link `C3 -> C1`
would not be stored. Synapse uses the former implementations so that it doesn't
need to recurse to test reachability between chains.
would not be stored. Synapse uses the former implementation so that it doesn't
need to recurse to test reachability between chains. This trade-offs extra storage
DMRobertson marked this conversation as resolved.
Show resolved Hide resolved
in order to save CPU cycles and DB queries.

### Example

Expand Down