Intention to remove repository history #1410
Replies: 3 comments 2 replies
-
@bnewbold could you elaborate on how those questions were answered? specifically, we're trying to do a downstream migration of repov2-repov3 and running into some issues here and would love to understand what the migration process looks like. |
Beta Was this translation helpful? Give feedback.
-
Trying to call the old https://bsky.social/xrpc/com.atproto.sync.listRepos endpoint returns exactly 1 entry, and stops: the new (supported or unsupported??) https://bsky.network/xrpc/com.atproto.sync.listRepos endpoint replies without CORS headers, so impossible to use from a browser context: The new bsky.network endpoint has two more bugs by the way, but that's possible to work around |
Beta Was this translation helpful? Give feedback.
-
Sounds reasonable thanks @DavidBuchanan314 ! @bnewbold mentioned that bsky.network may have a few other edge case bugs apart from the two I've reported, and it was my experience as well (like Looks like for now bsky.social is more stable across the API set. |
Beta Was this translation helpful? Give feedback.
-
We don't have all the details of our proposed protocol change worked out, but want to share the big picture earlier so that other developers have a heads up of protocol changes.
Motivation
A problematic area of the atproto repository data structure and repo event stream has been handling deletions of records. The current default behavior is that repositories contain a versioned history of changes via commit objects which reference the previous commit version via a content-addressed hash (CID Link). The current solution to full purges of deleted records has been to "rebase" the repository by creating a new commit which does not have a
prev
link to an old version. The problem with this is that rebases are "expensive" for all the downstream services subscribing to a repository. This results in rebases not happening frequently or automatically, which results in deleted content being available publicly via specially crafted API calls, which breaks human intents and expectations.Another motivation is to make it "cheaper" to host repositories by not holding on to full history, particularly for intermediate MST nodes which are no longer part of the current MST tree.
Repository logical clocks
At a high level, we are planning to replace the
prev
pointer in commits to aclock
value. The clock will act as an ever-increasing logical clock for the account repo. It is intentionally not a strong reference, and there may be arbitrary gaps between clock values. The clock value would be a signed part of the commit object, and there would be no "previous clock" values referenced in the repo structure itself.The repository structure change from a full-history Merkle DAG (where a commit references back to all previous versions, up to the most recent rebase/purge), to snapshot-at-a-time. The repo sync API would not provide a mechanism to fetch specific historical versions of the entire repo, though it would be possible to "catch up" from an older state to the current state of the repo most of the time.
There would no longer be a public, enumerable commit history.
Diffs and Event Stream
The "diff" concept, which includes a list of "ops" and all of the record blocks and MST blocks needed to update an old version of the repo to the current version, would become "coarse", meaning that in theory a diff could span multiple intermediate commits (versions) of the repository. Diffs would be allowed to include extra blocks, which could help with server-side implementation efficiency in some cases.
The repo event stream would be updated to be a stream of such "diffs", with each indicating a previous older clock version, plus all the blocks and ops needed to construct the current commit of the repo. In the event stream, coming from either a PDS or BGS, the event stream would be continuous, meaning that a subscriber will always have the complete current version of a repo.
Detecting Stale Reads
A common problem with distributed systems with eventual consistency is knowing whether a request or query contains recently-written results. This is most obviously important for a client to know whether they are "reading their own writes". For example, if you follow somebody (a write to your PDS), and then fetch your own author feed (from an AppView), you don't know if the follow relationship has been processed and indexed by the AppView yet.
Repo clocks can help with this if they are returned as metadata with most requests. An end client will know their own repo's current clock value (if returned by the PDS after a repo mutation request), and can compare that with the AppView's current clock value for the requester's repo (returned as metadata in all logged-in/authenticated responses). This lets clients know if they are reading "stale" data or not.
Open Questions
What is the clock syntax, and semantics? still discussing logical clocks vs. (bounded) timestamps, and integer encodings vs string encodings
Will
prev
stick around as an optional feature? might be useful in some cases (eg, official accounts which leave a public record), but the extra protocol and implementation complexity may not be worth itRepository version increment? this would probably involve an increment of the repo format version from 2 to 3, especially if the clock value is required
How and when to return current repo clock, to detect stale reads? probably as an HTTP header in API responses
What is the "catch up" API? not decided if we will support catch-up from an older clock value, or from an older commit version (CID), or using a generic MST synchronization scheme (eg, comparing top-level nodes in the tree)
Will this break firehose consumers? we hope this won't be too big of a change for most folks subscribing to the repo event stream (firehose), if they aren't working with history. if it isn't a seamless transition, we'll try to give a bit of time after the exact details are specified before we cut over the Bluesky prod firehose
Mechanism for in-repo record history? it might be helpful to (optionally) store old versions of records in the repo, along with the current version, for cases where a record was updated. for example, post edit history
Beta Was this translation helpful? Give feedback.
All reactions