Intention to remove repository history #1410

bnewbold · 2023-07-31T22:52:24Z

bnewbold
Jul 31, 2023
Maintainer

We don't have all the details of our proposed protocol change worked out, but want to share the big picture earlier so that other developers have a heads up of protocol changes.

Motivation

A problematic area of the atproto repository data structure and repo event stream has been handling deletions of records. The current default behavior is that repositories contain a versioned history of changes via commit objects which reference the previous commit version via a content-addressed hash (CID Link). The current solution to full purges of deleted records has been to "rebase" the repository by creating a new commit which does not have a prev link to an old version. The problem with this is that rebases are "expensive" for all the downstream services subscribing to a repository. This results in rebases not happening frequently or automatically, which results in deleted content being available publicly via specially crafted API calls, which breaks human intents and expectations.

Another motivation is to make it "cheaper" to host repositories by not holding on to full history, particularly for intermediate MST nodes which are no longer part of the current MST tree.

Repository logical clocks

At a high level, we are planning to replace the prev pointer in commits to a clock value. The clock will act as an ever-increasing logical clock for the account repo. It is intentionally not a strong reference, and there may be arbitrary gaps between clock values. The clock value would be a signed part of the commit object, and there would be no "previous clock" values referenced in the repo structure itself.

The repository structure change from a full-history Merkle DAG (where a commit references back to all previous versions, up to the most recent rebase/purge), to snapshot-at-a-time. The repo sync API would not provide a mechanism to fetch specific historical versions of the entire repo, though it would be possible to "catch up" from an older state to the current state of the repo most of the time.

There would no longer be a public, enumerable commit history.

Diffs and Event Stream

The "diff" concept, which includes a list of "ops" and all of the record blocks and MST blocks needed to update an old version of the repo to the current version, would become "coarse", meaning that in theory a diff could span multiple intermediate commits (versions) of the repository. Diffs would be allowed to include extra blocks, which could help with server-side implementation efficiency in some cases.

The repo event stream would be updated to be a stream of such "diffs", with each indicating a previous older clock version, plus all the blocks and ops needed to construct the current commit of the repo. In the event stream, coming from either a PDS or BGS, the event stream would be continuous, meaning that a subscriber will always have the complete current version of a repo.

Detecting Stale Reads

A common problem with distributed systems with eventual consistency is knowing whether a request or query contains recently-written results. This is most obviously important for a client to know whether they are "reading their own writes". For example, if you follow somebody (a write to your PDS), and then fetch your own author feed (from an AppView), you don't know if the follow relationship has been processed and indexed by the AppView yet.

Repo clocks can help with this if they are returned as metadata with most requests. An end client will know their own repo's current clock value (if returned by the PDS after a repo mutation request), and can compare that with the AppView's current clock value for the requester's repo (returned as metadata in all logged-in/authenticated responses). This lets clients know if they are reading "stale" data or not.

Open Questions

What is the clock syntax, and semantics? still discussing logical clocks vs. (bounded) timestamps, and integer encodings vs string encodings

Will prev stick around as an optional feature? might be useful in some cases (eg, official accounts which leave a public record), but the extra protocol and implementation complexity may not be worth it

Repository version increment? this would probably involve an increment of the repo format version from 2 to 3, especially if the clock value is required

How and when to return current repo clock, to detect stale reads? probably as an HTTP header in API responses

What is the "catch up" API? not decided if we will support catch-up from an older clock value, or from an older commit version (CID), or using a generic MST synchronization scheme (eg, comparing top-level nodes in the tree)

Will this break firehose consumers? we hope this won't be too big of a change for most folks subscribing to the repo event stream (firehose), if they aren't working with history. if it isn't a seamless transition, we'll try to give a bit of time after the exact details are specified before we cut over the Bluesky prod firehose

Mechanism for in-repo record history? it might be helpful to (optionally) store old versions of records in the repo, along with the current version, for cases where a record was updated. for example, post edit history

viksit · 2023-10-20T19:56:52Z

viksit
Oct 20, 2023

@bnewbold could you elaborate on how those questions were answered?

specifically, we're trying to do a downstream migration of repov2-repov3 and running into some issues here and would love to understand what the migration process looks like.

1 reply

bnewbold Oct 26, 2023
Maintainer Author

In short:

version field in commit nodes was incremented from 2 to 3
new rev field, a string containing a TID, required on commit nodes
sync endpoint works "since" a rev value
no mechanism yet for in-repo record history
prev is still around, but very vestigial and unused

We truncated history for all repos (eg, prev=null) during migration.

Here is the PR diff in the specs repo: https://github.com/bluesky-social/atproto-website/pull/186/files

mihailik · 2023-12-16T13:41:08Z

mihailik
Dec 16, 2023

Trying to call com.atproto.sync.listRepos and ever since the mushroom division it has 2 blocking issues:

the old https://bsky.social/xrpc/com.atproto.sync.listRepos endpoint returns exactly 1 entry, and stops:

the new (supported or unsupported??) https://bsky.network/xrpc/com.atproto.sync.listRepos endpoint replies without CORS headers, so impossible to use from a browser context:

The new bsky.network endpoint has two more bugs by the way, but that's possible to work around

1 reply

DavidBuchanan314 Dec 16, 2023

I think it's vaguely expected that listRepos against bsky.social returns nothing useful, since there are no users left there since the migration (except for that single user, no idea what that's about!) I've also noticed that bsky.social is very slow to give its listRepos non-response, suggesting a suboptimal db query is involved. The missing CORS on bsky.network does seem like a bug though.

mihailik · 2023-12-16T20:44:51Z

mihailik
Dec 16, 2023

Sounds reasonable thanks @DavidBuchanan314 !

@bnewbold mentioned that bsky.network may have a few other edge case bugs apart from the two I've reported, and it was my experience as well (like listRecords acting up)

Looks like for now bsky.social is more stable across the API set.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Intention to remove repository history #1410

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 3 comments 2 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Intention to remove repository history #1410

bnewbold Jul 31, 2023 Maintainer

Motivation

Repository logical clocks

Diffs and Event Stream

Detecting Stale Reads

Open Questions

Replies: 3 comments · 2 replies

viksit Oct 20, 2023

bnewbold Oct 26, 2023 Maintainer Author

mihailik Dec 16, 2023

DavidBuchanan314 Dec 16, 2023

mihailik Dec 16, 2023

bnewbold
Jul 31, 2023
Maintainer

Replies: 3 comments 2 replies

viksit
Oct 20, 2023

bnewbold Oct 26, 2023
Maintainer Author

mihailik
Dec 16, 2023

mihailik
Dec 16, 2023