Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add document TTL for SQLiteYStore #50

Merged
merged 6 commits into from
Nov 22, 2022

Conversation

dlqqq
Copy link
Contributor

@dlqqq dlqqq commented Nov 22, 2022

Right now, the SQLite database storing document updates continues to grows without bound as users edit files. We store all document updates because users may briefly lose connection during a session and our backend needs to deliver all the patches they missed during that interval.

However, we can be reasonably confident that if a document has not been updated after some duration of time (let's say 24 hours), then there are no users still editing that file. When this happens, we can safely delete the patches for that file, as the current contents are already persisted to disk. We term this interval the "time to live" (TTL), because the document updates will only be kept alive (i.e. persisted) if the last update was within the TTL.

This PR adds a class attribute (which we will probably want to make configurable later via traitlets) that stores the TTL for every document to SQLiteYStore, and checks this before every write to determine if we should delete all the previous document updates associated with this path. Adds a unit test covering both the pre-TTL and post-TTL cases.

Copy link
Collaborator

@davidbrochart davidbrochart left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @dlqqq.

We store all document updates because users may briefly lose connection during a session and our backend needs to deliver all the patches they missed during that interval.

It is the current use case, but we will probably use updates to display a document timeline in JupyterLab.

This PR adds a class attribute (which we will probably want to make configurable later via traitlets) that stores the TTL for every document to SQLiteYStore, and checks this before every write to determine if we should delete all the previous document updates associated with this path.

I don't think we want to depend on traitlets in ypy-websocket, but probably in jupyter-server-ydoc.
Another approach would be to shrink the database based on its size, rather than on time. Indeed, there is no reason to delete updates if the database is small, even if updates are old. What do you think?

.gitignore Outdated Show resolved Hide resolved
tests/conftest.py Outdated Show resolved Hide resolved
ypy_websocket/ystore.py Show resolved Hide resolved
dlqqq and others added 2 commits November 22, 2022 10:24
Co-authored-by: David Brochart <[email protected]>
Co-authored-by: David Brochart <[email protected]>
@dlqqq
Copy link
Contributor Author

dlqqq commented Nov 22, 2022

Another approach would be to shrink the database based on its size, rather than on time. Indeed, there is no reason to delete updates if the database is small, even if updates are old. What do you think?

Yeah, deleting old patches is almost certainly going to be a temporary fix for collaborative users who have very limited disk space. I discussed this with Brian, and we think the ideal solution is to employ some kind of "update deque" approach, where we view each table as a deque with a configured fixed size (either measured in number of updates or actual disk usage in bytes). If a new update is added that causes this buffer to exceed this size, the oldest updates are merged together until the buffer fits within this size. This way the document history provides a complete timeline starting from an empty file. IOW, updates are never deleted, but only merged.

Problem with defining the size as the number of updates is that the size of each update is variable, so different document histories could have wildly different sizes, and we're not enforcing a bound on update history size / file.

Problem with defining the size as the actual disk usage in bytes is that then there is a limit on how large collaborative files grow. E.g. let's say we set the size to be 1 MB. Then if users gradually write to a file until it is 1 MB, then even merging all of the patches together still results in this size being reached.

  • Perhaps we make the update history size proportional to the size of the file on disk?

Lots of food for thought on how we tackle the disk usage problem while preserving history since document history. But for now, this solution will suffice.

@dlqqq
Copy link
Contributor Author

dlqqq commented Nov 22, 2022

BTW @davidbrochart could you run the CI workflow? I promise I didn't add a bitcoin miner in CI 😁

tests/conftest.py Outdated Show resolved Hide resolved
Co-authored-by: David Brochart <[email protected]>
"CREATE TABLE IF NOT EXISTS yupdates (path TEXT NOT NULL, yupdate BLOB, metadata BLOB, timestamp REAL NOT NULL)"
)
await db.execute(
"CREATE INDEX IF NOT EXISTS idx_yupdates_path_timestamp ON yupdates (path, timestamp)"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you describe what this does?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This creates a composite index that makes the SELECT ... WHERE path = ? ORDER BY timestamp query more efficient. The mental model is that this constructs a B-tree where records are first sorted by path, and then ties are resolved by the timestamp, which is the best data structure for this query.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@@ -163,8 +170,21 @@ async def write(self, data: bytes) -> None:
await self.db_created.wait()
metadata = await self.get_metadata()
async with aiosqlite.connect(self.db_path) as db:
# first, determine time elapsed since last update
cursor = await db.execute(
"SELECT timestamp FROM yupdates WHERE path = ? ORDER BY timestamp DESC LIMIT 1",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand why we need to order by timestamp, the updates are supposed to be already ordered, right?

Copy link
Contributor Author

@dlqqq dlqqq Nov 22, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not necessarily, and even if it is, I think it's better to be explicit about the order this query requires, to avoid breaking this query if the table schema were to change in the future.

AFAIK, tables without a primary key are ordered simply by insertion order. Thus, the oldest update would be returned without the ORDER BY clause, which is not what we want. We want the most recent update.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But this also probably comes with an extra cost. In our case, insertion order is already ordered by time. Isn't it possible to get the last row in the query?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But this also probably comes with an extra cost.

You're right, but this is addressed by the composite index that you commented on earlier. I'm not expert in SQLite performance characteristics, but there are some justifications for this:

  1. Fetching a record by its index is not significantly slower than fetching a record by its primary key. In fact, most primary keys are actually just implemented with an implicit index in SQLite. IOW, this is about as fast as it gets.

  2. Reading an existing record from an index is about an order of magnitude slower than writing a new record. So the performance of this query is negligible relative to the write that's happening in the INSERT statement that follows in this method.

In our case, insertion order is already ordered by time. Isn't it possible to get the last row in the query?

No, ascending order is assumed if ORDER BY is not present. https://www.sqlite.org/lang_select.html#the_order_by_clause

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do keep in mind that yes, I don't have benchmarks, so these rationalizations could be completely false. However, I think there is plenty of good justification for this implementation, and we shouldn't let performance ambiguity steer us away.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the details David 👍

@davidbrochart
Copy link
Collaborator

Thanks @dlqqq !

@davidbrochart davidbrochart merged commit b8758c3 into y-crdt:main Nov 22, 2022
@dlqqq dlqqq deleted the add-document-ttl-sqlite branch November 22, 2022 22:21
@ellisonbg
Copy link

Am I understanding that the compaction/deletion logic is run on all inserts? We need updates to be as fast as possible so not clear on why what is needed. Also, wouldn't we want to delete entries during periods of inactivity?

@ellisonbg
Copy link

Also, I think this PR breaks things. It deletes the entire history and then inserts a single update. But if the history is deleted, we first need to read the document from disk in a single update, right @davidbrochart ?

@davidbrochart
Copy link
Collaborator

if the history is deleted, we first need to read the document from disk in a single update

Good point Brian, maybe we could compute a squashed update from the history and insert that before the new update?

@davidbrochart
Copy link
Collaborator

I opened #53.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants