Add document TTL for SQLiteYStore #50

dlqqq · 2022-11-22T01:53:47Z

Right now, the SQLite database storing document updates continues to grows without bound as users edit files. We store all document updates because users may briefly lose connection during a session and our backend needs to deliver all the patches they missed during that interval.

However, we can be reasonably confident that if a document has not been updated after some duration of time (let's say 24 hours), then there are no users still editing that file. When this happens, we can safely delete the patches for that file, as the current contents are already persisted to disk. We term this interval the "time to live" (TTL), because the document updates will only be kept alive (i.e. persisted) if the last update was within the TTL.

This PR adds a class attribute (which we will probably want to make configurable later via traitlets) that stores the TTL for every document to SQLiteYStore, and checks this before every write to determine if we should delete all the previous document updates associated with this path. Adds a unit test covering both the pre-TTL and post-TTL cases.

davidbrochart

Thank you @dlqqq.

We store all document updates because users may briefly lose connection during a session and our backend needs to deliver all the patches they missed during that interval.

It is the current use case, but we will probably use updates to display a document timeline in JupyterLab.

This PR adds a class attribute (which we will probably want to make configurable later via traitlets) that stores the TTL for every document to SQLiteYStore, and checks this before every write to determine if we should delete all the previous document updates associated with this path.

I don't think we want to depend on traitlets in ypy-websocket, but probably in jupyter-server-ydoc.
Another approach would be to shrink the database based on its size, rather than on time. Indeed, there is no reason to delete updates if the database is small, even if updates are old. What do you think?

.gitignore

tests/conftest.py

ypy_websocket/ystore.py

Co-authored-by: David Brochart <[email protected]>

dlqqq · 2022-11-22T18:41:10Z

Another approach would be to shrink the database based on its size, rather than on time. Indeed, there is no reason to delete updates if the database is small, even if updates are old. What do you think?

Yeah, deleting old patches is almost certainly going to be a temporary fix for collaborative users who have very limited disk space. I discussed this with Brian, and we think the ideal solution is to employ some kind of "update deque" approach, where we view each table as a deque with a configured fixed size (either measured in number of updates or actual disk usage in bytes). If a new update is added that causes this buffer to exceed this size, the oldest updates are merged together until the buffer fits within this size. This way the document history provides a complete timeline starting from an empty file. IOW, updates are never deleted, but only merged.

Problem with defining the size as the number of updates is that the size of each update is variable, so different document histories could have wildly different sizes, and we're not enforcing a bound on update history size / file.

Problem with defining the size as the actual disk usage in bytes is that then there is a limit on how large collaborative files grow. E.g. let's say we set the size to be 1 MB. Then if users gradually write to a file until it is 1 MB, then even merging all of the patches together still results in this size being reached.

Perhaps we make the update history size proportional to the size of the file on disk?

Lots of food for thought on how we tackle the disk usage problem while preserving history since document history. But for now, this solution will suffice.

dlqqq · 2022-11-22T18:42:09Z

BTW @davidbrochart could you run the CI workflow? I promise I didn't add a bitcoin miner in CI 😁

tests/conftest.py

Co-authored-by: David Brochart <[email protected]>

davidbrochart · 2022-11-22T20:57:46Z

ypy_websocket/ystore.py

+                "CREATE TABLE IF NOT EXISTS yupdates (path TEXT NOT NULL, yupdate BLOB, metadata BLOB, timestamp REAL NOT NULL)"
+            )
+            await db.execute(
+                "CREATE INDEX IF NOT EXISTS idx_yupdates_path_timestamp ON yupdates (path, timestamp)"


Could you describe what this does?

This creates a composite index that makes the SELECT ... WHERE path = ? ORDER BY timestamp query more efficient. The mental model is that this constructs a B-tree where records are first sorted by path, and then ties are resolved by the timestamp, which is the best data structure for this query.

https://www.sqlite.org/queryplanner.html#_multi_column_indices

davidbrochart · 2022-11-22T20:59:34Z

ypy_websocket/ystore.py

@@ -163,8 +170,21 @@ async def write(self, data: bytes) -> None:
        await self.db_created.wait()
        metadata = await self.get_metadata()
        async with aiosqlite.connect(self.db_path) as db:
+            # first, determine time elapsed since last update
+            cursor = await db.execute(
+                "SELECT timestamp FROM yupdates WHERE path = ? ORDER BY timestamp DESC LIMIT 1",


I don't understand why we need to order by timestamp, the updates are supposed to be already ordered, right?

Not necessarily, and even if it is, I think it's better to be explicit about the order this query requires, to avoid breaking this query if the table schema were to change in the future.

AFAIK, tables without a primary key are ordered simply by insertion order. Thus, the oldest update would be returned without the ORDER BY clause, which is not what we want. We want the most recent update.

But this also probably comes with an extra cost. In our case, insertion order is already ordered by time. Isn't it possible to get the last row in the query?

But this also probably comes with an extra cost.

You're right, but this is addressed by the composite index that you commented on earlier. I'm not expert in SQLite performance characteristics, but there are some justifications for this:

Fetching a record by its index is not significantly slower than fetching a record by its primary key. In fact, most primary keys are actually just implemented with an implicit index in SQLite. IOW, this is about as fast as it gets.

Reading an existing record from an index is about an order of magnitude slower than writing a new record. So the performance of this query is negligible relative to the write that's happening in the INSERT statement that follows in this method.

In our case, insertion order is already ordered by time. Isn't it possible to get the last row in the query?

No, ascending order is assumed if ORDER BY is not present. https://www.sqlite.org/lang_select.html#the_order_by_clause

Do keep in mind that yes, I don't have benchmarks, so these rationalizations could be completely false. However, I think there is plenty of good justification for this implementation, and we shouldn't let performance ambiguity steer us away.

Thanks for the details David 👍

davidbrochart · 2022-11-22T22:07:55Z

Thanks @dlqqq !

ellisonbg · 2022-11-22T22:33:13Z

Am I understanding that the compaction/deletion logic is run on all inserts? We need updates to be as fast as possible so not clear on why what is needed. Also, wouldn't we want to delete entries during periods of inactivity?

ellisonbg · 2022-11-22T22:35:34Z

Also, I think this PR breaks things. It deletes the entire history and then inserts a single update. But if the history is deleted, we first need to read the document from disk in a single update, right @davidbrochart ?

davidbrochart · 2022-11-23T09:36:31Z

if the history is deleted, we first need to read the document from disk in a single update

Good point Brian, maybe we could compute a squashed update from the history and insert that before the new update?

davidbrochart · 2022-11-23T11:06:30Z

I opened #53.

dlqqq added 2 commits November 22, 2022 01:34

add document TTL for sqlite ystore

ad1d270

add test JS dependencies to gitignore

406023f

davidbrochart requested changes Nov 22, 2022

View reviewed changes

.gitignore Outdated Show resolved Hide resolved

tests/conftest.py Outdated Show resolved Hide resolved

ypy_websocket/ystore.py Show resolved Hide resolved

dlqqq and others added 2 commits November 22, 2022 10:24

Update .gitignore

ee3ba41

Co-authored-by: David Brochart <[email protected]>

Update tests/conftest.py

5f1268d

Co-authored-by: David Brochart <[email protected]>

fix lint

645b46c

davidbrochart reviewed Nov 22, 2022

View reviewed changes

tests/conftest.py Outdated Show resolved Hide resolved

Update tests/conftest.py

d3b2cf2

Co-authored-by: David Brochart <[email protected]>

davidbrochart reviewed Nov 22, 2022

View reviewed changes

davidbrochart approved these changes Nov 22, 2022

View reviewed changes

davidbrochart merged commit b8758c3 into y-crdt:main Nov 22, 2022

dlqqq deleted the add-document-ttl-sqlite branch November 22, 2022 22:21

davidbrochart mentioned this pull request Nov 23, 2022

Add squashed update after deleting document history #53

Merged

davidbrochart mentioned this pull request Dec 6, 2022

Collaborative document session jupyterlab/jupyterlab#13550

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add document TTL for SQLiteYStore #50

Add document TTL for SQLiteYStore #50

dlqqq commented Nov 22, 2022

davidbrochart left a comment

dlqqq commented Nov 22, 2022

dlqqq commented Nov 22, 2022

davidbrochart Nov 22, 2022

dlqqq Nov 22, 2022

dlqqq Nov 22, 2022

davidbrochart Nov 22, 2022

davidbrochart Nov 22, 2022

dlqqq Nov 22, 2022 •

edited

Loading

davidbrochart Nov 22, 2022

dlqqq Nov 22, 2022

dlqqq Nov 22, 2022

davidbrochart Nov 22, 2022

davidbrochart commented Nov 22, 2022

ellisonbg commented Nov 22, 2022

ellisonbg commented Nov 22, 2022

davidbrochart commented Nov 23, 2022

davidbrochart commented Nov 23, 2022

Add document TTL for SQLiteYStore #50

Add document TTL for SQLiteYStore #50

Conversation

dlqqq commented Nov 22, 2022

davidbrochart left a comment

Choose a reason for hiding this comment

dlqqq commented Nov 22, 2022

dlqqq commented Nov 22, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dlqqq Nov 22, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

davidbrochart commented Nov 22, 2022

ellisonbg commented Nov 22, 2022

ellisonbg commented Nov 22, 2022

davidbrochart commented Nov 23, 2022

davidbrochart commented Nov 23, 2022

dlqqq Nov 22, 2022 •

edited

Loading