-
Notifications
You must be signed in to change notification settings - Fork 0
feat(db): use duckdb backend #509
base: main
Are you sure you want to change the base?
Conversation
@dtulga This is still very much a work in progress. I will clean it up and fix some issues with it, but I wanted to get some feedback on the general approach, since it's already a significant amount of code - particularly on This currently only includes partial functionality for indexing and a simple At the moment, the db api is a single class with a lot of
we could have
I realize this feels like creating our own ORM, but this an API that is higher-level than an ORM. We don't assume a relational backend or provide that level of granularity. The reason for creating |
9ca0d95
to
588e1d0
Compare
Running pylint seems to work properly, but running via pre-commit gives these false positives: ************* Module duckdb ldb/db/duckdb.py:6:0: W0406: Module import itself (import-self) ldb/db/duckdb.py:22:20: E1101: Module duckdb has no connect member (no-member) ldb/db/duckdb.py:313:23: E1101: Module duckdb has no ConstraintException member (no-member)
This currently includes a lot of code in order to provide the full context of the new db abstraction with Note duckdb isn't fully working yet. You should be able to index and run commands on
For manually testing you can:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good so far! Thanks for working on this!
CREATE TABLE IF NOT EXISTS dataset(id INTEGER PRIMARY KEY, name VARCHAR UNIQUE) | ||
""", | ||
) | ||
# The primary key is commented out for now due to lack of proper update support |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure if this is a dealbreaker, but this seems like a very major bug to me for this database! 😮
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I haven't work with other OLAP databases, so I'm not sure how typical this is, but there's some explanation here: duckdb/duckdb#1631 (comment)
There's also a workaround mentioned here:
duckdb/duckdb#61 (comment)
For querying duckdb should be fine - possibly better than sqlite for certain scenarios, although until we benchmark, it's hard to say. Updates for us are mainly on global metadata like last_indexed
for data objects and filesystem meta info. Also on the root dataset entries. Adding new annotations and saving new dataset versions rely on insertions, not updates. So for the data that we do want to update, I can see the following options as possibilities:
- Remove unique constraints and apply update statements (relying on our logic to maintain consistency)
- Use unique constraints with the workaround mentioned in this duckdb issue comment where updates are done with an intermediate table. (Might also require us to implement our own locking mechanism - not sure if duckdb has it)
- Adjust our schema and indexing logic to be append-only (i.e. insertions, no updates). We might want to preserve a history of filesystem metadata and indexing times anyway
- switch to sqlite
duckdb_path = osp.join(self.ldb_dir, "duckdb", "index.db") | ||
if not self._db_type: | ||
if osp.isfile(duckdb_path): | ||
self._db_type = "duckdb" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe this is a more general question, but do we want to have a versioning system for the database schema? Especially if we are auto-detecting it (and might assume it is usable without being sure of the database's table structure). (This is also something we could look into in a future version, for sure.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, good idea. I'm not sure right off how to implement it. I assume we'd store the version in the db and maybe have different DuckDB
classes/modules for different versions? I added an issue here: #511
Feel free to edit the description
if annotation is not None: | ||
if annotation_meta is None: | ||
raise ValueError( | ||
"If annotation is not None, then annotation_meta cannot be None" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this a debugging error, or something that might be shown to users?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Our indexing logic should work so that annotation_meta
is None
if and only if annotation
is None
. So this enforces that and should only happen if there's a bug elsewhere. This method signature is a bit awkward right now and it might be better to have a single argument for annotation data
self.conn.execute( | ||
""" | ||
INSERT INTO annotation ( | ||
SELECT distinct on (id) id, value from annotation_df |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems to imply we might have duplicates in the annotations being written (from annotation_df) - should we handle these outside the database call?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, it probably makes sense to have a dict
mapping id to value so that we're skipping duplicates as we go
result = self.conn.execute( | ||
""" | ||
SELECT * FROM data_object_annotation | ||
WHERE (data_object_id, annotation_id) in ( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure what kind of support DuckDB has, but if we can't make these primary keys, then at least an INDEX
might help speed up queries like this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, good point
The goal for this first step is to replace our json file-based backend with duckdb, while leaving most of the existing logic and data structures as is