Skip to content
This repository has been archived by the owner on Feb 28, 2024. It is now read-only.

feat(db): use duckdb backend #509

Draft
wants to merge 32 commits into
base: main
Choose a base branch
from
Draft

feat(db): use duckdb backend #509

wants to merge 32 commits into from

Conversation

jonburdo
Copy link
Contributor

@jonburdo jonburdo commented Oct 3, 2022

The goal for this first step is to replace our json file-based backend with duckdb, while leaving most of the existing logic and data structures as is

@jonburdo jonburdo marked this pull request as draft October 3, 2022 13:42
@jonburdo jonburdo requested a review from dtulga October 3, 2022 13:42
@jonburdo
Copy link
Contributor Author

jonburdo commented Oct 3, 2022

@dtulga This is still very much a work in progress. I will clean it up and fix some issues with it, but I wanted to get some feedback on the general approach, since it's already a significant amount of code - particularly on ldb/db/duckdb.py.

This currently only includes partial functionality for indexing and a simple ldb list ds:root.

At the moment, the db api is a single class with a lot of add_* and get_* functions. It's a bit cluttered and I'm thinking of breaking it up a little more. For example, instead of

db.get_annotation
db.get_annotation_many
db.get_annotation_all

we could have

db.annotation.get
db.annotation.many
db.annotation.all

I realize this feels like creating our own ORM, but this an API that is higher-level than an ORM. We don't assume a relational backend or provide that level of granularity.

The reason for creating FileDB in ldb/db/file.py is to help with the transition and avoid regressions. It may also be useful to preserve this backend as an option

@jonburdo jonburdo linked an issue Oct 3, 2022 that may be closed by this pull request
@jonburdo jonburdo force-pushed the feat/duckdb-backend branch from 9ca0d95 to 588e1d0 Compare October 5, 2022 14:32
Running pylint seems to work properly, but running via pre-commit gives these false positives:

************* Module duckdb
ldb/db/duckdb.py:6:0: W0406: Module import itself (import-self)
ldb/db/duckdb.py:22:20: E1101: Module duckdb has no connect member (no-member)
ldb/db/duckdb.py:313:23: E1101: Module duckdb has no ConstraintException member (no-member)
@jonburdo jonburdo self-assigned this Oct 6, 2022
@jonburdo
Copy link
Contributor Author

jonburdo commented Oct 6, 2022

This currently includes a lot of code in order to provide the full context of the new db abstraction with AbstractDB, DuckDB, and FileDB under ldb.db.*. The queries still need more work, and this still doesn't cover dataset saving entirely. If the api looks okay, I can start by splitting off the abstract and file dbs into a PR and merging those. Then duckdb can come after.

Note duckdb isn't fully working yet. You should be able to index and run commands on ds:root. In order to run tests against the duckdb backend instead of file, use:

LDB_DATABASE=duckdb pytest

For manually testing you can:

export LDB_DATABASE
ldb init <path>

Copy link
Contributor

@dtulga dtulga left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good so far! Thanks for working on this!

ldb/db/duckdb.py Show resolved Hide resolved
CREATE TABLE IF NOT EXISTS dataset(id INTEGER PRIMARY KEY, name VARCHAR UNIQUE)
""",
)
# The primary key is commented out for now due to lack of proper update support
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if this is a dealbreaker, but this seems like a very major bug to me for this database! 😮

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't work with other OLAP databases, so I'm not sure how typical this is, but there's some explanation here: duckdb/duckdb#1631 (comment)

There's also a workaround mentioned here:
duckdb/duckdb#61 (comment)

For querying duckdb should be fine - possibly better than sqlite for certain scenarios, although until we benchmark, it's hard to say. Updates for us are mainly on global metadata like last_indexed for data objects and filesystem meta info. Also on the root dataset entries. Adding new annotations and saving new dataset versions rely on insertions, not updates. So for the data that we do want to update, I can see the following options as possibilities:

  • Remove unique constraints and apply update statements (relying on our logic to maintain consistency)
  • Use unique constraints with the workaround mentioned in this duckdb issue comment where updates are done with an intermediate table. (Might also require us to implement our own locking mechanism - not sure if duckdb has it)
  • Adjust our schema and indexing logic to be append-only (i.e. insertions, no updates). We might want to preserve a history of filesystem metadata and indexing times anyway
  • switch to sqlite

ldb/command/init.py Outdated Show resolved Hide resolved
ldb/dataset.py Outdated Show resolved Hide resolved
ldb/dataset.py Outdated Show resolved Hide resolved
duckdb_path = osp.join(self.ldb_dir, "duckdb", "index.db")
if not self._db_type:
if osp.isfile(duckdb_path):
self._db_type = "duckdb"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe this is a more general question, but do we want to have a versioning system for the database schema? Especially if we are auto-detecting it (and might assume it is usable without being sure of the database's table structure). (This is also something we could look into in a future version, for sure.)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, good idea. I'm not sure right off how to implement it. I assume we'd store the version in the db and maybe have different DuckDB classes/modules for different versions? I added an issue here: #511
Feel free to edit the description

if annotation is not None:
if annotation_meta is None:
raise ValueError(
"If annotation is not None, then annotation_meta cannot be None"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this a debugging error, or something that might be shown to users?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Our indexing logic should work so that annotation_meta is None if and only if annotation is None. So this enforces that and should only happen if there's a bug elsewhere. This method signature is a bit awkward right now and it might be better to have a single argument for annotation data

self.conn.execute(
"""
INSERT INTO annotation (
SELECT distinct on (id) id, value from annotation_df
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems to imply we might have duplicates in the annotations being written (from annotation_df) - should we handle these outside the database call?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, it probably makes sense to have a dict mapping id to value so that we're skipping duplicates as we go

result = self.conn.execute(
"""
SELECT * FROM data_object_annotation
WHERE (data_object_id, annotation_id) in (
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure what kind of support DuckDB has, but if we can't make these primary keys, then at least an INDEX might help speed up queries like this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, good point

ldb/db/file.py Outdated Show resolved Hide resolved
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

replace json file-based index with duckdb
2 participants