feat(db): use duckdb backend #509

jonburdo · 2022-10-03T13:42:16Z

The goal for this first step is to replace our json file-based backend with duckdb, while leaving most of the existing logic and data structures as is

jonburdo · 2022-10-03T13:47:54Z

@dtulga This is still very much a work in progress. I will clean it up and fix some issues with it, but I wanted to get some feedback on the general approach, since it's already a significant amount of code - particularly on ldb/db/duckdb.py.

This currently only includes partial functionality for indexing and a simple ldb list ds:root.

At the moment, the db api is a single class with a lot of add_* and get_* functions. It's a bit cluttered and I'm thinking of breaking it up a little more. For example, instead of

db.get_annotation
db.get_annotation_many
db.get_annotation_all

we could have

db.annotation.get
db.annotation.many
db.annotation.all

I realize this feels like creating our own ORM, but this an API that is higher-level than an ORM. We don't assume a relational backend or provide that level of granularity.

The reason for creating FileDB in ldb/db/file.py is to help with the transition and avoid regressions. It may also be useful to preserve this backend as an option

…n LDBClient

…nv var

Running pylint seems to work properly, but running via pre-commit gives these false positives: ************* Module duckdb ldb/db/duckdb.py:6:0: W0406: Module import itself (import-self) ldb/db/duckdb.py:22:20: E1101: Module duckdb has no connect member (no-member) ldb/db/duckdb.py:313:23: E1101: Module duckdb has no ConstraintException member (no-member)

jonburdo · 2022-10-06T16:59:30Z

This currently includes a lot of code in order to provide the full context of the new db abstraction with AbstractDB, DuckDB, and FileDB under ldb.db.*. The queries still need more work, and this still doesn't cover dataset saving entirely. If the api looks okay, I can start by splitting off the abstract and file dbs into a PR and merging those. Then duckdb can come after.

Note duckdb isn't fully working yet. You should be able to index and run commands on ds:root. In order to run tests against the duckdb backend instead of file, use:

LDB_DATABASE=duckdb pytest

For manually testing you can:

export LDB_DATABASE
ldb init <path>

dtulga

Looks good so far! Thanks for working on this!

ldb/db/duckdb.py

dtulga · 2022-10-03T18:35:15Z

ldb/db/duckdb.py

+            CREATE TABLE IF NOT EXISTS dataset(id INTEGER PRIMARY KEY, name VARCHAR UNIQUE)
+            """,
+        )
+        # The primary key is commented out for now due to lack of proper update support


Not sure if this is a dealbreaker, but this seems like a very major bug to me for this database! 😮

I haven't work with other OLAP databases, so I'm not sure how typical this is, but there's some explanation here: duckdb/duckdb#1631 (comment)

There's also a workaround mentioned here:
duckdb/duckdb#61 (comment)

For querying duckdb should be fine - possibly better than sqlite for certain scenarios, although until we benchmark, it's hard to say. Updates for us are mainly on global metadata like last_indexed for data objects and filesystem meta info. Also on the root dataset entries. Adding new annotations and saving new dataset versions rely on insertions, not updates. So for the data that we do want to update, I can see the following options as possibilities:

Remove unique constraints and apply update statements (relying on our logic to maintain consistency)

Use unique constraints with the workaround mentioned in this duckdb issue comment where updates are done with an intermediate table. (Might also require us to implement our own locking mechanism - not sure if duckdb has it)

Adjust our schema and indexing logic to be append-only (i.e. insertions, no updates). We might want to preserve a history of filesystem metadata and indexing times anyway

switch to sqlite

ldb/command/init.py

ldb/dataset.py

dtulga · 2022-10-06T23:14:20Z

ldb/core.py

+        duckdb_path = osp.join(self.ldb_dir, "duckdb", "index.db")
+        if not self._db_type:
+            if osp.isfile(duckdb_path):
+                self._db_type = "duckdb"


Maybe this is a more general question, but do we want to have a versioning system for the database schema? Especially if we are auto-detecting it (and might assume it is usable without being sure of the database's table structure). (This is also something we could look into in a future version, for sure.)

Yeah, good idea. I'm not sure right off how to implement it. I assume we'd store the version in the db and maybe have different DuckDB classes/modules for different versions? I added an issue here: #511
Feel free to edit the description

dtulga · 2022-10-06T23:15:48Z

ldb/db/abstract.py

+        if annotation is not None:
+            if annotation_meta is None:
+                raise ValueError(
+                    "If annotation is not None, then annotation_meta cannot be None"


Is this a debugging error, or something that might be shown to users?

Our indexing logic should work so that annotation_meta is None if and only if annotation is None. So this enforces that and should only happen if there's a bug elsewhere. This method signature is a bit awkward right now and it might be better to have a single argument for annotation data

dtulga · 2022-10-06T23:23:31Z

ldb/db/duckdb.py

+            self.conn.execute(
+                """
+                INSERT INTO annotation (
+                    SELECT distinct on (id) id, value from annotation_df


This seems to imply we might have duplicates in the annotations being written (from annotation_df) - should we handle these outside the database call?

Yeah, it probably makes sense to have a dict mapping id to value so that we're skipping duplicates as we go

dtulga · 2022-10-06T23:37:00Z

ldb/db/duckdb.py

+        result = self.conn.execute(
+            """
+            SELECT * FROM data_object_annotation
+            WHERE (data_object_id, annotation_id) in (


Not sure what kind of support DuckDB has, but if we can't make these primary keys, then at least an INDEX might help speed up queries like this.

Yeah, good point

ldb/db/file.py

jonburdo marked this pull request as draft October 3, 2022 13:42

jonburdo requested a review from dtulga October 3, 2022 13:42

jonburdo linked an issue Oct 3, 2022 that may be closed by this pull request

replace json file-based index with duckdb #501

Open

jonburdo added 13 commits October 5, 2022 10:27

add objects

1fce242

add file db

678e6b1

fix db code formatting

7676a52

start duckdb

e7b6da9

use file db in query application

ae4e363

duckdb backend for data_object_meta and annotation

bdc8ed7

add LDBClient and duckdb operations for most of the indexing logic

beeec38

add dataset insert

b7d60b3

use JSON duckdb fields and data object annotation pair queries

7d3d94d

get list working with ds:root

c9aab54

fix query options and eval command

7f41897

make FileDB api compatible with DuckDB api, allow choice of backend i…

1ebca1b

…n LDBClient

allow choosing database backend with init -d option or LDB_DATABASE e…

588e1d0

…nv var

jonburdo force-pushed the feat/duckdb-backend branch from 9ca0d95 to 588e1d0 Compare October 5, 2022 14:32

jonburdo added 12 commits October 5, 2022 12:13

add relevant methods to AbstractDB

8968c33

remove duplicate functions, fix data_object_annotation insert bug

54a35e9

fix function signatures and FileDB add functions

d61fe51

adjust indexing code to use new api, fix db functions

4f80a8c

fix ls_collection signature and cast generator to dict

51a8ad9

fix FileDB serialization functions

f685303

fix dataset_set typo and flake8 unused import

47c2842

fix type annotations

6842c69

add 'id' to allowed-redefined-builtins in pylint config

5ee0dfe

add import-outside-toplevel to global pylint disable list

7cfe22f

fix pylint

798bc65

fix pylint encoding issue

12dfcaa

jonburdo self-assigned this Oct 6, 2022

jonburdo added 3 commits October 6, 2022 13:50

fix duckdb import

e3be673

fix pylint similar lines

7b6609f

fix malformed JSON bug

aff5eaf

dtulga reviewed Oct 7, 2022

View reviewed changes

jonburdo added 3 commits October 10, 2022 12:09

fix db annotation store and minor fixes

b9f558f

add missing abstract methods and have AbstractDB inherit from ABC

22af1bc

move functions used by duckdb only from AbstractDB to DuckDB class

3cfb0e2

jonburdo mentioned this pull request Oct 11, 2022

feat(db): add AbstractDB and FileDB #513

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(db): use duckdb backend #509

feat(db): use duckdb backend #509

jonburdo commented Oct 3, 2022 •

edited

Loading

jonburdo commented Oct 3, 2022 •

edited

Loading

jonburdo commented Oct 6, 2022

dtulga left a comment

dtulga Oct 3, 2022

jonburdo Oct 7, 2022

dtulga Oct 6, 2022

jonburdo Oct 7, 2022

dtulga Oct 6, 2022

jonburdo Oct 7, 2022

dtulga Oct 6, 2022

jonburdo Oct 7, 2022

dtulga Oct 6, 2022

jonburdo Oct 7, 2022

feat(db): use duckdb backend #509

Are you sure you want to change the base?

feat(db): use duckdb backend #509

Conversation

jonburdo commented Oct 3, 2022 • edited Loading

jonburdo commented Oct 3, 2022 • edited Loading

jonburdo commented Oct 6, 2022

dtulga left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jonburdo commented Oct 3, 2022 •

edited

Loading

jonburdo commented Oct 3, 2022 •

edited

Loading