feat: add duckdb as DataSource - Fixes #14563 #19317

rwhaling · 2022-03-22T21:18:47Z

SUMMARY

Adds Duckdb as an embedded, in-process OLAP db engine.
Duckdb can directly query CSV or Parquet files on disk - eventually, we should be able to query Parquet files directly on S3 as well.
Supersedes #19265
Relying on https://github.com/Mause/duckdb_engine for the SQLAlchemy implementation, and building on top of @alitrack's original work.
Since DuckDB is, much like SQLite, an in-process, single-threaded engine, the error handling in TestConnectionDatabaseCommand.run feels a bit weird. Might want to work with @Mause on a narrow exception class for the threading blip.

BEFORE/AFTER SCREENSHOTS OR ANIMATED GIF

TESTING INSTRUCTIONS

install duckdb_engine via pip install duckdb-engine
DuckDB appears in the add a database dropdown
Provide path to a duckdb database on disk in the url, e.g., duckdb:////Users/whoever/path/to/duck.db
in SQL lab, query any tables defined in your on-disk database, tables, schemas, and column defs all appear as normal
have not tested in-memory a ton but duckdb:///:memory: seems to work?
once you are in SQL Lab, you can directly query parquet or CSV files on disk: SELECT * from 'test.parquet' etc.

ADDITIONAL INFORMATION

Has associated issue: duckdb as DataSource #14563
Required feature flags:
Changes UI
Includes DB Migration (follow approval process in SIP-59)
- Migration is atomic, supports rollback & is backwards-compatible
- Confirm DB migration upgrade and downgrade tested
- Runtime estimates and downtime expectations provided
Introduces new feature or API
Removes existing feature or API

needs the forked version of [duckdb-engine](https://github.com/alitrack/duckdb_engine)

update _time_grain_expressions

codecov · 2022-03-22T23:19:24Z

Codecov Report

Merging #19317 (4aa477b) into master (d645579) will decrease coverage by 0.03%.
The diff coverage is 74.77%.

❗ Current head 4aa477b differs from pull request most recent head d1e16c4. Consider uploading reports for the commit d1e16c4 to get more accurate results

@@            Coverage Diff             @@
##           master   #19317      +/-   ##
==========================================
- Coverage   66.53%   66.50%   -0.04%     
==========================================
  Files        1667     1672       +5     
  Lines       64360    64564     +204     
  Branches     6493     6493              
==========================================
+ Hits        42824    42936     +112     
- Misses      19854    19946      +92     
  Partials     1682     1682

Flag	Coverage Δ
hive	`?`
mysql	`81.65% <88.63%> (+0.16%)`	⬆️
postgres	`81.69% <88.63%> (+0.16%)`	⬆️
presto	`?`
python	`81.78% <88.63%> (-0.18%)`	⬇️
sqlite	`81.46% <88.63%> (+0.16%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files	Coverage Δ
...t-ui-chart-controls/src/operators/pivotOperator.ts	`100.00% <ø> (ø)`
...i-chart-controls/src/operators/resampleOperator.ts	`100.00% <ø> (ø)`
.../superset-ui-core/src/query/types/QueryFormData.ts	`100.00% <ø> (ø)`
...tend/packages/superset-ui-core/src/style/index.tsx	`100.00% <ø> (ø)`
...end/plugins/legacy-plugin-chart-chord/src/Chord.js	`0.00% <0.00%> (ø)`
...ns/legacy-plugin-chart-chord/src/transformProps.js	`0.00% <0.00%> (ø)`
.../legacy-plugin-chart-country-map/src/CountryMap.js	`0.00% <0.00%> (ø)`
...acy-plugin-chart-country-map/src/transformProps.js	`0.00% <0.00%> (ø)`
...ns/legacy-plugin-chart-histogram/src/Histogram.jsx	`0.00% <0.00%> (ø)`
...egacy-plugin-chart-histogram/src/transformProps.js	`0.00% <0.00%> (ø)`
... and 258 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update d645579...d1e16c4. Read the comment docs.

zhaoyongjie

except for lint error, LGTM. BTW, I was impressed by DuckDB as a column base lite database.

villebro

LGTM - a few non-blocking style related comments

villebro · 2022-03-23T06:59:44Z

superset/databases/commands/test_connection.py

+                except RuntimeError:
+                    # Catches the equivalent single-threading error from duckdb.
+                    alive = engine.dialect.do_ping(conn)


nit: could we have these in the same except:

except (sqlite3.ProgrammingError, RuntimeError): # SQLite can't run on a separate thread, so ``func_timeout`` fails # RuntimeError catches the equivalent single-threading error from duckdb. alive = engine.dialect.do_ping(conn)

villebro · 2022-03-23T07:02:02Z

superset/db_engine_specs/duckdb.py

+
+    @classmethod
+    def get_table_names(
+        cls, database: "Database", inspector: Inspector, schema: Optional[str]


nit: if we add from __future__ import annotations in the beginning of the file, we can remove the quotes. See an example here:

superset/superset/common/query_context.py

Line 17 in b7ecb14

from __future__ import annotations

Suggested change

cls, database: "Database", inspector: Inspector, schema: Optional[str]

cls, database: Database, inspector: Inspector, schema: Optional[str]

rwhaling · 2022-03-23T13:19:13Z

Thanks for the review @villebro @zhaoyongjie - I'll incorporate your feedback and get the linter passing.

rwhaling · 2022-03-24T00:59:42Z

OK - I am about 90% confident that I am running the linter properly on my local and that all the tests will pass now - looks like I cannot trigger the CI myself, but I think if someone kicks that off we'll see the build go green.

zhaoyongjie · 2022-03-24T01:31:53Z

OK - I am about 90% confident that I am running the linter properly on my local and that all the tests will pass now - looks like I cannot trigger the CI myself, but I think if someone kicks that off we'll see the build go green.

@rwhaling, CI looks like waiting to finish other tasks. When the CI is all green, I will merge it. Thanks for the following up!

villebro · 2022-03-24T08:27:57Z

@rwhaling - One thing that I forgot while reviewing was if you would be able to open a PR that adds the relevant documentation for connecting to DuckDB? Something similar to the what this PR did for Trino (there's some unrelated changes, you can ignore those): #13171

rwhaling · 2022-03-24T11:58:41Z

@rwhaling - One thing that I forgot while reviewing was if you would be able to open a PR that adds the relevant documentation for connecting to DuckDB? Something similar to the what this PR did for Trino (there's some unrelated changes, you can ignore those): #13171

absolutely 👍

Alex-Monahan · 2022-03-24T13:30:33Z

Since DuckDB is, much like SQLite, an in-process, single-threaded engine, the error handling in TestConnectionDatabaseCommand.run feels a bit weird. Might want to work with @Mause on a narrow exception class for the

I had one small comment in case it is helpful! DuckDB is indeed embedded in your local process, but it is multi-threaded and can use as many CPU cores as you would like.

Thanks for building this connector!

* + duckdb support needs the forked version of [duckdb-engine](https://github.com/alitrack/duckdb_engine) * Update duckdb.py update _time_grain_expressions * removed superfluous get_all_datasource_names def in duckdb engine spec * added exception handling for duckdb single-threaded RuntimeError * fixed linter blips and other stylistic cleanup in duckdb.py * one last round of linter tweaks in test_connection.py for duckdb support Co-authored-by: Steven Lee <[email protected]> Co-authored-by: Richard Whaling <[email protected]> (cherry picked from commit 202e34a)

* + duckdb support needs the forked version of [duckdb-engine](https://github.com/alitrack/duckdb_engine) * Update duckdb.py update _time_grain_expressions * removed superfluous get_all_datasource_names def in duckdb engine spec * added exception handling for duckdb single-threaded RuntimeError * fixed linter blips and other stylistic cleanup in duckdb.py * one last round of linter tweaks in test_connection.py for duckdb support Co-authored-by: Steven Lee <[email protected]> Co-authored-by: Richard Whaling <[email protected]>

inakianduaga · 2022-06-21T06:06:02Z

Hi, testing this and Superset doesn't work when installing duckdb-engine, throws a 500 internal error on any page in the UI, backend says

module 'duckdb_engine' has no attribute 'name'
Traceback (most recent call last):                                                                                                                                                                                                                              File "/usr/local/lib/python3.8/site-packages/flask/app.py", line 1516, in full_dispatch_request
    rv = self.dispatch_request()
  File "/usr/local/lib/python3.8/site-packages/flask/app.py", line 1502, in dispatch_request
    return self.ensure_sync(self.view_functions[rule.endpoint])(**req.view_args)
  File "/app/superset/utils/log.py", line 245, in wrapper
    value = f(*args, **kwargs)
  File "/app/superset/views/core.py", line 2677, in welcome
    "common": common_bootstrap_payload(),
  File "/app/superset/views/base.py", line 368, in common_bootstrap_payload
    available_specs = get_available_engine_specs()
  File "/app/superset/db_engine_specs/__init__.py", line 141, in get_available_engine_specs
    backend = dialect.name

Here's my full Dockerfile

FROM apache/superset

# Switching to root to install the required packages
USER root

# Example: installing the MySQL driver to connect to the metadata database
# if you prefer Postgres, you may want to use `psycopg2-binary` instead
RUN pip install psycopg2-binary

# Example: installing a driver to connect to Redshift
# Find which driver you need based on the analytics database
# you want to connect to here:
# https://superset.apache.org/installation.html#database-dependencies
# RUN pip install sqlalchemy-redshift
RUN pip install pyhive[hive]
RUN pip install pyhive[presto]

# DuckDB support
RUN pip install duckdb-engine

ADD ./config/superset_config.py /app/pythonpath/superset_config.py

# Switching back to using the `superset` user
USER superset

Mause · 2022-06-21T06:46:19Z

@inakianduaga heya, that's on me - unfortunately sqlalchemy doesn't really document what's expected from a dialect, so it's hard to keep up with the changes on their side. I'll push out a fixed version shortly

Mause · 2022-06-21T07:29:12Z

@inakianduaga it should work now if you try with duckdb_engine 0.1.11, sorry about that 😅

inakianduaga · 2022-06-21T11:10:44Z

ok thanks. It actually worked for me by going to the fixed superset:1.5.1 docker image instead of latest. Not sure that made the difference or it was simultaneously your fix. But either way, it works now 🙇

Mageswaran1989 · 2022-07-14T12:55:45Z

I wanted to access the S3 parquet files from Superset/SQL Editor. While I am able to use DuckDB to do the same in my python shell, I am wondering how to do it with the Superset and/or duckdb-engine

My python snippet to load parquet files from S3:

import duckdb
cursor = duckdb.connect()
cursor.execute("INSTALL httpfs;")
cursor.execute("LOAD httpfs;")
cursor.execute("SET s3_region='******'")
cursor.execute("SET s3_access_key_id=''**************")
cursor.execute("SET s3_secret_access_key='*****************************'")
cursor.execute("PRAGMA enable_profiling;")
cursor.execute("SELECT count(*) FROM read_parquet('s3://<bucket>/prefix/*.parquet'")

Mageswaran1989 · 2022-07-14T12:58:12Z

@Mause Could you please help with loading S3 parquet files with your engine?
Reference: https://duckdb.org/docs/guides/import/s3_import

Mause · 2022-07-14T13:01:51Z

@Mageswaran1989

if you're having an issue with

duckdb, please raise it over at https://github.com/duckdb/duckdb
duckdb-engine, please raise it over at https://github.com/Mause/duckdb_engine
superset, raise a new issue in this repo

PS. @zhaoyongjie any chance you could lock this conversation?

alitrack and others added 5 commits October 26, 2021 12:12

+ duckdb support

9129669

needs the forked version of [duckdb-engine](https://github.com/alitrack/duckdb_engine)

Update duckdb.py

fd43c56

update _time_grain_expressions

Merge branch 'apache:master' into master

b9c2199

removed superfluous get_all_datasource_names def in duckdb engine spec

9fa02d7

added exception handling for duckdb single-threaded RuntimeError

1952eaf

pull-request-size bot added the size/M label Mar 22, 2022

zhaoyongjie self-requested a review March 23, 2022 02:06

zhaoyongjie approved these changes Mar 23, 2022

View reviewed changes

villebro approved these changes Mar 23, 2022

View reviewed changes

Richard Whaling added 2 commits March 23, 2022 08:33

fixed linter blips and other stylistic cleanup in duckdb.py

de4792e

one last round of linter tweaks in test_connection.py for duckdb support

d1e16c4

zhaoyongjie closed this Mar 24, 2022

zhaoyongjie reopened this Mar 24, 2022

zhaoyongjie merged commit 202e34a into apache:master Mar 24, 2022

djayatillake mentioned this pull request Mar 24, 2022

feat: add duckdb as DataSource lightdash/lightdash#1585

Closed

villebro added the lts-v1 label Mar 29, 2022

Mause mentioned this pull request Jun 21, 2022

fix: restore name attribute Mause/duckdb_engine#274

Merged

Mageswaran1989 mentioned this pull request Jul 14, 2022

Unable to read S3 parquet files using DuckDB as Database Connector #20708

Closed

apache locked as off-topic and limited conversation to collaborators Jul 15, 2022

apache unlocked this conversation Jul 15, 2022

apache locked as resolved and limited conversation to collaborators Jul 15, 2022

mistercrunch added 🍒 1.5.0 🍒 1.5.1 🍒 1.5.2 🍒 1.5.3 🏷️ bot A label used by `supersetbot` to keep track of which PR where auto-tagged with release labels 🚢 2.0.0 labels Mar 13, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add duckdb as DataSource - Fixes #14563 #19317

feat: add duckdb as DataSource - Fixes #14563 #19317

rwhaling commented Mar 22, 2022 •

edited

Loading

codecov bot commented Mar 22, 2022 •

edited

Loading

zhaoyongjie left a comment •

edited

Loading

villebro left a comment

villebro Mar 23, 2022

villebro Mar 23, 2022

rwhaling commented Mar 23, 2022

rwhaling commented Mar 24, 2022

zhaoyongjie commented Mar 24, 2022

villebro commented Mar 24, 2022

rwhaling commented Mar 24, 2022

Alex-Monahan commented Mar 24, 2022 •

edited

Loading

inakianduaga commented Jun 21, 2022 •

edited

Loading

Mause commented Jun 21, 2022

Mause commented Jun 21, 2022

inakianduaga commented Jun 21, 2022

Mageswaran1989 commented Jul 14, 2022

Mageswaran1989 commented Jul 14, 2022

Mause commented Jul 14, 2022 •

edited

Loading

	cls, database: "Database", inspector: Inspector, schema: Optional[str]
	cls, database: Database, inspector: Inspector, schema: Optional[str]

feat: add duckdb as DataSource - Fixes #14563 #19317

feat: add duckdb as DataSource - Fixes #14563 #19317

Conversation

rwhaling commented Mar 22, 2022 • edited Loading

SUMMARY

BEFORE/AFTER SCREENSHOTS OR ANIMATED GIF

TESTING INSTRUCTIONS

ADDITIONAL INFORMATION

codecov bot commented Mar 22, 2022 • edited Loading

Codecov Report

zhaoyongjie left a comment • edited Loading

Choose a reason for hiding this comment

villebro left a comment

Choose a reason for hiding this comment

villebro Mar 23, 2022

Choose a reason for hiding this comment

villebro Mar 23, 2022

Choose a reason for hiding this comment

rwhaling commented Mar 23, 2022

rwhaling commented Mar 24, 2022

zhaoyongjie commented Mar 24, 2022

villebro commented Mar 24, 2022

rwhaling commented Mar 24, 2022

Alex-Monahan commented Mar 24, 2022 • edited Loading

inakianduaga commented Jun 21, 2022 • edited Loading

Mause commented Jun 21, 2022

Mause commented Jun 21, 2022

inakianduaga commented Jun 21, 2022

Mageswaran1989 commented Jul 14, 2022

Mageswaran1989 commented Jul 14, 2022

Mause commented Jul 14, 2022 • edited Loading

rwhaling commented Mar 22, 2022 •

edited

Loading

codecov bot commented Mar 22, 2022 •

edited

Loading

zhaoyongjie left a comment •

edited

Loading

Alex-Monahan commented Mar 24, 2022 •

edited

Loading

inakianduaga commented Jun 21, 2022 •

edited

Loading

Mause commented Jul 14, 2022 •

edited

Loading