Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unable to read S3 parquet files using DuckDB as Database Connector #20708

Closed
Mageswaran1989 opened this issue Jul 14, 2022 · 23 comments
Closed
Labels
data:connect:duckdb Related to a specific database

Comments

@Mageswaran1989
Copy link

Mageswaran1989 commented Jul 14, 2022

I was following the DuckDB setup as per this PR and was able to load the DB file and create charts.

As a next step I wanted to load S3 parquet files in to Superset using DuckDB in memory option duckdb:///:memory:

Before trying with Superset, I used below Python code to check the DuckDB S3 parquet loading and found it to be working:

import duckdb
cursor = duckdb.connect()
cursor.execute("INSTALL httpfs;")
cursor.execute("LOAD httpfs;")
cursor.execute("SET s3_region='******'")
cursor.execute("SET s3_access_key_id=''**************")
cursor.execute("SET s3_secret_access_key='*****************************'")
cursor.execute("PRAGMA enable_profiling;")
cursor.execute("SELECT count(*) FROM read_parquet('s3://<bucket>/prefix/*.parquet'")
# pip install duckdb-engine
import sqlalchemy as sa


engine = sa.create_engine("duckdb:///:memory:")

with engine.connect() as conn:
    conn.execute(
        f"""
            INSTALL httpfs;

            LOAD httpfs;

            SET s3_region='ap-south-1';
            SET s3_access_key_id='****************';
            SET s3_secret_access_key='*****************************';
        """
    )
    r = conn.execute(
        """
           SELECT count(*) FROM read_parquet('s3://<bucket>/prefix/*.parquet');
        """
    ).fetchall()
    for row in r:
      print(dict(row))

When I tried to SET the S3 environment values in SQL Editor I was getting the below error on the UI:

Error:

DuckDB Error
duckdb error: Catalog Error: unrecognized configuration parameter "s3_region"

Did you mean: "TimeZone"


This may be triggered by:
Issue 1002 - The database returned an unexpected error. 

See less

Full Trace:

superset_cache          | 1:M 14 Jul 2022 14:18:00.250 * Background saving terminated with success
superset_app            | 127.0.0.1 - - [14/Jul/2022:14:18:24 +0000] "GET /health HTTP/1.1" 200 2 "-" "curl/7.74.0"
superset_app            | 172.20.0.1 - - [14/Jul/2022:14:18:34 +0000] "GET /api/v1/me/ HTTP/1.1" 200 150 "http://localhost:8088/superset/sqllab/" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.0.0 Safari/537.36"
superset_app            | 172.20.0.1 - - [14/Jul/2022:14:18:34 +0000] "GET /api/v1/me/ HTTP/1.1" 200 150 "http://localhost:8088/superset/sqllab/" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.0.0 Safari/537.36"
superset_app            | Triggering query_id: 60
superset_app            | 2022-07-14 14:18:35,809:INFO:superset.sqllab.command:Triggering query_id: 60
superset_app            | Query 60: Executing 1 statement(s)
superset_app            | 2022-07-14 14:18:35,942:INFO:superset.sql_lab:Query 60: Executing 1 statement(s)
superset_app            | Query 60: Set query to 'running'
superset_app            | 2022-07-14 14:18:35,943:INFO:superset.sql_lab:Query 60: Set query to 'running'
superset_app            | Query 60: Running statement 1 out of 1
superset_app            | 2022-07-14 14:18:36,083:INFO:superset.sql_lab:Query 60: Running statement 1 out of 1
superset_app            | Query 60: <class 'RuntimeError'>
superset_app            | Traceback (most recent call last):
superset_app            |   File "/app/superset/sql_lab.py", line 272, in execute_sql_statement
superset_app            |     db_engine_spec.execute(cursor, sql, async_=True)
superset_app            |   File "/app/superset/db_engine_specs/base.py", line 1261, in execute
superset_app            |     raise cls.get_dbapi_mapped_exception(ex)
superset_app            |   File "/app/superset/db_engine_specs/base.py", line 1259, in execute
superset_app            |     cursor.execute(query)
superset_app            |   File "/usr/local/lib/python3.8/site-packages/duckdb_engine/__init__.py", line 97, in execute
superset_app            |     raise e
superset_app            |   File "/usr/local/lib/python3.8/site-packages/duckdb_engine/__init__.py", line 85, in execute
superset_app            |     self.c.execute(statement)
superset_app            | RuntimeError: Catalog Error: unrecognized configuration parameter "s3_region"
superset_app            | 
superset_app            | Did you mean: "TimeZone"
superset_app            | 2022-07-14 14:18:36,157:ERROR:superset.sql_lab:Query 60: <class 'RuntimeError'>
superset_app            | Traceback (most recent call last):
superset_app            |   File "/app/superset/sql_lab.py", line 272, in execute_sql_statement
superset_app            |     db_engine_spec.execute(cursor, sql, async_=True)
superset_app            |   File "/app/superset/db_engine_specs/base.py", line 1261, in execute
superset_app            |     raise cls.get_dbapi_mapped_exception(ex)
superset_app            |   File "/app/superset/db_engine_specs/base.py", line 1259, in execute
superset_app            |     cursor.execute(query)
superset_app            |   File "/usr/local/lib/python3.8/site-packages/duckdb_engine/__init__.py", line 97, in execute
superset_app            |     raise e
superset_app            |   File "/usr/local/lib/python3.8/site-packages/duckdb_engine/__init__.py", line 85, in execute
superset_app            |     self.c.execute(statement)
superset_app            | RuntimeError: Catalog Error: unrecognized configuration parameter "s3_region"
superset_app            | 
superset_app            | Did you mean: "TimeZone"
superset_app            | [SupersetError(message='duckdb error: Catalog Error: unrecognized configuration parameter "s3_region"\n\nDid you mean: "TimeZone"', error_type=<SupersetErrorType.GENERIC_DB_ENGINE_ERROR: 'GENERIC_DB_ENGINE_ERROR'>, level=<ErrorLevel.ERROR: 'error'>, extra={'engine_name': 'DuckDB', 'issue_codes': [{'code': 1002, 'message': 'Issue 1002 - The database returned an unexpected error.'}]})]
superset_app            | Traceback (most recent call last):
superset_app            |   File "/app/superset/views/base.py", line 207, in wraps
superset_app            |     return f(self, *args, **kwargs)
superset_app            |   File "/app/superset/utils/log.py", line 244, in wrapper
superset_app            |     value = f(*args, **kwargs)
superset_app            |   File "/app/superset/views/core.py", line 2412, in sql_json
superset_app            |     command_result: CommandResult = command.run()
superset_app            |   File "/app/superset/sqllab/command.py", line 114, in run
superset_app            |     raise ex
superset_app            |   File "/app/superset/sqllab/command.py", line 98, in run
superset_app            |     status = self._run_sql_json_exec_from_scratch()
superset_app            |   File "/app/superset/sqllab/command.py", line 151, in _run_sql_json_exec_from_scratch
superset_app            |     raise ex
superset_app            |   File "/app/superset/sqllab/command.py", line 146, in _run_sql_json_exec_from_scratch
superset_app            |     return self._sql_json_executor.execute(
superset_app            |   File "/app/superset/sqllab/sql_json_executer.py", line 111, in execute
superset_app            |     raise SupersetErrorsException(
superset_app            | superset.exceptions.SupersetErrorsException: [SupersetError(message='duckdb error: Catalog Error: unrecognized configuration parameter "s3_region"\n\nDid you mean: "TimeZone"', error_type=<SupersetErrorType.GENERIC_DB_ENGINE_ERROR: 'GENERIC_DB_ENGINE_ERROR'>, level=<ErrorLevel.ERROR: 'error'>, extra={'engine_name': 'DuckDB', 'issue_codes': [{'code': 1002, 'message': 'Issue 1002 - The database returned an unexpected error.'}]})]
superset_app            | 2022-07-14 14:18:36,229:WARNING:superset.views.base:[SupersetError(message='duckdb error: Catalog Error: unrecognized configuration parameter "s3_region"\n\nDid you mean: "TimeZone"', error_type=<SupersetErrorType.GENERIC_DB_ENGINE_ERROR: 'GENERIC_DB_ENGINE_ERROR'>, level=<ErrorLevel.ERROR: 'error'>, extra={'engine_name': 'DuckDB', 'issue_codes': [{'code': 1002, 'message': 'Issue 1002 - The database returned an unexpected error.'}]})]
superset_app            | Traceback (most recent call last):
superset_app            |   File "/app/superset/views/base.py", line 207, in wraps
superset_app            |     return f(self, *args, **kwargs)
superset_app            |   File "/app/superset/utils/log.py", line 244, in wrapper
superset_app            |     value = f(*args, **kwargs)
superset_app            |   File "/app/superset/views/core.py", line 2412, in sql_json
superset_app            |     command_result: CommandResult = command.run()
superset_app            |   File "/app/superset/sqllab/command.py", line 114, in run
superset_app            |     raise ex
superset_app            |   File "/app/superset/sqllab/command.py", line 98, in run
superset_app            |     status = self._run_sql_json_exec_from_scratch()
superset_app            |   File "/app/superset/sqllab/command.py", line 151, in _run_sql_json_exec_from_scratch
superset_app            |     raise ex
superset_app            |   File "/app/superset/sqllab/command.py", line 146, in _run_sql_json_exec_from_scratch
superset_app            |     return self._sql_json_executor.execute(
superset_app            |   File "/app/superset/sqllab/sql_json_executer.py", line 111, in execute
superset_app            |     raise SupersetErrorsException(
superset_app            | superset.exceptions.SupersetErrorsException: [SupersetError(message='duckdb error: Catalog Error: unrecognized configuration parameter "s3_region"\n\nDid you mean: "TimeZone"', error_type=<SupersetErrorType.GENERIC_DB_ENGINE_ERROR: 'GENERIC_DB_ENGINE_ERROR'>, level=<ErrorLevel.ERROR: 'error'>, extra={'engine_name': 'DuckDB', 'issue_codes': [{'code': 1002, 'message': 'Issue 1002 - The database returned an unexpected error.'}]})]
@srinify srinify added the data:connect:duckdb Related to a specific database label Jul 17, 2022
@Mageswaran1989
Copy link
Author

Update: By patching the conenct method of the DuckDB engine here @sachintyagi22 was able to read the parquet files.

Something like this:

    def connect(self, *args: Any, **kwargs: Any) -> ConnectionWrapper:
        cursor = duckdb.connect(*args, **kwargs)
        cursor.execute("INSTALL httpfs;")
        cursor.execute("LOAD httpfs;")
        cursor.execute("SET s3_region='******'")
        cursor.execute("SET s3_access_key_id=''**************")
        cursor.execute("SET s3_secret_access_key='*****************************'")
        return ConnectionWrapper(cursor)

Without this patch, the duckdb session is not abel to recognize the SET commands.

@Mageswaran1989
Copy link
Author

@Mause Could you please have look on this issue?

@Mause
Copy link

Mause commented Jul 24, 2022

@Mageswaran1989 unfortunately I'm not a superset developer, and wouldn't know where to start investigating this

@alitrack
Copy link
Contributor

@Mageswaran1989 , @Mause

I tested with minio,

/usr/local/opt/minio/bin/minio server --config-dir=/usr/local/etc/minio --address=:9900 /usr/local/var/minio

and it works,

install 'httpfs';
load 'httpfs';
SET s3_endpoint='127.0.0.1:9900';
SET s3_access_key_id='minioadmin';
SET s3_secret_access_key='minioadmin';
SET s3_url_style = 'path';
SET s3_use_ssl=false;
select count(*) from 's3://ontime/*.parquet';

image

you need check Allow DML

image

@Mageswaran1989
Copy link
Author

@alitrack Thanks for the response. I am abale to read data from S3 as you mentioned.

But for each query, I had to give S3 credentials.
Any option to set it once and use it across queries?

@alitrack
Copy link
Contributor

There is a dirty job,

appends a method at the bottom of db_engine_specs/duckdb.py

here is a example,

    @classmethod
    def execute(cls, cursor: Any, query: str, **kwargs: Any) -> None:
        sql = f"""
        install 'httpfs';
        load 'httpfs';
        SET s3_endpoint='127.0.0.1:9900';
        SET s3_access_key_id='minioadmin';
        SET s3_secret_access_key='minioadmin';
        SET s3_url_style = 'path';
        SET s3_use_ssl=false;
        """
        cursor.execute(sql)

        return super().execute(cursor, query, **kwargs)

it overrides the execute method of BaseEngineSpec.

image

@Mause
Copy link

Mause commented Aug 11, 2022

If you can pass configuration to duckdb using superset (https://github.com/Mause/duckdb_engine/#configuration) I would recommend doing that instead

@alitrack
Copy link
Contributor

Superset supports ENGINE PARAMETERS, but DuckDB need install and load httpfs extension first,
then can pass s3 related configuration

from sqlalchemy import create_engine
connect_args = {'config': {'s3_endpoint': '127.0.0.1:9900', 's3_access_key_id': 'minioadmin', 's3_secret_access_key': 'minioadmin', 's3_url_style': 'path', 's3_use_ssl': 0}}
engine = create_engine("duckdb:///",connect_args=connect_args)

get the issue,

duckdb.InvalidInputException: Invalid Input Error: Unrecognized configuration property "s3_endpoint"

@Mause
Copy link

Mause commented Aug 22, 2022

@alitrack for now, I've added experimental support to duckdb-engine (as of 0.6.0) for the preloading of extensions via engine parameters - https://github.com/Mause/duckdb_engine#preloading-extensions-experimental

@alitrack
Copy link
Contributor

@Mause
Does not work, even in python code

connect_args={ 
        "preload_extensions": ["httpfs"],
        "config": {
          "s3_endpoint":"127.0.0.1:9900",
          "s3_access_key_id":"minioadmin",
          "s3_secret_access_key":"minioadmin",
          "s3_url_style":"path",
          "s3_use_ssl":False
  }
}
from curses import endwin
from sqlalchemy import create_engine

engine = create_engine("duckdb:///",connect_args=connect_args)

import pandas as pd

def test_s3():
    df = pd.read_sql("""

    select count(*) from 's3://ontime/*.parquet'

    """, engine)
    print(df)
try:
    test_s3()
except Exception as e:
    print(e)
    engine.execute("""
    SET s3_endpoint='127.0.0.1:9900';
    SET s3_access_key_id='minioadmin';
    SET s3_secret_access_key='minioadmin';
    SET s3_url_style = 'path';
    SET s3_use_ssl=0;
    """)

test_s3()

btw, you can try the Minio Play account (it is public)

SET s3_endpoint='play.min.io:9000';
SET s3_access_key_id='Q3AM3UQ867SPQQA43P2F';
SET s3_secret_access_key='zuf+tfteSlswRu7BJ86wekitnifILbZam1KYY3TG';
SET s3_url_style = 'path';
SET s3_use_ssl=true;
SET s3_region = 'us-east-1';

select * from 's3://sales5m/sales_5m_0.parquet' limit  3;

@Mause
Copy link

Mause commented Aug 23, 2022

@alitrack what exception are you getting?

@alitrack
Copy link
Contributor

works now,
connect_args should be

connect_args={ 
        "preload_extensions": ["httpfs"],
        "config": {
          "s3_endpoint":"127.0.0.1:9900",
          "s3_access_key_id":"minioadmin",
          "s3_secret_access_key":"minioadmin",
          "s3_url_style":"path",
          "s3_use_ssl":"False"
  }
}

and in superset, should be

{
    "connect_args": {
        "preload_extensions": [
            "httpfs"
        ],
        "config": {
            "s3_endpoint": "127.0.0.1:9900",
            "s3_access_key_id": "minioadmin",
            "s3_secret_access_key": "minioadmin",
            "s3_url_style": "path",
            "s3_use_ssl": "False"
        }
    }
}

@Mause
Copy link

Mause commented Aug 23, 2022

works now,
connect_args should be

connect_args={ 
        "preload_extensions": ["httpfs"],
        "config": {
          "s3_endpoint":"127.0.0.1:9900",
          "s3_access_key_id":"minioadmin",
          "s3_secret_access_key":"minioadmin",
          "s3_url_style":"path",
          "s3_use_ssl":"False"
  }
}

and in superset, should be

{
    "connect_args": {
        "preload_extensions": [
            "httpfs"
        ],
        "config": {
            "s3_endpoint": "127.0.0.1:9900",
            "s3_access_key_id": "minioadmin",
            "s3_secret_access_key": "minioadmin",
            "s3_url_style": "path",
            "s3_use_ssl": "False"
        }
    }
}

Yeah I realized the boolean bug after you posted, I'll push a patch release to fix that shortly

@alitrack
Copy link
Contributor

attention,
superset config does not support boolean, must be string

@Mause
Copy link

Mause commented Aug 23, 2022

attention, superset config does not support boolean, must be string

If it's JSON, it probably just needs to be false instead of False

@alitrack
Copy link
Contributor

attention, superset config does not support boolean, must be string

If it's JSON, it probably just needs to be false instead of False

you got,
with the last version of duckdb-engine, boolean false works too.

@Mause
Copy link

Mause commented Aug 24, 2022

I believe this issue is resolved then?

@sapcode
Copy link

sapcode commented Dec 8, 2022

Dear @Mause,

we have installed the superset with the current helm chart and also installed the DuckDB engine, it shows now in the Databases Connections dropdown. The superset version in the right menu is shown as "Version: 0.0.0-dev".

When we create a new DuckDB database without engine parameter but with "Allow DML" and call SQL-Lab we are able to use the connection settings when executing as DML statements and the data is delivered correctly:

install 'httpfs';
load 'httpfs';
SET s3_endpoint='argo-artifacts:9000';
SET s3_access_key_id='XXX';
SET s3_secret_access_key='XXX';
SET s3_url_style = 'path';
SET s3_use_ssl=false;
select count() from 's3://raw/.parquet';

However we don't want to provide the credentials in the SQL-Lab but would like to store them in the DuckDB Connector Advanced ENGINE PARAMETER dialog, we have tested the following parameter JSON strings but none of them seems to work:

Test1:
{ "engine_params": { "connect_args":{ "preload_extensions": ["httpfs"], "config": { "s3_endpoint": "argo-artifacts:9000", "s3_access_key_id": "XXX", "s3_secret_access_key": "XXX", "s3_url_style": "path", "s3_use_ssl": "False" } } } }

Test2:
{ "connect_args":{ "preload_extensions": ["httpfs"], "config": { "s3_endpoint": "argo-artifacts:9000", "s3_access_key_id": "XXX", "s3_secret_access_key": "XXX", "s3_url_style": "path", "s3_use_ssl": "False" } } }

We always get the same error:
An error occurred while fetching databases: "Connection failed, please check your connection settings"
In chrome debug we see:
PUT http://superset.mini.kube/api/v1/database/1 422 (UNPROCESSABLE ENTITY)
Response {type: 'basic', url: 'http://superset.mini.kube/api/v1/database/1'

When we remove the tag "engine_params" we get also the error:
File "/app/superset/sqllab/sql_json_executer.py", line 111, in execute superset raise SupersetErrorsException( superset superset.exceptions.SupersetErrorsException: [SupersetError(message="'str' object does not support item assignment", error_type=<SupersetErrorType.GENERIC_DB

When looking at the core.py line 394 - 405 it looks like that "engine_params" is searched first, then "connect_args"
https://github.com/apache/superset/blob/master/superset/models/core.py

    params = extra.get("engine_params", {})
    if nullpool:
        params["poolclass"] = NullPool

    connect_args = params.get("connect_args", {})
    if self.impersonate_user:
        self.db_engine_spec.update_impersonation_config(
            connect_args, str(sqlalchemy_url), effective_username
        )

    if connect_args:
        params["connect_args"] = connect_args

In Connector Advanced ENGINE PARAMETER dialog, what is the specific connection string which need to be passed ?
Are the any restrictions in usage of double " or single ' quotes in the JSON string ?

Thank you very much!

@Mause
Copy link

Mause commented Dec 8, 2022

@sapcode if you have issues with duckdb-engine specifically, please raise them in the duckdb-engine repo. I cannot help with general superset issues, but would recommend that you create a new issue instead of piggy-backing on this one

@apache / @robdiciuccio I'd appreciate it if you closed and locked this issue

@rajivpatki
Copy link

appends a method at the bottom of db_engine_specs/duckdb.py

This works well, but someone with SQL Lab access can read plaintext credentials with SELECT * FROM duckdb_settings();

If you can pass configuration to duckdb using superset (https://github.com/Mause/duckdb_engine/#configuration) I would recommend doing that instead

Using @Mause's suggestion: any admin with rights to edit the database connection can read the secrets in plain text.

Suggestion: override execute() for DuckDBEngineSpec in this manner:

    @classmethod
    def execute(cls, cursor: Any, query: str, **kwargs: Any) -> None:
        sql = f"""
        INSTALL aws;
        LOAD aws;
        INSTALL httpfs;
        LOAD httpfs;
        CALL load_aws_credentials();
        """

        if 's3://' in query and 'duckdb_settings()' not in query:
            cursor.execute(sql)
        return super().execute(cursor, query, **kwargs)

Note that CALL load_aws_credentials(); from aws extension of duckdb has a significant hit on query performance. The above approach allows us to:

  • Bypass loading credentials and extensions when not required
  • Not have to specify credentials in plain text and instead use AWS Default Credentials Provider Chain
  • Plug a security hole where credentials can be displayed in plain text

@include
Copy link

include commented Feb 28, 2024

Hi all,

I almost there too. I am facing another problem but I've successfully connected to S3 via DuckDB to retrieve parquet files using the following setup.

Go to advanced Database connections / (DuckDB) / settings / security and just drop the connect_args without the s3_endpoint.

{
    "connect_args": {
        "preload_extensions": ["httpfs"],
        "config": { 
            "s3_region": "eu-west-1",
            "s3_access_key_id": "FOO", 
            "s3_secret_access_key": "BAR"
        }
    }
}

your query should be something like:

SELECT * FROM read_parquet('s3://bucket/reports/data/*');

My problem is that when I try to create a new dataset from this database, I get a weird Schema / Table list which does not represet anything (blank) from the parquet.

I've tried to create a temporary table but without effect too and I know it is reading correctly the file because the SELECT and DESCRIBE works just nice and I see the contents without issues. Perhaps there might be something when building the dataset which cannot find a "physical" table?!

Would love to get your tips!.

Cheers,
F

@rajivpatki
Copy link

@include

your query should be something like:

SELECT * FROM read_parquet('s3://bucket/reports/data/*');

My problem is that when I try to create a new dataset from this database, I get a weird Schema / Table list which does not represet anything (blank) from the parquet.

I think you have not created a physical table on duckdb. You're only running a query that reads a s3 parquet dataset. Create a table: CREATE OR REPLACE TABLE main.table_name AS SELECT * FROM read_parquet('s3://...') or define the dataset as a view with query being SELECT * FROM read_parquet('s3://...')

Temporary tables do not survive beyond the life of the connection/cursor: https://duckdb.org/docs/sql/statements/create_table.html#temporary-tables

@rusackas
Copy link
Member

rusackas commented Jul 9, 2024

If I'm not mistaken, it sounds like all of the issues reported herein have been addressed, and this is safe to close. Thank you to all who participated in getting to the bottom of things! Let me know if this needs reopening for any reason, and we're happy to do so.

@rusackas rusackas closed this as completed Jul 9, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
data:connect:duckdb Related to a specific database
Projects
None yet
Development

No branches or pull requests

8 participants