-
Notifications
You must be signed in to change notification settings - Fork 14k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Unable to read S3 parquet files using DuckDB as Database Connector #20708
Comments
Update: By patching the Something like this: def connect(self, *args: Any, **kwargs: Any) -> ConnectionWrapper:
cursor = duckdb.connect(*args, **kwargs)
cursor.execute("INSTALL httpfs;")
cursor.execute("LOAD httpfs;")
cursor.execute("SET s3_region='******'")
cursor.execute("SET s3_access_key_id=''**************")
cursor.execute("SET s3_secret_access_key='*****************************'")
return ConnectionWrapper(cursor) Without this patch, the duckdb session is not abel to recognize the |
@Mause Could you please have look on this issue? |
@Mageswaran1989 unfortunately I'm not a superset developer, and wouldn't know where to start investigating this |
I tested with minio, /usr/local/opt/minio/bin/minio server --config-dir=/usr/local/etc/minio --address=:9900 /usr/local/var/minio and it works, install 'httpfs';
load 'httpfs';
SET s3_endpoint='127.0.0.1:9900';
SET s3_access_key_id='minioadmin';
SET s3_secret_access_key='minioadmin';
SET s3_url_style = 'path';
SET s3_use_ssl=false;
select count(*) from 's3://ontime/*.parquet'; you need check |
@alitrack Thanks for the response. I am abale to read data from S3 as you mentioned. But for each query, I had to give S3 credentials. |
There is a dirty job, appends a method at the bottom of db_engine_specs/duckdb.py here is a example, @classmethod
def execute(cls, cursor: Any, query: str, **kwargs: Any) -> None:
sql = f"""
install 'httpfs';
load 'httpfs';
SET s3_endpoint='127.0.0.1:9900';
SET s3_access_key_id='minioadmin';
SET s3_secret_access_key='minioadmin';
SET s3_url_style = 'path';
SET s3_use_ssl=false;
"""
cursor.execute(sql)
return super().execute(cursor, query, **kwargs) it overrides the execute method of BaseEngineSpec. |
If you can pass configuration to duckdb using superset (https://github.com/Mause/duckdb_engine/#configuration) I would recommend doing that instead |
Superset supports ENGINE PARAMETERS, but DuckDB need install and load httpfs extension first, from sqlalchemy import create_engine
connect_args = {'config': {'s3_endpoint': '127.0.0.1:9900', 's3_access_key_id': 'minioadmin', 's3_secret_access_key': 'minioadmin', 's3_url_style': 'path', 's3_use_ssl': 0}}
engine = create_engine("duckdb:///",connect_args=connect_args) get the issue,
|
@alitrack for now, I've added experimental support to |
@Mause connect_args={
"preload_extensions": ["httpfs"],
"config": {
"s3_endpoint":"127.0.0.1:9900",
"s3_access_key_id":"minioadmin",
"s3_secret_access_key":"minioadmin",
"s3_url_style":"path",
"s3_use_ssl":False
}
}
from curses import endwin
from sqlalchemy import create_engine
engine = create_engine("duckdb:///",connect_args=connect_args)
import pandas as pd
def test_s3():
df = pd.read_sql("""
select count(*) from 's3://ontime/*.parquet'
""", engine)
print(df)
try:
test_s3()
except Exception as e:
print(e)
engine.execute("""
SET s3_endpoint='127.0.0.1:9900';
SET s3_access_key_id='minioadmin';
SET s3_secret_access_key='minioadmin';
SET s3_url_style = 'path';
SET s3_use_ssl=0;
""")
test_s3() btw, you can try the Minio Play account (it is public) SET s3_endpoint='play.min.io:9000';
SET s3_access_key_id='Q3AM3UQ867SPQQA43P2F';
SET s3_secret_access_key='zuf+tfteSlswRu7BJ86wekitnifILbZam1KYY3TG';
SET s3_url_style = 'path';
SET s3_use_ssl=true;
SET s3_region = 'us-east-1';
select * from 's3://sales5m/sales_5m_0.parquet' limit 3; |
@alitrack what exception are you getting? |
works now, connect_args={
"preload_extensions": ["httpfs"],
"config": {
"s3_endpoint":"127.0.0.1:9900",
"s3_access_key_id":"minioadmin",
"s3_secret_access_key":"minioadmin",
"s3_url_style":"path",
"s3_use_ssl":"False"
}
} and in superset, should be {
"connect_args": {
"preload_extensions": [
"httpfs"
],
"config": {
"s3_endpoint": "127.0.0.1:9900",
"s3_access_key_id": "minioadmin",
"s3_secret_access_key": "minioadmin",
"s3_url_style": "path",
"s3_use_ssl": "False"
}
}
}
|
Yeah I realized the boolean bug after you posted, I'll push a patch release to fix that shortly |
attention, |
If it's JSON, it probably just needs to be |
you got, |
I believe this issue is resolved then? |
Dear @Mause, we have installed the superset with the current helm chart and also installed the DuckDB engine, it shows now in the Databases Connections dropdown. The superset version in the right menu is shown as "Version: 0.0.0-dev". When we create a new DuckDB database without engine parameter but with "Allow DML" and call SQL-Lab we are able to use the connection settings when executing as DML statements and the data is delivered correctly: install 'httpfs'; However we don't want to provide the credentials in the SQL-Lab but would like to store them in the DuckDB Connector Advanced ENGINE PARAMETER dialog, we have tested the following parameter JSON strings but none of them seems to work: Test1: Test2: We always get the same error: When we remove the tag "engine_params" we get also the error: When looking at the core.py line 394 - 405 it looks like that "engine_params" is searched first, then "connect_args"
In Connector Advanced ENGINE PARAMETER dialog, what is the specific connection string which need to be passed ? Thank you very much! |
@sapcode if you have issues with duckdb-engine specifically, please raise them in the duckdb-engine repo. I cannot help with general superset issues, but would recommend that you create a new issue instead of piggy-backing on this one @apache / @robdiciuccio I'd appreciate it if you closed and locked this issue |
This works well, but someone with SQL Lab access can read plaintext credentials with
Using @Mause's suggestion: any admin with rights to edit the database connection can read the secrets in plain text. Suggestion: override
Note that
|
Hi all, I almost there too. I am facing another problem but I've successfully connected to S3 via DuckDB to retrieve parquet files using the following setup. Go to advanced Database connections / (DuckDB) / settings / security and just drop the
your query should be something like:
My problem is that when I try to create a new dataset from this database, I get a weird Schema / Table list which does not represet anything (blank) from the parquet. I've tried to create a temporary table but without effect too and I know it is reading correctly the file because the Would love to get your tips!. Cheers, |
I think you have not created a physical table on duckdb. You're only running a query that reads a s3 parquet dataset. Create a table: Temporary tables do not survive beyond the life of the connection/cursor: https://duckdb.org/docs/sql/statements/create_table.html#temporary-tables |
If I'm not mistaken, it sounds like all of the issues reported herein have been addressed, and this is safe to close. Thank you to all who participated in getting to the bottom of things! Let me know if this needs reopening for any reason, and we're happy to do so. |
I was following the DuckDB setup as per this PR and was able to load the DB file and create charts.
As a next step I wanted to load S3 parquet files in to Superset using DuckDB in memory option
duckdb:///:memory:
Before trying with Superset, I used below Python code to check the DuckDB S3 parquet loading and found it to be working:
When I tried to
SET
the S3 environment values inSQL Editor
I was getting the below error on the UI:Error:
Full Trace:
The text was updated successfully, but these errors were encountered: