Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[sqllab] Tableschemaview fails to load with presto using parquet format files #25636

Closed
3 tasks done
tullis opened this issue Oct 13, 2023 · 3 comments · Fixed by #26782 or #28653
Closed
3 tasks done

[sqllab] Tableschemaview fails to load with presto using parquet format files #25636

tullis opened this issue Oct 13, 2023 · 3 comments · Fixed by #26782 or #28653

Comments

@tullis
Copy link

tullis commented Oct 13, 2023

Summary

On Superset version 3.1.0 the table schema previews fail to load on sqllab/ and dataset/add/ paths when using presto datasources and parquet format files.

These were working for Superset version 1.5.3, but have not worked since version 2.0.1 and up to 3.1.0.

Error condition on Superset version 3.1.0 with presto table using format=PARQUET

Conditions

  • This only occurs when those tables are using Presto and a table format = 'PARQUET'
  • Other database types such as Druid and MySQL do not exhhibit this behaviour
  • This occurs when logged in as an Admin user, therefore I believe that it is likely unrelated to Missing TableSchemaView permissions for the sql_lab role #25451
  • When the presto table is using format = 'TEXTFILE' this error does not occur

How to reproduce the bug

  1. Go to Superset version 3.1.0 as an Admin user and navigate to either of:
    a) /sqllab
    b) /dataset/add
  2. Select a database that uses the presto connector type
  3. Select any schema
  4. Select any table that uses a format = 'PARQUET'

Expected results

I would expect the left hand column to be populated with the column names from the selected schema.

Actual results

Several error messages appear stating that there were errors fetching table metadata and the left-hand column is not populated.

Error messages in the server log

There are no relevant error messages in the server log.

We can see the pyhive presto command going through:

INFO:pyhive.presto:SHOW COLUMNS FROM "wmf"."aqs_hourly"
INFO:pyhive.presto:SHOW COLUMNS FROM "wmf"."aqs_hourly"
INFO:pyhive.presto:SHOW COLUMNS FROM "wmf"."aqs_hourly"
INFO:pyhive.presto:SELECT * FROM wmf."aqs_hourly$partitions"
ORDER BY year DESC, month DESC, day DESC, hour DESC
LIMIT 1

We have a lot of DEBUG level messages from requests_kerberos.kerberos_ and urllib3.connectionpool and spnego._gss while the request is authenticated and processed, but these appear to show a 401 followed by a successful 200 response.

There are no stack traces shown.

Additional Screenshots

Error condition on Superset version 2.1.1 with presto table using format=PARQUET

No error on Superset version 1.5.3 for the same table

No error on Superset version 2.1.1 with presto table using format=TEXTFILE

Environment

  • browser type and version: Firefox 118.0.1 (64-bit) on Linux, but this affects other browsers.
  • superset version: 3.1.0
  • python version: 3.9.2
  • node.js version: 16
  • any feature flags active:
    • ENABLE_TEMPLATE_PROCESSING
    • DASHBOARD_NATIVE_FILTERS
    • ENABLE_FILTER_BOX_MIGRATION
  • metadata caching, memcached

Metadata database: MariaDB 10.4
Presto version 0.283

Optional components:

pyhive[kerberos,presto]==0.7.0
gunicorn[gevent]
apache-superset[hive,presto,mysql,druid,trino,spark,postgres]
pylibmc==1.6.1

Checklist

Make sure to follow these steps before submitting your issue - thank you!

  • I have checked the superset logs for python stacktraces and included it here as text if there are any.
  • I have reproduced the issue with at least the latest released version of superset.
  • I have checked the issue tracker for the same issue and I haven't found one similar.

Additional context

@tullis
Copy link
Author

tullis commented Jan 23, 2024

I have updated this issue based on my testing against Superset version 3.1.0

The issue is still preset in this version and is preventing us from upgrading our production instance from 1.5.3 to 3.1.0.

wmfgerrit pushed a commit to wikimedia/analytics-superset-deploy that referenced this issue Jan 24, 2024
We update the build process so as to build from the upstream superset
source, rather than just installing packages from pypi.

Add the superset[trino,hive,postgresql,spark] extras in order to
generate a more complete set of dependencies.

In addition to this, we have temporarily applied the patch
to fix this issue: apache/superset#25636

Bug: T335356
Change-Id: I9a47ae07a106a1481fdffe03f670ced4c46b262d
@rusackas
Copy link
Member

Assuming this is also present in 4.0 then, but that might be worth confirming.

I also wonder if this has been addressed at all by @betodealmeida 's catalog work and/or @dpgaspar 's parquet refactoring, both of which have only recently been merged to master (i.e. not released) yet.

@john-bodley
Copy link
Member

Re-opening because the fix was reverted in #28613.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants