-
Notifications
You must be signed in to change notification settings - Fork 14.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: enable metadata sync for virtual tables #10645
feat: enable metadata sync for virtual tables #10645
Conversation
af7e124
to
4f1d41f
Compare
031df5d
to
0483a1d
Compare
Codecov Report
@@ Coverage Diff @@
## master #10645 +/- ##
==========================================
- Coverage 64.32% 64.18% -0.14%
==========================================
Files 784 784
Lines 36952 36955 +3
Branches 3529 3524 -5
==========================================
- Hits 23769 23721 -48
- Misses 13074 13126 +52
+ Partials 109 108 -1
Flags with carried forward coverage won't be shown. Click here to find out more.
Continue to review full report at Codecov.
|
// Handle carefully when the schema is empty | ||
const endpoint = | ||
`/datasource/external_metadata/${ | ||
datasource.type || datasource.datasource_type | ||
}/${datasource.id}/` + | ||
`?db_id=${datasource.database.id}` + | ||
`&schema=${datasource.schema || ''}` + | ||
`&table_name=${datasource.datasource_name || datasource.table_name}`; | ||
const endpoint = `/datasource/external_metadata/${ | ||
datasource.type || datasource.datasource_type | ||
}/${datasource.id}/`; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With the simplification of the code in the endpoint, the slightly hackish query params (db_id
, schema
and table_name
) are now redundant.
else: | ||
db_dialect = self.database.get_dialect() | ||
cols = self.database.get_columns( | ||
self.table_name, schema=self.schema or None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had to add checking for empty strings here (schema=self.schema or None
) to get CI to pass, as I wasn't able to track down why CI was always creating empty schema names instead of None
on some of the examples databases. I later noticed that similar logic is being applied elsewhere (e.g. here and here), so I figured it's ok to solve the symptoms here with this simple workaround instead of ensuring undefined schemas are always None
.
try: | ||
datatype = db_engine_spec.column_datatype_to_string( | ||
col.type, db_dialect | ||
) | ||
except Exception as ex: # pylint: disable=broad-except | ||
datatype = "UNKNOWN" | ||
logger.error("Unrecognized data type in %s.%s", new_table, col.name) | ||
logger.exception(ex) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This compilation step was moved to external_metadata()
which is called earlier in the method, hence making this step redundant.
elif datasource_type == "table": | ||
database = ( | ||
db.session.query(Database).filter_by(id=request.args.get("db_id")).one() | ||
) | ||
table_class = ConnectorRegistry.sources["table"] | ||
datasource = table_class( | ||
database=database, | ||
table_name=request.args.get("table_name"), | ||
schema=request.args.get("schema") or None, | ||
) | ||
else: | ||
raise Exception(f"Unsupported datasource_type: {datasource_type}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know why the table was created like this instead of just using the available functionality that populates all fields like sql
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mmmmh, oh it could b that we use this for tables that don't have associated datasets yet, for example in the SQL Lab code base where we show the schema of a table on the left panel.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just checked and it looks like we call /api/v1/database/1/table/
and /superset/extra_table_metadata/
from SQL Lab, and i think this endpoint used to be called in place of /api/v1/database/1/table/
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NOTE: looked a bit deeper and it turns out that /api/v1/database/1/table/
used for SQL Lab does much more, like getting index/comments and more. Eventually we could reuse that endpoint here and surface more of that metadata in this context as its somewhat relevant here too, but for now I think your approach is a better path forward.
superset/connectors/sqla/models.py
Outdated
parsed_query = ParsedQuery(self.sql) | ||
if not parsed_query.is_readonly(): | ||
raise SupersetSecurityException( | ||
SupersetError( | ||
error_type=SupersetErrorType.DATASOURCE_SECURITY_ACCESS_ERROR, | ||
message=_("Only `SELECT` statements are allowed"), | ||
level=ErrorLevel.ERROR, | ||
) | ||
) | ||
statements = parsed_query.get_statements() | ||
if len(statements) > 1: | ||
raise SupersetSecurityException( | ||
SupersetError( | ||
error_type=SupersetErrorType.DATASOURCE_SECURITY_ACCESS_ERROR, | ||
message=_("Only single queries supported"), | ||
level=ErrorLevel.ERROR, | ||
) | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While this checking isn't being done when rendering the query in get_sqla_query()
, I doubt anyone should be attempting DML or multiple queries here. It's possible that someone might be executing stored procedures or similar here along with a select on an engine that supports it, but we can deal with that later when the use case comes to light.
superset/db_engine_specs/base.py
Outdated
@@ -929,23 +929,25 @@ def _truncate_label(cls, label: str) -> str: | |||
|
|||
@classmethod | |||
def column_datatype_to_string( | |||
cls, sqla_column_type: TypeEngine, dialect: Dialect | |||
cls, column_type: Union[TypeEngine, str], dialect: Dialect |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@bkyryliuk it appears the the get_columns()
method in the Presto spec is sometimes returning types as strings and not native Sql Alchemy type objects, which was causing my new tests to fail (see comment below). Which made me wonder how we hadn't bumped into this problem before, as this method should be called every time we add a new table.
schema="main", | ||
table_name="dummy_sql_table", | ||
database=get_example_database(), | ||
sql="select 123 as intcol, 'abc' as strcol", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@bkyryliuk this query was raising an exception, as either the type for intcol
or strcol
was returned as OTHER
, indicating that the type wasn't found in models/sql_types/presto_sql_types.py:type_map
. See https://github.com/apache/incubator-superset/blob/878f06d1339bb32f74a70e3c6c5d338c86a6f5c6/superset/db_engine_specs/presto.py#L337-L358
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
d532fdb
to
0b5ecd5
Compare
) | ||
) | ||
with closing(engine.raw_connection()) as conn: | ||
with closing(conn.cursor()) as cursor: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd like to have a single code path that does this. Is there a way we can refactor/share code with the sqllab modules here?
https://github.com/apache/incubator-superset/blob/master/superset/sql_lab.py#L337
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm also wondering if this should run on an async worker when possible, but that makes more complex here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In this particular case the work is very much synchronous, but I agree that the single code path is desirable (this solution was a compromise for quick delivery as I feel sql_lab.py
and result_set.py
are in need of more comprehensive refactoring outside the scope of this PR). I have a proposal in mind that should be a small step in the right direction without having to derail this PR too much. Will update this PR shortly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mistercrunch I looked into joining these code paths and making it possible to make it async, but came to the conclusion that that refactoring is best done once we start working on the async query framework. I added a todo with my name next to it stating that the metadata fetching should be merged with the SQL Lab code, and will be happy to do that once we have the necessary structures in place.
dda5494
to
a8749f0
Compare
superset/connectors/sqla/models.py
Outdated
col["type"] = "UNKNOWN" | ||
db_engine_spec = self.database.db_engine_spec | ||
if self.sql: | ||
engine = self.database.get_sqla_engine() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
engine = self.database.get_sqla_engine(schema=self.schema)
I wonder if there are missing parameters ?
superset/connectors/sqla/models.py
Outdated
) | ||
with closing(engine.raw_connection()) as conn: | ||
with closing(conn.cursor()) as cursor: | ||
query = statements[0] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
query = self.database.apply_limit_to_sql(query, limit)
Respect
I think it’s better to add a restriction, because here only a few data needs to be queried.
a8749f0
to
9948233
Compare
Thanks @WenQiangW for the review comments! |
Codecov Report
@@ Coverage Diff @@
## master #10645 +/- ##
==========================================
- Coverage 65.78% 61.60% -4.18%
==========================================
Files 838 838
Lines 39841 39843 +2
Branches 3655 3650 -5
==========================================
- Hits 26208 24544 -1664
- Misses 13532 15119 +1587
- Partials 101 180 +79
Flags with carried forward coverage won't be shown. Click here to find out more.
Continue to review full report at Codecov.
|
This has been rebased + comments addressed + added support for templating. If there are no further bugs here, I propose merging this as-is. |
Ping @robdiciuccio . IMO the refactoring proposed here is best taken care of when we start working on the async query framework, during which I assume we'll end up refactoring many parts of the SQL Lab codebase relevant to this functionality. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
everything looks good to me, only caveat is that we don't use virtual tables in dropbox - won't be able to test it out.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm supportive of merging this to fix the "dead end" issue that currently exists when following the explore flow (can't add columns to your query).
Thanks @mistercrunch and @bkyryliuk for your help pushing this across the finish line! 🏁 |
* feat: enable metadata sync for virtual tables * add migration and check for empty schema name * simplify request * truncate trailing column attributes for MySQL * add unit test * use db_engine_spec func to truncate collation and charset * Remove redundant migration * add more tests * address review comments and apply templating to query * add todo for refactoring * remove schema from tests * check column datatype
SUMMARY
This PR adds column metadata syncing support for SQL-based virtual tables (both legacy and React CRUD). In addition, the query is checked for DML and multiple statements which raise an exception. This PR is blocked by #10658 which fixes a bug that this PR exposes.
SCREENSHOTS
Syncing column metadata for virtual table:
Trying to execute a
DELETE FROM
query:Trying to execute multiple
SELECT
s:On legacy CRUD view, refreshing does the same:
TEST PLAN
ADDITIONAL INFORMATION