Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: enable metadata sync for virtual tables #10645

Merged
merged 12 commits into from
Oct 27, 2020

Conversation

villebro
Copy link
Member

@villebro villebro commented Aug 19, 2020

SUMMARY

This PR adds column metadata syncing support for SQL-based virtual tables (both legacy and React CRUD). In addition, the query is checked for DML and multiple statements which raise an exception. This PR is blocked by #10658 which fixes a bug that this PR exposes.

SCREENSHOTS

Syncing column metadata for virtual table:
sync2
Trying to execute a DELETE FROM query:
image
Trying to execute multiple SELECTs:
image
On legacy CRUD view, refreshing does the same:
image

TEST PLAN

ADDITIONAL INFORMATION

  • Has associated issue:
  • Changes UI
  • Requires DB Migration.
  • Confirm DB Migration upgrade and downgrade tested.
  • Introduces new feature or API
  • Removes existing feature or API

@villebro villebro force-pushed the villebro/virtual-table-metadata branch 2 times, most recently from af7e124 to 4f1d41f Compare August 19, 2020 20:55
@pull-request-size pull-request-size bot added size/L and removed size/M labels Aug 19, 2020
@villebro villebro force-pushed the villebro/virtual-table-metadata branch 3 times, most recently from 031df5d to 0483a1d Compare August 21, 2020 07:04
@codecov-commenter
Copy link

codecov-commenter commented Aug 21, 2020

Codecov Report

Merging #10645 into master will decrease coverage by 0.13%.
The diff coverage is 100.00%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #10645      +/-   ##
==========================================
- Coverage   64.32%   64.18%   -0.14%     
==========================================
  Files         784      784              
  Lines       36952    36955       +3     
  Branches     3529     3524       -5     
==========================================
- Hits        23769    23721      -48     
- Misses      13074    13126      +52     
+ Partials      109      108       -1     
Flag Coverage Δ
#cypress 54.61% <0.00%> (+0.09%) ⬆️
#javascript 60.83% <100.00%> (+<0.01%) ⬆️
#python 59.56% <100.00%> (-0.23%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
...erset-frontend/src/datasource/DatasourceEditor.jsx 73.01% <100.00%> (+0.33%) ⬆️
superset/connectors/sqla/models.py 90.28% <100.00%> (+0.55%) ⬆️
superset/views/datasource.py 94.82% <100.00%> (+1.38%) ⬆️
superset/db_engine_specs/presto.py 70.56% <0.00%> (-12.14%) ⬇️
superset/examples/world_bank.py 97.10% <0.00%> (-2.90%) ⬇️
superset/examples/birth_names.py 97.36% <0.00%> (-2.64%) ⬇️
superset/views/database/mixins.py 80.70% <0.00%> (-1.76%) ⬇️
superset/models/core.py 87.22% <0.00%> (-0.28%) ⬇️
...rontend/src/SqlLab/components/QueryAutoRefresh.jsx 72.72% <0.00%> (+6.81%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 0f44d3e...104bda0. Read the comment docs.

Comment on lines -356 to +397
// Handle carefully when the schema is empty
const endpoint =
`/datasource/external_metadata/${
datasource.type || datasource.datasource_type
}/${datasource.id}/` +
`?db_id=${datasource.database.id}` +
`&schema=${datasource.schema || ''}` +
`&table_name=${datasource.datasource_name || datasource.table_name}`;
const endpoint = `/datasource/external_metadata/${
datasource.type || datasource.datasource_type
}/${datasource.id}/`;
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With the simplification of the code in the endpoint, the slightly hackish query params (db_id, schema and table_name) are now redundant.

else:
db_dialect = self.database.get_dialect()
cols = self.database.get_columns(
self.table_name, schema=self.schema or None
Copy link
Member Author

@villebro villebro Aug 21, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had to add checking for empty strings here (schema=self.schema or None) to get CI to pass, as I wasn't able to track down why CI was always creating empty schema names instead of None on some of the examples databases. I later noticed that similar logic is being applied elsewhere (e.g. here and here), so I figured it's ok to solve the symptoms here with this simple workaround instead of ensuring undefined schemas are always None.

Comment on lines -1319 to -1334
try:
datatype = db_engine_spec.column_datatype_to_string(
col.type, db_dialect
)
except Exception as ex: # pylint: disable=broad-except
datatype = "UNKNOWN"
logger.error("Unrecognized data type in %s.%s", new_table, col.name)
logger.exception(ex)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This compilation step was moved to external_metadata() which is called earlier in the method, hence making this step redundant.

Comment on lines -107 to -118
elif datasource_type == "table":
database = (
db.session.query(Database).filter_by(id=request.args.get("db_id")).one()
)
table_class = ConnectorRegistry.sources["table"]
datasource = table_class(
database=database,
table_name=request.args.get("table_name"),
schema=request.args.get("schema") or None,
)
else:
raise Exception(f"Unsupported datasource_type: {datasource_type}")
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know why the table was created like this instead of just using the available functionality that populates all fields like sql.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mmmmh, oh it could b that we use this for tables that don't have associated datasets yet, for example in the SQL Lab code base where we show the schema of a table on the left panel.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just checked and it looks like we call /api/v1/database/1/table/ and /superset/extra_table_metadata/ from SQL Lab, and i think this endpoint used to be called in place of /api/v1/database/1/table/

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NOTE: looked a bit deeper and it turns out that /api/v1/database/1/table/ used for SQL Lab does much more, like getting index/comments and more. Eventually we could reuse that endpoint here and surface more of that metadata in this context as its somewhat relevant here too, but for now I think your approach is a better path forward.

@villebro villebro changed the title [WIP] feat: enable metadata sync for virtual tables feat: enable metadata sync for virtual tables Aug 21, 2020
Comment on lines 643 to 659
parsed_query = ParsedQuery(self.sql)
if not parsed_query.is_readonly():
raise SupersetSecurityException(
SupersetError(
error_type=SupersetErrorType.DATASOURCE_SECURITY_ACCESS_ERROR,
message=_("Only `SELECT` statements are allowed"),
level=ErrorLevel.ERROR,
)
)
statements = parsed_query.get_statements()
if len(statements) > 1:
raise SupersetSecurityException(
SupersetError(
error_type=SupersetErrorType.DATASOURCE_SECURITY_ACCESS_ERROR,
message=_("Only single queries supported"),
level=ErrorLevel.ERROR,
)
)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While this checking isn't being done when rendering the query in get_sqla_query(), I doubt anyone should be attempting DML or multiple queries here. It's possible that someone might be executing stored procedures or similar here along with a select on an engine that supports it, but we can deal with that later when the use case comes to light.

@@ -929,23 +929,25 @@ def _truncate_label(cls, label: str) -> str:

@classmethod
def column_datatype_to_string(
cls, sqla_column_type: TypeEngine, dialect: Dialect
cls, column_type: Union[TypeEngine, str], dialect: Dialect
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bkyryliuk it appears the the get_columns() method in the Presto spec is sometimes returning types as strings and not native Sql Alchemy type objects, which was causing my new tests to fail (see comment below). Which made me wonder how we hadn't bumped into this problem before, as this method should be called every time we add a new table.

schema="main",
table_name="dummy_sql_table",
database=get_example_database(),
sql="select 123 as intcol, 'abc' as strcol",
Copy link
Member Author

@villebro villebro Aug 21, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bkyryliuk this query was raising an exception, as either the type for intcol or strcol was returned as OTHER, indicating that the type wasn't found in models/sql_types/presto_sql_types.py:type_map. See https://github.com/apache/incubator-superset/blob/878f06d1339bb32f74a70e3c6c5d338c86a6f5c6/superset/db_engine_specs/presto.py#L337-L358

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see the same in our deployment:

image

Looks like pyhive / sqlalchemy is not happy with the varchar(3)

SHOW COLUMNS FROM bogdankyryliuk.bogdan_simple_test
--
intcol	integer		
strcol	varchar(3)		

@villebro villebro force-pushed the villebro/virtual-table-metadata branch from d532fdb to 0b5ecd5 Compare August 21, 2020 11:22
)
)
with closing(engine.raw_connection()) as conn:
with closing(conn.cursor()) as cursor:
Copy link
Member

@mistercrunch mistercrunch Aug 21, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd like to have a single code path that does this. Is there a way we can refactor/share code with the sqllab modules here?
https://github.com/apache/incubator-superset/blob/master/superset/sql_lab.py#L337

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm also wondering if this should run on an async worker when possible, but that makes more complex here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this particular case the work is very much synchronous, but I agree that the single code path is desirable (this solution was a compromise for quick delivery as I feel sql_lab.py and result_set.py are in need of more comprehensive refactoring outside the scope of this PR). I have a proposal in mind that should be a small step in the right direction without having to derail this PR too much. Will update this PR shortly.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mistercrunch I looked into joining these code paths and making it possible to make it async, but came to the conclusion that that refactoring is best done once we start working on the async query framework. I added a todo with my name next to it stating that the metadata fetching should be merged with the SQL Lab code, and will be happy to do that once we have the necessary structures in place.

@villebro villebro changed the title feat: enable metadata sync for virtual tables [WIP] feat: enable metadata sync for virtual tables Aug 22, 2020
@villebro villebro force-pushed the villebro/virtual-table-metadata branch 3 times, most recently from dda5494 to a8749f0 Compare August 25, 2020 05:05
col["type"] = "UNKNOWN"
db_engine_spec = self.database.db_engine_spec
if self.sql:
engine = self.database.get_sqla_engine()

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

engine = self.database.get_sqla_engine(schema=self.schema)

I wonder if there are missing parameters ?

)
with closing(engine.raw_connection()) as conn:
with closing(conn.cursor()) as cursor:
query = statements[0]

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

query = self.database.apply_limit_to_sql(query, limit)

Respect

I think it’s better to add a restriction, because here only a few data needs to be queried.

@villebro villebro force-pushed the villebro/virtual-table-metadata branch from a8749f0 to 9948233 Compare October 21, 2020 07:33
@villebro
Copy link
Member Author

Thanks @WenQiangW for the review comments!

@villebro villebro changed the title [WIP] feat: enable metadata sync for virtual tables feat: enable metadata sync for virtual tables Oct 21, 2020
@codecov-io
Copy link

codecov-io commented Oct 21, 2020

Codecov Report

Merging #10645 into master will decrease coverage by 4.17%.
The diff coverage is 100.00%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #10645      +/-   ##
==========================================
- Coverage   65.78%   61.60%   -4.18%     
==========================================
  Files         838      838              
  Lines       39841    39843       +2     
  Branches     3655     3650       -5     
==========================================
- Hits        26208    24544    -1664     
- Misses      13532    15119    +1587     
- Partials      101      180      +79     
Flag Coverage Δ
#cypress ?
#javascript 62.64% <100.00%> (+<0.01%) ⬆️
#python 60.97% <100.00%> (+0.05%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
...erset-frontend/src/datasource/DatasourceEditor.jsx 63.72% <100.00%> (-7.65%) ⬇️
superset/connectors/sqla/models.py 90.45% <100.00%> (+0.70%) ⬆️
superset/views/datasource.py 94.82% <100.00%> (+1.38%) ⬆️
superset-frontend/src/SqlLab/App.jsx 0.00% <0.00%> (-100.00%) ⬇️
superset-frontend/src/explore/App.jsx 0.00% <0.00%> (-100.00%) ⬇️
superset-frontend/src/dashboard/App.jsx 0.00% <0.00%> (-100.00%) ⬇️
superset-frontend/src/explore/index.jsx 0.00% <0.00%> (-100.00%) ⬇️
superset-frontend/src/dashboard/index.jsx 0.00% <0.00%> (-100.00%) ⬇️
superset-frontend/src/setup/setupColors.js 0.00% <0.00%> (-100.00%) ⬇️
superset-frontend/src/chart/ChartContainer.jsx 0.00% <0.00%> (-100.00%) ⬇️
... and 171 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update cae54ac...9cbf857. Read the comment docs.

@villebro
Copy link
Member Author

This has been rebased + comments addressed + added support for templating. If there are no further bugs here, I propose merging this as-is.

@villebro villebro requested a review from mistercrunch October 21, 2020 10:56
@villebro
Copy link
Member Author

Ping @robdiciuccio . IMO the refactoring proposed here is best taken care of when we start working on the async query framework, during which I assume we'll end up refactoring many parts of the SQL Lab codebase relevant to this functionality.

Copy link
Member

@bkyryliuk bkyryliuk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

everything looks good to me, only caveat is that we don't use virtual tables in dropbox - won't be able to test it out.

Copy link
Member

@mistercrunch mistercrunch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm supportive of merging this to fix the "dead end" issue that currently exists when following the explore flow (can't add columns to your query).

@villebro villebro merged commit ecdff72 into apache:master Oct 27, 2020
@villebro villebro deleted the villebro/virtual-table-metadata branch October 27, 2020 06:56
@villebro
Copy link
Member Author

Thanks @mistercrunch and @bkyryliuk for your help pushing this across the finish line! 🏁

auxten pushed a commit to auxten/incubator-superset that referenced this pull request Nov 20, 2020
* feat: enable metadata sync for virtual tables

* add migration and check for empty schema name

* simplify request

* truncate trailing column attributes for MySQL

* add unit test

* use db_engine_spec func to truncate collation and charset

* Remove redundant migration

* add more tests

* address review comments and apply templating to query

* add todo for refactoring

* remove schema from tests

* check column datatype
@mistercrunch mistercrunch added 🏷️ bot A label used by `supersetbot` to keep track of which PR where auto-tagged with release labels 🚢 1.0.0 labels Mar 12, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
🏷️ bot A label used by `supersetbot` to keep track of which PR where auto-tagged with release labels size/L 🚢 1.0.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants