perf: improve perf in SIP-68 migration #19416

betodealmeida · 2022-03-29T18:35:21Z

SUMMARY

This PR improves the performance on the SIP-68 migration script by using sqloxide (https://pypi.org/project/sqloxide/) to parse the SQL when extracting dependencies. In case the parsing fails it falls back to using sqlparse.

It also addresses a few bugs:

Tables were assigned incorrectly to datasets because of the lack of the database id in the predicate (also fixed in perf(alembic): paginize db migration for new dataset models #19406).
Datasets where incorrectly flagged as virtual when their sql was an empty string.
Datasets incorrectly flagged as virtual would have all tables associated with them, since the predicate was empty.
Tables referenced in virtual datasets but not present as physical datasets were not being created.

BEFORE/AFTER SCREENSHOTS OR ANIMATED GIF

N/A

TESTING INSTRUCTIONS

$ superset db downgrade 5afbb1a5849b
$ superset db upgrade

Migration still works and relationships are populated correctly. We can see all the datasets and the associated tables with this query:

SELECT
  sl_datasets.name AS dataset_name,
  sl_datasets.is_physical,
  sl_datasets.expression,
  ARRAY_AGG(sl_tables.name) AS table_names
FROM sl_datasets
JOIN sl_dataset_tables
  ON sl_datasets.id = sl_dataset_tables.dataset_id
JOIN sl_tables
  ON sl_dataset_tables.table_id = sl_tables.id
GROUP BY 1, 2, 3
ORDER BY 2 DESC;

And the results:

       dataset_name        | is_physical |                                                                                                expression                                                                                                 |           table_names
---------------------------+-------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+----------------------------------
 covid_vaccines            | t           | covid_vaccines                                                                                                                                                                                            | {covid_vaccines}
 bart_lines                | t           | bart_lines                                                                                                                                                                                                | {bart_lines}
 threads                   | t           | threads                                                                                                                                                                                                   | {threads}
 messages                  | t           | messages                                                                                                                                                                                                  | {messages}
 video_game_sales          | t           | video_game_sales                                                                                                                                                                                          | {video_game_sales}
 birth_france_by_region    | t           | birth_france_by_region                                                                                                                                                                                    | {birth_france_by_region}
 sf_population_polygons    | t           | sf_population_polygons                                                                                                                                                                                    | {sf_population_polygons}
 users                     | t           | users                                                                                                                                                                                                     | {users}
 users_channels            | t           | users_channels                                                                                                                                                                                            | {users_channels}
 long_lat                  | t           | long_lat                                                                                                                                                                                                  | {long_lat}
 wb_health_population      | t           | wb_health_population                                                                                                                                                                                      | {wb_health_population}
 FCC 2018 Survey           | t           | "FCC 2018 Survey"                                                                                                                                                                                         | {"FCC 2018 Survey"}
 birth_names               | t           | birth_names                                                                                                                                                                                               | {birth_names}
 flights                   | t           | flights                                                                                                                                                                                                   | {flights}
 channels                  | t           | channels                                                                                                                                                                                                  | {channels}
 exported_stats            | t           | exported_stats                                                                                                                                                                                            | {exported_stats}
 unicode_test              | t           | unicode_test                                                                                                                                                                                              | {unicode_test}
 channel_members           | t           | channel_members                                                                                                                                                                                           | {channel_members}
 users_channels-uzooNNtSRO | f           | SELECT uc1.name as channel_1, uc2.name as channel_2, count(*) AS cnt FROM users_channels uc1 JOIN users_channels uc2 ON uc1.user_id = uc2.user_id GROUP BY uc1.name, uc2.name HAVING uc1.name <> uc2.name+| {users_channels}
                           |             |                                                                                                                                                                                                           |
 new_members_daily         | f           | SELECT date, total_membership - lag(total_membership) OVER (ORDER BY date) AS new_members FROM exported_stats                                                                                             | {exported_stats}
 messages_channels         | f           | SELECT m.ts, c.name, m.text FROM messages m JOIN channels c ON m.channel_id = c.id                                                                                                                        | {messages,channels}
 members_channels_2        | f           | SELECT c.name AS channel_name, u.name AS member_name FROM channel_members cm JOIN channels c ON cm.channel_id = c.id JOIN users u ON cm.user_id = u.id                                                    | {channels,channel_members,users}
(22 rows)

ADDITIONAL INFORMATION

Has associated issue:
Required feature flags:
Changes UI
Includes DB Migration (follow approval process in SIP-59)
- Migration is atomic, supports rollback & is backwards-compatible
- Confirm DB migration upgrade and downgrade tested
- Runtime estimates and downtime expectations provided
Introduces new feature or API
Removes existing feature or API

codecov · 2022-03-29T21:42:37Z

Codecov Report

Merging #19416 (f92077c) into master (816a2c3) will decrease coverage by 0.09%.
The diff coverage is 93.05%.

❗ Current head f92077c differs from pull request most recent head 96d513c. Consider uploading reports for the commit 96d513c to get more accurate results

@@            Coverage Diff             @@
##           master   #19416      +/-   ##
==========================================
- Coverage   66.48%   66.39%   -0.10%     
==========================================
  Files        1670     1670              
  Lines       63968    63824     -144     
  Branches     6512     6510       -2     
==========================================
- Hits        42531    42374     -157     
- Misses      19748    19761      +13     
  Partials     1689     1689

Flag	Coverage Δ
hive	`?`
mysql	`81.86% <92.68%> (+0.21%)`	⬆️
postgres	`81.91% <92.68%> (+0.21%)`	⬆️
presto	`?`
python	`82.00% <92.68%> (-0.13%)`	⬇️
sqlite	`81.67% <92.68%> (+0.20%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files	Coverage Δ
...ntrols/src/components/CertifiedIconWithTooltip.tsx	`80.00% <ø> (ø)`
...d/packages/superset-ui-chart-controls/src/index.ts	`100.00% <ø> (ø)`
...omponents/ColumnConfigControl/ColumnConfigItem.tsx	`0.00% <ø> (ø)`
...tiveFilters/FiltersConfigModal/DraggableFilter.tsx	`71.87% <ø> (ø)`
...t/annotation_layers/annotations/commands/update.py	`88.23% <ø> (ø)`
superset/annotation_layers/annotations/schemas.py	`100.00% <ø> (ø)`
superset/cli/examples.py	`0.00% <ø> (ø)`
superset/cli/importexport.py	`80.00% <ø> (ø)`
superset/cli/main.py	`0.00% <ø> (ø)`
superset/cli/thumbnails.py	`0.00% <ø> (ø)`
... and 111 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 816a2c3...96d513c. Read the comment docs.

ktmud

Thanks for the quick turnaround. The table name extraction function seems to be a useful utility, can it be added to some more shared place?

superset/migrations/versions/b8d3a24d9131_new_dataset_models.py

tests/unit_tests/migrations/versions/b8d3a24d9131_new_dataset_models_test.py

betodealmeida · 2022-03-30T00:00:10Z

scripts/benchmark_migration.py

+        try:
+            model = getattr(Base.classes, table)
+        except AttributeError:
+            continue


This is needed to run the bechmark migration script on the SIP-68 migration.

betodealmeida · 2022-03-30T00:01:56Z

superset/connectors/sqla/models.py

@@ -2278,8 +2285,7 @@ def write_shadow_dataset(  # pylint: disable=too-many-locals
            )

        # physical dataset
-        tables = []
-        if dataset.sql is None:
+        if not dataset.sql:


Some of our example datasets have .sql == '', which made them to be marked as virtual during the migration. it's not a big deal, since they will still work when we switch to the new models (in the new Dataset model the difference between virtual and physical is greatly reduced).

betodealmeida · 2022-03-30T00:03:07Z

superset/connectors/sqla/models.py

            )
-            tables = session.query(NewTable).filter(predicate).all()


We were only assigning tables that already exist. The load_or_create_tables function will create any tables that are referenced in the SQL but don't exist yet.

* chore: improve perf in SIP-68 migration * Small fixes * Create tables referenced in SQL * Update logic in SqlaTable as well * Fix unit tests (cherry picked from commit 63b5e2e)

* chore: improve perf in SIP-68 migration * Small fixes * Create tables referenced in SQL * Update logic in SqlaTable as well * Fix unit tests

betodealmeida requested a review from a team as a code owner March 29, 2022 18:35

pull-request-size bot added the size/L label Mar 29, 2022

ktmud reviewed Mar 29, 2022

View reviewed changes

ktmud mentioned this pull request Mar 29, 2022

perf: refactor SIP-68 db migrations with INSERT SELECT FROM #19421

Merged

9 tasks

betodealmeida commented Mar 30, 2022

View reviewed changes

betodealmeida force-pushed the faster_parser_sip68 branch from 07e8169 to 413b157 Compare March 30, 2022 00:08

eschutho approved these changes Mar 30, 2022

View reviewed changes

betodealmeida requested a review from ktmud March 30, 2022 00:27

betodealmeida force-pushed the faster_parser_sip68 branch 3 times, most recently from bc59a0c to e863bb9 Compare March 30, 2022 01:41

betodealmeida added 4 commits March 29, 2022 18:50

chore: improve perf in SIP-68 migration

4d025ca

Small fixes

69dc87f

Create tables referenced in SQL

b3a84d1

Update logic in SqlaTable as well

f349047

betodealmeida force-pushed the faster_parser_sip68 branch from e863bb9 to 7d0de3c Compare March 30, 2022 01:52

Fix unit tests

8d2318c

betodealmeida force-pushed the faster_parser_sip68 branch from 7d0de3c to 8d2318c Compare March 30, 2022 02:03

betodealmeida merged commit 63b5e2e into apache:master Mar 30, 2022

villebro added the lts-v1 label Mar 30, 2022

betodealmeida mentioned this pull request Apr 6, 2022

fix: sqloxide optional #19570

Merged

9 tasks

mistercrunch added 🍒 1.5.0 🍒 1.5.1 🍒 1.5.2 labels Mar 13, 2024

mistercrunch added 🍒 1.5.3 🏷️ bot A label used by `supersetbot` to keep track of which PR where auto-tagged with release labels 🚢 2.0.0 labels Mar 13, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf: improve perf in SIP-68 migration #19416

perf: improve perf in SIP-68 migration #19416

betodealmeida commented Mar 29, 2022 •

edited

Loading

codecov bot commented Mar 29, 2022 •

edited

Loading

ktmud left a comment

betodealmeida Mar 30, 2022

betodealmeida Mar 30, 2022

betodealmeida Mar 30, 2022

perf: improve perf in SIP-68 migration #19416

perf: improve perf in SIP-68 migration #19416

Conversation

betodealmeida commented Mar 29, 2022 • edited Loading

SUMMARY

BEFORE/AFTER SCREENSHOTS OR ANIMATED GIF

TESTING INSTRUCTIONS

ADDITIONAL INFORMATION

codecov bot commented Mar 29, 2022 • edited Loading

Codecov Report

ktmud left a comment

Choose a reason for hiding this comment

betodealmeida Mar 30, 2022

Choose a reason for hiding this comment

betodealmeida Mar 30, 2022

Choose a reason for hiding this comment

betodealmeida Mar 30, 2022

Choose a reason for hiding this comment

betodealmeida commented Mar 29, 2022 •

edited

Loading

codecov bot commented Mar 29, 2022 •

edited

Loading