-
Notifications
You must be signed in to change notification settings - Fork 14k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
perf(alembic): paginize db migration for new dataset models #19406
perf(alembic): paginize db migration for new dataset models #19406
Conversation
bc69072
to
2196255
Compare
is_physical_table | ||
and (column.expression is None or column.expression == "") | ||
), | ||
type=column.type or "Unknown", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A-Z
is_spatial=False, | ||
is_temporal=False, | ||
type="Unknown", # figuring this out would require a type inferrer | ||
warning_text=metric.warning_text, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ditto A-Z
session = inspect(target).session | ||
session: Session = inspect(target).session | ||
database_id = target.database_id | ||
is_physical_table = not target.sql |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BaseDatasource
checks whether a table is physical/virtual by checking whether table.sql
is falsy. I'm changing all target.sql is None
in this script to keep it consistent with current behavior.
predicate = or_( | ||
*[ | ||
and_( | ||
NewTable.database_id == database_id, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add database_id
enforcement as all three together (db + schema + table name) forms a unique key.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We also need to update superset/connectors/sqla/models.py
, where the original logic lives (it had to be copied here so that this migration would still work in the future).
batch_op.create_unique_constraint("uq_sl_datasets_uuid", ["uuid"]) | ||
batch_op.create_unique_constraint( | ||
"uq_sl_datasets_sqlatable_id", ["sqlatable_id"] | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Move constraints to the end to improve readability.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👏
❗ Please consider rebasing your branch to avoid db migration conflicts. |
2196255
to
3a6cee0
Compare
3a6cee0
to
bd18390
Compare
@@ -373,65 +366,35 @@ def upgrade(): | |||
# ExtraJSONMixin | |||
sa.Column("extra_json", sa.Text(), nullable=True), | |||
# ImportExportMixin | |||
sa.Column("uuid", UUIDType(binary=True), primary_key=False, default=uuid4), | |||
sa.Column( | |||
"uuid", UUIDType(binary=True), primary_key=False, default=uuid4, unique=True |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm moving unique key and foreign key constraints inline.
❗ Please consider rebasing your branch to avoid db migration conflicts. |
1 similar comment
❗ Please consider rebasing your branch to avoid db migration conflicts. |
Close in favor of #19421 |
SUMMARY
We run into some scalability issues with the db migration script for SIP-68 (PR #17543). For context, we have more than 165k datasets, 1.9 million columns and 345k metrics. Loading all of them in memory and convert them to the new tables in one giant commit, like current implementation does, is impractical. It'd kill the Python process if not the db connection.
This PR tries to optimize this migration script by:
iter_next
is impossible because entities have changed.lazy="selectin"
to enableSELECT IN
eager loading that pulls related data in one SQL statement instead of three.After this optimization, the migration for our 165k datasets took about 7 hours to finish. Still slow, but better than not able to finish at all. Ideally, in the future, for large full-table migrations like this, the script should be written with raw SQL as much as possible for each dialect we support.
BEFORE/AFTER SCREENSHOTS OR ANIMATED GIF
TESTING INSTRUCTIONS
ADDITIONAL INFORMATION