-
Notifications
You must be signed in to change notification settings - Fork 14.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix(migration): Ensure the paginated update is deterministic #21778
fix(migration): Ensure the paginated update is deterministic #21778
Conversation
yield obj | ||
session.merge(obj) | ||
|
||
while True: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The option was to use either query.slice(...)
or session.execute(query).fetchmany(...)
. I opted for the later because:
- Otherwise one would need to include an
order_by(...)
condition and there's no guarantee that it would be defined. - Re-executing the query n times using a different OFFSET per query is likely neither efficient nor guarantees correct pagination given that the filter condition could change.
end = min(start + batch_size, count) | ||
for obj in query[start:end]: | ||
yield obj | ||
session.merge(obj) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's no need to merge the record. The caller should handle this if required.
@@ -66,7 +66,6 @@ def upgrade(): | |||
state["anchor"] = state["hash"] | |||
del state["hash"] | |||
entry.value = pickle.dumps(value) | |||
session.commit() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's no need to commit as the paginated_update
handles it.
@@ -87,5 +86,3 @@ def downgrade(): | |||
state["hash"] = state["anchor"] | |||
del state["anchor"] | |||
entry.value = pickle.dumps(value) | |||
session.merge(entry) | |||
session.commit() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's no need to commit as the paginated_update
handles it.
@@ -87,5 +86,3 @@ def downgrade(): | |||
state["hash"] = state["anchor"] | |||
del state["anchor"] | |||
entry.value = pickle.dumps(value) | |||
session.merge(entry) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's no need to merge the existing entry. See here for details:
Used when you may have more than 1 in-memory objects which map to the same database record with some key.
1cd04bd
to
f8f44a5
Compare
Codecov Report
@@ Coverage Diff @@
## master #21778 +/- ##
==========================================
- Coverage 66.88% 66.87% -0.01%
==========================================
Files 1802 1802
Lines 68987 68988 +1
Branches 7345 7345
==========================================
Hits 46139 46139
- Misses 20951 20952 +1
Partials 1897 1897
Flags with carried forward coverage won't be shown. Click here to find out more.
📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
😱 Thanks for fixing this nasty bug! It must took a lot of time to debug.
Your comment reminded me, what if an object was updated in the loop in a way that makes it no longer matches the query's filtering condition---using the original offset after previous batch was committed would mean some rows will be inevitably skipped.... which might be a bigger problem. This paginated update should really be used with caution. Maybe before each iteration to each batch, we should check whether the total count of the filtering query changes.
Now that the query only executes once this isn't a problem. The result set in paginated rather than the query being sliced. This ensures the logic is indeed safe. Note my default records from MySQL et al. may be ordered consistently. Your comment actually highlights the issue where the migration mutates the record resulting in a different result set per iteration. This regression likely only impacted the visualization migrations which are actually optional. |
SUMMARY
This PR fixes an issue with the
paginated_update
method which is used in a number of migrations. The problem is the pagination was not deterministic, i.e., per iteration it slices the query via a SQL statement using anOFFSET
andLIMIT
. The issue is if the results are not ordered in a consistent way, i.e., by primary key, hence the ordering of sliced results is random meaning that a record may never be processed or processed multiple times.Here's is where the
paginated_update
method is called. It's used by theassign_uuids
method, updating thekey_value
table, as well as migrating visualizations. The later is likely the only ones of concern given that the migration changes the visualization type which is part of the query filter and thus it continually skips over eligible records. Thankfully these migrations are actually only optional as the legacy visualization type is not deprecated. A future migration will be necessary regardless when the legacy visualization types are actually deprecated.BEFORE/AFTER SCREENSHOTS OR ANIMATED GIF
TESTING INSTRUCTIONS
Tested locally.
ADDITIONAL INFORMATION