Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

gh-119105: difflib: improve recursion for degenerate cases #119131

Merged
merged 10 commits into from
May 19, 2024
24 changes: 19 additions & 5 deletions Lib/difflib.py
Original file line number Diff line number Diff line change
Expand Up @@ -911,33 +911,47 @@ def _fancy_replace(self, a, alo, ahi, b, blo, bhi):

# don't synch up unless the lines have a similarity score of at
# least cutoff; best_ratio tracks the best score seen so far
best_ratio, cutoff = 0.74, 0.75
# best_ratio is a tuple storing the best .ratio() seen so far, and
# a measure of how far the indices are from their index range
# midpoints. The latter is used to resolve ratio ties. Favoring
# indices near the midpoints tends to cut the ranges in half. Else,
# if there are many pairs with the best ratio, recursion can grow
# very deep, and runtime becomes cubic. See:
# https://github.com/python/cpython/issues/119105
best_ratio, cutoff = (0.74, 0), 0.75
pulkin marked this conversation as resolved.
Show resolved Hide resolved
cruncher = SequenceMatcher(self.charjunk)
eqi, eqj = None, None # 1st indices of equal lines (if any)

# search for the pair that matches best without being identical
# (identical lines must be junk lines, & we don't want to synch up
# on junk -- unless we have to)
amid = (alo + ahi - 1) / 2
bmid = (blo + bhi - 1) / 2
for j in range(blo, bhi):
bj = b[j]
cruncher.set_seq2(bj)
weight_j = - abs(j - bmid)
for i in range(alo, ahi):
ai = a[i]
if ai == bj:
if eqi is None:
eqi, eqj = i, j
continue
cruncher.set_seq1(ai)
# weight is used to balance the recursion by prioritizing
# i and j in the middle of their ranges
weight = weight_j - abs(i - amid)
# computing similarity is expensive, so use the quick
# upper bounds first -- have seen this speed up messy
# compares by a factor of 3.
# note that ratio() is only expensive to compute the first
# time it's called on a sequence pair; the expensive part
# of the computation is cached by cruncher
if cruncher.real_quick_ratio() > best_ratio and \
cruncher.quick_ratio() > best_ratio and \
cruncher.ratio() > best_ratio:
best_ratio, best_i, best_j = cruncher.ratio(), i, j
if (cruncher.real_quick_ratio(), weight) > best_ratio and \
(cruncher.quick_ratio(), weight) > best_ratio and \
(cruncher.ratio(), weight) > best_ratio:
best_ratio, best_i, best_j = (cruncher.ratio(), weight), i, j
best_ratio, _ = best_ratio
if best_ratio < cutoff:
# no non-identical "pretty close" pair
if eqi is None:
Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
``difflib.Differ`` is much faster for some cases of diffs where many pairs of lines are equally similar.
Loading