GH-102613: Fast recursive globbing in `pathlib.Path.glob()` #104512

barneygale · 2023-05-15T18:18:17Z

This PR introduces a 'walk-and-match' strategy for handling glob patterns that include a non-terminal ** wildcard, such as **/*.py. For this example, the previous implementation recursively walked directories using os.scandir() when it expanded the ** component, and then scanned those same directories again when expanded the *.py component. This is wasteful.

In the new implementation, any components following a ** wildcard are used to build a re.Pattern object, which is used to filter the results of the recursive walk. A pattern like **/*.py uses half the number of os.scandir() calls; a pattern like **/*/*.py a third, etc.

This new algorithm does not apply if either:

The follow_symlinks argument is set to None (its default), or
The pattern contains .. components.

In these cases we fall back to the old implementation.

This PR also replaces selector classes with selector functions. These generators directly yield results rather calling through to their successors. A new internal Path._glob() method takes care to chain these generators together, which simplifies the lazy algorithm and slightly improves performance. It should also be easier to understand and maintain.

Performance for the original #102613 repro case, with 400 nested a/ directories, and matching treatment of symlinks and hidden files:

$ ../python -m timeit -s 'import glob' 'print(glob.glob("**/*", recursive=True, include_hidden=True))'
5 loops, best of 5: 66.2 msec per loop
$ ../python -m timeit -s 'from pathlib import Path' 'print(list(Path(".").rglob("**/*", follow_symlinks=True)))'
10 loops, best of 5: 22.7 msec per loop  # before this PR
10 loops, best of 5: 16.5 msec per loop  # after this PR

These results were from an SSD. The improvement will be greater for slow storage (e.g. network-mounted volumes).

Issue: Path.rglob performance issues in deeply nested directories compared to glob.glob(recursive=True) #102613

This commit replaces selector classes with selector functions. These generators directly yield results rather calling through to their successor. A new internal `Path._glob()` takes care to chain these generators together, which simplifies the lazy algorithm and slightly improves performance.

barneygale · 2023-05-31T21:17:54Z

@zooba here's the promised walk-and-match implementation!

zooba

This seems fine, there's potentially a few simplifications, but it looks like a great improvement over the existing code.

Do we need any new tests to specifically trigger anything that behaves differently?

Lib/pathlib.py

barneygale · 2023-06-01T19:00:24Z

Thanks for the review! I've added a few more tests exercising .. and ** segments.

`..` components are resolved lexically, rather than after symlinks.

barneygale added performance Performance or resource usage topic-pathlib labels May 15, 2023

bedevere-bot added the awaiting core review label May 15, 2023

bedevere-bot mentioned this pull request May 15, 2023

Path.rglob performance issues in deeply nested directories compared to glob.glob(recursive=True) #102613

Closed

barneygale added 2 commits May 15, 2023 19:37

Speed up matching *

d5b1836

Add comments, docstrings.

d5c86c6

barneygale mentioned this pull request May 17, 2023

GH-77609: Add follow_symlinks argument to pathlib.Path.glob() #102616

Merged

barneygale added 2 commits May 18, 2023 19:06

Merge branch 'main' into pythongh-102613-remove-selector-classes

53dcb79

Merge branch 'main' into pythongh-102613-remove-selector-classes

6da6a83

barneygale marked this pull request as draft May 30, 2023 17:39

bedevere-bot removed the awaiting core review label May 30, 2023

barneygale added 3 commits May 30, 2023 23:20

Merge branch 'main' into pythongh-102613-remove-selector-classes

73ed81d

Add support for matching files recursively.

4005619

Implement walk-and-match algorithm.

9401d36

barneygale changed the title ~~GH-102613: Simplify implementation of pathlib.Path.glob()~~ GH-102613: Fast recursive globbing in pathlib.Path.glob() May 31, 2023

barneygale added 3 commits May 31, 2023 20:56

Fix up docs, news blurb.

b217587

Fix comment

de7e857

Fix handling of newlines in filenames.

d1023c7

barneygale marked this pull request as ready for review May 31, 2023 20:36

bedevere-bot added the awaiting core review label May 31, 2023

zooba reviewed Jun 1, 2023

View reviewed changes

Lib/pathlib.py Outdated Show resolved Hide resolved

Lib/pathlib.py Show resolved Hide resolved

Lib/pathlib.py Show resolved Hide resolved

barneygale added 5 commits June 1, 2023 18:12

Speed up recursive selection

ad33eec

Exclude self from walk-and-match matching.

064efdb

Optimize walk-and-match logic.

14c6a58

Consume adjacent '**' segments before considering use of matching.

04720bd

Add some more tests for complex patterns.

9c6b44f

Drop test case that doesn't work on Windows.

4cfb836

`..` components are resolved lexically, rather than after symlinks.

barneygale requested a review from zooba June 6, 2023 18:25

zooba approved these changes Jun 6, 2023

View reviewed changes

bedevere-bot added awaiting merge and removed awaiting core review labels Jun 6, 2023

barneygale merged commit 24af451 into python:main Jun 6, 2023

bedevere-bot removed the awaiting merge label Jun 6, 2023

barneygale mentioned this pull request Feb 6, 2024

Speed up pathlib.Path.glob() by removing redundant regex matching #115060

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GH-102613: Fast recursive globbing in `pathlib.Path.glob()` #104512

GH-102613: Fast recursive globbing in `pathlib.Path.glob()` #104512

barneygale commented May 15, 2023 •

edited

Loading

barneygale commented May 31, 2023

zooba left a comment

barneygale commented Jun 1, 2023 •

edited

Loading

GH-102613: Fast recursive globbing in pathlib.Path.glob() #104512

GH-102613: Fast recursive globbing in pathlib.Path.glob() #104512

Conversation

barneygale commented May 15, 2023 • edited Loading

barneygale commented May 31, 2023

zooba left a comment

Choose a reason for hiding this comment

barneygale commented Jun 1, 2023 • edited Loading

GH-102613: Fast recursive globbing in `pathlib.Path.glob()` #104512

GH-102613: Fast recursive globbing in `pathlib.Path.glob()` #104512

barneygale commented May 15, 2023 •

edited

Loading

barneygale commented Jun 1, 2023 •

edited

Loading