-
-
Notifications
You must be signed in to change notification settings - Fork 30.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GH-102613: Fast recursive globbing in pathlib.Path.glob()
#104512
GH-102613: Fast recursive globbing in pathlib.Path.glob()
#104512
Conversation
This commit replaces selector classes with selector functions. These generators directly yield results rather calling through to their successor. A new internal `Path._glob()` takes care to chain these generators together, which simplifies the lazy algorithm and slightly improves performance.
pathlib.Path.glob()
pathlib.Path.glob()
@zooba here's the promised walk-and-match implementation! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems fine, there's potentially a few simplifications, but it looks like a great improvement over the existing code.
Do we need any new tests to specifically trigger anything that behaves differently?
Thanks for the review! I've added a few more tests exercising |
`..` components are resolved lexically, rather than after symlinks.
This PR introduces a 'walk-and-match' strategy for handling glob patterns that include a non-terminal
**
wildcard, such as**/*.py
. For this example, the previous implementation recursively walked directories usingos.scandir()
when it expanded the**
component, and then scanned those same directories again when expanded the*.py
component. This is wasteful.In the new implementation, any components following a
**
wildcard are used to build are.Pattern
object, which is used to filter the results of the recursive walk. A pattern like**/*.py
uses half the number ofos.scandir()
calls; a pattern like**/*/*.py
a third, etc.This new algorithm does not apply if either:
None
(its default), or..
components.In these cases we fall back to the old implementation.
This PR also replaces selector classes with selector functions. These generators directly yield results rather calling through to their successors. A new internal
Path._glob()
method takes care to chain these generators together, which simplifies the lazy algorithm and slightly improves performance. It should also be easier to understand and maintain.Performance for the original #102613 repro case, with 400 nested
a/
directories, and matching treatment of symlinks and hidden files:These results were from an SSD. The improvement will be greater for slow storage (e.g. network-mounted volumes).