-
-
Notifications
You must be signed in to change notification settings - Fork 30.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Speed up pathlib.Path.glob()
by removing redundant regex matching
#115060
Comments
… regex matching When expanding and filtering paths for a `**` wildcard segment, build an `re.Pattern` object from the subsequent pattern parts, rather than the entire pattern. Also skip compiling a pattern when expanding a `*` wildcard segment.
… matching (#115061) When expanding and filtering paths for a `**` wildcard segment, build an `re.Pattern` object from the subsequent pattern parts, rather than the entire pattern, and match against the `os.DirEntry` object prior to instantiating a path object. Also skip compiling a pattern when expanding a `*` wildcard segment.
… regex matching (python#115061) When expanding and filtering paths for a `**` wildcard segment, build an `re.Pattern` object from the subsequent pattern parts, rather than the entire pattern, and match against the `os.DirEntry` object prior to instantiating a path object. Also skip compiling a pattern when expanding a `*` wildcard segment.
Re-opening: there's another optimization possible. Previous versions of pathlib used If we can somehow check that the underlying filesystem is case-sensitive, and the user sets Alternatively, we could add a normalize_case argument to Alternatively alternatively, we could enable this behaviour specifically when case_sensitive is set to I need to think about this more. |
… scanning. For ordinary literal pattern segments (e.g. `foo/bar` in `foo/bar/../**`), skip calling `_scandir()` on each segment, and instead call `exists()` or `is_dir()` as necessary to exclude missing paths. This only applies when *case_sensitive* is `None` (the default); otherwise we can't guarantee case sensitivity or realness with this approach. If *follow_symlinks* is `False` we also need to exclude symlinks from intermediate segments. This restores an optimization that was removed in da1980a by some eejit. It's actually even faster because we don't `stat()` intermediate directories, and in some cases we can skip all filesystem access when expanding a literal part (e.g. when it's followed by a non-recursive wildcard segment).
Closed because I'm planning to make pathlib use the |
Re-opening because I'm a prize eejit. |
Move pathlib globbing implementation to a new module and class: `pathlib._glob.Globber`. This class implements fast string-based globbing. It's called by `pathlib.Path.glob()`, which then converts strings back to path objects. In the private pathlib ABCs, add a `pathlib._abc.Globber` subclass that works with `PathBase` objects rather than strings, and calls user-defined path methods like `PathBase.stat()` rather than `os.stat()`. This sets the stage for two more improvements: - pythonGH-115060: Query non-wildcard segments with `lstat()` - pythonGH-116380: Move `pathlib._glob` to `glob` (unify implementations).
…17589) Move pathlib globbing implementation into a new private class: `glob._Globber`. This class implements fast string-based globbing. It's called by `pathlib.Path.glob()`, which then converts strings back to path objects. In the private pathlib ABCs, add a `pathlib._abc.Globber` subclass that works with `PathBase` objects rather than strings, and calls user-defined path methods like `PathBase.stat()` rather than `os.stat()`. This sets the stage for two more improvements: - GH-115060: Query non-wildcard segments with `lstat()` - GH-116380: Unify `pathlib` and `glob` implementations of globbing. No change to the implementations of `glob.glob()` and `glob.iglob()`.
…al parts Don't bother calling `os.scandir()` to scan for literal pattern segments, like `foo` in `foo/*.py`. Instead, append the segment(s) as-is and call through to the next selector with `exists=False`, which signals that the path might not exist. Subsequent selectors will call `os.scandir()` or `os.lstat()` to filter out missing paths as needed.
…ts (#117732) Don't bother calling `os.scandir()` to scan for literal pattern segments, like `foo` in `foo/*.py`. Instead, append the segment(s) as-is and call through to the next selector with `exists=False`, which signals that the path might not exist. Subsequent selectors will call `os.scandir()` or `os.lstat()` to filter out missing paths as needed.
Sorry for the close/open spam, I keep spotting things to do. |
…stat()` Since 6258844, paths that might not exist can be fed into pathlib's globbing implementation, which will call `os.scandir()` / `os.lstat()` only when strictly necessary. This allows us to drop an initial `self.is_dir()` call, which saves a `stat()`.
#117831) Since 6258844, paths that might not exist can be fed into pathlib's globbing implementation, which will call `os.scandir()` / `os.lstat()` only when strictly necessary. This allows us to drop an initial `self.is_dir()` call, which saves a `stat()`. Co-authored-by: Shantanu <[email protected]>
…gs (python#117589) Move pathlib globbing implementation into a new private class: `glob._Globber`. This class implements fast string-based globbing. It's called by `pathlib.Path.glob()`, which then converts strings back to path objects. In the private pathlib ABCs, add a `pathlib._abc.Globber` subclass that works with `PathBase` objects rather than strings, and calls user-defined path methods like `PathBase.stat()` rather than `os.stat()`. This sets the stage for two more improvements: - pythonGH-115060: Query non-wildcard segments with `lstat()` - pythonGH-116380: Unify `pathlib` and `glob` implementations of globbing. No change to the implementations of `glob.glob()` and `glob.iglob()`.
…al parts (python#117732) Don't bother calling `os.scandir()` to scan for literal pattern segments, like `foo` in `foo/*.py`. Instead, append the segment(s) as-is and call through to the next selector with `exists=False`, which signals that the path might not exist. Subsequent selectors will call `os.scandir()` or `os.lstat()` to filter out missing paths as needed.
…stat()` (python#117831) Since 6258844, paths that might not exist can be fed into pathlib's globbing implementation, which will call `os.scandir()` / `os.lstat()` only when strictly necessary. This allows us to drop an initial `self.is_dir()` call, which saves a `stat()`. Co-authored-by: Shantanu <[email protected]>
In #104512 we made
pathlib.Path.glob()
use a "walk-and-filter" strategy for expanding**
wildcards in patterns: when we encounter a**
segment, we immediately consume subsequent segments and use them to build a regex that is used to filter results. This saves a bunch ofscandir()
calls.However! We actually build a regex for the entire pattern given to
glob()
, rather than just the segments following**
wildcards. And so when evaluating a pattern likedir*/**/file*
, thedir*
part is needlessly matched twice against each path. @zooba noted this in a review comment at the time.We should be able to improve performance by building an
re.Pattern
only for segments following**
wildcards, and not the entireglob()
pattern.Linked PRs
pathlib.Path.glob()
by removing redundant regex matching #115061pathlib.Path.glob()
by skipping directory scanning #116152pathlib.Path.glob()
by not scanning literal parts #117732pathlib.Path.glob()
by omitting initialstat()
#117831The text was updated successfully, but these errors were encountered: