Speed up `pathlib.Path.glob()` by removing redundant regex matching #115060

barneygale · 2024-02-06T03:53:20Z

In #104512 we made pathlib.Path.glob() use a "walk-and-filter" strategy for expanding ** wildcards in patterns: when we encounter a ** segment, we immediately consume subsequent segments and use them to build a regex that is used to filter results. This saves a bunch of scandir() calls.

However! We actually build a regex for the entire pattern given to glob(), rather than just the segments following ** wildcards. And so when evaluating a pattern like dir*/**/file*, the dir* part is needlessly matched twice against each path. @zooba noted this in a review comment at the time.

We should be able to improve performance by building an re.Pattern only for segments following ** wildcards, and not the entire glob() pattern.

Linked PRs

The text was updated successfully, but these errors were encountered:

… regex matching When expanding and filtering paths for a `**` wildcard segment, build an `re.Pattern` object from the subsequent pattern parts, rather than the entire pattern. Also skip compiling a pattern when expanding a `*` wildcard segment.

… matching (#115061) When expanding and filtering paths for a `**` wildcard segment, build an `re.Pattern` object from the subsequent pattern parts, rather than the entire pattern, and match against the `os.DirEntry` object prior to instantiating a path object. Also skip compiling a pattern when expanding a `*` wildcard segment.

… regex matching (python#115061) When expanding and filtering paths for a `**` wildcard segment, build an `re.Pattern` object from the subsequent pattern parts, rather than the entire pattern, and match against the `os.DirEntry` object prior to instantiating a path object. Also skip compiling a pattern when expanding a `*` wildcard segment.

barneygale · 2024-02-26T19:06:19Z

Re-opening: there's another optimization possible.

Previous versions of pathlib used path.exists() or path.is_dir() to check whether non-wildcard segments exist, rather than calling path._scandir() on the parent and filtering through a regex. This was dropped when we added the case_sensitive argument.

If we can somehow check that the underlying filesystem is case-sensitive, and the user sets case_sensitive=True (or leaves it at the default of case_sensitive=None on POSIX), then we can restore this more direct check for existence. And we can do better! For patterns like */foo/bar.py, we can skip checking whether foo is a directory, because we're going to check foo/bar.py exists anyway. Likewise for foo/*, where the _scandir() call failing makes an is_dir() check irrelevant.

Alternatively, we could add a normalize_case argument to glob(), defaulting to True. If set to False, and case_sensitive is effectively False, then literal segments in results wouldn't be case-normalized.

Alternatively alternatively, we could enable this behaviour specifically when case_sensitive is set to None (the default), and so users would need to set case_sensitive=False to get fully case-normalized results.

I need to think about this more.

… scanning. For ordinary literal pattern segments (e.g. `foo/bar` in `foo/bar/../**`), skip calling `_scandir()` on each segment, and instead call `exists()` or `is_dir()` as necessary to exclude missing paths. This only applies when *case_sensitive* is `None` (the default); otherwise we can't guarantee case sensitivity or realness with this approach. If *follow_symlinks* is `False` we also need to exclude symlinks from intermediate segments. This restores an optimization that was removed in da1980a by some eejit. It's actually even faster because we don't `stat()` intermediate directories, and in some cases we can skip all filesystem access when expanding a literal part (e.g. when it's followed by a non-recursive wildcard segment).

barneygale · 2024-04-06T16:13:50Z

Closed because I'm planning to make pathlib use the glob module for globbing, see #116392

barneygale · 2024-04-06T19:18:27Z

Re-opening because I'm a prize eejit.

Move pathlib globbing implementation to a new module and class: `pathlib._glob.Globber`. This class implements fast string-based globbing. It's called by `pathlib.Path.glob()`, which then converts strings back to path objects. In the private pathlib ABCs, add a `pathlib._abc.Globber` subclass that works with `PathBase` objects rather than strings, and calls user-defined path methods like `PathBase.stat()` rather than `os.stat()`. This sets the stage for two more improvements: - pythonGH-115060: Query non-wildcard segments with `lstat()` - pythonGH-116380: Move `pathlib._glob` to `glob` (unify implementations).

…17589) Move pathlib globbing implementation into a new private class: `glob._Globber`. This class implements fast string-based globbing. It's called by `pathlib.Path.glob()`, which then converts strings back to path objects. In the private pathlib ABCs, add a `pathlib._abc.Globber` subclass that works with `PathBase` objects rather than strings, and calls user-defined path methods like `PathBase.stat()` rather than `os.stat()`. This sets the stage for two more improvements: - GH-115060: Query non-wildcard segments with `lstat()` - GH-116380: Unify `pathlib` and `glob` implementations of globbing. No change to the implementations of `glob.glob()` and `glob.iglob()`.

…al parts Don't bother calling `os.scandir()` to scan for literal pattern segments, like `foo` in `foo/*.py`. Instead, append the segment(s) as-is and call through to the next selector with `exists=False`, which signals that the path might not exist. Subsequent selectors will call `os.scandir()` or `os.lstat()` to filter out missing paths as needed.

…ts (#117732) Don't bother calling `os.scandir()` to scan for literal pattern segments, like `foo` in `foo/*.py`. Instead, append the segment(s) as-is and call through to the next selector with `exists=False`, which signals that the path might not exist. Subsequent selectors will call `os.scandir()` or `os.lstat()` to filter out missing paths as needed.

barneygale · 2024-04-13T00:43:38Z

Sorry for the close/open spam, I keep spotting things to do.

…stat()` Since 6258844, paths that might not exist can be fed into pathlib's globbing implementation, which will call `os.scandir()` / `os.lstat()` only when strictly necessary. This allows us to drop an initial `self.is_dir()` call, which saves a `stat()`.

#117831) Since 6258844, paths that might not exist can be fed into pathlib's globbing implementation, which will call `os.scandir()` / `os.lstat()` only when strictly necessary. This allows us to drop an initial `self.is_dir()` call, which saves a `stat()`. Co-authored-by: Shantanu <[email protected]>

…gs (python#117589) Move pathlib globbing implementation into a new private class: `glob._Globber`. This class implements fast string-based globbing. It's called by `pathlib.Path.glob()`, which then converts strings back to path objects. In the private pathlib ABCs, add a `pathlib._abc.Globber` subclass that works with `PathBase` objects rather than strings, and calls user-defined path methods like `PathBase.stat()` rather than `os.stat()`. This sets the stage for two more improvements: - pythonGH-115060: Query non-wildcard segments with `lstat()` - pythonGH-116380: Unify `pathlib` and `glob` implementations of globbing. No change to the implementations of `glob.glob()` and `glob.iglob()`.

…al parts (python#117732) Don't bother calling `os.scandir()` to scan for literal pattern segments, like `foo` in `foo/*.py`. Instead, append the segment(s) as-is and call through to the next selector with `exists=False`, which signals that the path might not exist. Subsequent selectors will call `os.scandir()` or `os.lstat()` to filter out missing paths as needed.

…stat()` (python#117831) Since 6258844, paths that might not exist can be fed into pathlib's globbing implementation, which will call `os.scandir()` / `os.lstat()` only when strictly necessary. This allows us to drop an initial `self.is_dir()` call, which saves a `stat()`. Co-authored-by: Shantanu <[email protected]>

barneygale added performance Performance or resource usage topic-pathlib labels Feb 6, 2024

bedevere-app bot mentioned this issue Feb 6, 2024

GH-115060: Speed up pathlib.Path.glob() by removing redundant regex matching #115061

Merged

barneygale closed this as completed Feb 10, 2024

barneygale reopened this Feb 26, 2024

bedevere-app bot mentioned this issue Feb 29, 2024

GH-115060: Speed up pathlib.Path.glob() by skipping directory scanning #116152

Closed

barneygale closed this as completed Apr 6, 2024

barneygale reopened this Apr 6, 2024

barneygale mentioned this issue Apr 6, 2024

GH-117586: Speed up pathlib.Path.glob() by working with strings #117589

Merged

bedevere-app bot mentioned this issue Apr 10, 2024

GH-115060: Speed up pathlib.Path.glob() by not scanning literal parts #117732

Merged

barneygale closed this as completed Apr 12, 2024

barneygale reopened this Apr 13, 2024

bedevere-app bot mentioned this issue Apr 13, 2024

GH-115060: Speed up pathlib.Path.glob() by omitting initial stat() #117831

Merged

barneygale closed this as completed Apr 13, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Speed up `pathlib.Path.glob()` by removing redundant regex matching #115060

Speed up `pathlib.Path.glob()` by removing redundant regex matching #115060

barneygale commented Feb 6, 2024 •

edited by bedevere-app bot

Loading

barneygale commented Feb 26, 2024 •

edited

Loading

barneygale commented Apr 6, 2024

barneygale commented Apr 6, 2024

barneygale commented Apr 13, 2024

Speed up pathlib.Path.glob() by removing redundant regex matching #115060

Speed up pathlib.Path.glob() by removing redundant regex matching #115060

Comments

barneygale commented Feb 6, 2024 • edited by bedevere-app bot Loading

Linked PRs

barneygale commented Feb 26, 2024 • edited Loading

barneygale commented Apr 6, 2024

barneygale commented Apr 6, 2024

barneygale commented Apr 13, 2024

Speed up `pathlib.Path.glob()` by removing redundant regex matching #115060

Speed up `pathlib.Path.glob()` by removing redundant regex matching #115060

barneygale commented Feb 6, 2024 •

edited by bedevere-app bot

Loading

barneygale commented Feb 26, 2024 •

edited

Loading