Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Speed up pathlib.Path.glob() by removing redundant regex matching #115060

Closed
barneygale opened this issue Feb 6, 2024 · 4 comments
Closed

Speed up pathlib.Path.glob() by removing redundant regex matching #115060

barneygale opened this issue Feb 6, 2024 · 4 comments
Labels
performance Performance or resource usage topic-pathlib

Comments

@barneygale
Copy link
Contributor

barneygale commented Feb 6, 2024

In #104512 we made pathlib.Path.glob() use a "walk-and-filter" strategy for expanding ** wildcards in patterns: when we encounter a ** segment, we immediately consume subsequent segments and use them to build a regex that is used to filter results. This saves a bunch of scandir() calls.

However! We actually build a regex for the entire pattern given to glob(), rather than just the segments following ** wildcards. And so when evaluating a pattern like dir*/**/file*, the dir* part is needlessly matched twice against each path. @zooba noted this in a review comment at the time.

We should be able to improve performance by building an re.Pattern only for segments following ** wildcards, and not the entire glob() pattern.

Linked PRs

@barneygale barneygale added performance Performance or resource usage topic-pathlib labels Feb 6, 2024
barneygale added a commit to barneygale/cpython that referenced this issue Feb 6, 2024
… regex matching

When expanding and filtering paths for a `**` wildcard segment, build an
`re.Pattern` object from the subsequent pattern parts, rather than the
entire pattern.

Also skip compiling a pattern when expanding a `*` wildcard segment.
barneygale added a commit that referenced this issue Feb 10, 2024
… matching (#115061)

When expanding and filtering paths for a `**` wildcard segment, build an `re.Pattern` object from the subsequent pattern parts, rather than the entire pattern, and match against the `os.DirEntry` object prior to instantiating a path object. Also skip compiling a pattern when expanding a `*` wildcard segment.
fsc-eriker pushed a commit to fsc-eriker/cpython that referenced this issue Feb 14, 2024
… regex matching (python#115061)

When expanding and filtering paths for a `**` wildcard segment, build an `re.Pattern` object from the subsequent pattern parts, rather than the entire pattern, and match against the `os.DirEntry` object prior to instantiating a path object. Also skip compiling a pattern when expanding a `*` wildcard segment.
@barneygale
Copy link
Contributor Author

barneygale commented Feb 26, 2024

Re-opening: there's another optimization possible.

Previous versions of pathlib used path.exists() or path.is_dir() to check whether non-wildcard segments exist, rather than calling path._scandir() on the parent and filtering through a regex. This was dropped when we added the case_sensitive argument.

If we can somehow check that the underlying filesystem is case-sensitive, and the user sets case_sensitive=True (or leaves it at the default of case_sensitive=None on POSIX), then we can restore this more direct check for existence. And we can do better! For patterns like */foo/bar.py, we can skip checking whether foo is a directory, because we're going to check foo/bar.py exists anyway. Likewise for foo/*, where the _scandir() call failing makes an is_dir() check irrelevant.

Alternatively, we could add a normalize_case argument to glob(), defaulting to True. If set to False, and case_sensitive is effectively False, then literal segments in results wouldn't be case-normalized.

Alternatively alternatively, we could enable this behaviour specifically when case_sensitive is set to None (the default), and so users would need to set case_sensitive=False to get fully case-normalized results.

I need to think about this more.

@barneygale barneygale reopened this Feb 26, 2024
barneygale added a commit to barneygale/cpython that referenced this issue Feb 29, 2024
… scanning.

For ordinary literal pattern segments (e.g. `foo/bar` in `foo/bar/../**`),
skip calling `_scandir()` on each segment, and instead call `exists()` or
`is_dir()` as necessary to exclude missing paths. This only applies when
*case_sensitive* is `None` (the default); otherwise we can't guarantee case
sensitivity or realness with this approach. If *follow_symlinks* is `False`
we also need to exclude symlinks from intermediate segments.

This restores an optimization that was removed in da1980a by some eejit.
It's actually even faster because we don't `stat()` intermediate
directories, and in some cases we can skip all filesystem access when
expanding a literal part (e.g. when it's followed by a non-recursive
wildcard segment).
@barneygale
Copy link
Contributor Author

Closed because I'm planning to make pathlib use the glob module for globbing, see #116392

@barneygale
Copy link
Contributor Author

Re-opening because I'm a prize eejit.

@barneygale barneygale reopened this Apr 6, 2024
barneygale added a commit to barneygale/cpython that referenced this issue Apr 6, 2024
Move pathlib globbing implementation to a new module and class:
`pathlib._glob.Globber`. This class implements fast string-based globbing.
It's called by `pathlib.Path.glob()`, which then converts strings back to
path objects.

In the private pathlib ABCs, add a `pathlib._abc.Globber` subclass that
works with `PathBase` objects rather than strings, and calls user-defined
path methods like `PathBase.stat()` rather than `os.stat()`.

This sets the stage for two more improvements:

- pythonGH-115060: Query non-wildcard segments with `lstat()`
- pythonGH-116380: Move `pathlib._glob` to `glob` (unify implementations).
barneygale added a commit that referenced this issue Apr 10, 2024
…17589)

Move pathlib globbing implementation into a new private class: `glob._Globber`. This class implements fast string-based globbing. It's called by `pathlib.Path.glob()`, which then converts strings back to path objects.

In the private pathlib ABCs, add a `pathlib._abc.Globber` subclass that works with `PathBase` objects rather than strings, and calls user-defined path methods like `PathBase.stat()` rather than `os.stat()`.

This sets the stage for two more improvements:

- GH-115060: Query non-wildcard segments with `lstat()`
- GH-116380: Unify `pathlib` and `glob` implementations of globbing.

No change to the implementations of `glob.glob()` and `glob.iglob()`.
barneygale added a commit to barneygale/cpython that referenced this issue Apr 10, 2024
…al parts

Don't bother calling `os.scandir()` to scan for literal pattern segments,
like `foo` in `foo/*.py`. Instead, append the segment(s) as-is and call
through to the next selector with `exists=False`, which signals that the
path might not exist. Subsequent selectors will call `os.scandir()` or
`os.lstat()` to filter out missing paths as needed.
barneygale added a commit that referenced this issue Apr 12, 2024
…ts (#117732)

Don't bother calling `os.scandir()` to scan for literal pattern segments,
like `foo` in `foo/*.py`. Instead, append the segment(s) as-is and call
through to the next selector with `exists=False`, which signals that the
path might not exist. Subsequent selectors will call `os.scandir()` or
`os.lstat()` to filter out missing paths as needed.
@barneygale barneygale reopened this Apr 13, 2024
@barneygale
Copy link
Contributor Author

Sorry for the close/open spam, I keep spotting things to do.

barneygale added a commit to barneygale/cpython that referenced this issue Apr 13, 2024
…stat()`

Since 6258844, paths that might not exist can be fed into pathlib's
globbing implementation, which will call `os.scandir()` / `os.lstat()` only
when strictly necessary. This allows us to drop an initial `self.is_dir()`
call, which saves a `stat()`.
barneygale added a commit that referenced this issue Apr 13, 2024
#117831)

Since 6258844, paths that might not exist can be fed into pathlib's
globbing implementation, which will call `os.scandir()` / `os.lstat()` only
when strictly necessary. This allows us to drop an initial `self.is_dir()`
call, which saves a `stat()`.

Co-authored-by: Shantanu <[email protected]>
diegorusso pushed a commit to diegorusso/cpython that referenced this issue Apr 17, 2024
…gs (python#117589)

Move pathlib globbing implementation into a new private class: `glob._Globber`. This class implements fast string-based globbing. It's called by `pathlib.Path.glob()`, which then converts strings back to path objects.

In the private pathlib ABCs, add a `pathlib._abc.Globber` subclass that works with `PathBase` objects rather than strings, and calls user-defined path methods like `PathBase.stat()` rather than `os.stat()`.

This sets the stage for two more improvements:

- pythonGH-115060: Query non-wildcard segments with `lstat()`
- pythonGH-116380: Unify `pathlib` and `glob` implementations of globbing.

No change to the implementations of `glob.glob()` and `glob.iglob()`.
diegorusso pushed a commit to diegorusso/cpython that referenced this issue Apr 17, 2024
…al parts (python#117732)

Don't bother calling `os.scandir()` to scan for literal pattern segments,
like `foo` in `foo/*.py`. Instead, append the segment(s) as-is and call
through to the next selector with `exists=False`, which signals that the
path might not exist. Subsequent selectors will call `os.scandir()` or
`os.lstat()` to filter out missing paths as needed.
diegorusso pushed a commit to diegorusso/cpython that referenced this issue Apr 17, 2024
…stat()` (python#117831)

Since 6258844, paths that might not exist can be fed into pathlib's
globbing implementation, which will call `os.scandir()` / `os.lstat()` only
when strictly necessary. This allows us to drop an initial `self.is_dir()`
call, which saves a `stat()`.

Co-authored-by: Shantanu <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance Performance or resource usage topic-pathlib
Projects
None yet
Development

No branches or pull requests

2 participants
@barneygale and others