GH-116380: Speed up `glob.glob()` by removing some system calls #116392

barneygale · 2024-03-05T23:13:48Z

Speed up glob.glob() and glob.iglob() by reducing the number of system calls made.

This unifies the implementations of globbing in the glob and pathlib modules.

Depends on

Filtered recursive walk

Expanding a recursive ** segment entails walking the entire directory tree, and so any subsequent pattern segments (except special segments) can be evaluated by filtering the expanded paths through a regex. For example, glob.glob("foo/**/*.py", recursive=True) recursively walks foo/ with os.scandir(), and then filters paths through a regex based on "**/*.py, with no further filesystem access needed.

This solves #104269 as a side-effect.

Tracking path existence

We store a flag alongside each path indicating whether the path is guaranteed to exist. As we process the pattern:

Certain special pattern segments ("", "." and "..") leave the flag unchanged
Literal pattern segments (e.g. foo/bar) set the flag to false
Wildcard pattern segments (e.g. */*.py) set the flag to true (because children are found via os.scandir())
Recursive pattern segments (e.g. **) leave the flag unchanged for the root path, and set it to true for descendants discovered via os.scandir().

If the flag is false at the end, we call lstat() on each path to filter out missing paths.

Minor speed-ups

We:

Exclude paths that don't match a non-terminal non-recursive wildcard pattern prior to calling is_dir().
Use a stack rather than recursion to implement recursive wildcards.
- Addresses Change shutil.rmtree and os.walk to support very deep hierarchies #89727 for the glob module.
Pre-compile regular expressions and pre-join literal pattern segments.
Convert to/from bytes (a minor use-case) in iglob() rather than supporting bytes throughout. This particularly simplifies the code needed to handle relative bytes paths with dir_fd.
Avoid calling os.path.join(); instead we keep paths in a normalized form and append trailing slashes when needed.
Avoid calling os.path.normcase(); instead we use case-insensitive regex matching.

Implementation notes

Much of this functionality is already present in pathlib's implementation of globbing. The specific additions we make are:

Support for dir_fd
Support for include_hidden
Support for generating paths relative to root_dir

Results

Speedups via python -m timeit -s "from glob import glob" "glob(pattern, recursive=True, include_hidden=True)" from CPython source directory on Linux:

pattern	speedup
`Lib/*`	1.48x
`Lib/*/`	1.82x
`Lib/*.py`	1.15x
`Lib/**`	4.98x
`Lib/**/`	1.31x
`Lib/*/`	1.82x
`Lib//`	14.9x
`Lib/*//`	2.25x
`Lib/*/.py`	1.81x
`Lib/**/__init__.py`	1.08x
`Lib/*//*.py`	2.38x
`Lib/*//__init__.py`	1.19x

Issue: Speed up glob.glob() by reducing number of system calls made #116380

barneygale · 2024-03-05T23:53:36Z

Needs a fix for #116377 to land.

gpshead

Nice work!

Lib/glob.py

Misc/NEWS.d/next/Library/2024-03-05-23-08-11.gh-issue-116380.56HU7I.rst

Lib/glob.py

serhiy-storchaka

Please do not hurry to merge. This is an old code. The main advantage of the initial code was its simplicity, but since then it was complicated by adding new features and optimizations. In particularly the use of os.scandir() instead of os.listdir() significantly improved performance. The new implementation should be benchmarked with different test cases: deep and wide threes, files and directories domination.

Lib/glob.py

barneygale · 2024-03-07T12:01:44Z

Thanks Serhiy! We use os.scandir() if either:

We're expanding a recursive wildcard (we need to distinguish directories in order to recurse)
We're expanding a non-final non-recursive wildcard (we need to select only directories)

If neither of these are true, then we don't need to stat() the children, and so os.listdir() is actually a little faster I think. But I will test this on a few machines to be sure!

edit: to further illustrate what I mean, here's where os.listdir() is used:

	non-recursive part	recursive part
non-terminal part	`os.scandir()`	`os.scandir()`
terminal part	`os.listdir()` <--	`os.scandir()`

barneygale · 2024-03-10T18:49:54Z

The new implementation should be benchmarked with different test cases: deep and wide threes, files and directories domination.

I've been looking into this! The randomfiletree project is helpful - it can repeatedly walk a tree and create child files/folders according to a gaussian distribution, which seems to me like a good approximation for an average "shallow and wide" filesystem structure, including tweaking for file or folder distribution.

It's difficult to produce "deep and narrow" trees this way, as the file/folder probability would need to change with the depth (I think?). I've been considering writing a tree generator that works this way, e.g.:

At depth==0, generate 100 subdirectories
At 0 < depth < 50, generate 1 subdirectory
At depth==50, generate 100 files

... but is that overly arbitrary? Is there a better way? Or do I just need to come up with a bunch of test cases along those lines?

barneygale · 2024-03-15T01:18:37Z

A test of 100 nested directories named "deep" from my Linux machine:

pattern	speedup
`deep/**`	3.86x
`deep/**/`	4.03x
`deep/*/`	4.92x
`deep/*//`	4.93x

Lib/glob.py

Doc/whatsnew/3.14.rst

Lib/glob.py

Misc/NEWS.d/next/Library/2024-03-05-23-08-11.gh-issue-116380.56HU7I.rst

Lib/glob.py

barneygale · 2024-08-28T16:36:43Z

@picnixz to address some of your comments on using map() rather than looping and yield: I did this so that calling close() on the iglob(dir_fd=blah) generator causes os.close() to be called on all open file descriptors, which seems to work with a stack of for loops but not map(). But I didn't add a test case - I'll do that now :)

picnixz · 2024-08-28T16:44:23Z

That's... an interesting functionality I wasn't aware of :) If someone could explain to me the reason I'd be happy. Anyway, let's keep your loops.

Co-authored-by: Bénédikt Tran <[email protected]>

Lib/glob.py

picnixz

The journey was a bit long but I have nothing else to say here! Thank you, as always, for addressing my nitpicking comments and considering my suggestions!

I'd be more comfortable if @serhiy-storchaka can have a final look at it just to see whether I've missed some logic.

barneygale · 2024-10-31T00:25:00Z

Thank you very much for the thorough review. Your comments were very helpful as ever, thank you for being patient.

pythonGH-116380: Make glob.glob() twice as fast

db3c620

barneygale added the performance Performance or resource usage label Mar 5, 2024

bedevere-app bot added the awaiting core review label Mar 5, 2024

bedevere-app bot mentioned this pull request Mar 5, 2024

Speed up glob.glob() by reducing number of system calls made #116380

Open

Use os.listdir() if we don't need to check entry type.

9e1f059

barneygale added 8 commits March 6, 2024 01:21

A few small speedups.

10432df

Simplify prefix removal

7e389e2

Re-implement glob0(), glob1(), and has_magic().

8680a0a

Fix errant StopIteration.

3bf3124

Skip compiling pattern for consecutive ** segments.

f8fb992

Clarify regex/path building in literal and recursive selectors.

50ef080

Simplify code to ignore root_dir.

ccefacd

Fix possible Windows separator issue.

fa951f6

gpshead reviewed Mar 6, 2024

View reviewed changes

Lib/glob.py Outdated Show resolved Hide resolved

Lib/glob.py Outdated Show resolved Hide resolved

Lib/glob.py Outdated Show resolved Hide resolved

Misc/NEWS.d/next/Library/2024-03-05-23-08-11.gh-issue-116380.56HU7I.rst Outdated Show resolved Hide resolved

Privat33r-dev reviewed Mar 6, 2024

View reviewed changes

Lib/glob.py Outdated Show resolved Hide resolved

Privat33r-dev reviewed Mar 6, 2024

View reviewed changes

Lib/glob.py Outdated Show resolved Hide resolved

serhiy-storchaka self-requested a review March 6, 2024 17:12

barneygale added 5 commits March 6, 2024 21:20

Address some review feedback.

0aec12c

Use assignment expressions in a couple of places

72691ba

Replace lambda with operator.not_.

c58dd21

Merge branch 'main' into pythongh-116380

c361ec9

Speed up _add_trailing_slash()

22b30db

barneygale commented Mar 7, 2024

View reviewed changes

Lib/glob.py Show resolved Hide resolved

barneygale added 2 commits March 7, 2024 02:10

Speed up select_literal()

83b70bd

Speed up select_recursive()

1d32d14

serhiy-storchaka reviewed Mar 7, 2024

View reviewed changes

Lib/glob.py Outdated Show resolved Hide resolved

barneygale added 5 commits May 31, 2024 22:22

Make _relative_glob() a generator.

f9f9a8d

Simplify skipping empty string

24a9ee4

Merge branch 'main' into pythongh-116380

d05d58d

Merge branch 'main' into pythongh-116380

27c463e

Make _GlobberBase fully abstract.

a94f2a7

eryksun reviewed Jun 7, 2024

View reviewed changes

Lib/glob.py Outdated Show resolved Hide resolved

picnixz reviewed Jun 9, 2024

View reviewed changes

Lib/glob.py Outdated Show resolved Hide resolved

Lib/glob.py Outdated Show resolved Hide resolved

Lib/glob.py Show resolved Hide resolved

Lib/glob.py Outdated Show resolved Hide resolved

Lib/glob.py Outdated Show resolved Hide resolved

barneygale added 5 commits June 9, 2024 21:12

Address review feedback

d19bb89

Typo fix

1677588

Speed up pattern parsing.

539f044

Add test for globbing above recursion limit.

70a1b42

Merge branch 'main' into pythongh-116380

1560712

picnixz self-requested a review August 28, 2024 12:29

picnixz reviewed Aug 28, 2024

View reviewed changes

barneygale and others added 6 commits September 1, 2024 15:52

Apply suggestions from code review

099e86e

Co-authored-by: Bénédikt Tran <[email protected]>

Test that iglob().close() closes file descriptors.

ee76faf

Address some review feedback

4cf8a4d

Merge branch 'main' into pythongh-116380

8a118a7

Address more review comments

3ad9367

Drop parse_entry

66af33d

barneygale requested a review from picnixz October 27, 2024 23:27

picnixz reviewed Oct 28, 2024

View reviewed changes

Lib/glob.py Outdated Show resolved Hide resolved

Lib/glob.py Outdated Show resolved Hide resolved

Lib/glob.py Show resolved Hide resolved

barneygale added 2 commits October 28, 2024 18:56

Address review feedback

ce74ef1

Add comment.

a69a060

barneygale requested a review from picnixz October 31, 2024 00:18

picnixz approved these changes Oct 31, 2024

View reviewed changes

Merge branch 'main' into pythongh-116380

a10a1e0

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GH-116380: Speed up `glob.glob()` by removing some system calls #116392

GH-116380: Speed up `glob.glob()` by removing some system calls #116392

barneygale commented Mar 5, 2024 •

edited

Loading

barneygale commented Mar 5, 2024 •

edited

Loading

gpshead left a comment

serhiy-storchaka left a comment

barneygale commented Mar 7, 2024 •

edited

Loading

barneygale commented Mar 10, 2024

barneygale commented Mar 15, 2024

barneygale commented Aug 28, 2024 •

edited

Loading

picnixz commented Aug 28, 2024

picnixz left a comment •

edited

Loading

barneygale commented Oct 31, 2024

GH-116380: Speed up glob.glob() by removing some system calls #116392

Are you sure you want to change the base?

GH-116380: Speed up glob.glob() by removing some system calls #116392

Conversation

barneygale commented Mar 5, 2024 • edited Loading

Depends on

Filtered recursive walk

Tracking path existence

Minor speed-ups

Implementation notes

Results

barneygale commented Mar 5, 2024 • edited Loading

gpshead left a comment

Choose a reason for hiding this comment

serhiy-storchaka left a comment

Choose a reason for hiding this comment

barneygale commented Mar 7, 2024 • edited Loading

barneygale commented Mar 10, 2024

barneygale commented Mar 15, 2024

barneygale commented Aug 28, 2024 • edited Loading

picnixz commented Aug 28, 2024

picnixz left a comment • edited Loading

Choose a reason for hiding this comment

barneygale commented Oct 31, 2024

GH-116380: Speed up `glob.glob()` by removing some system calls #116392

GH-116380: Speed up `glob.glob()` by removing some system calls #116392

barneygale commented Mar 5, 2024 •

edited

Loading

barneygale commented Mar 5, 2024 •

edited

Loading

barneygale commented Mar 7, 2024 •

edited

Loading

barneygale commented Aug 28, 2024 •

edited

Loading

picnixz left a comment •

edited

Loading