You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Trailrunner uses pathlib to iteratively walk the tree. Internally pathlib uses os.scandir but, crucially, pathlib throws away all of the extra metadata and just returns the Path objects.
That means that calls to child.is_file() and child.is_dir() will perform extra syscalls.
If we switch to using os.walk then we can eliminate the child/directory additional syscalls, but still incur a cost of an additional syscall to check if a file is a symlink (2 per file).
If we switch to using os.scandir directly and ignore symlinks altogether, then we're able to hit 1 syscall per file.
With my testing on a tree containing 41111 python files (in nested directories 6 deep), only returning python files, I get the following results (note exact number of syscalls is arbitrary, the relative magnitude is important):
Using Path.iterdir: 0.38s (user) @ 399980 file system syscalls
Using os.walk: 0.21s @ 104018 file system syscalls
Using os.scandir (ignoring symlinks!): 0.10s @ file system syscalls 64877
To measure syscalls on osx I'm using: sudo ktrace trace -S -f C3 -c python <thetest>.py | wc -l.
Ultimately, does this matter? Probably not for most people. Switching to scandir is somewhere between 3-4x faster, but unless you're on a system where syscalls are expensive (remote/shares) and you have a large directory structure, .3 of a second isn't much to write home about.
The test functions themselves:
import os
from pathlib import Path
def pathlib_iter(root_path):
root = Path(root_path).resolve()
def gen(children):
for child in children:
if child.is_file():
if child.suffix == ".py":
yield child
elif child.is_dir():
yield from gen(child.iterdir())
yield from gen([root])
def os_walk(root_path):
for root, dirs, files in os.walk(root_path):
for file in files:
if file.endswith(".py"):
yield os.path.join(root, file)
def os_scandir(root_path):
scandir = os.scandir
def scan(dirs):
for dir in dirs:
subdirs = []
scanner = scandir(dir)
for dir_entry in scanner:
if dir_entry.is_file():
if dir_entry.name.endswith(".py"):
yield dir_entry.path
elif dir_entry.is_dir():
subdirs.append(dir_entry.path)
yield from scan(subdirs)
yield from scan([root_path])
The text was updated successfully, but these errors were encountered:
Trailrunner uses pathlib to iteratively walk the tree. Internally pathlib uses os.scandir but, crucially, pathlib throws away all of the extra metadata and just returns the Path objects.
That means that calls to
child.is_file()
andchild.is_dir()
will perform extra syscalls.If we switch to using os.walk then we can eliminate the child/directory additional syscalls, but still incur a cost of an additional syscall to check if a file is a symlink (2 per file).
If we switch to using os.scandir directly and ignore symlinks altogether, then we're able to hit 1 syscall per file.
With my testing on a tree containing 41111 python files (in nested directories 6 deep), only returning python files, I get the following results (note exact number of syscalls is arbitrary, the relative magnitude is important):
To measure syscalls on osx I'm using:
sudo ktrace trace -S -f C3 -c python <thetest>.py | wc -l
.Ultimately, does this matter? Probably not for most people. Switching to scandir is somewhere between 3-4x faster, but unless you're on a system where syscalls are expensive (remote/shares) and you have a large directory structure, .3 of a second isn't much to write home about.
The test functions themselves:
The text was updated successfully, but these errors were encountered: