Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consider using os.walk or os.scandir rather than pathlib #116

Open
jarshwah opened this issue Jun 6, 2024 · 0 comments
Open

Consider using os.walk or os.scandir rather than pathlib #116

jarshwah opened this issue Jun 6, 2024 · 0 comments

Comments

@jarshwah
Copy link

jarshwah commented Jun 6, 2024

Trailrunner uses pathlib to iteratively walk the tree. Internally pathlib uses os.scandir but, crucially, pathlib throws away all of the extra metadata and just returns the Path objects.

That means that calls to child.is_file() and child.is_dir() will perform extra syscalls.

If we switch to using os.walk then we can eliminate the child/directory additional syscalls, but still incur a cost of an additional syscall to check if a file is a symlink (2 per file).

If we switch to using os.scandir directly and ignore symlinks altogether, then we're able to hit 1 syscall per file.

With my testing on a tree containing 41111 python files (in nested directories 6 deep), only returning python files, I get the following results (note exact number of syscalls is arbitrary, the relative magnitude is important):

  1. Using Path.iterdir: 0.38s (user) @ 399980 file system syscalls
  2. Using os.walk: 0.21s @ 104018 file system syscalls
  3. Using os.scandir (ignoring symlinks!): 0.10s @ file system syscalls 64877

To measure syscalls on osx I'm using: sudo ktrace trace -S -f C3 -c python <thetest>.py | wc -l.

Ultimately, does this matter? Probably not for most people. Switching to scandir is somewhere between 3-4x faster, but unless you're on a system where syscalls are expensive (remote/shares) and you have a large directory structure, .3 of a second isn't much to write home about.

The test functions themselves:

import os
from pathlib import Path

def pathlib_iter(root_path):
    root = Path(root_path).resolve()
    def gen(children):
        for child in children:
            if child.is_file():
                if child.suffix == ".py":
                    yield child
            elif child.is_dir():
                yield from gen(child.iterdir())
    yield from gen([root])

def os_walk(root_path):
    for root, dirs, files in os.walk(root_path):
        for file in files:
            if file.endswith(".py"):
                yield os.path.join(root, file)
    
def os_scandir(root_path):
    scandir = os.scandir
    def scan(dirs):
        for dir in dirs:
            subdirs = []
            scanner = scandir(dir)
            for dir_entry in scanner:
                if dir_entry.is_file():
                    if dir_entry.name.endswith(".py"):
                        yield dir_entry.path
                elif dir_entry.is_dir():
                    subdirs.append(dir_entry.path)
            yield from scan(subdirs)
    yield from scan([root_path])
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant