Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

E0401 (import-error) checks perform a lot of repeated stat calls #9310

Closed
correctmost opened this issue Dec 18, 2023 · 2 comments · Fixed by pylint-dev/astroid#2408
Closed
Labels
Enhancement ✨ Improvement to a component Needs PR This issue is accepted, sufficiently specified and now needs an implementation performance
Milestone

Comments

@correctmost
Copy link
Contributor

correctmost commented Dec 18, 2023

Bug description

I run pylint on a repo that is mounted via SSHFS, which leads to slow I/O speeds.

While profiling a run, I noticed that the import-error checks perform a lot of repeated stat calls because they check for the presence of various .py, .pyc, .so, .cpython-311-x86_64-linux-gnu.so, etc. files.

Many of these presence checks are repeated, so I'm wondering if it would be possible to improve performance by eliminating repeated checks or caching the results of previous calls.

I have prepared a repo that illustrates the issue. (The example repo contains ~60 files, whereas the repo I noticed the performance issue with contains ~2000 files.)

I noticed that pylint's performance can be improved by adding "missing" __init__.py files to the repo, but I'm hoping pylint itself can be tuned to increase performance even further.

Configuration

[MAIN]
jobs=1

[MESSAGES CONTROL]
disable=all
enable=E0401

[REPORTS]
reports=no
score=no

Command used

Steps to reproduce

git clone --branch import-error-stats https://github.com/correctmost/pylint-corpus.git
cd pylint-corpus

python ./profile_pylint.py
head -n 20 profiler_stats

Analysis

Notice that one of the top results is for posix.stat:

--> 27668    0.185    0.000    0.187    0.000 {built-in method posix.stat}

posix.stat is called by isfile, which is called most often by find_module in astroid:

--> <frozen genericpath>:27(isfile) <-   15128    0.044    1.282  astroid/interpreter/_import/spec.py:129(find_module)

There is evidence of repeated stats from strace:

$ strace -e trace=%%stat python profile_pylint.py 2>&1 | sort | uniq -c | sort -nr | less

1314 newfstatat(AT_FDCWD, "pylint-corpus/src/__init__.py", {st_mode=S_IFREG|0644, st_size=0, ...}, 0) = 0
 904 newfstatat(AT_FDCWD, "pylint-corpus/src/resources/__init__.pyc", 0x7ffd4b370690, 0) = -1 ENOENT (No such file or directory)
 904 newfstatat(AT_FDCWD, "pylint-corpus/src/resources/__init__.py", 0x7ffd4b370690, 0) = -1 ENOENT (No such file or directory)
 811 newfstatat(AT_FDCWD, "pylint-corpus/src/resources", {st_mode=S_IFDIR|0755, st_size=4096, ...}, 0) = 0
 710 newfstatat(AT_FDCWD, "pylint-corpus", {st_mode=S_IFDIR|0755, st_size=4096, ...}, 0) = 0
 553 newfstatat(AT_FDCWD, "pylint-corpus/src/sites/hierarchy/cat1/subcat1", {st_mode=S_IFDIR|0755, st_size=4096, ...}, 0) = 0
 552 newfstatat(AT_FDCWD, "pylint-corpus/src/__init__.so", 0x7ffd4b370690, 0) = -1 ENOENT (No such file or directory)
 552 newfstatat(AT_FDCWD, "pylint-corpus/src/__init__.cpython-311-x86_64-linux-gnu.so", 0x7ffd4b370690, 0) = -1 ENOENT (No such file or directory)
 552 newfstatat(AT_FDCWD, "pylint-corpus/src/__init__.abi3.so", 0x7ffd4b370690, 0) = -1 ENOENT (No such file or directory)
 550 newfstatat(AT_FDCWD, "pylint-corpus/src/sites/hierarchy/cat1/subcat1/src.so", 0x7ffd4b370690, 0) = -1 ENOENT (No such file or directory)
 550 newfstatat(AT_FDCWD, "pylint-corpus/src/sites/hierarchy/cat1/subcat1/src.pyc", 0x7ffd4b370690, 0) = -1 ENOENT (No such file or directory)
 550 newfstatat(AT_FDCWD, "pylint-corpus/src/sites/hierarchy/cat1/subcat1/src.py", 0x7ffd4b370690, 0) = -1 ENOENT (No such file or directory)
 550 newfstatat(AT_FDCWD, "pylint-corpus/src/sites/hierarchy/cat1/subcat1/src/__init__.pyi", 0x7ffd4b370690, 0) = -1 ENOENT (No such file or directory)
 550 newfstatat(AT_FDCWD, "pylint-corpus/src/sites/hierarchy/cat1/subcat1/src/__init__.pyc", 0x7ffd4b370690, 0) = -1 ENOENT (No such file or directory)
 550 newfstatat(AT_FDCWD, "pylint-corpus/src/sites/hierarchy/cat1/subcat1/src/__init__.py", 0x7ffd4b370690, 0) = -1 ENOENT (No such file or directory)
 550 newfstatat(AT_FDCWD, "pylint-corpus/src/sites/hierarchy/cat1/subcat1/src.cpython-311-x86_64-linux-gnu.so", 0x7ffd4b370690, 0) = -1 ENOENT (No such file or directory)
 550 newfstatat(AT_FDCWD, "pylint-corpus/src/sites/hierarchy/cat1/subcat1/src.abi3.so", 0x7ffd4b370690, 0) = -1 ENOENT (No such file or directory)

Pylint output

There is no output, just reduced performance

Expected behavior

Improved performance via caching or reduced file-presence checks

Pylint version

pylint 3.0.3
astroid 3.0.2
Python 3.11.6 (main, Nov 14 2023, 09:36:21) [GCC 13.2.1 20230801]

OS / Environment

Arch Linux

Additional dependencies

No response

@correctmost correctmost added the Needs triage 📥 Just created, needs acknowledgment, triage, and proper labelling label Dec 18, 2023
@Pierre-Sassoulas Pierre-Sassoulas added Enhancement ✨ Improvement to a component performance Needs PR This issue is accepted, sufficiently specified and now needs an implementation and removed Needs triage 📥 Just created, needs acknowledgment, triage, and proper labelling labels Dec 18, 2023
@Pierre-Sassoulas
Copy link
Member

Thank you for analyzing the problem and opening an issue.

@crazybolillo
Copy link
Contributor

crazybolillo commented Mar 29, 2024

Ran the commands as specified, seems like I was able to reproduce it. Here is an excerpt:

135887165 function calls (125297436 primitive calls) in 416.767 seconds
Ordered by: internal time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
 42330150   54.812    0.000   54.916    0.000 {built-in method builtins.isinstance}
    46521   36.663    0.001   69.141    0.001 pylint-corpus/venv/lib/python3.11/site-packages/astroid/interpreter/_import/spec.py:336(_get_zipimporters)
1471365/1303788   24.992    0.000   89.879    0.000 pylint-corpus/venv/lib/python3.11/site-packages/astroid/transforms.py:59(_transform)
  2170870   17.717    0.000   39.384    0.000 /usr/lib/python3.11/tokenize.py:433(_tokenize)
2694089/2794   15.479    0.000  126.435    0.045 pylint-corpus/venv/lib/python3.11/site-packages/astroid/transforms.py:106(_visit_generic)
1471365/1385   11.403    0.000  126.860    0.092 pylint-corpus/venv/lib/python3.11/site-packages/astroid/transforms.py:78(_visit)
943457/1151    9.901    0.000  193.783    0.168 pylint-corpus/venv/lib/python3.11/site-packages/pylint/utils/ast_walker.py:72(walk)
  7190960    9.836    0.000    9.836    0.000 pylint-corpus/venv/lib/python3.11/site-packages/astroid/brain/brain_numpy_utils.py:64(name_looks_like_numpy_member)
  1992340    8.433    0.000   13.762    0.000 pylint-corpus/venv/lib/python3.11/site-packages/astroid/brain/brain_builtin_inference.py:174(_builtin_filter_predicate)
  2549321    7.740    0.000    7.740    0.000 {method 'match' of 're.Pattern' objects}
  1013403    7.578    0.000   15.724    0.000 <frozen posixpath>:71(join)
     2357    6.710    0.003    6.711    0.003 {built-in method builtins.compile}
  4427955    5.813    0.000    5.813    0.000 {method 'get' of 'dict' objects}
  2138556    5.791    0.000    8.817    0.000 <string>:1(<lambda>)
1530664/28481    5.540    0.000   40.758    0.001 pylint-corpus/venv/lib/python3.11/site-packages/astroid/rebuilder.py:472(visit)
552142/197    5.164    0.000   13.760    0.070 pylint-corpus/venv/lib/python3.11/site-packages/pylint/utils/file_state.py:56(_set_state_on_block_lines)
485780/274829    4.348    0.000   13.787    0.000 /usr/lib/python3.11/functools.py:981(__get__)
  1000703    4.048    0.000    4.094    0.000 {built-in method posix.stat}
  2934432    4.009    0.000    4.016    0.000 pylint-corpus/venv/lib/python3.11/site-packages/astroid/brain/brain_numpy_utils.py:72(attribute_looks_like_numpy_member)
   123173    3.937    0.000   40.614    0.000 pylint-corpus/venv/lib/python3.11/site-packages/astroid/interpreter/_import/spec.py:128(find_module)

Can't promise anything but will take a look into it. Besides comparing results against this test repository itself, I wonder if there are a set of standard repos to use for performance analysis? Was thinking of double checking any changes against some repos used in the primer 🤔

crazybolillo added a commit to crazybolillo/astroid that referenced this issue Apr 4, 2024
Certain checkers upstream on pylint like import-error heavily use
find_spec. This method is IO intensive as it looks for files
across several search paths to return a ModuleSpec.

Since imports across files may repeat themselves it makes sense to cache
this method in order to speed up the linting process.

Local testing shows that caching reduces the total amount of calls to
find_module methods (used by find_spec) by about 50%. Linting the test
repository in the related issue goes from 439 to 351 seconds.

Closes pylint-dev/pylint#9310.
crazybolillo added a commit to crazybolillo/astroid that referenced this issue Apr 5, 2024
Certain checkers upstream on pylint like import-error heavily use
find_spec. This method is IO intensive as it looks for files
across several search paths to return a ModuleSpec.

Since imports across files may repeat themselves it makes sense to cache
this method in order to speed up the linting process.

Local testing shows that caching reduces the total amount of calls to
find_module methods (used by find_spec) by about 50%. Linting the test
repository in the related issue goes from 439 to 351 seconds.

Closes pylint-dev/pylint#9310.
crazybolillo added a commit to crazybolillo/astroid that referenced this issue Apr 5, 2024
Certain checkers upstream on pylint like import-error heavily use
find_spec. This method is IO intensive as it looks for files
across several search paths to return a ModuleSpec.

Since imports across files may repeat themselves it makes sense to cache
this method in order to speed up the linting process.

Local testing shows that caching reduces the total amount of calls to
find_module methods (used by find_spec) by about 50%. Linting the test
repository in the related issue goes from 40 seconds to 37 seconds. This
was on a NVME disk and after warmup, so timing gains may be bigger on
slower file systems like the one mentioned in the referenced issue.

Closes pylint-dev/pylint#9310.
crazybolillo added a commit to crazybolillo/astroid that referenced this issue Apr 9, 2024
Certain checkers upstream on pylint like import-error heavily use
find_spec. This method is IO intensive as it looks for files
across several search paths to return a ModuleSpec.

Since imports across files may repeat themselves it makes sense to cache
this method in order to speed up the linting process.

Local testing shows that caching reduces the total amount of calls to
find_module methods (used by find_spec) by about 50%. Linting the test
repository in the related issue goes from 40 seconds to 37 seconds. This
was on a NVME disk and after warmup, so timing gains may be bigger on
slower file systems like the one mentioned in the referenced issue.

Closes pylint-dev/pylint#9310.
@Pierre-Sassoulas Pierre-Sassoulas added this to the 3.2.0 milestone Apr 16, 2024
crazybolillo added a commit to crazybolillo/astroid that referenced this issue Apr 22, 2024
Certain checkers upstream on pylint like import-error heavily use
find_spec. This method is IO intensive as it looks for files
across several search paths to return a ModuleSpec.

Since imports across files may repeat themselves it makes sense to cache
this method in order to speed up the linting process.

Local testing shows that caching reduces the total amount of calls to
find_module methods (used by find_spec) by about 50%. Linting the test
repository in the related issue goes from 58 seconds to 52 seconds. This
was on a NVME disk and after warmup, so timing gains may be bigger on
slower file systems like the one mentioned in the referenced issue.

Closes pylint-dev/pylint#9310.
crazybolillo added a commit to crazybolillo/astroid that referenced this issue Apr 29, 2024
Certain checkers upstream on pylint like import-error heavily use
find_spec. This method is IO intensive as it looks for files
across several search paths to return a ModuleSpec.

Since imports across files may repeat themselves it makes sense to cache
this method in order to speed up the linting process.

Local testing shows that caching reduces the total amount of calls to
find_module methods (used by find_spec) by about 50%. Linting the test
repository in the related issue goes from 58 seconds to 52 seconds. This
was on a NVME disk and after warmup, so timing gains may be bigger on
slower file systems like the one mentioned in the referenced issue.

Closes pylint-dev/pylint#9310.
crazybolillo added a commit to crazybolillo/pylint that referenced this issue May 4, 2024
If possible it is desirable to look for modules with no context file as
it results in no search paths being given to astroid's find_spec(). This
makes call to it more uniform and opens up the possibility of effective
caching.

Related to pylint-dev#9310.
crazybolillo added a commit to crazybolillo/pylint that referenced this issue May 4, 2024
If possible it is desirable to look for modules with no context file as
it results in no search paths being given to astroid's find_spec(). This
makes call to it more uniform and opens up the possibility of effective
caching.

Refs pylint-dev#9310.
crazybolillo added a commit to crazybolillo/pylint that referenced this issue May 4, 2024
If possible it is desirable to look for modules with no context file as
it results in no search paths being given to astroid's find_spec(). This
makes calls to it more uniform and opens up the possibility of effective
caching.

Refs pylint-dev#9310.
crazybolillo added a commit to crazybolillo/astroid that referenced this issue May 4, 2024
Certain checkers upstream on pylint like import-error heavily use
find_spec. This method is IO intensive as it looks for files
across several search paths to return a ModuleSpec.

Since imports across files may repeat themselves it makes sense to cache
this method in order to speed up the linting process.

Closes pylint-dev/pylint#9310.
crazybolillo added a commit to crazybolillo/astroid that referenced this issue May 4, 2024
Certain checkers upstream on pylint like import-error heavily use
find_spec. This method is IO intensive as it looks for files
across several search paths to return a ModuleSpec.

Since imports across files may repeat themselves it makes sense to cache
this method in order to speed up the linting process.

Closes pylint-dev/pylint#9310.
crazybolillo added a commit to crazybolillo/pylint that referenced this issue May 4, 2024
jacobtylerwalls pushed a commit that referenced this issue May 4, 2024
* Avoid search paths for ImportChecker when possible

If possible it is desirable to look for modules with no context file as
it results in no search paths being given to astroid's find_spec(). This
makes calls to it more uniform and opens up the possibility of effective
caching.

Refs #9310.
crazybolillo added a commit to crazybolillo/astroid that referenced this issue May 4, 2024
Certain checkers upstream on pylint like import-error heavily use
find_spec. This method is IO intensive as it looks for files
across several search paths to return a ModuleSpec.

Since imports across files may repeat themselves it makes sense to cache
this method in order to speed up the linting process.

Closes pylint-dev/pylint#9310.
jacobtylerwalls pushed a commit to pylint-dev/astroid that referenced this issue May 4, 2024
Certain checkers upstream on pylint like import-error heavily use
find_spec. This method is IO intensive as it looks for files
across several search paths to return a ModuleSpec.

Since imports across files may repeat themselves it makes sense to cache
this method in order to speed up the linting process.

Closes pylint-dev/pylint#9310.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Enhancement ✨ Improvement to a component Needs PR This issue is accepted, sufficiently specified and now needs an implementation performance
Projects
None yet
3 participants