Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Long lines of spaces cause exponential runtime #4062

Closed
jkrinke opened this issue Feb 3, 2021 · 6 comments
Closed

Long lines of spaces cause exponential runtime #4062

jkrinke opened this issue Feb 3, 2021 · 6 comments
Labels
Enhancement ✨ Improvement to a component performance

Comments

@jkrinke
Copy link

jkrinke commented Feb 3, 2021

Any line that consists of spaces will take exponential runtime over the length of the line. Extremely long lines will take forever to process.

Steps to reproduce

  1. perl -e 'print " " x 10000, "\n"' > y.py; time python -m pylint y.py

By varying the length one can observe that the runtime doubles every 100 spaces or so.

Current behavior

pylint's runtime is larger than 24h

Expected behavior

pylint's runtime should be sub-second.

pylint --version output

pylint 2.6.0
astroid 2.4.2
Python 3.9.1 (default, Jan 2 2021, 18:55:28)
[GCC 4.8.5 20150623 (Red Hat 4.8.5-39)]

@Pierre-Sassoulas Pierre-Sassoulas added Enhancement ✨ Improvement to a component performance labels Feb 6, 2021
@PCManticore
Copy link
Contributor

Thanks for the report. There seems to be an issue in the formatting checkers.

@jkrinke
Copy link
Author

jkrinke commented Feb 9, 2021

Similar to issue #4041 I would not consider this enhancement/topic-performance.

In a CI/CD setting such long runtimes are equivalent to non-termination.

@matusvalo
Copy link
Collaborator

I did quick profile and it seems that majority of time is spent in tokenize module.

(astroid37) matus@matus-debian:~/dev/pylint$ python -m cProfile -s time /home/matus/pyenvironments/astroid37/bin/pylint y.py 
         733801 function calls (721966 primitive calls) in 2.567 seconds

   Ordered by: internal time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        4    1.728    0.432    1.731    0.433 tokenize.py:487(_tokenize)
(astroid37) matus@matus-debian:~/dev/pylint$ python -m cProfile -s time /home/matus/pyenvironments/astroid37/bin/pylint y.py | head
         733801 function calls (721966 primitive calls) in 19.159 seconds

   Ordered by: internal time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        4   15.726    3.932   15.762    3.941 tokenize.py:487(_tokenize)
      610    1.561    0.003    1.561    0.003 {method 'match' of 're.Pattern' objects}
        8    0.478    0.060    0.478    0.060 {built-in method typed_ast._ast3._parse}
        6    0.348    0.058    0.348    0.058 {method 'readline' of '_io.BufferedReader' objects}
        1    0.175    0.175    0.211    0.211 misc.py:128(process_module)
       # output omitted
(astroid37) matus@matus-debian:~/dev/pylint$ python -m cProfile -s time /home/matus/pyenvironments/astroid37/bin/pylint y.py
         733801 function calls (721966 primitive calls) in 19.159 seconds

   Ordered by: internal time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        4   15.726    3.932   15.762    3.941 tokenize.py:487(_tokenize)
      610    1.561    0.003    1.561    0.003 {method 'match' of 're.Pattern' objects}
        8    0.478    0.060    0.478    0.060 {built-in method typed_ast._ast3._parse}
        6    0.348    0.058    0.348    0.058 {method 'readline' of '_io.BufferedReader' objects}
        1    0.175    0.175    0.211    0.211 misc.py:128(process_module)
       # output omitted

@matusvalo
Copy link
Collaborator

matusvalo commented Feb 21, 2021

I have doublechecked the tokenizer and it seems that there is root cause of the issue:

(astroid37) matus@matus-debian:~/dev/pylint$ perl -e 'print " " x 10000000, "\n"' > y.py
(astroid37) matus@matus-debian:~/dev/pylint$ time python -m tokenize y.py 
0,0-0,0:            ENCODING       'utf-8'        
1,10000000-1,10000001:NL             '\n'           
2,0-2,0:            ENDMARKER      ''             

real	0m1.659s
user	0m1.639s
sys	0m0.020s
(astroid37) matus@matus-debian:~/dev/pylint$ perl -e 'print " " x 100000000, "\n"' > y.py
(astroid37) matus@matus-debian:~/dev/pylint$ time python -m tokenize y.py 
0,0-0,0:            ENCODING       'utf-8'        
1,100000000-1,100000001:NL             '\n'           
2,0-2,0:            ENDMARKER      ''             

real	0m16.184s
user	0m16.038s
sys	0m0.141s
(astroid37) matus@matus-debian:~/dev/pylint$ 

btw. there were optimizations done lately in tokenize.py - maybe can be checked against python master:
python/cpython#24311

@DanielNoord
Copy link
Collaborator

I have checked this both on 3.10 and 3.8 but I can't seem to replicate times of 24h+ on either version with the latest main. In fact, I'm getting times of around 1.5 seconds for a line with 11 million+ spaces.

Looking at my own profiler output and that reported by @matusvalo I see similar results although my machine is clearly faster.
Considering we're only calling tokenize 4 times I don't think we're necessarily doing anything wrong here..

@jkrinke I'm not sure if you're still interested in this issue but could you report back how long your example takes on the latest main or 2.12.2?

Otherwise I think we could close this issue as it might have been solved with improvements to the tokenize module itself.


Profiler output:

         951656 function calls (940963 primitive calls) in 3.528 seconds

   Ordered by: internal time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        4    1.648    0.412    1.650    0.413 tokenize.py:429(_tokenize)
      128    0.225    0.002    0.225    0.002 {built-in method builtins.compile}
      482    0.218    0.000    0.218    0.000 {method 'match' of 're.Pattern' objects}
      381    0.181    0.000    0.181    0.000 {method 'read' of '_io.BufferedReader' objects}
       23    0.145    0.006    0.146    0.006 {built-in method _imp.create_dynamic}
     1685    0.086    0.000    0.086    0.000 {built-in method posix.stat}
      379    0.070    0.000    0.070    0.000 {built-in method io.open_code}

@Pierre-Sassoulas
Copy link
Member

I'm going to close as completed, 1.5 seconds for a line with 11 million+ spaces. seems good enough.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Enhancement ✨ Improvement to a component performance
Projects
None yet
Development

No branches or pull requests

5 participants