Making pycodestyle faster #754

nickdrozd · 2018-04-15T23:25:03Z

I've been on a Python performance optimization kick recently (see
pylint-dev/astroid#497), and I'm a pycodestyle
user, so I figured I would give it a look and see if its performance
could be improved at all.

Of course, pycodestyle is already pretty fast, so to give it some
stress I'm testing it out by pointing it right at a large repo, namely
Zulip (https://github.com/zulip/zulip). In particular, my test command
is time ~/pycodestyle/pycodestyle.py -qq ~/zulip.

Here are the times from three runs of master:

real	2m10.664s
user	0m55.333s
sys	1m15.249s

real	2m11.376s
user	0m55.545s
sys	1m15.745s

real	2m11.588s
user	0m55.500s
sys	1m16.044s

I used the yappi profiling library to see if there were any hotspots
in the code. There were. Take a look at the graph below. The brighter
the box, the hotter the spot. In more detail, each box represents a
function and has three numbers: 1) the percentage of total CPU time
spent in that function, 2) the percentage of total CPU time spent in
that function but not its subcalls, and 3) the number of times the
function was called.

The red box that sticks out is Checker.run_check. It is called well
over two million times, and 27.7 of total CPU time is spent there,
almost all over which is in the function itself. This seems like an
awful lot considering how short the function is:

    def run_check(self, check, argument_names):
        """Run a check plugin."""
        arguments = []
        for name in argument_names:
            arguments.append(getattr(self, name))
        return check(*arguments)

So why does it suck up so much time?

I think I've worked out how it goes. When a check is registered (with
register_check), its arguments are extracted with the inspect
library and stored as a list of strings. When a check is run,
run_check iterates over its associated list of arguments,
dynamically accesses those attributes of the Checker, and then
passes those values to the check to actually run.

The problem here is that dynamic attribute access is slow, and doing
it in tight loops is really slow (see
pylint-dev/astroid#497 for a harrowing cautionary
tale on this subject). My idea was to see if there was a way to do
away with the dynamic attribute access, basically by "compiling" the
attribute access into the code.

It turns out that this can be accomplished by passing the checker
instance into the check as an argument, and then call the attributes
directly on the checker. Implementing this change involves a
large-scale shuffling of arguments and strings, but other than that
not much changes. register_check has to take the check's argument
names as arguments now, since they are no longer the actual arguments.
run_check itself can also be done away with, since all it would have
to do would be to call the check with the checker as an argument, and
that can be done inline.

This change resulted in a substantial speedup:

real	1m28.057s
user	0m40.340s
sys	0m47.669s

real	1m27.843s
user	0m39.910s
sys	0m47.901s

real	1m28.258s
user	0m40.379s
sys	0m47.849s

Here is the resulting yappi graph:

This graph is a lot more colorful than the last one. This means that
the work is spread out more evenly among the various functions and
there isn't one overwhelmingly critical hotspot.

One function that stuck out to me was Checker.init_checker_state.
After some experimentation, it appeared that despite taking up almost
6% of total CPU time, the function didn't do much. Cutting it provided
a non-negligible speed improvement:

real	1m19.463s
user	0m36.857s
sys	0m42.565s

real	1m19.837s
user	0m36.469s
sys	0m43.329s

real	1m19.521s
user	0m36.462s
sys	0m43.026s

A little further poking around revealed that run_check and
init_checker_state were the only consumers of the "argument names",
so I cut those out too. This led to some nice code simplification and
an ever-so-slight speedup:

real	1m19.686s
user	0m36.516s
sys	0m43.129s

real	1m18.466s
user	0m36.196s
sys	0m42.227s

real	1m19.063s
user	0m36.188s
sys	0m42.846s

Here is the yappi graph after these changes:

The major hotspot is now tokenize.tokenize, which is part of the
standard library. This is good, as it suggests that pycodestyle is
nearing the point of being as fast as it can be. After that, the next
most expensive functions are

check_logical,
generate_tokens,
build_tokens_line,
check_all,
maybe_check_physical, and
_is_eol_token_.

These functions all feel to me like they are doing something
inefficiently, but I don't understand them well enough to say what.

These measurements were all taken running master with

Python 3.6.5 (default, Mar 30 2018, 06:42:10)
[GCC 4.2.1 Compatible Apple LLVM 9.0.0 (clang-900.0.39.2)] on darwin

`run_check` works by iterating over a list of strings, then dynamically accessing the appropriate checker attributes. Dynamic attribute access is slow, and doing it in a tight loop is really slow. Speed can be improved significantly by passing the checker in as an argument to the check and calling the attributes directly. This requires some argument shuffling.

`init_checker_state` gets called a lot, but why? Performance can be improved by doing it just once.

Checks were being stored along with their "argument names", but only `run_check` and `init_checker_state` were being using them, so they can be cut.

sigmavirus24 · 2018-04-16T14:26:38Z

While this is a fantastic improvement in performance, many tools use pycodestyle as a library with these checks as public APIs. That means this is a significant breaking change that will need to be very carefully reviewed and considered.

nickdrozd · 2018-04-17T04:12:58Z

Well, it looks like the times in that post are all bullshit. It turns
out that profiling a program with yappi causes it to run slower than
it normally would. Thus the numbers I gave before are farcically high.

Here are the real numbers:

Before

real	0m13.617s
user	0m13.401s
sys	0m0.136s

real	0m13.329s
user	0m13.266s
sys	0m0.051s

real	0m13.169s
user	0m13.106s
sys	0m0.050s

After

real	0m10.631s
user	0m10.563s
sys	0m0.054s

real	0m10.620s
user	0m10.561s
sys	0m0.048s

real	0m10.708s
user	0m10.648s
sys	0m0.049s

Whoops!

But hey, that's still 23% faster, right? I don't know, maybe
pycodestyle is plenty fast already and a little bit of performance
improvement isn't worth the breaking API changes. It's probably still
worth getting rid of init_checker_state, and that can be done
independently of getting rid of run_check.

Oh well, I still had fun spending the weekend poking around in the
pycodestyle code :)

sigmavirus24 · 2018-04-17T11:58:41Z

Could we get some of the the performance benefit by using a dictionary instead of getattr?

nickdrozd · 2018-05-03T22:51:08Z

Sorry to take so long to reply, I have been busy!

Regarding the API change, I do think that this new API is simpler than
the old one. In the new API, a new check only has to take a single
argument (I called it self in the PR, but it would probably be
clearer to call it checker), while at present a check has to take as
arguments every attribute of the checker it will access, and its first
argument must be the type of check being performed, even if this is
not used as an argument in the function. This latter requirement can
prove confusing to linters. For instance, pylint gives this warning
several times running against pycodestyle.py: Unused argument 'logical_line' (unused-argument). So apart from the speed
improvements, the API change might by itself be considered a feature
rather than a bug.

Maybe a change like this would be appropriate for a 3.0 release?

Then again, I have no idea how much outside code relies on this API.
Do you? I know autopep8 uses it (@hhatto). The changes needed to
update such code would be relatively straightforward I think. I would
be happy to help convert it if we decide that's worth doing.

nickdrozd · 2018-05-03T22:52:18Z

@sigmavirus24 I think I tried something like that, but I can't remember the details. What do you have in mind?

nickdrozd added 3 commits April 15, 2018 17:48

Cut init_checker_state

37db846

`init_checker_state` gets called a lot, but why? Performance can be improved by doing it just once.

Cut check argument names

994f71f

Checks were being stored along with their "argument names", but only `run_check` and `init_checker_state` were being using them, so they can be cut.

sigmavirus24 added undecided needs review labels Apr 16, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Making pycodestyle faster #754

Making pycodestyle faster #754

nickdrozd commented Apr 15, 2018

sigmavirus24 commented Apr 16, 2018

nickdrozd commented Apr 17, 2018

sigmavirus24 commented Apr 17, 2018

nickdrozd commented May 3, 2018

nickdrozd commented May 3, 2018

Making pycodestyle faster #754

Are you sure you want to change the base?

Making pycodestyle faster #754

Conversation

nickdrozd commented Apr 15, 2018

sigmavirus24 commented Apr 16, 2018

nickdrozd commented Apr 17, 2018

Before

After

sigmavirus24 commented Apr 17, 2018

nickdrozd commented May 3, 2018

nickdrozd commented May 3, 2018