-
-
Notifications
You must be signed in to change notification settings - Fork 2.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
FindModuleCache: leverage BuildSourceSet #9478
Conversation
# another 'bar' module, because it's a waste of time and even in the | ||
# unlikely event that we did find one that matched, it probably would | ||
# be completely unrelated and undesirable | ||
return ModuleNotFoundReason.NOT_FOUND |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is probably the most controversial behavior change, but unfortunately,also the key ingredient in achieving any measurable speedup, because most calls to find_module
will fail, e.g. from typing import Dict
triggers a call to find_module('typing.Dict')
...
04bc7f6
to
ff60027
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interesting. I still have to get comfortable with this, but I tried to understand what you're doing and review the code and comments for style and clarity.
I wonder if there's a different way to speed things up, e.g. change fscache.py to use presence or absence in the listdir cache for a directory to decide whether a file exists, rather than call stat() (maybe only call stat() if it does exist).
Also, this is pretty subtle code, and I think you discovered this while you were working on this! Even though this probably gets exercised quite a bit by other unit tests, maybe you should add some tests that specifically try to at least test that all the code paths (both successes and failures) work? (If a bug were to be introduced here, it would be pretty nasty to track it down based on failures in other tests.)
mypy/modulefinder.py
Outdated
# fast path for any modules in the current source set | ||
# this is particularly important when there are a large number of search | ||
# paths which share the first (few) component(s) due to the use of namespace | ||
# packages, for instance |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Punctuation.
# fast path for any modules in the current source set | |
# this is particularly important when there are a large number of search | |
# paths which share the first (few) component(s) due to the use of namespace | |
# packages, for instance | |
# Fast path for any modules in the current source set. | |
# This is particularly important when there are a large number of search | |
# paths which share the first (few) component(s) due to the use of namespace | |
# packages, for instance: | |
# |
(The other paragraphs below are also missing periods.)
mypy/modulefinder.py
Outdated
@@ -92,6 +93,33 @@ def __repr__(self) -> str: | |||
self.base_dir) | |||
|
|||
|
|||
class BuildSourceSet: | |||
"""Efficiently test a file's membership in the set of build sources.""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This docstring has always bothered me.
"""Efficiently test a file's membership in the set of build sources.""" | |
"""Helper to efficiently test membership in a set of build sources.""" |
mypy/modulefinder.py
Outdated
# __init__.py | ||
# baz/ | ||
# | ||
# mypy gets [foo/company/foo, foo/company/bar, foo/company/baz, ...] as input |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also is this supposed to say bar/company/par, baz/company/baz?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, good catch.
mypy/modulefinder.py
Outdated
# 2. foo.bar.py[i] is in the source set | ||
# 3. foo.bar is not a directory |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got your slashes and dots mixed up?
# 2. foo.bar.py[i] is in the source set | |
# 3. foo.bar is not a directory | |
# 2. foo/bar.py[i] is in the source set | |
# 3. foo/bar is not a directory |
Or perhaps "3. foo.bar is not a package"?
Also, I find it hard to match these three conditions with the three tests actually being made. Point (1) is not really a condition, it just sets the context: id == 'foo.bar.baz', id[:idx] == 'foo.bar', and parent is either 'foo/bar.py' or foo/bar/__init__.py[i]
). Point (2) is the check for __init__.py
. Point (3) is the isdir() check. Together points (2) and (3) verify that foo.bar is a module but not a package. Have I got that right?
Maybe you can work the example from your GitHub comment into the source comment here to clarify things?
mypy/modulefinder.py
Outdated
# otherwise we might have false positives compared to slow path | ||
d = os.path.dirname(p) | ||
for i in range(id.count('.')): | ||
if not self.fscache.isfile(os.path.join(d, '__init__.py')): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't this check for __init__.pyi
too? That makes a package too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch!
mypy/modulefinder.py
Outdated
return p | ||
|
||
idx = id.rfind('.') | ||
if idx != - 1: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PEP8
if idx != - 1: | |
if idx != -1: |
mypy/modulefinder.py
Outdated
parent = self.find_module_via_source_set(id[:idx]) | ||
if ( | ||
parent and isinstance(parent, str) | ||
and not parent.endswith('__init__.py') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This feels dicey -- what if there's a module named foo__init__.py
? Also, (again) why not check for __init__.pyi
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, yes, I suppose this check should include a path separator. And, yes, checking for pyi
as well
mypy/modulefinder.py
Outdated
parent and isinstance(parent, str) | ||
and not parent.endswith('__init__.py') | ||
and not self.fscache.isdir(os.path.splitext(parent)[0]) | ||
): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't really use this black-like formatting in mypy.
mypy/modulefinder.py
Outdated
if idx != - 1: | ||
parent = self.find_module_via_source_set(id[:idx]) | ||
if ( | ||
parent and isinstance(parent, str) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you need the parent and
? I think that if parent is None it's not an instance of str, and if it's a string, it won't be empty. If you do need it, I'd move it to a separate line (to emphasize that there really are four conditions being checked) and if you can change it to parent is not None
.
I've considered a couple other approaches but this seemed to me the most elegant and efficient. I'm open to other ideas, but, as I mentioned in my first comment, the key ingredient to the speedup is cutting short the search when we've found a valid parent module, and I don't think filesystem cache can really address that.
It is indeed pretty subtle. It is covered quite extensively by other tests, but you're right that tracking bugs via such indirect coverage is tricky. What general form do you think would be best for tests to most directly cover this area? |
ff60027
to
4cfdab1
Compare
Rebasing this on top of 0.931 The impact is even larger now. With vanilla 0.931, our codebase takes 29min to typecheck, with this patch, the time drops under 8min (both mypyc-enabled build). I believe I have addressed all of @gvanrossum 's comments. I have also taken the extra step to gate the fast path behind a command line switch to assuage concerns about it causing subtle breakages in the wild. |
Gated behind a command line flag to assuage concerns about subtle issues in module lookup being introduced by this fast path.
4cfdab1
to
4ab2035
Compare
I'm sorry I haven't had time to look into this. I still have some hope that I'll find the time before the next mypy release. |
Apologies for the accidental branch deletion. Re-opened as #12616 (rebased on master) |
Given a large codebase with folder hierarchy of the form
with >100 toplevel folders, the time spent in
load_graph
is dominated by
find_module
because this operation isitself
O(n)
wheren
is the number of input files, whichends up being
O(n**2)
because it is called for every importstatement in the codebase and the way find_module work,
it will always scan through each and every one of those
toplevel directories for each and every import statement
of
company.*
Introduce a fast path that leverages the fact that for
imports within the code being typechecked, we already
have a mapping of module import path to file path in
BuildSourceSet
In a real-world codebase with ~13k files split across
hundreds of packages, this brings
load_graph
from~180s down to ~48s, with profiling showing that
parse
is now taking the vast majority of the time, as expected.