Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FindModuleCache: leverage BuildSourceSet #9478

Closed
wants to merge 1 commit into from

Conversation

huguesb
Copy link
Contributor

@huguesb huguesb commented Sep 24, 2020

Given a large codebase with folder hierarchy of the form

foo/
    company/
        __init__.py
        foo/
bar/
    company/
        __init__.py
        bar/
baz/
    company/
        __init__.py
        baz/
    ...

with >100 toplevel folders, the time spent in load_graph
is dominated by find_module because this operation is
itself O(n) where n is the number of input files, which
ends up being O(n**2) because it is called for every import
statement in the codebase and the way find_module work,
it will always scan through each and every one of those
toplevel directories for each and every import statement
of company.*

Introduce a fast path that leverages the fact that for
imports within the code being typechecked, we already
have a mapping of module import path to file path in
BuildSourceSet

In a real-world codebase with ~13k files split across
hundreds of packages, this brings load_graph from
~180s down to ~48s, with profiling showing that parse
is now taking the vast majority of the time, as expected.

# another 'bar' module, because it's a waste of time and even in the
# unlikely event that we did find one that matched, it probably would
# be completely unrelated and undesirable
return ModuleNotFoundReason.NOT_FOUND
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is probably the most controversial behavior change, but unfortunately,also the key ingredient in achieving any measurable speedup, because most calls to find_module will fail, e.g. from typing import Dict triggers a call to find_module('typing.Dict')...

Copy link
Member

@gvanrossum gvanrossum left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting. I still have to get comfortable with this, but I tried to understand what you're doing and review the code and comments for style and clarity.

I wonder if there's a different way to speed things up, e.g. change fscache.py to use presence or absence in the listdir cache for a directory to decide whether a file exists, rather than call stat() (maybe only call stat() if it does exist).

Also, this is pretty subtle code, and I think you discovered this while you were working on this! Even though this probably gets exercised quite a bit by other unit tests, maybe you should add some tests that specifically try to at least test that all the code paths (both successes and failures) work? (If a bug were to be introduced here, it would be pretty nasty to track it down based on failures in other tests.)

Comment on lines 286 to 289
# fast path for any modules in the current source set
# this is particularly important when there are a large number of search
# paths which share the first (few) component(s) due to the use of namespace
# packages, for instance
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Punctuation.

Suggested change
# fast path for any modules in the current source set
# this is particularly important when there are a large number of search
# paths which share the first (few) component(s) due to the use of namespace
# packages, for instance
# Fast path for any modules in the current source set.
# This is particularly important when there are a large number of search
# paths which share the first (few) component(s) due to the use of namespace
# packages, for instance:
#

(The other paragraphs below are also missing periods.)

@@ -92,6 +93,33 @@ def __repr__(self) -> str:
self.base_dir)


class BuildSourceSet:
"""Efficiently test a file's membership in the set of build sources."""
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This docstring has always bothered me.

Suggested change
"""Efficiently test a file's membership in the set of build sources."""
"""Helper to efficiently test membership in a set of build sources."""

mypy/modulefinder.py Show resolved Hide resolved
# __init__.py
# baz/
#
# mypy gets [foo/company/foo, foo/company/bar, foo/company/baz, ...] as input
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also is this supposed to say bar/company/par, baz/company/baz?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, good catch.

Comment on lines 181 to 182
# 2. foo.bar.py[i] is in the source set
# 3. foo.bar is not a directory
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got your slashes and dots mixed up?

Suggested change
# 2. foo.bar.py[i] is in the source set
# 3. foo.bar is not a directory
# 2. foo/bar.py[i] is in the source set
# 3. foo/bar is not a directory

Or perhaps "3. foo.bar is not a package"?

Also, I find it hard to match these three conditions with the three tests actually being made. Point (1) is not really a condition, it just sets the context: id == 'foo.bar.baz', id[:idx] == 'foo.bar', and parent is either 'foo/bar.py' or foo/bar/__init__.py[i]). Point (2) is the check for __init__.py. Point (3) is the isdir() check. Together points (2) and (3) verify that foo.bar is a module but not a package. Have I got that right?

Maybe you can work the example from your GitHub comment into the source comment here to clarify things?

# otherwise we might have false positives compared to slow path
d = os.path.dirname(p)
for i in range(id.count('.')):
if not self.fscache.isfile(os.path.join(d, '__init__.py')):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't this check for __init__.pyi too? That makes a package too.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch!

return p

idx = id.rfind('.')
if idx != - 1:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PEP8

Suggested change
if idx != - 1:
if idx != -1:

parent = self.find_module_via_source_set(id[:idx])
if (
parent and isinstance(parent, str)
and not parent.endswith('__init__.py')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This feels dicey -- what if there's a module named foo__init__.py? Also, (again) why not check for __init__.pyi?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, yes, I suppose this check should include a path separator. And, yes, checking for pyi as well

parent and isinstance(parent, str)
and not parent.endswith('__init__.py')
and not self.fscache.isdir(os.path.splitext(parent)[0])
):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't really use this black-like formatting in mypy.

if idx != - 1:
parent = self.find_module_via_source_set(id[:idx])
if (
parent and isinstance(parent, str)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you need the parent and? I think that if parent is None it's not an instance of str, and if it's a string, it won't be empty. If you do need it, I'd move it to a separate line (to emphasize that there really are four conditions being checked) and if you can change it to parent is not None.

@huguesb
Copy link
Contributor Author

huguesb commented Oct 18, 2020

I wonder if there's a different way to speed things up,

I've considered a couple other approaches but this seemed to me the most elegant and efficient. I'm open to other ideas, but, as I mentioned in my first comment, the key ingredient to the speedup is cutting short the search when we've found a valid parent module, and I don't think filesystem cache can really address that.

Also, this is pretty subtle code, and I think you discovered this while you were working on this! Even though this probably gets exercised quite a bit by other unit tests, maybe you should add some tests that specifically try to at least test that all the code paths (both successes and failures) work? (If a bug were to be introduced here, it would be pretty nasty to track it down based on failures in other tests.)

It is indeed pretty subtle. It is covered quite extensively by other tests, but you're right that tracking bugs via such indirect coverage is tricky. What general form do you think would be best for tests to most directly cover this area?

@huguesb
Copy link
Contributor Author

huguesb commented Jan 21, 2022

Rebasing this on top of 0.931

The impact is even larger now. With vanilla 0.931, our codebase takes 29min to typecheck, with this patch, the time drops under 8min (both mypyc-enabled build).

I believe I have addressed all of @gvanrossum 's comments. I have also taken the extra step to gate the fast path behind a command line switch to assuage concerns about it causing subtle breakages in the wild.

Gated behind a command line flag to assuage concerns about subtle
issues in module lookup being introduced by this fast path.
@gvanrossum
Copy link
Member

I'm sorry I haven't had time to look into this. I still have some hope that I'll find the time before the next mypy release.

@huguesb huguesb closed this Apr 18, 2022
@huguesb huguesb deleted the pr-module-finder branch April 18, 2022 05:47
@huguesb
Copy link
Contributor Author

huguesb commented Apr 18, 2022

Apologies for the accidental branch deletion. Re-opened as #12616 (rebased on master)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants