Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FindModuleCache: optionally leverage BuildSourceSet #12616

Merged
merged 2 commits into from
May 20, 2022

Conversation

huguesb
Copy link
Contributor

@huguesb huguesb commented Apr 18, 2022

Given a large codebase with folder hierarchy of the form

foo/
    company/
        __init__.py
        foo/
bar/
    company/
        __init__.py
        bar/
baz/
    company/
        __init__.py
        baz/
    ...

with >100 toplevel folders, the time spent in load_graph
is dominated by find_module because this operation is
itself O(n) where n is the number of input files, which
ends up being O(n**2) because it is called for every import
statement in the codebase and the way find_module work,
it will always scan through each and every one of those
toplevel directories for each and every import statement
of company.*

Introduce a fast path that leverages the fact that for
imports within the code being typechecked, we already
have a mapping of module import path to file path in
BuildSourceSet

In a real-world codebase with ~13k files split across
hundreds of packages, this brings load_graph from
~180s down to ~48s, with profiling showing that parse
is now taking the vast majority of the time, as expected.

@huguesb
Copy link
Contributor Author

huguesb commented Apr 18, 2022

This is a duplicate of #9478, necessary due to an accidental branch deletion.

@huguesb huguesb mentioned this pull request Apr 18, 2022
@github-actions

This comment has been minimized.

Copy link
Collaborator

@JukkaL JukkaL left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you know why this feature cannot/should not be enabled by default? Would some tests fail? I'm sympathetic to improving mypy performance in your use case, and even if the new logic can't fully replace the default logic, it may still be reasonable to include it behind a (at least somewhat hidden) flag.

It would still be great to have at least some test coverage for this feature, even if it's not enabled by default. Otherwise it could easily regress when somebody changes the related code.

add_invertible_flag(
'--fast-module-lookup', default=False,
help="Enable fast path for finding modules within input sources",
group=code_group)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this flag doesn't have good test coverage, I'd prefer to not advertise it in the mypy --help output, since it could easily break between releases. Instead, this could be discussed in the documentation, with some caveats about being an experimental feature.

You can use help=argparse.SUPPRESS for this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you switch to help=argparse.SUPPRESS?

@@ -158,6 +188,50 @@ def clear(self) -> None:
self.initial_components.clear()
self.ns_ancestors.clear()

def find_module_via_source_set(self, id: str) -> Optional[ModuleSearchResult]:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add docstring. Also mention that this is not used by default.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@huguesb
Copy link
Contributor Author

huguesb commented Apr 20, 2022 via email

@JukkaL
Copy link
Collaborator

JukkaL commented Apr 20, 2022

In the previous PR Guido expressed some concerns about possible subtle
breakages so I figured putting this behind a switch would make it more
palatable.

I agree with this. This is something we might possibly enable by default in the future, but it seems too dangerous to do it now.

Another possible option might be to run the test suite in both modes in CI to ensure adequate coverage.

This is not a practical option, since it would slow down our CI significantly. A better option would be to have at least a handful of tests (possibly copy from existing tests and only add the flag) that would cover at least the most commonly used code paths.

@huguesb
Copy link
Contributor Author

huguesb commented Apr 20, 2022

Another possible option might be to run the test suite in both modes in CI to ensure adequate coverage.

This is not a practical option, since it would slow down our CI significantly. A better option would be to have at least a handful of tests (possibly copy from existing tests and only add the flag) that would cover at least the most commonly used code paths.

To clarify: I meant configuring CI to have a separate parallel run of the whole test suite with the flag enabled, which takes more compute but should leave overall latency unchanged. This means that local test runs would not exercise this feature unless explicitly enabled, which seems fine if the feature is not publicly exposed.

@huguesb
Copy link
Contributor Author

huguesb commented Apr 21, 2022

A better option would be to have at least a handful of tests (possibly copy from existing tests and only add the flag) that would cover at least the most commonly used code paths.
I have modified the test suite to make it easy to run a subset of test files with multiple combinations of flags without duplicating the actual test cases. I have it configured to run all module-related typechecking files with both with and without the --fast-module-lookup flag.

I also added a handful of extra tests to verify that the fast path has similar behavior in some edge cases.

Finally, I added a section in the docs to explain the purpose of the new flag, and its experimental status.

@github-actions

This comment has been minimized.

1 similar comment
@github-actions

This comment has been minimized.

@huguesb
Copy link
Contributor Author

huguesb commented Apr 22, 2022

A better option would be to have at least a handful of tests (possibly copy from existing tests and only add the flag) that would cover at least the most commonly used code paths.
I have modified the test suite to make it easy to run a subset of test files with multiple combinations of flags without duplicating the actual test cases. I have it configured to run all module-related typechecking files with both with and without the --fast-module-lookup flag.

I also added a handful of extra tests to verify that the fast path has similar behavior in some edge cases.

Finally, I added a section in the docs to explain the purpose of the new flag, and its experimental status.

@JukkaL are you comfortable shipping this feature given those adjustments?

@huguesb huguesb requested a review from JukkaL April 26, 2022 04:04
@huguesb
Copy link
Contributor Author

huguesb commented May 3, 2022

Anything more I can do to get this into master / next release?

@huguesb
Copy link
Contributor Author

huguesb commented May 10, 2022

Anything more I can do to get this into master / next release?

Ping @JukkaL ?

This is a really big deal for us. We've deployed this fix onto a custom build of 0.790 because the performance impact was so significant and having to backport the fix is a big impediment to our ability to upgrade mypy, because of the additional complication of creating mypyc-enabled builds for all supported architectures. It would be very valuable to be able to start using mainline mypy again.

@huguesb huguesb mentioned this pull request May 18, 2022
Copy link
Collaborator

@JukkaL JukkaL left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left a few additional comments.

mypy/test/testcheck.py Outdated Show resolved Hide resolved
add_invertible_flag(
'--fast-module-lookup', default=False,
help="Enable fast path for finding modules within input sources",
group=code_group)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you switch to help=argparse.SUPPRESS?

@@ -516,6 +516,32 @@ same directory on the search path, only the stub file is used.
(However, if the files are in different directories, the one found
in the earlier directory is used.)

If a namespace package is spread across many distinct folders, for
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you move the explanation of the new flag to command_line.rst? I don't think that we need to mention the option here, at least as long as we don't know how common it is to have large namespace packages that would be helped by the option. This page is more of an introduction to module lookup logic and doesn't need to cover every flag.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Gated behind a command line flag to assuage concerns about subtle
issues in module lookup being introduced by this fast path.
@huguesb
Copy link
Contributor Author

huguesb commented May 19, 2022

Rebased on master and addressed comments.

@github-actions
Copy link
Contributor

According to mypy_primer, this change has no effect on the checked open source code. 🤖🎉

Copy link
Collaborator

@JukkaL JukkaL left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! Looks good now, and just in time for the 0.960 release.

@JukkaL JukkaL merged commit a6166b2 into python:master May 20, 2022
JukkaL pushed a commit that referenced this pull request May 20, 2022
Given a large codebase with folder hierarchy of the form

```
foo/
    company/
        __init__.py
        foo/
bar/
    company/
        __init__.py
        bar/
baz/
    company/
        __init__.py
        baz/
    ...
```

with >100 toplevel folders, the time spent in load_graph
is dominated by find_module because this operation is
itself O(n) where n is the number of input files, which
ends up being O(n**2) because it is called for every import
statement in the codebase and the way find_module work,
it will always scan through each and every one of those
toplevel directories for each and every import statement
of company.*

Introduce a fast path that leverages the fact that for
imports within the code being typechecked, we already
have a mapping of module import path to file path in
BuildSourceSet

Gated behind a command line flag (--fast-module-lookup) to 
assuage concerns about subtle issues in module lookup being 
introduced by this fast path.
@gvanrossum
Copy link
Member

Great!

@aviau
Copy link
Contributor

aviau commented May 30, 2022

I saw this in the changelog and I decided to test. We use namespaces a lot (with the exact same layout as above) and we have a 1.8k files codebase.

We can't notice the speed difference.

(we run mypy twice -- once on the whole codebase and once on individual scripts with other flags and another cache folder)

With --fast-module-lookup:

$ time ./bin/mypy
Success: no issues found in 1807 source files
Success: no issues found in 85 source files

real	0m15.235s
user	0m13.303s
sys	0m1.836s

Without:

$ time ./bin/mypy
Success: no issues found in 1807 source files
Success: no issues found in 85 source files

real	0m14.976s
user	0m13.090s
sys	0m1.814s

The difference is really small and changes depending on the run.

However we have only 19 toplevel folders. (interestingly almost proportional to the size of our codebase compared to yours!).

@huguesb
Copy link
Contributor Author

huguesb commented May 30, 2022

I saw this in the changelog and I decided to test. We use namespaces a lot (with the exact same layout as above) and we have a 1.8k files codebase.

...

The difference is really small and changes depending on the run.

However we have only 19 toplevel folders. (interestingly almost proportional to the size of our codebase compared to yours!).

Thanks for trying it out and reporting your results!

For reference, as of today the codebase that motivated this PR has 21650 files split over 982 distinct subtrees that all share the same toplevel python package name.

The number of packages sharing a toplevel namespace is the most significant factor that will determine how much of a speedup you may experience. Another relevant factor is the use of from x import y statements where y is not a module (but a function or variable inside a module). iirc, if all your import statements target a module the impact of the fast path will be smaller. For reference, we have on the order of 150k such import statements in our codebase.

Also note that the alternate module lookup logic only takes effect for files that are part of the input sources, so depending on how you invoke mypy it may or may not trigger. We explicitly pass every toplevel folder in the input command line and in MYPYPATH and I have not done extensive testing of alternate invocation modes to see which ones actually trigger the fast path.

@huguesb huguesb deleted the pr-module-finder branch June 10, 2022 08:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants