Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Overhaul file-system operations to use FileSystemCache #4623

Merged
merged 19 commits into from
Mar 14, 2018
Merged

Conversation

msullivan
Copy link
Collaborator

@msullivan msullivan commented Feb 23, 2018

This overhauls our file-system accesses to always go through a FileSystemCache when stat()ing or reading source files. (We don't use it for reading cache json files, since we seem to only do a single read on each anyways and so caching doesn't help us any.) This eliminates a number of potential bugs having to do with seeing inconsistent file-system state and reduces the number of file reads+decodes in fine-grained incremental mode.

This also allows us to eliminate a number of caches maintained by find_module that duplicated functionality of FileSystemCache. Since I was reluctant to make the FileSystemCache global, I took the opportunity to make find_module's caches non-global as well.

@msullivan
Copy link
Collaborator Author

Argh: on windows python generates different error messages in the FileNotFoundError for open and os.stat, and this changes things to do os.stat first.

@msullivan msullivan changed the base branch from load_graph_opt to master March 1, 2018 19:03
They'll still merge conflict, but the resolution will be trivial now
(I'm trying avoid making one depend on the other)
Copy link
Collaborator

@JukkaL JukkaL left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Excellent, this should improve performance and this will make it easier to reason about correctness when there are concurrent file system updates. Also thanks for refactoring things a little while you were are it!

I have just a few minor things (also you need to fix a merge conflict). Finally, before merging, I'd like a few more steps:

  1. Measure the performance impact to a trivial fine-grained incremental update (preferably on macOS, since FS operations tend to be slow on macs).
  2. Manually verify that file system changes are still correctly picked up in some common scenarios (both with daemon and without).

@@ -499,13 +505,14 @@ def get_module_to_path_map(manager: BuildManager) -> Dict[str, str]:
for module, node in manager.modules.items()}


def get_sources(modules: Dict[str, str],
def get_sources(fscache: FileSystemCache,
modules: Dict[str, str],
changed_modules: List[Tuple[str, str]]) -> List[BuildSource]:
# TODO: Race condition when reading from the file system; we should only read each
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this TODO comment be removed now?

mypy/build.py Outdated
@@ -544,6 +548,23 @@ def find_config_file_line_number(path: str, section: str, setting_name: str) ->
return -1


class FindModuleCache:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add docstring. Mention that this never flushes the FileSystemCache and caller is responsible for that.

mypy/build.py Outdated


def find_module(id: str, lib_path_arg: Iterable[str]) -> Optional[str]:
def find_module(cache: FindModuleCache, id: str, lib_path_arg: Iterable[str]) -> Optional[str]:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be a method of FindModuleCache? If not, document in FindModuleCache that this function populates the data structures and is tightly coupled with the class.

mypy/build.py Outdated
dirs = []
for pathitem in lib_path:
# e.g., '/usr/lib/python3.4/foo/bar'
isdir = find_module_isdir_cache.get((pathitem, dir_chain))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why was this block of code removed?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's maintaining an isdir cache, which is handled by the fscache stat cache.

@msullivan
Copy link
Collaborator Author

OK, all of the other changes I was making that conflicted with this are in now, so I've updated it all and made the requested cleanups.

Doesn't affect the performance of a trivial fine-grained incremental update at all.

Manual testing seems to pick up changes fine.

@gvanrossum
Copy link
Member

@ethanhs Note that this will most likely affect your PEP 561 PR (#4693) severely.

@emmatyping
Copy link
Collaborator

emmatyping commented Mar 11, 2018

@gvanrossum, indeed. I was originally hoping to get my old PR merged before this one was opened, but that of course didn't happen. The good news is my implementation in #4693 was in a single commit, so I should be able to revert that, rebase on master and re-apply the changes (with slight modification) without too much headache. Thanks for the heads up!

PS: I originally interpreted your comment to mean that this is getting merged before my PR, is that the case?

@emmatyping emmatyping mentioned this pull request Mar 11, 2018
9 tasks
Copy link
Collaborator

@JukkaL JukkaL left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are a few additional remaining non-cached os operations in mypy.build (also see comments):

  • getmtime (os.path.getmtime)
  • remove_cwd_prefix_from_path (os.path.isfile)

Should these be updated to use the cache as well?

mypy/build.py Outdated

def list_dir(path: str) -> Optional[List[str]]:
"""Return a cached directory listing.
Module locations and some intermediate results are cached internally
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Style nit: Indents seem off unless github diff view is acting up. Just to be extra clear, indents should be like this:

class Foo:
    """First line.

    Second line (4 space indent).
    """

mypy/build.py Outdated
# use hits to avoid adding it a second time when we see x.pyi.
# This also avoids both x.py and x.pyi when x/ was seen first.
hits = set() # type: Set[str]
for item in sorted(os.listdir(os.path.dirname(module_path))):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we use FileSystemCache here and below?

manager = result.manager
graph = result.graph
self.fine_grained_manager = FineGrainedBuildManager(manager, graph)
self.fine_grained_manager = mypy.server.update.FineGrainedBuildManager(result)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIUC the imports still have from mypy.server.update import FineGrainedBuildManager so why the module prefix?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Improperly resolved a merge conflict. When this patch first went up it used a prefix, but I changed it to not in another patch and must have undone it while merging.

mypy/main.py Outdated
@@ -527,7 +527,7 @@ def add_invertible_flag(flag: str,
.format(special_opts.package))
options.build_type = BuildType.MODULE
lib_path = [os.getcwd()] + build.mypy_path()
targets = build.find_modules_recursive(special_opts.package, lib_path)
targets = build.FindModuleCache().find_modules_recursive(special_opts.package, lib_path)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's kind of sad that this creates a separate instance of all the caches, and introduces an opportunity for a race condition. Maybe add a TODO (since it was a pre-existing condition)?

mypy/fscache.py Outdated


class FileSystemCache:
def __init__(self, pyversion: Tuple[int, int]) -> None:
def __init__(self, pyversion: Optional[Tuple[int, int]] = None) -> None:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if there shouldn't be two classes here. A base class that has no python_version and no read_with_python_encoding() method, and a subclass that has both. This would avoid the Optional and the assert (I had to go on a hunt for all instantiations of this class to get comfortable with your current version, and it seems all but one call site would be happy with the proposed base class.)

mypy/build.py Outdated
lib_path = tuple(lib_path_arg)
fscache = self.fscache

def find() -> Optional[str]:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While we're refactoring, I wonder if this should be made a regular method (with a longer name) and passed in whatever it needs from find_module(). The nested function always bothered me, and a nested function that references self bothers me more.

mypy/build.py Outdated


def verify_module(id: str, path: str) -> bool:
def verify_module(fscache: FileSystemCache, id: str, path: str) -> bool:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe this could become a method on FileSystemCache?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd rather not add more knowledge about python-specific things to FileSystemCache

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK.

mypy/build.py Outdated
@@ -1399,14 +1325,6 @@ def delete_cache(id: str, path: str, manager: BuildManager) -> None:
d. from P import M; checks filesystem whether module P.M exists in
filesystem.

e. Race conditions, where somebody modifies a file while we're
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe keep the bullet with the first sentence and replace the rest with "Solved by FileSystemClass"?

except IOError as ioerr:
# ioerr.strerror differs for os.stat failures between Windows and
# other systems, but os.strerror(ioerr.errno) does not, so we use that.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe explain that we want the error to be platform-independent so tests can have predictable output?

mypy/fscache.py Outdated
return stat.S_ISREG(st.st_mode)
try:
st = self.stat(path)
return stat.S_ISREG(st.st_mode)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Possibly move this out of the try/except -- this can't raise OSError.


On case-insensitive filesystems (like Mac or Windows) this returns
False if the case of the path's last component does not exactly
match the case found in the filesystem.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, this means that importing wrongly-cased packages will not be flagged, since we only compare the filename (which would be __init__.py for a package). Not actionable except perhaps add a TODO comment; this was a pre-existing problem. (Python itself gets it right so that's a fallback.)

@msullivan
Copy link
Collaborator Author

I updated remove_cwd_prefix_from_path to use the cache, but getmtime is only used for cache json files, which we don't use the fscache for.

@msullivan msullivan merged commit 5ac30ea into master Mar 14, 2018
@msullivan msullivan deleted the use_fscache branch March 14, 2018 00:41
@gvanrossum
Copy link
Member

Yay!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants