Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Change importlib to first try to import modules using the standard mechanism #11997

Merged
merged 13 commits into from
Mar 3, 2024

Conversation

nicoddemus
Copy link
Member

@nicoddemus nicoddemus commented Feb 17, 2024

As detailed in #11475 (comment), currently with --import-mode=importlib pytest will try to import every file by using a unique module name, regardless if that module could be imported using the normal import mechanism without touching sys.path.

This has the consequence that non-test modules available in sys.path (via other mechanism, such as being installed into a virtualenv, PYTHONPATH, etc) would end up being imported as standalone modules, instead of imported with their expected module names.

To illustrate:

.env/
  lib/
    site-packages/
      anndata/
        core.py 

Given anndata is installed into the virtual environment, python -c "import anndata.core" works, but pytest with importlib mode would import that module as a standalone module named ".env.lib.site-packages.anndata.core", because importlib module was designed to import test files which are not reachable from sys.path, but now it is clear that normal modules should be imported using the standard mechanisms if possible.

Now imporlib mode will first try to import the module normally, without changing sys.path, and if that fails it falls back to importing the module as a standalone module.

This supersedes #11931.

Fix #11475
Close #11931

@nicoddemus nicoddemus changed the title 11475 importlib Change importlib to first try to import modules using the standard mechanism Feb 17, 2024
testing/test_pathlib.py Outdated Show resolved Hide resolved
@nicoddemus
Copy link
Member Author

@flying-sheep I believe the changes here are the correct approach for the importlib in general. With this PR the tests in your pytest-doctest-import-mismatch pass (including the doctest), the only change being required is to also exclude the conftest.py from being installed -- it should always be located near the test files, as the fixtures defined in conftest.py files are only available for tests in the same directory or below.

@nicoddemus nicoddemus marked this pull request as ready for review February 17, 2024 20:13
@nicoddemus
Copy link
Member Author

@flying-sheep would appreciate if you can test this PR using your original project to ensure it works as intended (if it is simple and you have the time, of course). 👍

Copy link
Contributor

@flying-sheep flying-sheep left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Beautiful! If I understand correctly, this is basically the more efficient way to do what I did in my PR by brute forcing. I have a thought regarding the changed wording for --import-mode=importlib:

doc/en/explanation/pythonpath.rst Outdated Show resolved Hide resolved
@nicoddemus
Copy link
Member Author

nicoddemus commented Feb 18, 2024

Beautiful! If I understand correctly, this is basically the more efficient way to do what I did in my PR by brute forcing.

Indeed. The previous semantics were:

  • Ignore sys.path entirely, and try to import everything by importing it directly, deriving a unique module name.

This has now changed to:

  • Try to import the path normally with the current sys.path. If that fails, import it directly deriving a unique module name.

In retrospect this is what we should have implemented since the beginning, but the thought when this was initially implemented was that it would be used to import test modules only, but we missed the case of importing normal modules too (for doctesting for example).

@flying-sheep
Copy link
Contributor

we missed the case of importing normal modules too (for doctesting for example).

An alternative way of solving this would have been to only collect doctests from roots using --pyargs, then always use sys.path for importing the doctests’ parent modules. But that ship has sailed I guess.

Thank you for spending time solving this!

@flying-sheep
Copy link
Contributor

I tested this with our package, and it seems to indeed work flawlessly!

@nicoddemus
Copy link
Member Author

An alternative way of solving this would have been to only collect doctests from roots using --pyargs, then always use sys.path for importing the doctests’ parent modules. But that ship has sailed I guess.

That would solve doctests, but I think the problem is more general and best solved in import_path directly, so any code importing a module will have the same behavior.

I tested this with our package, and it seems to indeed work flawlessly!

Thanks!

@bluetech
Copy link
Member

I haven't reviewed the changes yet, but regarding

Now imporlib mode will first try to import the module normally, without changing sys.path, and if that fails it falls back to importing the module as a standalone module.

Just to make sure, IIRC one of the reasons for adding importlib mode was to deal with the non-package-conftest problem where it's like

a/conftest.py
b/conftest.py

and then we try to load conftest module twice. Or it may be not with conftest but something similar, hopefully you get the idea.

If so, doesn't "first try to import the module normally" still run into a problem here, that a module with the same name but different path is already loaded, and we'll return it?

@nicoddemus
Copy link
Member Author

If so, doesn't "first try to import the module normally" still run into a problem here, that a module with the same name but different path is already loaded, and we'll return it?

Assuming tests is at the root of the repository and is not importable:

tests/
  a/
    conftest.py
  b/
    conftest.py

Here we will try to import tests/a/conftest.py and it will fail, so we fallback to import this module directly using importlib.import_module and derive a new name for it based on the rootdir: tests.a.conftest. Same for b/conftest.py, so they will end up with different module names and everything will work.

I will improve the tests to also cover this scenario so there is no doubt this works as intended.

@nicoddemus
Copy link
Member Author

Hmm I realize now probably the wording "standalone" is confusing; I meant "not part of a package", but I can see that one can understands this as "using the module's base name only".

Appreciate any suggestions.

Copy link
Member

@bluetech bluetech left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The import_module might import (= execute) modules that were not intended. I don't think that we can assume that arbitrary names (names of test files) are safe to import.

E.g. consider I'm running pytest x.py and I have an x module in my site-packages, then the x module in site-packages will get executed, this is quite unexpected and possibly dangerous. (The code as written also has a bug, where even in the case the if mod.__file__ and Path(mod.__file__) == path condition is false, the module from the site-packages will still be returned, because of the return sys.modules[module_name] just below).

The Python importlib stuff confuses me greatly but it seems to me that the existing importlib mode code basically has the right idea -- it first finds the module spec and only executes it afterwards. Perhaps we can just tweak/extend the existing sys.meta_path search code to work, and keep something like the mod.__file__ and Path(mod.__file__) == path check, but with the module spec, not the actual module?

@nicoddemus
Copy link
Member Author

nicoddemus commented Feb 18, 2024

E.g. consider I'm running pytest x.py and I have an x module in my site-packages, then the x module in site-packages will get executed, this is quite unexpected and possibly dangerous

Ouch, you are definitely right. Will have to think about this some more then.

@nicoddemus
Copy link
Member Author

nicoddemus commented Feb 18, 2024

(The code as written also has a bug, where even in the case the if mod.file and Path(mod.file) == path condition is false, the module from the site-packages will still be returned, because of the return sys.modules[module_name] just below).

Note that the module_name will be different, not the one we tried originally before the check.

@nicoddemus
Copy link
Member Author

nicoddemus commented Feb 18, 2024

E.g. consider I'm running pytest x.py and I have an x module in my site-packages, then the x module in site-packages will get executed, this is quite unexpected and possibly dangerous

Great catch here @bluetech, I really dropped the ball on this one. 😓

Using the specs for that first import, which crucially also uses the module location to import the file and not only the bare module name as my previous attempt, nicely solves this.

I added another test to ensure we are covering the situation you described, where pytest /path/to/x.py will only import /path/to/x.py, not some other x.py that just happens to be reachable from sys.path.

src/_pytest/pathlib.py Outdated Show resolved Hide resolved
src/_pytest/pathlib.py Outdated Show resolved Hide resolved
Copy link
Member

@bluetech bluetech left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Didn't review yet, only looked at the resolve_pkg_root_and_module_name function.

src/_pytest/pathlib.py Outdated Show resolved Hide resolved

Passing the full path to `models.py` will yield Path("src") and "app.core.models".

If consider_ns_packages is True, then we additionally check if the top-level directory
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIUC, this handles the case where there is a python package (directory structure with __init__.py, like app above, and not a single file) nested inside a namespace package (like let's say mycompany). I have two questions/comments:

  1. Why only handle python packages (__init__.py) and not the case where app.core.models doesn't have __init__.py's either?

  2. Namespace packages can have multiple levels, e.g. com.mycompany.app.core.models not just mycompany.app.core.models. IIUC the current code only handles one level.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Yeah that's something we can revisit later, pytest never supported packages without a __init__ correctly (is there a name for it?), I recall we had several discussions with other package-related authors (pip maintainers, Jason Coombs, etc) on how to support that, but the discussion stalled. I would be happy to tackle that in a separate PR though, as this requires more tests in order to make sure the support is sound.

  2. Do they? I always thought it was a mechanism just to share the top-level, based on the use cases out there and the docs at https://packaging.python.org/en/latest/guides/packaging-namespace-packages/#native-namespace-packages. I will double check if this is really supported, and if so update the code accordingly.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

About 2, you are right, it does support multiple levels (TIL). 👍

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm hesitant on adding multiple level support for namespace packages, let me explain my reasoning.

I can see situations (might be contrived I admit) where we identify a module as part of a namespace package incorrectly, just because the directory is in sys.path somehow.

For example:

temp/
  some/
    folder/
      myproject/
        src/
          com/
            company/
              myapp/
                __init__.py
                core.py

If we look upwards from core.py in order to find if it is in a namespace package, we must keep looking upwards until we find a directory in sys.path or until we reach the root of the file system and give up. Suppose for some reason temp/ is in sys.path (possibly due to a misconfiguration), this means we could end up determining that core.py is be part of the package temp.some.folder.myproject.src.com.company.myapp.core, which is incorrect, depending on the order directories appear in sys.path, if temp/ appears first: temp/ or temp/some/folder/myproject/src.

The namespace package rules are easy to follow when you try to figure out a file path starting from a module name: to see if com.company.myapp.core as it appears in the source code is valid , we need to through each sys.path entry, join it with com/company/myapp/core.py and see if the file exists: if it does, double check that there are no __init__.py files present, then import it. This is what Python needs to do and why it works unambiguously.

But the situation is not so simple when you need to derive a module name starting from a file path, because you need to search recursively upwards and see if you can build a valid module name using that full path, but even if we do build a valid module name we don't know if that was intentional or not, and we might end up with a module named temp.some.folder.myproject.src.com.company.myapp.core; in the previous case, going from module name to path, the user is telling us the module name, so there is no ambiguity.

I recall that was the central point when this matter was being discussed with other package-related authors in the past (I cannot find the thread, perhaps @RonnyPfannschmidt can recall at least which forum that took place?).

Hope I was able to explain the situation, sorry if the explanation is lacking.

Btw, I only introduced this change regarding namespace packages because 1 specific test in doctests that is meant to import from namespace packages, but AFAICT it was working with namespace packages by accident, due to how we were building the module name from the path, not due to any proper namespace package support -- so much that I changed it slighly the test to use a src layout, while keeping the namespace package configuration, and then the code in main could not find it anymore. The change implemented here made we at least recognize namespace packages with one level.

I might be overthinking this too, so would like to hear your opinions on this.

Unless my assessment is incorrect, I think a compromise would be to:

  1. Support just one level for now; up until now we did not support any level at all, so this is a small test to provide full support.
  2. Keep the status quo of "pytest does not yet support namespace packages", remove the partial support I added here, and mark the test as xfail.
  3. Implement full support in this PR.

I'm OK with 3), just wanted to present the full picture and see if this is the direction we want to go.

There might also a simpler/safer way to determine if a file path is part of a full namespace package that I'm not seeing, in which case much of my concerns here are moot.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To aid in the discussion I went ahead and implemented it in 1232d19.

The implementation is simple, I'm a bit hesitant if this can have undesired side effects in the wild.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think your analysis is correct, particularly the difference between going module path => filesystem path (what Python does) and the reverse (what we want to do).

I think the partial support is too confusing, so IMO we shouldn't do option 1 (I prefer 2 over 1).

I totally understand the concern with option 3. However, I think we can try it and see. We may end up with these long unintended module paths, but I don't think it would be actively dangerous (=> actually importing unintended stuff) because intermediate paths in namespace packages are not executing anything (there is no __init__.py to import).

Longer term, and this has already been brought by several people, I do think we need to add some concept of "workspace roots", which are Directory collector nodes directly below Session, and which put a strict bound on filesystem traversals (never going beyond them). (Also, initial conftests only collected from workspace roots. The confcutdir concept is gone. --pypkg args become workspace roots. All nodeids are rooted in roots, can't have <rootdir>/../bla.py or /foo/bar.py nodeids anymore). Hopefully I can put up a proper proposal sometime soon.


Suppose for some reason temp/ is in sys.path (possibly due to a misconfiguration), this means we could end up determining that core.py is be part of the package temp.some.folder.myproject.src.com.company.myapp.core, which is incorrect, depending on the order directories appear in sys.path, if temp/ appears first: temp/ or temp/some/folder/myproject/src.

Since we're iterating pkg_path.parents in order, I don't think there is ambiguity -- we'll always try com.company.myapp.core in temp/some/folder/myproject/src first, so that's the one that would be found, even if temp/ is in sys.path.


Regarding the initial implementation, in the loop for parent in pkg_path.parents:, is there some special handling needed if one of the parents has a __init__.py? Can namespace packages be nested in regular packages? Or the other way around?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we're iterating pkg_path.parents in order, I don't think there is ambiguity -- we'll always try com.company.myapp.core in temp/some/folder/myproject/src first, so that's the one that would be found, even if temp/ is in sys.path.

Yes, you are right.

Regarding the initial implementation, in the loop for parent in pkg_path.parents:, is there some special handling needed if one of the parents has a init.py?

Good catch, indeed if there is an __init__.py it is no longer a namespace package. Added a specific test for this.


Went ahead with option 3 then, please take a look.

I also double checked and the pytest-doctest-import-mismatch (the reproducer for the issue) still passes.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch, indeed if there is an init.py it is no longer a namespace package. Added a specific test for this.

So I've just seen in the link you provided https://packaging.python.org/en/latest/guides/packaging-namespace-packages/#native-namespace-packages a regular package nested inside of a namespace package... So I assume it is possible?

doc/en/explanation/pythonpath.rst Outdated Show resolved Hide resolved
doc/en/explanation/pythonpath.rst Outdated Show resolved Hide resolved
doc/en/explanation/pythonpath.rst Outdated Show resolved Hide resolved
doc/en/explanation/pythonpath.rst Outdated Show resolved Hide resolved
testing/test_pathlib.py Show resolved Hide resolved

Passing the full path to `models.py` will yield Path("src") and "app.core.models".

If consider_ns_packages is True, then we additionally check if the top-level directory
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch, indeed if there is an init.py it is no longer a namespace package. Added a specific test for this.

So I've just seen in the link you provided https://packaging.python.org/en/latest/guides/packaging-namespace-packages/#native-namespace-packages a regular package nested inside of a namespace package... So I assume it is possible?

Copy link
Member

@bluetech bluetech left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Didn't review the code yet, just some light comments. Will review a bit later.

@nicoddemus
Copy link
Member Author

nicoddemus commented Feb 25, 2024

So I've just seen in the link you provided https://packaging.python.org/en/latest/guides/packaging-namespace-packages/#native-namespace-packages a regular package nested inside of a namespace package... So I assume it is possible?

(I can't seem to reply inline to your original comment)

Sub-packages can have __init__.py files, but not the directories that make up the top level of the namespace package.

Consider this:

dist1/
  com/
    company/
      app/
        core/
          __init__.py
        __init__.py
dist2/
  com/
    company/
      calc/
        algo/
          __init__.py
        __init__.py

In the example above, com.company is the namespace package (which is shared), while app.core and calc.algo are the sub-packages.

Using this simple script to test it:

import sys

sys.path.append(r"e:\Temp\src\dist1")
sys.path.append(r"e:\Temp\src\dist2")

import com.company.app.core.models
import com.company.calc.algo.algorithms
print(com.company.app.core.models)
print(com.company.calc.algo.algorithms)

This prints:

<module 'com.company.app.core.models' from 'e:\\Temp\\src\\dist1\\com\\company\\app\\core\\models.py'>
<module 'com.company.calc.algo.algorithms' from 'e:\\Temp\\src\\dist2\\com\\company\\calc\\algo\\algorithms.py'>

app.core and calc.algo are free to have __init__ files, but adding an __init__.py file to com/ or company/ will break the logic, and then com.company will no longer be considered a namespace package.

To prove this, we create dist1/com/__init__.py and run the script again:

λ py -3.12 foo.py
Traceback (most recent call last):
  File "e:\Temp\src\foo.py", line 10, in <module>
    import com.company.calc.algo.algorithms
ModuleNotFoundError: No module named 'com.company.calc'

But as I mentioned, the sub-packages can have __init__.py, but they are optional: removing all of them gets the same results as the first example. However this case (no __init__.py) is not supported under the current code.

This is why resolve_pkg_root_and_module_name will bail if it finds a __init__.py file while moving upwards in the herarchy.

Copy link
Member

@bluetech bluetech left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Went over the code now, it looks good to me in general! I left a few comments in addition to the previous ones.

Also for your consideration, I've split the PR into commits, I think it will help with reviewing/bisecting/reverting compared to one big commit. https://github.com/bluetech/pytest/commits/11475-importlib/

src/_pytest/pathlib.py Outdated Show resolved Hide resolved
src/_pytest/pathlib.py Outdated Show resolved Hide resolved
src/_pytest/pathlib.py Outdated Show resolved Hide resolved
src/_pytest/pathlib.py Outdated Show resolved Hide resolved
*,
mode: Union[str, ImportMode] = ImportMode.prepend,
root: Path,
consider_namespace_packages: bool,
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not use a default here?

I prefer to force everywhere that calls it to pass the new option explicitly, lest we miss passing the option in production code by mistake.

@nicoddemus nicoddemus requested a review from bluetech March 2, 2024 15:55
Copy link
Member

@bluetech bluetech left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me!

testing/test_pathlib.py Outdated Show resolved Hide resolved
src/_pytest/pathlib.py Outdated Show resolved Hide resolved
The test seems wrong, and we haven't been able to figure out what it's
trying to test (the original issue is lost in time).
Ensure the implementation isn't changed to trigger such a bug.
…name`

This applies to `append` and `prepend` import modes; support for
`importlib` mode will be added in a separate change.
…chanism

As detailed in
pytest-dev#11475 (comment),
currently with `--import-mode=importlib` pytest will try to import every
file by using a unique module name, regardless if that module could be
imported using the normal import mechanism without touching `sys.path`.

This has the consequence that non-test modules available in `sys.path`
(via other mechanism, such as being installed into a virtualenv,
PYTHONPATH, etc) would end up being imported as standalone modules,
instead of imported with their expected module names.

To illustrate:

```
.env/
  lib/
    site-packages/
      anndata/
        core.py
```

Given `anndata` is installed into the virtual environment, `python -c
"import anndata.core"` works, but pytest with `importlib` mode would
import that module as a standalone module named
`".env.lib.site-packages.anndata.core"`, because importlib module was
designed to import test files which are not reachable from `sys.path`,
but now it is clear that normal modules should be imported using the
standard mechanisms if possible.

Now `imporlib` mode will first try to import the module normally,
without changing `sys.path`, and if that fails it falls back to
importing the module as a standalone module.

This also makes `importlib` respect namespace packages.

This supersedes pytest-dev#11931.

Fix pytest-dev#11475
Close pytest-dev#11931
Test with namespace packages support even when it will not find namespace packages to ensure it will at least not give weird results or crashes.
@nicoddemus nicoddemus merged commit 89ee449 into pytest-dev:main Mar 3, 2024
24 checks passed
@flying-sheep
Copy link
Contributor

flying-sheep commented Mar 4, 2024

/edit: moved to #11475 (comment)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants