Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[sdk] Containerized Python Component module not found error #8385

Closed
connor-mccarthy opened this issue Oct 21, 2022 · 18 comments
Closed

[sdk] Containerized Python Component module not found error #8385

connor-mccarthy opened this issue Oct 21, 2022 · 18 comments
Assignees
Labels
area/sdk kind/bug lifecycle/stale The issue / pull request is stale, any activities remove this label.

Comments

@connor-mccarthy
Copy link
Member

connor-mccarthy commented Oct 21, 2022

There is a bug when building a containerized Python component that happens (at least) in the case when the longest path of the import graph ending at the component involves >2 modules.

Environment

KFP SDK 2.0.0-beta.6

Steps to reproduce

For example:

# component.py
from module_one import one
from kfp import dsl

@dsl.component
def comp(): ...
# module_one.py
from module_two import two
one = 1
# module_two.py
two = 2

Then: kfp component build .

You get a No module named error.

Expected result

Should build without an error.

Materials and Reference

Related: #8353

@Davidnet
Copy link
Member

Hi @connor-mccarthy I am trying to follow the docs and I ended up with the same error as #8353 how can I help to close this issue?

@Davidnet
Copy link
Member

Davidnet commented Jan 31, 2023

I can reproduce it with even one just module:

working directory:

├── main.py
├── poetry.lock
├── poetry.toml
├── pyproject.toml
└── utils.py

Where:

main.py

from kfp import dsl
from .utils import return_date

@dsl.component(
    base_image='python:3.7',
    target_image='gcr.io/my-project/my-component:v1',
)
def main():
    x = return_date()

utils.py

from datetime import datetime

def return_date():
    return datetime.now().strftime('%Y-%m-%d')

Results:

╰─>$ kfp component build .
Building component using KFP package path: kfp==2.0.0-beta.11
attempted relative import with no known parent package

Alternatively:

>$ kfp component build bug/
Building component using KFP package path: kfp==2.0.0-beta.11
attempted relative import with no known parent package

Adding a __init__.py which is not in the docs:

have the same results

Moving the files to a src folder and trying to replicate the docs I cannot seem to have the same results as the docs

Update:

Even with changing the utils.py as:

def return_date():
    return "today"

still get the same error.

My poetry just have the package installation kfp = {version = "2.0.0b11", extras = ["all"]}

@connor-mccarthy
Copy link
Member Author

@Davidnet, thank you for the detailed reproduction notes. This is on our backlog.

how can I help to close this issue?

If you're interested, you're welcome to address this bug and submit a PR!

@Davidnet
Copy link
Member

@connor-mccarthy Awesome, thanks for the response any ideas on where could I start looking?

@connor-mccarthy
Copy link
Member Author

Thanks, @Davidnet. Some links:

@b4sus
Copy link
Contributor

b4sus commented Mar 16, 2023

Hey guys,
I also had a look on this issue as it is quite annoying and I think I might have found the cause for No module named problem.
When running kfp command, this resolves (which kfp) to a script in (my case, depending on your python installation it will be different) mamba environment bin directory ~/mambaforge/envs/kfp-env/bin/kfp.
The file looks like this:

#!/home/b4sus/mambaforge/envs/kfp-env/bin/python3.11
# -*- coding: utf-8 -*-
import re
import sys
from kfp.cli.__main__ import main
if __name__ == '__main__':
    sys.argv[0] = re.sub(r'(-script\.pyw|\.exe)?$', '', sys.argv[0])
    sys.exit(main())

I patched the file by adding print(sys.path) to see the path where python will be checking for modules. The thing is that current directory (where the kfp is executed from) is not there. That one is added only when executing python directly (do not quote me on this, but see here) and not through this shell approach.

Now comes the funny part, bare with me here :)
The sdk will check the provided directory in search for components. It will list python files by using Path.glob and try to load each module (here).
This may or may not yield the error. It all depends on the order of files returned by the glob method.
For example, following the files from @connor-mccarthy setup, if the glob would return files in order [module_two.py, module_one.py, component.py], the whole thing would work. Loading module_two.py would be no problem (no imports), loading module_one.py would be no problem (import of module_two would work because it was already loaded in), loading of component.py would be no problem (import of module_one would work because it was already loaded in).

In other words, python would never need to check sys.path for loading the modules.

However, if the returned order from glob would be for example [module_one.py, module_two.py, component.py], loading would fail during load of module_one.py which is importing module_two. This time python doesn't have module_two cached and needs to check sys.path and it cannot find it -> No module named error.

As a quickfix, I added following to the kfp file (~/mambaforge/envs/kfp-env/bin/kfp):

    import os
    sys.path.append(os.getcwd())

This should work when all the component python files are in the current directory where you execute kfp.

It also works when using slightly more of a structure, eg:

my_project
    app
        components
            __init__.py
            my_component.py (having absolute import app.utils.helper
        utils
            __init__.py
            helper.py

Then from my_project directory execute:

kfp component build . --component-filepattern app/components/my_comp*

@connor-mccarthy
Copy link
Member Author

Thanks for this thorough investigation, @b4sus. This makes sense based on the errors we've observed.

As a quickfix, I added following to the kfp file (~/mambaforge/envs/kfp-env/bin/kfp):

Is this quickfix a final fix? Or is there a more robust fix you have in mind? If final, are you interested in submitting a PR? I would be happy to review promptly.

@b4sus
Copy link
Contributor

b4sus commented Mar 17, 2023

Hey @connor-mccarthy,
some more considerations could be necessary, including having some recommended structure for the project. Additionally, I am no expert on python import system, so that's something as well :)

Let's assume standard (I think) python project (let's call it myproject) layout:

project_dir/
    myproject/
        __init__.py
        components/
            __init__py.py
            my_comp.py
        pipelines/
            __init__py.py
            my_pipelines.py
    pyproject.toml
    .gitignore

Now (considering my fix) you have to run kfp from project_dir like

kfp component build .

This will work, but there are caveats:

  1. you have to use absolute imports (generally recommended afaik)
  2. you have to use . as input for kfp - it will be recursive - checking subdirectories as well - desirable I'd say
  3. if somewhere you have also lightweight components it will fail (because of missing base_image) - so you have to separate those to different packages - eg myproject/components/lightweight and myproject/components/cont - and then use myproject/components/cont and then use --component-filepattern like kfp component build . --component-filepattern "myproject/components/cont/*.py"
  4. complete content of project_dir will end up in docker image
  5. input for kfp component build (the . above) is basically usesless - you cannot use anything else - otherwise Dockerfile and other files will be generated elsewhere and the docker image will not have correct structure copied into
  6. possibly something else

All this is fine for me, but needs to be considered.

@connor-mccarthy
Copy link
Member Author

connor-mccarthy commented Mar 17, 2023

@b4sus

including having some recommended structure for the project

I agree with this. I'm working on a v2 docs refresh currently and will keep this in mind.

but there are caveats

Thank you for laying out these considerations. In general, as long as there are no regressions (all existing user code that uses Containerized Python Components will still work), I'm happy to eagerly merge a better but not perfect fix. It sounds like some of these constraints may be been introduced by the proposed fix, however (such as the requirement to use the cwd . as the path). If that's the case, perhaps we ought to wait.

If we do wait, I'll take a stab at this soon and certainly leverage your investigation. Thank you again for the thorough writeup -- this helps a lot.

@Mithil467
Copy link
Contributor

Thanks for the issue breakdown @b4sus and @connor-mccarthy.

I have created a PR #9157 to essentially add the module directory to the sys.path within the load_module function.

@b4sus
Copy link
Contributor

b4sus commented Apr 25, 2023

Just tested the fix with b15, nice to see progress in the topic 👍
However :), it is not working for me - I have similar setup as mentioned above:

project_dir/
    myproject/
        __init__.py
        components/
            __init__py.py
            my_comp.py
        pipelines/
            __init__py.py
            my_pipelines.py
        utils.py
    pyproject.toml
    .gitignore

and using the absolute imports. From component module I import some util function from other module (via absolute import, eg in file myproject/components/my_comp.py using import from myproject.utils import xyz) -> the root package is not visible - No module named 'myproject' - what makes sense as it was never added to sys.path.
I guess with relative imports only it's working, but it would be great to also support absolute.

@connor-mccarthy
Copy link
Member Author

Thanks for testing and updating this, @b4sus. Reopening.

rd-pong pushed a commit to rd-pong/pipelines that referenced this issue Apr 26, 2023
…nents. Fixes kubeflow#8385 (kubeflow#9157)

* Add module directory to sys.path

* Add nested module imports unit test

* Add release note to release.md
@pbalm
Copy link

pbalm commented May 15, 2023

This bug seems to have been introduced in kfp==1.8.12

@connor-mccarthy
Copy link
Member Author

@b4sus, I've been looking into a fix for this and my sense is that we may not want to support absolute imports at this time, since it requires that the KFP component executor pip install the the user's package (in your example, the myproject package), which requires that KFP know something about the user's directory structure.

In a bit more detail, the No module named 'myproject' can be resolved at component build/compile time by running pip install . from project_dir/ to install myproject. However, there will still be a No module named 'myproject' exception at component runtime, since KFP doesn't install the package in the component build (see the Dockerfile template) or runtime commands/args that execute the component.

This is of course something that we could change by adding additional parameters to the component build process or doing something "smart" under the hood in the component build logic, but I'm not sure the cost of either (a) exposing those parameters to users or (b) maintaining that smart logic is worth the benefit. It seems reasonable that the module that contains the component definition(s) should be runnable as a script (with relative imports), rather than something that requires pip installation (with absolute imports).

Let me know if you have other thoughts or if my understanding needs refinement.

Thank you, by the way, for the very helpful minimal reproducible example.

@b4sus
Copy link
Contributor

b4sus commented May 16, 2023

Hey @connor-mccarthy ,
I understand the script nature - I also tend to run component as script to try (and I need to run it from project_dir for it to work).
However, as the project grows and let's say you want to have proper tests (under project_dir/test) and run them simply by pytest tests from project_dir, you will run into problem (using relative imports).

I might have another idea for a fix though :). What about adding components_directory (cli) argument from build function to the path? Like this:

def build(components_directory: str, component_filepattern: str, engine: str,
          kfp_package_path: Optional[str], overwrite_dockerfile: bool,
          build_image: bool, platform: str, push_image: bool):
    """Builds containers for KFP v2 Python-based components."""

    sys.path.append(components_directory)

    if build_image and engine != 'docker':

Then I can run kfp build wherever with argument pointing to project_dir and the absolute import should (tested locally) work. This would solve absolute imports, relative should not be affected. I cannot think of case when this would cause an issue, also shouldn't be maintenance heavy.
What do you think?

@connor-mccarthy connor-mccarthy self-assigned this Jun 22, 2023
@Maxl94
Copy link

Maxl94 commented Aug 16, 2023

Hey @b4sus,

I have a similar project structure as you mentioned above. I wonder how you handle the component within the pipelines and how you separate the components. If the pipeline consist of more than one component, lets say preprocessing and training.

project_dir/
    myproject/
        __init__.py
        components/
            preprocssing/
                __init__.py
                component.py
            training/
                __init__.py
                component.py
            __init__.py
        pipelines/
            __init__.py
            my_pipelines.py
        utils.py
    pyproject.toml
    .gitignore

For now I use a CI/CI pipeline to build and push the components and finally compile and submit the pipeline job (using Google Vertex AI). The components build command looks like the following. This might be unorthodox, but this was the only way I could find, which works with absolute imports.

kfp component build . --component-filepattern ${{ matrix.component }}/component.py --push-image
#  The matrix.component variable looks for example like that "myproject/components/preprocessing"

Now the problem is, that the generate Dockerfile always has a COPY . . command, which leads to a rebuild of the component everytime I change anything in my project. This also mean I cant changes component without rerunning the whole pipline. That can be costly.

Bevor the containerized python components I used the docker components and only copied the parts of the project relevant to the component. But that required a lot of manual work and lead to errors because of files missing.

Copy link

github-actions bot commented Dec 2, 2023

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@github-actions github-actions bot added the lifecycle/stale The issue / pull request is stale, any activities remove this label. label Dec 2, 2023
Copy link

github-actions bot commented Mar 2, 2024

This issue has been automatically closed because it has not had recent activity. Please comment "/reopen" to reopen it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/sdk kind/bug lifecycle/stale The issue / pull request is stale, any activities remove this label.
Projects
Status: Closed
Development

No branches or pull requests

6 participants