Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Copy user files that were imported by workflow in pyflyte run #2663

Merged

Conversation

thomasjpfan
Copy link
Member

@thomasjpfan thomasjpfan commented Aug 8, 2024

Tracking issue

Related to flyteorg/flyte#5343

Why are the changes needed?

Currently, pyflyte run requires --copy-all when the workflow imports files from other directories. This is a common gotcha because pyflyte run workflow.py works, but pyflyte run --remote workflow.py does not.

What changes were proposed in this pull request?

With this PR:

  • We look at all the modules in sys.modules and copy over all files that are necessary for the workflow to run.
  • This way pyflyte run --remote workflow.py will work out of the box most of the time.
  • Less copying than --copy-all because it only copies over the required files. If you have a project with tests or docs, it will not be copied.

I think the new coping mechanism does what copy_module_to_destination does, but I did not want to adjust it just in case I break anything. I added a new add_imported_modules_from_source to additionally copy over files.

How was this patch tested?

I added unit tests to make sure the copying works. I also had a project with this structure:

.
└── workflows
    ├── __init__.py
    ├── another_module
    │   └── wow.py
    ├── main.py
    └── util.py

and ran pyflyte run --remote workflows/main.py and everything copies correctly. Simple workflow in the root directory also works:

.
├── main.py
└── util.py

Signed-off-by: Thomas J. Fan <[email protected]>
Signed-off-by: Thomas J. Fan <[email protected]>
@@ -46,6 +50,8 @@ def compress_scripts(source_path: str, destination: str, module_name: str):

visited: typing.List[str] = []
copy_module_to_destination(source_path, destination_path, module_name, visited)
add_imported_modules_from_source(source_path, destination_path, list(sys.modules.values()))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we remove copy_module_to_destination, does it work as well? If so, I think we can remove that.

For more context:
copy_module_to_destination will only copy the module to remote if and only if that module has a flyte task/workflow.

in this case, pyflyte run --remote uploads foo_wf and main

└── workflows
    ├── __init__.py
    ├── another_module
    │   └── foo_wf.py
    ├── main.py  # it import foo_wf

I did this because I didn't know how to do add_imported_modules_from_source at that time.

Now you have added it, I think we should be able to remove copy_module_to_destination

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, remove this one line and merge. tested this against dev but feel free to again. the only difference i saw is that with the copy_module... call, the main script itself is already in the destination folder, and because it exists it's skipped when add_imported... sees it. Without it, the main file itself is copied by add_imported...

@@ -127,6 +133,65 @@ def tar_strip_file_attributes(tar_info: tarfile.TarInfo) -> tarfile.TarInfo:
return tar_info


def add_imported_modules_from_source(source_path: str, destination: str, modules: List[ModuleType]):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If utils imports another local module, will we copy that to the remote storage as well?

.
└── workflows
    ├── __init__.py
    ├── another_module
    │   └── wow.py
    ├── main.py  # only import util.py
    └── util.py # import util2.py
     └── util2.py     

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any files that is a part of the import chain from the original workflow will be copied over.

Details: If main.py imports from util.py and util.py imports from util2.py then util2.py is part of sys.modules when main.py is loaded. With this PR, if the module is in sys.modules and in the same working directory as main.py, then it will be uploaded.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice, perfect

Signed-off-by: Thomas J. Fan <[email protected]>
Copy link

codecov bot commented Aug 8, 2024

Codecov Report

Attention: Patch coverage is 14.81481% with 46 lines in your changes missing coverage. Please review.

Project coverage is 46.23%. Comparing base (d802c7e) to head (cd34d90).
Report is 14 commits behind head on master.

Files Patch % Lines
flytekit/tools/script_mode.py 13.46% 45 Missing ⚠️
flytekit/remote/remote.py 50.00% 1 Missing ⚠️

❗ There is a different number of reports uploaded between BASE (d802c7e) and HEAD (cd34d90). Click for more details.

HEAD has 3 uploads less than BASE
Flag BASE (d802c7e) HEAD (cd34d90)
5 2
Additional details and impacted files
@@             Coverage Diff             @@
##           master    #2663       +/-   ##
===========================================
- Coverage   92.53%   46.23%   -46.31%     
===========================================
  Files          27      187      +160     
  Lines        1206    19162    +17956     
  Branches        0     2766     +2766     
===========================================
+ Hits         1116     8859     +7743     
- Misses         90     9867     +9777     
- Partials        0      436      +436     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

wild-endeavor
wild-endeavor previously approved these changes Aug 8, 2024
Signed-off-by: Thomas J. Fan <[email protected]>
Signed-off-by: Thomas J. Fan <[email protected]>
pingsutw
pingsutw previously approved these changes Aug 8, 2024
Signed-off-by: Thomas J. Fan <[email protected]>
Signed-off-by: Thomas J. Fan <[email protected]>
@@ -493,7 +493,7 @@ def _update_flyte_context(params: RunLevelParams) -> FlyteContext.Builder:
if output_prefix and ctx.file_access.is_remote(output_prefix):
with tempfile.TemporaryDirectory() as tmp_dir:
archive_fname = pathlib.Path(os.path.join(tmp_dir, "script_mode.tar.gz"))
compress_scripts(params.computed_params.project_root, str(archive_fname), params.computed_params.module)
compress_scripts(params.computed_params.project_root, str(archive_fname), list(sys.modules.values()))
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pingsutw I'm concerned with this change. At this point, is the main workflow imported by flytekit?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think so.

entity = load_naive_entity(module, exe_entity, project_root)

If not, could we params.computed_params.module to the list as well?

Copy link
Member Author

@thomasjpfan thomasjpfan Aug 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a new get_all_modules function that adds params.computed_params.module if it is not already imported.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sgtm, thank you!

Signed-off-by: Thomas J. Fan <[email protected]>
Signed-off-by: Thomas J. Fan <[email protected]>
Signed-off-by: Thomas J. Fan <[email protected]>
@wild-endeavor wild-endeavor merged commit 89c8a1b into flyteorg:master Aug 9, 2024
99 checks passed
Mecoli1219 pushed a commit to Mecoli1219/flytekit that referenced this pull request Aug 14, 2024
Mecoli1219 pushed a commit to Mecoli1219/flytekit that referenced this pull request Aug 14, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants