Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Run PEXes as normal applications #962

Closed
jsirois opened this issue May 4, 2020 · 5 comments · Fixed by #1153
Closed

Run PEXes as normal applications #962

jsirois opened this issue May 4, 2020 · 5 comments · Fixed by #1153
Assignees
Milestone

Comments

@jsirois
Copy link
Member

jsirois commented May 4, 2020

The PEX runtime presents several differences from the typical python runtime and this can lead to problems PEXing various python programs:

  1. The need to mark a PEX as not zip-safe for application code that uses filesystem APIs to find code and resources in the application.
  2. The merging of the sys.path from individual pre-installed wheel chroots can expose bugs in underlying distributions that are normally masked by being installed in the same chroot as their enclosing dependency set.

Although 1 has a workaround with --not-zip-safe it's often surprising to users and not easy to discover as a problem solution. Item 2 doesn't have a solution and this leads to the inability to PEX certain applications that use namespace packages inconsistently in particular (see #331 for examples).

Recent work that added the --unzip option showed that 1st unzipping PEXes led to better cold and warm cache startup latency. If we were to push on this, one solution to both of the problems listed above would be for PEXes to always unzip / re-package themselves in the pex cache with standard site-packages layout (ie: in one sys.path chroot). This should even improve PEX performance more since the sys.path would have less entries allowing imports provided by the PEX to be found in a single fs search instead of needing to search 1 (the pex zip itself) + N (pex dependencies) locations.

@cosmicexplorer
Copy link
Contributor

cosmicexplorer commented May 6, 2020

Would it be feasible to consider extending this proposal to allow the building of PEX files unzipped, which aren't intended to be run on another host? The result of something like PEXBuilder.build_unzipped() could be just the unzipped chroot path copied into the pex cache, which a python application depending on pex could use to know how to invoke the unzipped PEX file.

Rationale

The lazy-loading ipex functionality added to pants in pantsbuild/pants#8793 (which just turns your code intoo a python application that knows how to build the real pex file) currently has to go through a single trip of (when the ipex is first executed) downloading wheels => stuffing them in a pex file, but if pex files were then unconditionally unzipped on top of that it would then involve a second crossing of that zipped/unzipped bridge. Stuffing 3rdparty dependencies into the resulting pex file appears to take more time than actually resolving dependencies (more time than running pants, actually, since the 3rdparty dependencies are not zipped into the ipex file when it is created), especially when the resolve is mostly cached, and the resulting pex will always be run on the same host because ipex will exec the built pex immediately after it is produced.

@jsirois
Copy link
Member Author

jsirois commented May 6, 2020

The Pex API has supported this for a long time and Pants has correspondingly used it for a long time:
https://github.com/pantsbuild/pex/blob/916e61e04634c60d09ee25859c76ba4f27282e51/pex/pex_builder.py#L462-L477

@cosmicexplorer
Copy link
Contributor

I had forgotten the difference between .build() and .freeze()! Thank you so much!!

@jsirois jsirois added this to the 3.0 milestone May 11, 2020
@kwlzn
Copy link
Contributor

kwlzn commented May 19, 2020

I suspect an inverted flag to indicate e.g. "--force-zip" (or "--zip-safe=False" becoming the default) might be nice to leave around for tight-quartered execution of large pex envs where expansive space consumption may not be desirable (like a pex that contains a large but otherwise zip-safe library executing on a PySpark worker, etc) - otherwise, this sounds great to me as a default mode.

pex execution overhead is a major UX issue for us particularly with O(GB) pex envs for DS/ML - and for e.g. local tool use cases.

@jsirois
Copy link
Member Author

jsirois commented Dec 2, 2020

A speed hack note when implementing this:

If we find sys.argv[0] (the PEX zip file) is writeable, we could re-write its shebang to point to the selected venv interpreter to fully eliminate re-exec overhead on subsequent runs:

Given the mechanism here:

$ cat __main__.py 
#!/usr/bin/env python
from __future__ import print_function

def _maybe_reexec():
    import sys

    _BINARY = sys.argv[0]

    # Here we would extract the app to a venv, abbreviated at extracted_app.py for demonstration.
    _NEW_SHEBANG = b"#!" + sys.executable.encode("utf-8") + b" extracted_app.py\n"

    with open(_BINARY, "rb") as fp:
        shebang = fp.readline()
        if shebang == _NEW_SHEBANG:
            return

        import os
        import shutil

        new_binary = "{}.rewrite".format(_BINARY)
        with open(new_binary, "wb") as new_fp:
            new_fp.write(_NEW_SHEBANG)
            new_fp.write(fp.read())

        shutil.copymode(_BINARY, new_binary)
        os.rename(new_binary, _BINARY)
        os.execv(_BINARY, sys.argv)


_maybe_reexec()
del _maybe_reexec


import sys


print("ERROR: should have never gotten here!")
sys.exit(1)

Which relies on the to-be-written PEX -> venv extraction code represented by a pre-extracted single file for demonstration here:

$ cat extracted_app.py 
import sys

print("Hello. ARGV={}".format(sys.argv))

We get:

$ zip main.zip __main__.py && cat <(echo '#!/usr/bin/env python') main.zip > main.pex && chmod +x main.pex
  adding: __main__.py (deflated 53%)
$ head -1 main.pex
#!/usr/bin/env python
$ time ./main.pex
Hello. ARGV=['extracted_app.py', './main.pex']

real	0m0.051s
user	0m0.039s
sys	0m0.007s
$ head -1 main.pex
#!/usr/bin/python
$ time ./main.pex
#!/usr/bin/python extracted_app.py
Hello. ARGV=['extracted_app.py', './main.pex']

real	0m0.024s
user	0m0.016s
sys	0m0.004s
$ time python -c 'print("Hello")'
Hello

real	0m0.022s
user	0m0.018s
sys	0m0.004s

This self-modifying executable approach could be made robust by writing down the values of any PEX_* environment variables that affect interpreter selection and if those don't match on a subsequent run, then re-run interpreter selection and if a new interpreter is called for, re-run the application install / shebang re-write.

@jsirois jsirois self-assigned this Dec 5, 2020
jsirois added a commit to jsirois/pex that referenced this issue Dec 7, 2020
Add a new `--include-tools` option to include any pex.tools in generated
PEX files. These tools are activated by running PEX files with
PEX_TOOLS=1. The `Info` tool seeds the tool set and simply dumps the
effective PEX-INFO for the given PEX.

Work towards pex-tool#962 and pex-tool#1115
jsirois added a commit that referenced this issue Dec 8, 2020
Add a new `--include-tools` option to include any pex.tools in generated
PEX files. These tools are activated by running PEX files with
PEX_TOOLS=1. The `Info` tool seeds the tool set and simply dumps the
effective PEX-INFO for the given PEX.

Work towards #962 and #1115
jsirois added a commit to jsirois/pex that referenced this issue Dec 11, 2020
This fixes binary canonicalization to handle virtual environments
created with virtualenv instead of pyvenv. It also adds support for
resolving the base interpreter used to build a virtual environment.

The ability to resolve a virtual environment intepreter will be used to
fix pex-tool#1031 where virtual environments created with
`--system-site-packages` leak those packages through as regular sys.path
entries otherwise undetectable by PEX.

Work towards pex-tool#962 and pex-tool#1115.
This was referenced Dec 11, 2020
jsirois added a commit that referenced this issue Dec 11, 2020
This fixes binary canonicalization to handle virtual environments
created with virtualenv instead of pyvenv. It also adds support for
resolving the base interpreter used to build a virtual environment.

The ability to resolve a virtual environment intepreter will be used to
fix #1031 where virtual environments created with
`--system-site-packages` leak those packages through as regular sys.path
entries otherwise undetectable by PEX.

Work towards #962 and #1115.
jsirois added a commit that referenced this issue Dec 14, 2020
Add a `venv` tool to create a virtual environment from a PEX file. The
virtual environment is seeded with just the PEX user code and
distributions applicable to the selected interpreter for the local
machine. The virtual environment does not have Pip installed by default
although that can be requested with `--pip`.

The virtual environment comes with a `__main__.py` at the root of the 
venv to emulate a loose pex that can be run with `python venv.dir` just
like a loose pex. This entry point supports all the behavior of the
original PEX file not related to interpreter selection, namely support
for PEX_SCRIPT, PEX_MODULE, PEX_INTERPRETER and PEX_EXTRA_SYS_PATH.

A sibling `pex` script is linked to `__main__.py` to provide the
maximum performance entrypoint that always avoids interpreter
re-execing and thus yields equivalent performance to a pure virtual
environment.

Work towards #962 and #1115.
jsirois added a commit that referenced this issue Dec 24, 2020
The new --venv execution mode builds a PEX file that includes pex.tools
and extracts itself into a venv under PEX_ROOT upon 1st execution or any
execution that might select a diffrent interpreter than the default.

In order to speed up the local build and execute case, --seed mode is
added to seed the PEX_ROOT caches that will be used at runtime. This is
important for --venv mode since venv seeding depends on the selected
interpreter and one is already selected during the PEX file build
process.

Fixes #962
Fixes #1097
Fixes #1115
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants