Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pex build fails due to existing work-directory #1969

Closed
christopherfrieler opened this issue Oct 31, 2022 · 26 comments
Closed

pex build fails due to existing work-directory #1969

christopherfrieler opened this issue Oct 31, 2022 · 26 comments
Assignees
Labels

Comments

@christopherfrieler
Copy link

Beginning with version 2.1.105 building the pex file in our CI pipeline fails with the following message:

.../python3.8/site-packages/pex/atomic_directory.py:176: PEXWarning: [pid:XX, tid:XXX, cwd:...]: After obtaining an exclusive lock on <PEX_ROOT>/isolated/.2f4fc85fa2be055a2975ce1147100c0d5c7e663a.atomic_directory.lck, failed to establish a work directory at <PEX_ROOT>/isolated/2f4fc85fa2be055a2975ce1147100c0d5c7e663a.workdir due to: [Errno 17] File exists: '<PEX_ROOT>/isolated/2f4fc85fa2be055a2975ce1147100c0d5c7e663a.workdir'
pex_warnings.warn(
.../python3.8/site-packages/pex/atomic_directory.py:187: PEXWarning: [pid:XX, tid:XXX, cwd:...]: Continuing to forcibly re-create the work directory at <PEX_ROOT>/isolated/2f4fc85fa2be055a2975ce1147100c0d5c7e663a.workdir.
pex_warnings.warn(
Failed to spawn a job for .../bin/python: [Errno 17] File exists: '<PEX_ROOT>/isolated/2f4fc85fa2be055a2975ce1147100c0d5c7e663a.workdir/pex/./venv'

It seems to have to do with #1905 introduced in version 2.1.105, but we have no clue, why this is happening in our CI pipeline, while building the .pex file on MacOS developer machines works. It looks like something else is creating that directory, but there is only one pex command in the pipeline job and the PEX_ROOT is not cached.

Our build environment uses:

  • the Red Hat UBI 8.4 Docker image
  • Python 3.8
  • poetry 1.1, which manages pex as a dev-dependency

Then we build the pex with
poetry run pex --inherit-path --python=python3.8 --requirement requirements.txt --find-links dist/ our_module --output-file dist/final.pex

Any idea why this is happening or what else we could check would be helpful.

@jsirois jsirois added the bug label Oct 31, 2022
@jsirois
Copy link
Member

jsirois commented Oct 31, 2022

@christopherfrieler that warning message looks like the one in the 2.1.112 release. I added it in #1961 to help debug a probable race or wrong POSIX assumption that has been hard to track down. I know this is painful for you, but I'm very happy to have a repro case from you! Can you try setting _PEX_FILE_LOCK_STYLE=bsd (added in #1962) in your CI environment and see if that changes anything?

@christopherfrieler
Copy link
Author

@jsirois Thanks for the quick response. Setting _PEX_FILE_LOCK_STYLE=bsd solved the problem. Would you suggest to set it as a workaround until there is a fix for the locking?

@jsirois
Copy link
Member

jsirois commented Nov 2, 2022

@christopherfrieler that is good news. Yes, use that env var for now. I've created an issue to track deprecating that env var and formalizing the solution over in #1971. Even with the deprecation, it will only be a warning and I won't remove the env var until 3.0.0.

@oseiberts11
Copy link

oseiberts11 commented Nov 17, 2022

For the record, we seem to be running into the same issue intermittently (in gitlab CI, in case it matters). I'm trying out the fix with setting the environment variable _PEX_FILE_LOCK_STYLE=bsd. The first run after that went fine, which is a good sign but not proof (unfortunately).

@jsirois
Copy link
Member

jsirois commented Nov 18, 2022

@oseiberts11 thanks for the report. I am definitely interested to hear at some point if you pass a reasonable number of CI runs still without seeing the issues come back.

@oseiberts11
Copy link

@jsirois We had a bunch of CI runs (renovate caused lots of merge requests :) ) and didn't see the issue again. So I think it is safe to say that changing the lock style helps. Thanks for putting in that code! :)

@jsirois
Copy link
Member

jsirois commented Nov 21, 2022

Thanks @oseiberts11 that's great to hear. I think formalizing this option is in order then.

@shalabhc
Copy link
Contributor

shalabhc commented Dec 3, 2022

Just wanted to say we saw this intermittently in our github workflows after which we pinned pex to 2.1.111 since that was before the atomic_directory stuff. Looking forward to the fix landing.

@jsirois
Copy link
Member

jsirois commented Dec 3, 2022

@shalabhc the issue exists well before 2.1.111 I believe and atomic directory is very old. The new thing to try is exporting _PEX_FILE_LOCK_STYLE=bsd in newer versions of Pex. Have you tried this? I may be confused about what you're saying though. Exactly what issue did you encounter that pinning back to 2.1.111 solves?

@shalabhc
Copy link
Contributor

shalabhc commented Dec 3, 2022

After combing old logs in github, I was able to find the error we were seeing (below). This was about 20 days ago. Initially I thought there was something wrong with how I was invoking pex but then I noticed some recent change to atomic directory which looked major (maybe #1961?) and decided to pin to an earlier version. The failures went away after I pinned to 2.1.111.

I'll try _PEX_FILE_LOCK_STYLE=bsd at some point, but am unlikely to get to it in the next week.

/home/runner/.pex/installed_wheels/051b6342a63b3b34d7c078ae15a69b4be072bca358b9f6bcbdd1f6ede373a20c/pex-2.1.113-py2.py3-none-any.whl/pex/atomic_directory.py:217: PEXWarning: [pid:1680, tid:140476798035712, cwd:/home/runner/work/_actions/dagster-io/dagster-cloud-action/pex-testing]: After obtaining an exclusive lock on /home/runner/.pex/isolated/.175945331e12232f56263ca040847a84ca7b851d.atomic_directory.lck, failed to establish a work directory at /home/runner/.pex/isolated/175945331e12232f56263ca040847a84ca7b851d.lck.work due to: [Errno 17] File exists: '/home/runner/.pex/isolated/175945331e12232f56263ca040847a84ca7b851d.lck.work'
  pex_warnings.warn(
/home/runner/.pex/installed_wheels/051b6342a63b3b34d7c078ae15a69b4be072bca358b9f6bcbdd1f6ede373a20c/pex-2.1.113-py2.py3-none-any.whl/pex/atomic_directory.py:228: PEXWarning: [pid:1680, tid:140476798035712, cwd:/home/runner/work/_actions/dagster-io/dagster-cloud-action/pex-testing]: Continuing to forcibly re-create the work directory at /home/runner/.pex/isolated/175945331e12232f56263ca040847a84ca7b851d.lck.work.
  pex_warnings.warn(
Failed to spawn a job for /opt/hostedtoolcache/Python/3.8.14/x64/bin/python3.8: [Errno 2] No such file or directory: '/home/runner/.pex/installed_wheels/051b6342a63b3b34d7c078ae15a69b4be072bca358b9f6bcbdd1f6ede373a20c/pex-2.1.113-py2.py3-none-any.whl/pex/vendor/_vendored/attrs/attrs-21.5.0.dev0.dist-info/LICENSE' -> '/home/runner/.pex/isolated/175945331e12232f56263ca040847a84ca7b851d.lck.work/pex/vendor/_vendored/attrs/attrs-21.5.0.dev0.dist-info/LICENSE'

@jsirois
Copy link
Member

jsirois commented Dec 7, 2022

Ok, thanks for digging up that detail @shalabhc. The naming of the lock file in your output is .lck.work which indicates its using the code introduced here in 2.1.113 changed by #1974:
https://github.com/pantsbuild/pex/pull/1974/files#diff-1bdd426adefbf437ea181e2e6b56c924d945f33776c118c019c310ea840bd3c1R66

I hope to add your case to the victory list once you get around to trying pex>=2.1.113 + _PEX_FILE_LOCK_STYLE=bsd.

@shalabhc
Copy link
Contributor

I can confirm setting _PEX_FILE_LOCK_STYLE=bsd fixed a failure with the above message in our tests with pex 2.1.119

@jsirois
Copy link
Member

jsirois commented Dec 29, 2022

That's good to hear. The trend here would seem to indicate this is a good fix. There is still exactly 1 remaining known customer it does not fix, but they are also using Pants remote caching and it's appearing likely that may mix in its own missing file bug. I'll formalize this fix in the New Year. Thank you everyone who's reported the results of their experiment.

@james-johnston-thumbtack

We are also having this problem. I was able to make a small test case to reliably demonstrate the problem on my computer. To try it, run the following shell script:

#!/bin/sh

echo '#!/usr/bin/env python3' > hello.py

echo 'import setuptools' > setup.py
echo 'setuptools.setup(name="hello", version="10", entry_points={"console_scripts": ["hello = hello:main"]})' >> setup.py

docker run --rm \
        --mount type=bind,source=$(pwd),destination=/mount/workdir \
        --workdir /mount/workdir \
        python:3.11-bullseye \
        /bin/bash -c 'export PATH=$PATH:/tmp/.local/bin && pip install pex==2.1.121 && python setup.py build bdist_wheel && pex hello -v --python=python3 --python-shebang="/usr/bin/env python3" -c hello -o hello.pex'

I am running this on a MacBook Pro with M1 Pro chip, macOS Ventura 13.2, and Docker Desktop 4.16.2. It yields the following error:

pex: Building pex
pex: Building pex :: Resolving distributions (hello)
pex: Building pex :: Resolving distributions (hello) :: Resolving requirements.
pex: Building pex :: Resolving distributions (hello) :: Resolving requirements. :: Resolving for:
  /usr/local/bin/python3.11
pex: Hashing pex
pex: Hashing pex
pex: Hashing pex: 40.6ms
pex: Isolating pex
pex: Isolating pex :: Extracting pex to /root/.pex/isolated/e7af1fe664716d79c12cc7c7a3326d9223711fb7
pex: Hashing pex: 33.1ms
pex: Isolating pex
/usr/local/lib/python3.11/site-packages/pex/atomic_directory.py:217: PEXWarning: [pid:1, tid:281473120707040, cwd:/mount/workdir]: After obtaining an exclusive lock on /root/.pex/isolated/.e7af1fe664716d79c12cc7c7a3326d9223711fb7.atomic_directory.lck, failed to establish a work directory at /root/.pex/isolated/e7af1fe664716d79c12cc7c7a3326d9223711fb7.lck.work due to: [Errno 17] File exists: '/root/.pex/isolated/e7af1fe664716d79c12cc7c7a3326d9223711fb7.lck.work'
  pex_warnings.warn(
/usr/local/lib/python3.11/site-packages/pex/atomic_directory.py:228: PEXWarning: [pid:1, tid:281473120707040, cwd:/mount/workdir]: Continuing to forcibly re-create the work directory at /root/.pex/isolated/e7af1fe664716d79c12cc7c7a3326d9223711fb7.lck.work.
  pex_warnings.warn(
pex: Isolating pex :: Extracting pex to /root/.pex/isolated/e7af1fe664716d79c12cc7c7a3326d9223711fb7
Failed to spawn a job for /usr/local/bin/python3.11: [Errno 2] No such file or directory: '/usr/local/lib/python3.11/site-packages/pex/venv/virtualenv.py' -> '/root/.pex/isolated/e7af1fe664716d79c12cc7c7a3326d9223711fb7.lck.work/pex/venv/virtualenv.py'

I have noticed that if I remove the --python flag, the script runs farther and gives a more expected complaint about No matching distribution found for hello.

I also noticed that pex 2.1.112 is the first version to fail with this error.

The _PEX_FILE_LOCK_STYLE=bsd option also fixes the error. But maybe this test case will help fix the POSIX lock style.

@jsirois
Copy link
Member

jsirois commented Feb 3, 2023

Thanks @james-johnston-thumbtack. I repro with your script, but your script is also odd-seeming in a few ways. This does the same thing and works fine for example:

$ cat test.sh
#!/bin/sh

cat <<EOF > hello.py
def main():
    print("Hello")
EOF

cat <<EOF > setup.py
import setuptools


setuptools.setup(
    name="hello",
    version="10",
    entry_points={
        "console_scripts": ["hello = hello:main"]
    }
)
EOF

docker run --rm \
    --mount type=bind,source=$(pwd),destination=/mount/workdir \
    --workdir /mount/workdir \
    python:3.11-bullseye \
    /bin/bash -c '
        pip install pex==2.1.121 && \
        python setup.py bdist_wheel && \
        pex dist/hello-10-py3-none-any.whl -v -c hello -o hello.pex
    '

So I'll dig into some of the oddities your script sets up, like the PATH bit, the build bdist_wheel instead of just bdist_wheel, and the seeming reliance on some CWD behavior of Pip since hello is not on PyPI but you manage to use it instead of a path like .. Although these are all odd, they are certainly possible and even legitimate; so Pex should work with them.

@jsirois
Copy link
Member

jsirois commented Feb 3, 2023

Ok, you are definitely right @james-johnston-thumbtack the key bit is --python=python3. Using either of --python=/usr/local/bin/python3 or --python=/usr/bin/python3 works fine.

@jsirois
Copy link
Member

jsirois commented Feb 3, 2023

In fact, --python=python and --python=python3.11 work fine too.
In the image I see:

root@b39e34a16db5:/mount/workdir# which -a python
/usr/local/bin/python
/usr/local/bin/python
root@b39e34a16db5:/mount/workdir# which -a python3
/usr/local/bin/python3
/usr/local/bin/python3
/usr/bin/python3
root@b39e34a16db5:/mount/workdir# which -a python3.11
/usr/local/bin/python3.11
/usr/local/bin/python3.11

And:

root@b39e34a16db5:/mount/workdir# /usr/bin/python3 -V
Python 3.9.2
root@b39e34a16db5:/mount/workdir# /usr/local/bin/python3 -V
Python 3.11.1

So the issue would appear to be related to there being 2 different non-specific python3 on the PATH in that container and Pex behaving badly as a result, but only under posix locking, because I re-checked, and _PEX_FILE_LOCK_STYLE=bsd definitely fixes!. This is juicy.

@jsirois
Copy link
Member

jsirois commented Feb 3, 2023

And re the 2 python3 interpreters, the PATH order is in play as well here. If Python 3.9 is on the PATH 1st, it works fine:

PATH=/usr/bin:$PATH _PEX_FILE_LOCK_STYLE=posix pex dist/hello-10-py3-none-any.whl --python=python3 -c hello -o hello.pex

If Python3.11 is 1st on PATH on the other hand, it fails:

PATH=/usr/local/bin:/usr/bin:$PATH _PEX_FILE_LOCK_STYLE=posix pex dist/hello-10-py3-none-any.whl --python=python3 -c hello -o hello.pex

Unless bsd locking is used of course, then fine again:

PATH=/usr/local/bin:/usr/bin:$PATH _PEX_FILE_LOCK_STYLE=bsd pex dist/hello-10-py3-none-any.whl --python=python3 -c hello -o hello.pex -vvvvvvvvv

@jsirois
Copy link
Member

jsirois commented Feb 3, 2023

Ok, one thing I've consistently missed in my analysis of all this is that Pex had threads prior to the introduction of lock files and their use of a parallel downloader. The jobs API uses a single thread to implement a work-stealing queue of job processes to launch:
https://github.com/pantsbuild/pex/blob/3e71bc7896935a8177e9bbe9399769c3522528dc/pex/jobs.py#L512-L514

So, although jobs - the central parallelization mechanism Pex has used for a long time - is process based and not thread based, it does spawn a solitary thread; as such, unless things are just so, posix file locks should be expected to fail.

@jsirois
Copy link
Member

jsirois commented Feb 3, 2023

I'll be damned, this fixes:

$ git diff
diff --git a/pex/jobs.py b/pex/jobs.py
index 734836a..db05fc0 100644
--- a/pex/jobs.py
+++ b/pex/jobs.py
@@ -550,3 +550,4 @@ def execute_parallel(
                         error = e
         finally:
             job_slots.release()
+            spawner.join()

I really don't know how I continually glossed over / missed the jobs.py Thread spawn.

I need to think this through a bit more, but I think this solves the issue. The bsd locks are still needed for the lock file handling of parallel downloads (and the later added parallel downloads of PEP-691 metadata), but the old-school plain old Pex code paths are made safe with the lone thread join ensuring its shut down before serially continuing to the next lines of code.

@jsirois
Copy link
Member

jsirois commented Feb 3, 2023

@james-johnston-thumbtack thank you so much for the repro case. As is always the case, these are absolute gold and make debugging roughly infinitely easier and quicker than it is otherwise.

@jsirois
Copy link
Member

jsirois commented Feb 3, 2023

Ok, so the fact there was >1 python3 in the image was critical here, that got 2 concurrent jobs running under the thread spawner. With just python or python3.11 or a full path, only 1 Python was available and the race was not triggered.

@jsirois
Copy link
Member

jsirois commented Feb 3, 2023

And as to the 2.1.112 observation, I find the same for this test case. It does not trigger before (going back to 2.1.90) but does trigger reliably after. That jives with this PR: #1961 which changed all atomic_directory locks to be exclusive (I.E.: use file locking). One of those that changed from not using file locking to using file locking in that PR was code in the interpreter identification job. That is exactly the problem here where 2 interpreters are being identified.

@jsirois
Copy link
Member

jsirois commented Feb 3, 2023

Importantly though, in the OP @christopherfrieler reported 2.1.105 as the 1st problematic release in his case. The repro case does not trigger for Pex 2.1.105; so more analysis is needed.

@james-johnston-thumbtack

thank you so much for the repro case

Awesome - glad it was helpful & thank you for taking a look at it!

jsirois added a commit to jsirois/pex that referenced this issue Feb 9, 2023
Although the thread was not leaked per-se, it could run after
`execute_parallel` returned which could cause parallel atomic_directory
posix locks to fail.

Fixes one case in pex-tool#1969.
jsirois added a commit to jsirois/pex that referenced this issue Feb 10, 2023
Although the thread was not leaked per-se, it could run after
`execute_parallel` returned which could cause parallel atomic_directory
posix locks to fail.

Fixes one case in pex-tool#1969.
jsirois added a commit that referenced this issue Feb 10, 2023
Although the thread was not leaked per-se, it could run after
`execute_parallel` returned which could cause parallel atomic_directory
posix locks to fail.

Fixes one case in #1969.
jsirois added a commit to jsirois/pex that referenced this issue Feb 27, 2023
This is needed to have independent POSIX fcntl locks in the same process
by multiple threads and also needed whenever BSD flock locks silently
use fcntl emulation (exotic systems or under NFS mounts). Pex is
designed to avoid multi-threading when using POSIX locks; so this just
serves as a design-error backstop for those style locks. For silently
emulated BSD locks, this provides correctness.

Analysis of pex-tool#2066 and pex-tool#1969 do not point to this enhancement solving any
existing problems, but this is an improvement for the cases mentioned
should we hit them.

Work regarding pex-tool#2066.
jsirois added a commit that referenced this issue Feb 27, 2023
This is needed to have independent POSIX fcntl locks in the same process
by multiple threads and also needed whenever BSD flock locks silently
use fcntl emulation (exotic systems or under NFS mounts). Pex is
designed to avoid multi-threading when using POSIX locks; so this just
serves as a design-error backstop for those style locks. For silently
emulated BSD locks, this provides correctness.

Analysis of #2066 and #1969 do not point to this enhancement solving any
existing problems, but this is an improvement for the cases mentioned
should we hit them.

Work regarding #2066.
@jsirois
Copy link
Member

jsirois commented Apr 17, 2023

Since the OP issue was fixed (see: #1969 (comment)) I'll assume the OP report of Pex 2.1.105 was in error and call this closed since all reported issues by @christopherfrieler, @oseiberts11, @shalabhc and @james-johnston-thumbtack are solved now.

@jsirois jsirois closed this as completed Apr 17, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

5 participants