-
-
Notifications
You must be signed in to change notification settings - Fork 270
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
pex build fails due to existing work-directory #1969
Comments
@christopherfrieler that warning message looks like the one in the 2.1.112 release. I added it in #1961 to help debug a probable race or wrong POSIX assumption that has been hard to track down. I know this is painful for you, but I'm very happy to have a repro case from you! Can you try setting |
@jsirois Thanks for the quick response. Setting |
@christopherfrieler that is good news. Yes, use that env var for now. I've created an issue to track deprecating that env var and formalizing the solution over in #1971. Even with the deprecation, it will only be a warning and I won't remove the env var until 3.0.0. |
For the record, we seem to be running into the same issue intermittently (in gitlab CI, in case it matters). I'm trying out the fix with setting the environment variable |
@oseiberts11 thanks for the report. I am definitely interested to hear at some point if you pass a reasonable number of CI runs still without seeing the issues come back. |
@jsirois We had a bunch of CI runs (renovate caused lots of merge requests :) ) and didn't see the issue again. So I think it is safe to say that changing the lock style helps. Thanks for putting in that code! :) |
Thanks @oseiberts11 that's great to hear. I think formalizing this option is in order then. |
Just wanted to say we saw this intermittently in our github workflows after which we pinned pex to |
@shalabhc the issue exists well before 2.1.111 I believe and atomic directory is very old. The new thing to try is exporting |
After combing old logs in github, I was able to find the error we were seeing (below). This was about 20 days ago. Initially I thought there was something wrong with how I was invoking pex but then I noticed some recent change to atomic directory which looked major (maybe #1961?) and decided to pin to an earlier version. The failures went away after I pinned to 2.1.111. I'll try
|
Ok, thanks for digging up that detail @shalabhc. The naming of the lock file in your output is I hope to add your case to the victory list once you get around to trying |
I can confirm setting |
That's good to hear. The trend here would seem to indicate this is a good fix. There is still exactly 1 remaining known customer it does not fix, but they are also using Pants remote caching and it's appearing likely that may mix in its own missing file bug. I'll formalize this fix in the New Year. Thank you everyone who's reported the results of their experiment. |
We are also having this problem. I was able to make a small test case to reliably demonstrate the problem on my computer. To try it, run the following shell script:
I am running this on a MacBook Pro with M1 Pro chip, macOS Ventura 13.2, and Docker Desktop 4.16.2. It yields the following error:
I have noticed that if I remove the I also noticed that pex 2.1.112 is the first version to fail with this error. The |
Thanks @james-johnston-thumbtack. I repro with your script, but your script is also odd-seeming in a few ways. This does the same thing and works fine for example:
So I'll dig into some of the oddities your script sets up, like the PATH bit, the |
Ok, you are definitely right @james-johnston-thumbtack the key bit is |
In fact,
And:
So the issue would appear to be related to there being 2 different non-specific |
And re the 2 python3 interpreters, the PATH order is in play as well here. If Python 3.9 is on the PATH 1st, it works fine:
If Python3.11 is 1st on PATH on the other hand, it fails:
Unless bsd locking is used of course, then fine again:
|
Ok, one thing I've consistently missed in my analysis of all this is that Pex had threads prior to the introduction of lock files and their use of a parallel downloader. The jobs API uses a single thread to implement a work-stealing queue of job processes to launch: So, although jobs - the central parallelization mechanism Pex has used for a long time - is process based and not thread based, it does spawn a solitary thread; as such, unless things are just so, posix file locks should be expected to fail. |
I'll be damned, this fixes: $ git diff
diff --git a/pex/jobs.py b/pex/jobs.py
index 734836a..db05fc0 100644
--- a/pex/jobs.py
+++ b/pex/jobs.py
@@ -550,3 +550,4 @@ def execute_parallel(
error = e
finally:
job_slots.release()
+ spawner.join() I really don't know how I continually glossed over / missed the jobs.py Thread spawn. I need to think this through a bit more, but I think this solves the issue. The bsd locks are still needed for the lock file handling of parallel downloads (and the later added parallel downloads of PEP-691 metadata), but the old-school plain old Pex code paths are made safe with the lone thread join ensuring its shut down before serially continuing to the next lines of code. |
@james-johnston-thumbtack thank you so much for the repro case. As is always the case, these are absolute gold and make debugging roughly infinitely easier and quicker than it is otherwise. |
Ok, so the fact there was >1 python3 in the image was critical here, that got 2 concurrent jobs running under the thread spawner. With just python or python3.11 or a full path, only 1 Python was available and the race was not triggered. |
And as to the 2.1.112 observation, I find the same for this test case. It does not trigger before (going back to 2.1.90) but does trigger reliably after. That jives with this PR: #1961 which changed all atomic_directory locks to be exclusive (I.E.: use file locking). One of those that changed from not using file locking to using file locking in that PR was code in the interpreter identification job. That is exactly the problem here where 2 interpreters are being identified. |
Importantly though, in the OP @christopherfrieler reported 2.1.105 as the 1st problematic release in his case. The repro case does not trigger for Pex 2.1.105; so more analysis is needed. |
Awesome - glad it was helpful & thank you for taking a look at it! |
Although the thread was not leaked per-se, it could run after `execute_parallel` returned which could cause parallel atomic_directory posix locks to fail. Fixes one case in pex-tool#1969.
Although the thread was not leaked per-se, it could run after `execute_parallel` returned which could cause parallel atomic_directory posix locks to fail. Fixes one case in pex-tool#1969.
Although the thread was not leaked per-se, it could run after `execute_parallel` returned which could cause parallel atomic_directory posix locks to fail. Fixes one case in #1969.
This is needed to have independent POSIX fcntl locks in the same process by multiple threads and also needed whenever BSD flock locks silently use fcntl emulation (exotic systems or under NFS mounts). Pex is designed to avoid multi-threading when using POSIX locks; so this just serves as a design-error backstop for those style locks. For silently emulated BSD locks, this provides correctness. Analysis of pex-tool#2066 and pex-tool#1969 do not point to this enhancement solving any existing problems, but this is an improvement for the cases mentioned should we hit them. Work regarding pex-tool#2066.
This is needed to have independent POSIX fcntl locks in the same process by multiple threads and also needed whenever BSD flock locks silently use fcntl emulation (exotic systems or under NFS mounts). Pex is designed to avoid multi-threading when using POSIX locks; so this just serves as a design-error backstop for those style locks. For silently emulated BSD locks, this provides correctness. Analysis of #2066 and #1969 do not point to this enhancement solving any existing problems, but this is an improvement for the cases mentioned should we hit them. Work regarding #2066.
Since the OP issue was fixed (see: #1969 (comment)) I'll assume the OP report of Pex 2.1.105 was in error and call this closed since all reported issues by @christopherfrieler, @oseiberts11, @shalabhc and @james-johnston-thumbtack are solved now. |
Beginning with version 2.1.105 building the pex file in our CI pipeline fails with the following message:
It seems to have to do with #1905 introduced in version 2.1.105, but we have no clue, why this is happening in our CI pipeline, while building the .pex file on MacOS developer machines works. It looks like something else is creating that directory, but there is only one pex command in the pipeline job and the PEX_ROOT is not cached.
Our build environment uses:
Then we build the pex with
poetry run pex --inherit-path --python=python3.8 --requirement requirements.txt --find-links dist/ our_module --output-file dist/final.pex
Any idea why this is happening or what else we could check would be helpful.
The text was updated successfully, but these errors were encountered: