Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parallell pex creation fails due to possible race conditions in package installation #1968

Closed
tapetersen opened this issue Oct 31, 2022 · 11 comments · Fixed by #1974
Closed
Assignees
Labels

Comments

@tapetersen
Copy link

We're using pex with makefiles and parallell jobs in CI and are noticing spurious errors like the one below

pex --find-links out/wheels -o out/pexes/pex1.pex ./pex1 -e pex1:main
pex --find-links out/wheels -o out/pexes/pex2.pex -r pex2/requirements.txt --python-script pex2/entrypoint.py
Could not find the installation RECORD for zipp 3.10.0 under /root/.pex/installed_wheels/4fcb6f278987a6605757302a6e40e896257570d11c51628968ccb2a47e80c6c1/zipp-3.10.0-py3-none-any.whl.workdir

The other pex built fine and if I retry it will work. Problem is probably made worse by having all wheels cached so no time is spent downloading them which makes it more likely to run into races when installing.

This would probably need some locking around the installed_wheels folder or more fine grained locking/atomik operations for the real installation folder.

It's not urgent for us as I can avoid making those in parallell or ensure separate pexdirs but it was somewhat of a nasty issue to debug and would perhaps save someone the time and frustration in the future.

@jsirois
Copy link
Member

jsirois commented Oct 31, 2022

This would probably need some locking around the installed_wheels folder or more fine grained locking/atomik operations for the real installation folder.

Yeah, Pex has an atomic_directory abstraction it uses for all such things:
https://github.com/pantsbuild/pex/blob/22e651589b3f2da26da6786dcbd1553fd7c8d4a6/pex/atomic_directory.py#L102-L122

Clearly something is broken there though. Pex is used by Pants quite heavily and it runs many Pex processes in parallel as well. That had smoked out these issues ~2 years ago, but there is clearly something wrong (see #1969). I explained a bit over on that issue, but if you're on Pex 2.1.112 can you try exporting _PEX_FILE_LOCK_STYLE=bsd and report back if that helps? You'd need to wipe the PEX_ROOT (default ~/.pex) before trying the parallel make to start from a clean slate with this maybe-fix.

@jsirois jsirois added the bug label Oct 31, 2022
@jsirois jsirois self-assigned this Oct 31, 2022
@jsirois
Copy link
Member

jsirois commented Nov 2, 2022

@tapetersen, @christopherfrieler had success using _PEX_FILE_LOCK_STYLE=bsd over in #1969. If you get a chance to try this and report back, I'd love to have all the data points I can get to close this out in #1971.

@tapetersen
Copy link
Author

I'm running the builds with pex installed in a .venv and get the following (along with the same errors when trying with the env-var)

Ignoring the following environment variables in Pex venv mode:
_PEX_FILE_LOCK_STYLE=bsd

Will try to install it outside venv and try again. and see if I get any difference

@tapetersen
Copy link
Author

tapetersen commented Nov 2, 2022

Hmm even Outside of a venv with _PEX_FILE_LOCK_STYLE=bsd I get failed builds and sometimes still the message about it being ignored but sometimes it also fails without that warning

Error without that warning

Could not find the installation RECORD for commonmark 0.9.1 under /root/.pex/installed_wheels/da2f38c92590f83de410ba1a3cbceafbc74fee9def35f9251ba9a971d6d66fd9/commonmark-0.9.1-py2.py3-none-any.whl.workdir

Error with that warning (both having the variable set from commandline

pex --find-links out/wheels -o out/pexes/pex1.pex ./pex1 -e pex1:main
pex --find-links out/wheels -o out/pexes/pex2.pex -r pex2/requirements.txt --python-script pex2entrypoint.py
pid 3227 -> /root/.pex/venvs/3ff4eec5a4c6144123a0484f8822cd11473e74e9/5fd7049af63e03f347278c89401424cd9731df9a/bin/python -sE /root/.pex/venvs/3ff4eec5a4c6144123a0484f8822cd11473e74e9/5fd7049af63e03f347278c89401424cd9731df9a/pex --disable-pip-version-check --no-python-version-warning --exists-action a --no-input --use-deprecated legacy-resolver --isolated -q --cache-dir /root/.pex/pip_cache install --no-deps --no-index --only-binary :all: --no-warn-script-location --force-reinstall --ignore-installed --ignore-requires-python --prefix /root/.pex/installed_wheels/da2f38c92590f83de410ba1a3cbceafbc74fee9def35f9251ba9a971d6d66fd9/commonmark-0.9.1-py2.py3-none-any.whl.workdir --no-compile /root/.pex/downloads/resolver_download.ispj28lb/usr.local.bin.python3.8/commonmark-0.9.1-py2.py3-none-any.whl exited with 1 and STDERR:
Ignoring the following environment variables in Pex venv mode:
_PEX_FILE_LOCK_STYLE=bsd

ERROR: Could not install packages due to an EnvironmentError: [Errno 2] No such file or directory: '/root/.pex/installed_wheels/da2f38c92590f83de410ba1a3cbceafbc74fee9def35f9251ba9a971d6d66fd9/commonmark-0.9.1-py2.py3-none-any.whl.workdir/lib/python3.8/site-packages/commonmark/tests/unit_tests.py'

pex --version 2.1.112

Also running with PIP_NO_INDEX=1 PIP_FIND_LINKS=/root/wheels so all wheels are pre-downloaded and are just installed which as mentioned I suspect makes it more likely to run into it.

@jsirois
Copy link
Member

jsirois commented Nov 2, 2022

Ok, thanks for that data. The warning is expected, the continued failures are not. I need to think more here.

@tapetersen
Copy link
Author

@jsirois No problem, like I said no rush for us. We can work around it in the makefile without any problems but happy to provide data if you want to resolve it.

@jsirois
Copy link
Member

jsirois commented Nov 2, 2022

Actually - one question presents itself. Are your errors confined to missing files in paths that are prefixed by /root/.pex/installed_wheels/? If so, that might be a special case I can look at. I think that is the oldest use of atomic_directory and it may not use the locking at all, just an atomic rename. Your examples indicate this happens when Pex is calling pip install (Pex dogfoods itself and runs Pip as a venv PEX in there).

@tapetersen
Copy link
Author

Hmm not sure about all errors but from a quick glance all I can see do indeed appear under that directory.
. I prepared a small example for reproduction as well that may help you here in a gist
https://gist.github.com/tapetersen/32d564029be43022502e814a8d3e3057

@jsirois
Copy link
Member

jsirois commented Nov 2, 2022

Excellent - thank you.

@jsirois
Copy link
Member

jsirois commented Nov 2, 2022

I repro - which is great. Thank you for setting that gist up. I should be able to get to the bottom of this pretty quickly.

jsirois added a commit to jsirois/pex that referenced this issue Nov 3, 2022
This was broken by pex-tool#1961 which did not account for the two unlocked
uses of AtomicDirectory by the pex.resolver for wheel building and
installing. Although those directories deserve a second look as
candidates for exclusive locking, we have seen no known issues with
those directories since introduction of AtomicDirectory up until pex-tool#1961.
As such, this change just restores the prior behavior for those two
cases.

Fixes pex-tool#1968
jsirois added a commit that referenced this issue Nov 3, 2022
This was broken by #1961 which did not account for the two unlocked
uses of AtomicDirectory by the pex.resolver for wheel building and
installing. Although those directories deserve a second look as
candidates for exclusive locking, we have seen no known issues with
those directories since introduction of AtomicDirectory up until #1961.
As such, this change just restores the prior behavior for those two
cases.

Fixes #1968
@jsirois
Copy link
Member

jsirois commented Nov 3, 2022

Alright - thanks again @tapetersen for your help tracking this down. The fix in #1974 is now released in Pex 2.1.113.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants