-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Occasional segfaults in event_loop teardown due to dead logger #2681
Comments
Can you reproduce this on something other than a NIVDIA gpu? If so can you provide the detailed reproduction steps for that? |
I will give it a go and get back to you (it may take a little while), thanks! |
@wence- has been busy with other tasks, but I finally succeeded in coming up with a reproducer that doesn't require GPUs, I'm not entirely sure the bug is the same as the stack I see is somewhat different from that in the original description: (gdb) bt
#0 aws_mem_release (allocator=0x55c315d103b9, ptr=0x55c649b598d0) at /home/mambauser/aws-sdk-cpp/crt/aws-crt-cpp/crt/aws-c-common/source/allocator.c:210
#1 0x00007fbbf0accc55 in Aws::Crt::StlAllocator<char>::deallocate (this=0x55c649b598a0, this=0x55c649b598a0, p=<optimized out>)
at /home/mambauser/aws-sdk-cpp/crt/aws-crt-cpp/include/aws/crt/StlAllocator.h:56
#2 std::allocator_traits<Aws::Crt::StlAllocator<char> >::deallocate (__n=<optimized out>, __p=<optimized out>, __a=...)
at /opt/conda/envs/aws-segfault/x86_64-conda-linux-gnu/include/c++/11.4.0/bits/alloc_traits.h:345
#3 std::__cxx11::basic_string<char, std::char_traits<char>, Aws::Crt::StlAllocator<char> >::_M_destroy (__size=<optimized out>, this=0x55c649b598a0)
at /opt/conda/envs/aws-segfault/x86_64-conda-linux-gnu/include/c++/11.4.0/bits/basic_string.h:245
#4 std::__cxx11::basic_string<char, std::char_traits<char>, Aws::Crt::StlAllocator<char> >::_M_dispose (this=0x55c649b598a0)
at /opt/conda/envs/aws-segfault/x86_64-conda-linux-gnu/include/c++/11.4.0/bits/basic_string.h:240
#5 std::__cxx11::basic_string<char, std::char_traits<char>, Aws::Crt::StlAllocator<char> >::_M_dispose (this=0x55c649b598a0)
at /opt/conda/envs/aws-segfault/x86_64-conda-linux-gnu/include/c++/11.4.0/bits/basic_string.h:237
#6 std::__cxx11::basic_string<char, std::char_traits<char>, Aws::Crt::StlAllocator<char> >::~basic_string (this=0x55c649b598a0, __in_chrg=<optimized out>)
at /opt/conda/envs/aws-segfault/x86_64-conda-linux-gnu/include/c++/11.4.0/bits/basic_string.h:672
#7 std::default_delete<std::__cxx11::basic_string<char, std::char_traits<char>, Aws::Crt::StlAllocator<char> > >::operator() (this=<optimized out>,
__ptr=0x55c649b598a0) at /opt/conda/envs/aws-segfault/x86_64-conda-linux-gnu/include/c++/11.4.0/bits/unique_ptr.h:85
#8 std::unique_ptr<std::__cxx11::basic_string<char, std::char_traits<char>, Aws::Crt::StlAllocator<char> >, std::default_delete<std::__cxx11::basic_string<char, std::char_traits<char>, Aws::Crt::StlAllocator<char> > > >::~unique_ptr (this=<optimized out>, __in_chrg=<optimized out>)
at /opt/conda/envs/aws-segfault/x86_64-conda-linux-gnu/include/c++/11.4.0/bits/unique_ptr.h:361
#9 0x00007fbe99af3495 in ?? () from /lib/x86_64-linux-gnu/libc.so.6
#10 0x00007fbe99af3610 in exit () from /lib/x86_64-linux-gnu/libc.so.6
#11 0x00007fbe99ad7d97 in ?? () from /lib/x86_64-linux-gnu/libc.so.6
#12 0x00007fbe99ad7e40 in __libc_start_main () from /lib/x86_64-linux-gnu/libc.so.6
#13 0x000055c646edb421 in _start () To run the reproducer a system with Docker available is required, first save the snippet below named aws-segfault.dockerfileFROM mambaorg/micromamba:jammy
# Install runtime, build and debug dependencies
RUN micromamba create -c conda-forge -n aws-segfault \
python=3.10 \
cuda-version=11.8 \
pyarrow=12.0.1 \
aws-sdk-cpp=1.11.156 \
pytest \
pytest-cov \
pytest-timeout \
cmake \
c-compiler \
cxx-compiler \
gcc_linux-64=11.* \
make \
gdb \
git
SHELL ["micromamba", "run", "-n", "aws-segfault", "/bin/bash", "-c"]
# Install distributed from source in developer mode
RUN cd ${HOME} \
&& git clone https://github.com/dask/distributed \
&& cd distributed \
&& git checkout 2023.9.2 \
&& pip install -e .
# Build and install aws-sdk-cpp from source with debug config
# Note: we must unset SCCACHE_BUCKET that is predefined by the base image
RUN unset SCCACHE_BUCKET \
&& cd ${HOME} \
&& git clone --recurse-submodules https://github.com/aws/aws-sdk-cpp \
&& cd aws-sdk-cpp \
&& git checkout 1.11.181 \
&& mkdir build \
&& cd build \
&& cmake .. \
-DCMAKE_BUILD_TYPE=Debug \
-DCMAKE_PREFIX_PATH=$CONDA_PREFIX \
-DCMAKE_INSTALL_PREFIX=$CONDA_PREFIX \
-DBUILD_ONLY="s3" \
-DAUTORUN_UNIT_TESTS=OFF \
-DENABLE_TESTING=OFF \
&& cmake --build . --config=Debug \
&& cmake --install . --config=Debug
COPY test_aws_segfault.py /home/mambauser/distributed/test_aws_segfault.py
WORKDIR /home/mambauser/distributed And save the following named test_aws_segfault.pyimport pytest
pytestmark = pytest.mark.gpu
from distributed import Client
from distributed.deploy.local import LocalCluster
from distributed.utils_test import gen_test
@gen_test()
async def test_aws_segfault():
from pyarrow.fs import FSSpecHandler, PyFileSystem
da = pytest.importorskip("dask.array")
async with LocalCluster(
protocol="tcp", n_workers=2, threads_per_worker=2, dashboard_address=":0", asynchronous=True
) as cluster:
async with Client(cluster, asynchronous=True):
assert cluster.scheduler_address.startswith("tcp://")
x = da.ones((10000, 10000), chunks=(1000, 1000)).persist()
await x
y = (x + x.T).sum()
await y With both files present in the local directory, build the docker image: docker build -t aws-segfault:latest -f aws-segfault.dockerfile . Then start the container: docker run --rm -ti aws-segfault:latest /bin/bash Inside the container run: micromamba activate aws-segfault
python -m pytest -vs test_aws_segfault.py
# Or run with gdb
# gdb --args python -m pytest -vs test_aws_segfault.py
# run With the instructions above I consistently got a segfault and didn't need to run multiple times to reproduce. |
### Rationale for this change In accordance to #38364, we believe that for various reasons (shortening import time, preventing unnecessary resource consumption and potential bugs with S3 library) it is appropriate to avoid initialization of S3 resources at import time and move that step to occur at first-use. ### What changes are included in this PR? - Remove calls to `ensure_s3_initialized()` that were up until now executed during `import pyarrow.fs`; - Move `ensure_s3_intialized()` calls to `python/pyarrow/_s3fs.pyx` module; - Add global flag to mark whether S3 has been previously initialized and `atexit` handlers registered. ### Are these changes tested? Yes, existing S3 tests check whether it has been initialized, otherwise failing with a C++ exception. ### Are there any user-facing changes? No, the behavior is now slightly different with S3 initialization not happening immediately after `pyarrow.fs` is imported, but no changes are expected from a user perspective relying on the public API alone. **This PR contains a "Critical Fix".** A bug in aws-sdk-cpp reported in aws/aws-sdk-cpp#2681 causes segmentation faults under specific circumstances when Python processes shutdown, specifically observed with Dask+GPUs (so far we were unable to pinpoint the exact correlation of Dask+GPUs+S3). While this definitely doesn't seem to affect all users and is not directly sourced in Arrow, it may affect use cases that are completely independent of S3 to operate, which is particularly problematic in CI where all tests pass successfully but the process crashes at shutdown. * Closes: #38364 Lead-authored-by: Peter Andreas Entschev <[email protected]> Co-authored-by: Antoine Pitrou <[email protected]> Signed-off-by: Antoine Pitrou <[email protected]>
The aws-sdk-pinning is proving to be far too problematic to maintain. It causes conflicts in many environments due to its common usage across many other packages in the conda-forge ecosystem that have since updated their pinning to require newer versions than the 1.10.* that we've pinned to. This reversion will unblock most of RAPIDS CI. We will search for alternative fixes to the dask-cuda/distributed issues that we're observing (in particular, resolution of the underlying issues apache/arrow#38364 and aws/aws-sdk-cpp#2681). Authors: - Vyas Ramasubramani (https://github.com/vyasr) Approvers: - Ray Douglass (https://github.com/raydouglass) - Bradley Dice (https://github.com/bdice) - Lawrence Mitchell (https://github.com/wence-) - Peter Andreas Entschev (https://github.com/pentschev) URL: #14319
This is not a resolution, but may help mitigate problems from aws/aws-sdk-cpp#2681 Authors: - Richard (Rick) Zamora (https://github.com/rjzamora) Approvers: - Peter Andreas Entschev (https://github.com/pentschev) - Lawrence Mitchell (https://github.com/wence-) - Bradley Dice (https://github.com/bdice) URL: #14321
### Rationale for this change In accordance to apache#38364, we believe that for various reasons (shortening import time, preventing unnecessary resource consumption and potential bugs with S3 library) it is appropriate to avoid initialization of S3 resources at import time and move that step to occur at first-use. ### What changes are included in this PR? - Remove calls to `ensure_s3_initialized()` that were up until now executed during `import pyarrow.fs`; - Move `ensure_s3_intialized()` calls to `python/pyarrow/_s3fs.pyx` module; - Add global flag to mark whether S3 has been previously initialized and `atexit` handlers registered. ### Are these changes tested? Yes, existing S3 tests check whether it has been initialized, otherwise failing with a C++ exception. ### Are there any user-facing changes? No, the behavior is now slightly different with S3 initialization not happening immediately after `pyarrow.fs` is imported, but no changes are expected from a user perspective relying on the public API alone. **This PR contains a "Critical Fix".** A bug in aws-sdk-cpp reported in aws/aws-sdk-cpp#2681 causes segmentation faults under specific circumstances when Python processes shutdown, specifically observed with Dask+GPUs (so far we were unable to pinpoint the exact correlation of Dask+GPUs+S3). While this definitely doesn't seem to affect all users and is not directly sourced in Arrow, it may affect use cases that are completely independent of S3 to operate, which is particularly problematic in CI where all tests pass successfully but the process crashes at shutdown. * Closes: apache#38364 Lead-authored-by: Peter Andreas Entschev <[email protected]> Co-authored-by: Antoine Pitrou <[email protected]> Signed-off-by: Antoine Pitrou <[email protected]>
### Rationale for this change In accordance to apache#38364, we believe that for various reasons (shortening import time, preventing unnecessary resource consumption and potential bugs with S3 library) it is appropriate to avoid initialization of S3 resources at import time and move that step to occur at first-use. ### What changes are included in this PR? - Remove calls to `ensure_s3_initialized()` that were up until now executed during `import pyarrow.fs`; - Move `ensure_s3_intialized()` calls to `python/pyarrow/_s3fs.pyx` module; - Add global flag to mark whether S3 has been previously initialized and `atexit` handlers registered. ### Are these changes tested? Yes, existing S3 tests check whether it has been initialized, otherwise failing with a C++ exception. ### Are there any user-facing changes? No, the behavior is now slightly different with S3 initialization not happening immediately after `pyarrow.fs` is imported, but no changes are expected from a user perspective relying on the public API alone. **This PR contains a "Critical Fix".** A bug in aws-sdk-cpp reported in aws/aws-sdk-cpp#2681 causes segmentation faults under specific circumstances when Python processes shutdown, specifically observed with Dask+GPUs (so far we were unable to pinpoint the exact correlation of Dask+GPUs+S3). While this definitely doesn't seem to affect all users and is not directly sourced in Arrow, it may affect use cases that are completely independent of S3 to operate, which is particularly problematic in CI where all tests pass successfully but the process crashes at shutdown. * Closes: apache#38364 Lead-authored-by: Peter Andreas Entschev <[email protected]> Co-authored-by: Antoine Pitrou <[email protected]> Signed-off-by: Antoine Pitrou <[email protected]>
### Rationale for this change In accordance to #38364, we believe that for various reasons (shortening import time, preventing unnecessary resource consumption and potential bugs with S3 library) it is appropriate to avoid initialization of S3 resources at import time and move that step to occur at first-use. ### What changes are included in this PR? - Remove calls to `ensure_s3_initialized()` that were up until now executed during `import pyarrow.fs`; - Move `ensure_s3_intialized()` calls to `python/pyarrow/_s3fs.pyx` module; - Add global flag to mark whether S3 has been previously initialized and `atexit` handlers registered. ### Are these changes tested? Yes, existing S3 tests check whether it has been initialized, otherwise failing with a C++ exception. ### Are there any user-facing changes? No, the behavior is now slightly different with S3 initialization not happening immediately after `pyarrow.fs` is imported, but no changes are expected from a user perspective relying on the public API alone. **This PR contains a "Critical Fix".** A bug in aws-sdk-cpp reported in aws/aws-sdk-cpp#2681 causes segmentation faults under specific circumstances when Python processes shutdown, specifically observed with Dask+GPUs (so far we were unable to pinpoint the exact correlation of Dask+GPUs+S3). While this definitely doesn't seem to affect all users and is not directly sourced in Arrow, it may affect use cases that are completely independent of S3 to operate, which is particularly problematic in CI where all tests pass successfully but the process crashes at shutdown. * Closes: #38364 Lead-authored-by: Peter Andreas Entschev <[email protected]> Co-authored-by: Antoine Pitrou <[email protected]> Signed-off-by: Antoine Pitrou <[email protected]>
### Rationale for this change In accordance to apache#38364, we believe that for various reasons (shortening import time, preventing unnecessary resource consumption and potential bugs with S3 library) it is appropriate to avoid initialization of S3 resources at import time and move that step to occur at first-use. ### What changes are included in this PR? - Remove calls to `ensure_s3_initialized()` that were up until now executed during `import pyarrow.fs`; - Move `ensure_s3_intialized()` calls to `python/pyarrow/_s3fs.pyx` module; - Add global flag to mark whether S3 has been previously initialized and `atexit` handlers registered. ### Are these changes tested? Yes, existing S3 tests check whether it has been initialized, otherwise failing with a C++ exception. ### Are there any user-facing changes? No, the behavior is now slightly different with S3 initialization not happening immediately after `pyarrow.fs` is imported, but no changes are expected from a user perspective relying on the public API alone. **This PR contains a "Critical Fix".** A bug in aws-sdk-cpp reported in aws/aws-sdk-cpp#2681 causes segmentation faults under specific circumstances when Python processes shutdown, specifically observed with Dask+GPUs (so far we were unable to pinpoint the exact correlation of Dask+GPUs+S3). While this definitely doesn't seem to affect all users and is not directly sourced in Arrow, it may affect use cases that are completely independent of S3 to operate, which is particularly problematic in CI where all tests pass successfully but the process crashes at shutdown. * Closes: apache#38364 Lead-authored-by: Peter Andreas Entschev <[email protected]> Co-authored-by: Antoine Pitrou <[email protected]> Signed-off-by: Antoine Pitrou <[email protected]>
This should be fixed with these two PR's:
|
This issue is now closed. Comments on closed issues are hard for our team to see. |
Thanks, we'll keep an eye out. |
Describe the bug
I am using distributed and pyarrow, which links to this SDK for s3 file access.
Using version 1.11.156, I see (intermittently, so a race condition of some kind) a segfault in the
"EvntLoopCleanup"
thread that AWS-SDK sets up.See backtrace below.
Expected Behavior
No segfault
Current Behavior
Backtrace:
So it looks like in attempting to log something, the (static?)
CRTLogSystem
has already been torn down.(I can attempt to build
libaws-c-io
with symbols as well to get more information).Reproduction Steps
This is not that easy (I have not yet managed to minimise at all sorry!). Needs an NVIDIA gpu and a cuda toolkit in the host environment + nvidia container toolkit.
Possible Solution
No response
Additional Information/Context
SDK version 1.10.x doesn't exhibit this issue.
AWS CPP SDK version used
1.11.156
Compiler and Version used
g++ (conda-forge gcc 12.3.0-2) 12.3.0
Operating System and version
Ubuntu 18.04
The text was updated successfully, but these errors were encountered: