Occasional segfaults in event_loop teardown due to dead logger #2681

wence- · 2023-09-22T14:31:10Z

Describe the bug

I am using distributed and pyarrow, which links to this SDK for s3 file access.

Using version 1.11.156, I see (intermittently, so a race condition of some kind) a segfault in the "EvntLoopCleanup" thread that AWS-SDK sets up.

See backtrace below.

Expected Behavior

No segfault

Current Behavior

Backtrace:

(gdb) info thread
  Id   Target Id                                           Frame 
  1    Thread 0x7ffff7fe8740 (LWP 54918) "python"          _dl_close_worker (map=map@entry=0x5555c05d20a0, 
    force=force@entry=false) at dl-close.c:794
  3    Thread 0x7fff809ff700 (LWP 54931) "jemalloc_bg_thd" 0x00007ffff7bc2065 in futex_abstimed_wait_cancelable (
    private=<optimized out>, abstime=0x7fff809fed70, expected=0, futex_word=0x7fff81016634)
    at ../sysdeps/unix/sysv/linux/futex-internal.h:205
  5    Thread 0x7fff15d9e700 (LWP 54934) "cuda-EvtHandlr"  0x00007ffff7132bb9 in __GI___poll (fds=0x55555b07c900, nfds=2, 
    timeout=-1) at ../sysdeps/unix/sysv/linux/poll.c:29
  10   Thread 0x7fff1559d700 (LWP 54941) "AwsEventLoop 1"  0x00007ffff713f947 in epoll_wait (epfd=37, events=0x7fff1559c950, 
    maxevents=100, timeout=100000) at ../sysdeps/unix/sysv/linux/epoll_wait.c:30
  11   Thread 0x7ffed9cd2700 (LWP 54942) "cuda-EvtHandlr"  0x00007ffff7132bb9 in __GI___poll (fds=0x7ffed4000bd0, nfds=10, 
    timeout=100) at ../sysdeps/unix/sysv/linux/poll.c:29
  12   Thread 0x7ffed9270700 (LWP 54943) "python"          0x00007ffff7bc2065 in futex_abstimed_wait_cancelable (
    private=<optimized out>, abstime=0x7ffed926fe80, expected=0, futex_word=0x55555b156768)
    at ../sysdeps/unix/sysv/linux/futex-internal.h:205
  55   Thread 0x7ffe91dda700 (LWP 54988) "jemalloc_bg_thd" 0x00007ffff7bc1ad3 in futex_wait_cancelable (
    private=<optimized out>, expected=0, futex_word=0x7fff81016704) at ../sysdeps/unix/sysv/linux/futex-internal.h:88
  57   Thread 0x7ffe8f728700 (LWP 54991) "jemalloc_bg_thd" 0x00007ffff7bc1ad3 in futex_wait_cancelable (
    private=<optimized out>, expected=0, futex_word=0x7fff810167d4) at ../sysdeps/unix/sysv/linux/futex-internal.h:88
* 84   Thread 0x7fff8eb84700 (LWP 55056) "EvntLoopCleanup" 0x00007fff82829b90 in Aws::Utils::Logging::s_aws_logger_redirect_get_log_level (logger=0x7fff8290b2f0 <Aws::Utils::Logging::s_sdkCrtLogger>, subject=1025)
    at /usr/local/src/conda/aws-sdk-cpp-1.11.156/src/aws-cpp-sdk-core/source/utils/logging/CRTLogging.cpp:73

(gdb) bt
#0  0x00007fff82829b90 in Aws::Utils::Logging::s_aws_logger_redirect_get_log_level (
    logger=0x7fff8290b2f0 <Aws::Utils::Logging::s_sdkCrtLogger>, subject=1025)
    at /usr/local/src/conda/aws-sdk-cpp-1.11.156/src/aws-cpp-sdk-core/source/utils/logging/CRTLogging.cpp:73
#1  0x00007fff81e2e7d3 in s_destroy ()
   from /opt/conda/envs/dask/lib/python3.9/site-packages/pyarrow/../../../././libaws-c-io.so.1.0.0
#2  0x00007fff81e28786 in s_aws_event_loop_group_shutdown_sync ()
   from /opt/conda/envs/dask/lib/python3.9/site-packages/pyarrow/../../../././libaws-c-io.so.1.0.0
#3  0x00007fff81e287d9 in s_event_loop_destroy_async_thread_fn ()
   from /opt/conda/envs/dask/lib/python3.9/site-packages/pyarrow/../../../././libaws-c-io.so.1.0.0
#4  0x00007fff81fa9359 in thread_fn ()
   from /opt/conda/envs/dask/lib/python3.9/site-packages/pyarrow/../../../././libaws-c-common.so.1
#5  0x00007ffff7bbb6db in start_thread (arg=0x7fff8eb84700) at pthread_create.c:463
#6  0x00007ffff713f61f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95
(gdb) p *CRTLogSystem._M_ptr
$8 = {_vptr.CRTLogSystemInterface = 0x0}

So it looks like in attempting to log something, the (static?) CRTLogSystem has already been torn down.

(I can attempt to build libaws-c-io with symbols as well to get more information).

Reproduction Steps

This is not that easy (I have not yet managed to minimise at all sorry!). Needs an NVIDIA gpu and a cuda toolkit in the host environment + nvidia container toolkit.

$ git clone https://github.com/dask/distributed
$ docker run --rm -it --gpus device=1 --network=host --volume $PWD:/workspace --workdir /workspace --entrypoint /bin/bash docker.io/gpuci/distributed:23.10-cuda11.5-devel-ubuntu18.04-py3.9
# in container
$ export PATH=/opt/conda/bin:/usr/local/cuda/bin:$PATH
$ export PARALLEL_LEVEL=${PARALLEL_LEVEL:-4}
$ cd /workspace
$ export DASK_DISTRIBUTED__ADMIN__SYSTEM_MONITOR__GIL__ENABLED=False
$ . /opt/conda/etc/profile.d/conda.sh
$ conda activate dask
$ python -m pip install -e .
$ python -m pip install git+https://github.com/dask/dask
# this might need to be run a few times
$ pytest distributed -v -m gpu --runslow -k 'ping_pong or transpose'

Possible Solution

No response

Additional Information/Context

SDK version 1.10.x doesn't exhibit this issue.

AWS CPP SDK version used

1.11.156

Compiler and Version used

g++ (conda-forge gcc 12.3.0-2) 12.3.0

Operating System and version

Ubuntu 18.04

The text was updated successfully, but these errors were encountered:

jmklix · 2023-09-25T18:53:32Z

Can you reproduce this on something other than a NIVDIA gpu? If so can you provide the detailed reproduction steps for that?

wence- · 2023-09-27T16:04:45Z

Can you reproduce this on something other than a NIVDIA gpu? If so can you provide the detailed reproduction steps for that?

I will give it a go and get back to you (it may take a little while), thanks!

pentschev · 2023-10-19T13:20:21Z

@wence- has been busy with other tasks, but I finally succeeded in coming up with a reproducer that doesn't require GPUs, I'm not entirely sure the bug is the same as the stack I see is somewhat different from that in the original description:

(gdb) bt
#0  aws_mem_release (allocator=0x55c315d103b9, ptr=0x55c649b598d0) at /home/mambauser/aws-sdk-cpp/crt/aws-crt-cpp/crt/aws-c-common/source/allocator.c:210
#1  0x00007fbbf0accc55 in Aws::Crt::StlAllocator<char>::deallocate (this=0x55c649b598a0, this=0x55c649b598a0, p=<optimized out>)
    at /home/mambauser/aws-sdk-cpp/crt/aws-crt-cpp/include/aws/crt/StlAllocator.h:56
#2  std::allocator_traits<Aws::Crt::StlAllocator<char> >::deallocate (__n=<optimized out>, __p=<optimized out>, __a=...)
    at /opt/conda/envs/aws-segfault/x86_64-conda-linux-gnu/include/c++/11.4.0/bits/alloc_traits.h:345
#3  std::__cxx11::basic_string<char, std::char_traits<char>, Aws::Crt::StlAllocator<char> >::_M_destroy (__size=<optimized out>, this=0x55c649b598a0)
    at /opt/conda/envs/aws-segfault/x86_64-conda-linux-gnu/include/c++/11.4.0/bits/basic_string.h:245
#4  std::__cxx11::basic_string<char, std::char_traits<char>, Aws::Crt::StlAllocator<char> >::_M_dispose (this=0x55c649b598a0)
    at /opt/conda/envs/aws-segfault/x86_64-conda-linux-gnu/include/c++/11.4.0/bits/basic_string.h:240
#5  std::__cxx11::basic_string<char, std::char_traits<char>, Aws::Crt::StlAllocator<char> >::_M_dispose (this=0x55c649b598a0)
    at /opt/conda/envs/aws-segfault/x86_64-conda-linux-gnu/include/c++/11.4.0/bits/basic_string.h:237
#6  std::__cxx11::basic_string<char, std::char_traits<char>, Aws::Crt::StlAllocator<char> >::~basic_string (this=0x55c649b598a0, __in_chrg=<optimized out>)
    at /opt/conda/envs/aws-segfault/x86_64-conda-linux-gnu/include/c++/11.4.0/bits/basic_string.h:672
#7  std::default_delete<std::__cxx11::basic_string<char, std::char_traits<char>, Aws::Crt::StlAllocator<char> > >::operator() (this=<optimized out>,
    __ptr=0x55c649b598a0) at /opt/conda/envs/aws-segfault/x86_64-conda-linux-gnu/include/c++/11.4.0/bits/unique_ptr.h:85
#8  std::unique_ptr<std::__cxx11::basic_string<char, std::char_traits<char>, Aws::Crt::StlAllocator<char> >, std::default_delete<std::__cxx11::basic_string<char, std::char_traits<char>, Aws::Crt::StlAllocator<char> > > >::~unique_ptr (this=<optimized out>, __in_chrg=<optimized out>)
    at /opt/conda/envs/aws-segfault/x86_64-conda-linux-gnu/include/c++/11.4.0/bits/unique_ptr.h:361
#9  0x00007fbe99af3495 in ?? () from /lib/x86_64-linux-gnu/libc.so.6
#10 0x00007fbe99af3610 in exit () from /lib/x86_64-linux-gnu/libc.so.6
#11 0x00007fbe99ad7d97 in ?? () from /lib/x86_64-linux-gnu/libc.so.6
#12 0x00007fbe99ad7e40 in __libc_start_main () from /lib/x86_64-linux-gnu/libc.so.6
#13 0x000055c646edb421 in _start ()

To run the reproducer a system with Docker available is required, first save the snippet below named aws-segfault.dockerfile:

aws-segfault.dockerfile

FROM mambaorg/micromamba:jammy

# Install runtime, build and debug dependencies
RUN micromamba create -c conda-forge -n aws-segfault \
        python=3.10 \
        cuda-version=11.8 \
        pyarrow=12.0.1 \
        aws-sdk-cpp=1.11.156 \
        pytest \
        pytest-cov \
        pytest-timeout \
        cmake \
        c-compiler \
        cxx-compiler \
        gcc_linux-64=11.* \
        make \
        gdb \
        git

SHELL ["micromamba", "run", "-n", "aws-segfault", "/bin/bash", "-c"]

# Install distributed from source in developer mode
RUN cd ${HOME} \
    && git clone https://github.com/dask/distributed \
    && cd distributed \
    && git checkout 2023.9.2 \
    && pip install -e .

# Build and install aws-sdk-cpp from source with debug config
# Note: we must unset SCCACHE_BUCKET that is predefined by the base image
RUN unset SCCACHE_BUCKET \
    && cd ${HOME} \
    && git clone --recurse-submodules https://github.com/aws/aws-sdk-cpp \
    && cd aws-sdk-cpp \
    && git checkout 1.11.181 \
    && mkdir build \
    && cd build \
    && cmake .. \
        -DCMAKE_BUILD_TYPE=Debug \
        -DCMAKE_PREFIX_PATH=$CONDA_PREFIX \
        -DCMAKE_INSTALL_PREFIX=$CONDA_PREFIX \
        -DBUILD_ONLY="s3" \
        -DAUTORUN_UNIT_TESTS=OFF \
        -DENABLE_TESTING=OFF \
    && cmake --build . --config=Debug \
    && cmake --install . --config=Debug

COPY test_aws_segfault.py /home/mambauser/distributed/test_aws_segfault.py
WORKDIR /home/mambauser/distributed

And save the following named test_aws_segfault.py to the same local directory:

test_aws_segfault.py

import pytest

pytestmark = pytest.mark.gpu

from distributed import Client
from distributed.deploy.local import LocalCluster
from distributed.utils_test import gen_test


@gen_test()
async def test_aws_segfault():
    from pyarrow.fs import FSSpecHandler, PyFileSystem
    da = pytest.importorskip("dask.array")

    async with LocalCluster(
        protocol="tcp", n_workers=2, threads_per_worker=2, dashboard_address=":0", asynchronous=True
    ) as cluster:
        async with Client(cluster, asynchronous=True):
            assert cluster.scheduler_address.startswith("tcp://")
            x = da.ones((10000, 10000), chunks=(1000, 1000)).persist()
            await x
            y = (x + x.T).sum()
            await y

With both files present in the local directory, build the docker image:

docker build -t aws-segfault:latest -f aws-segfault.dockerfile .

Then start the container:

docker run --rm -ti aws-segfault:latest /bin/bash

Inside the container run:

micromamba activate aws-segfault
python -m pytest -vs test_aws_segfault.py
# Or run with gdb
# gdb --args python -m pytest -vs test_aws_segfault.py
# run

With the instructions above I consistently got a segfault and didn't need to run multiple times to reproduce.

### Rationale for this change In accordance to #38364, we believe that for various reasons (shortening import time, preventing unnecessary resource consumption and potential bugs with S3 library) it is appropriate to avoid initialization of S3 resources at import time and move that step to occur at first-use. ### What changes are included in this PR? - Remove calls to `ensure_s3_initialized()` that were up until now executed during `import pyarrow.fs`; - Move `ensure_s3_intialized()` calls to `python/pyarrow/_s3fs.pyx` module; - Add global flag to mark whether S3 has been previously initialized and `atexit` handlers registered. ### Are these changes tested? Yes, existing S3 tests check whether it has been initialized, otherwise failing with a C++ exception. ### Are there any user-facing changes? No, the behavior is now slightly different with S3 initialization not happening immediately after `pyarrow.fs` is imported, but no changes are expected from a user perspective relying on the public API alone. **This PR contains a "Critical Fix".** A bug in aws-sdk-cpp reported in aws/aws-sdk-cpp#2681 causes segmentation faults under specific circumstances when Python processes shutdown, specifically observed with Dask+GPUs (so far we were unable to pinpoint the exact correlation of Dask+GPUs+S3). While this definitely doesn't seem to affect all users and is not directly sourced in Arrow, it may affect use cases that are completely independent of S3 to operate, which is particularly problematic in CI where all tests pass successfully but the process crashes at shutdown. * Closes: #38364 Lead-authored-by: Peter Andreas Entschev <[email protected]> Co-authored-by: Antoine Pitrou <[email protected]> Signed-off-by: Antoine Pitrou <[email protected]>

The aws-sdk-pinning is proving to be far too problematic to maintain. It causes conflicts in many environments due to its common usage across many other packages in the conda-forge ecosystem that have since updated their pinning to require newer versions than the 1.10.* that we've pinned to. This reversion will unblock most of RAPIDS CI. We will search for alternative fixes to the dask-cuda/distributed issues that we're observing (in particular, resolution of the underlying issues apache/arrow#38364 and aws/aws-sdk-cpp#2681). Authors: - Vyas Ramasubramani (https://github.com/vyasr) Approvers: - Ray Douglass (https://github.com/raydouglass) - Bradley Dice (https://github.com/bdice) - Lawrence Mitchell (https://github.com/wence-) - Peter Andreas Entschev (https://github.com/pentschev) URL: #14319

This is not a resolution, but may help mitigate problems from aws/aws-sdk-cpp#2681 Authors: - Richard (Rick) Zamora (https://github.com/rjzamora) Approvers: - Peter Andreas Entschev (https://github.com/pentschev) - Lawrence Mitchell (https://github.com/wence-) - Bradley Dice (https://github.com/bdice) URL: #14321

### Rationale for this change In accordance to apache#38364, we believe that for various reasons (shortening import time, preventing unnecessary resource consumption and potential bugs with S3 library) it is appropriate to avoid initialization of S3 resources at import time and move that step to occur at first-use. ### What changes are included in this PR? - Remove calls to `ensure_s3_initialized()` that were up until now executed during `import pyarrow.fs`; - Move `ensure_s3_intialized()` calls to `python/pyarrow/_s3fs.pyx` module; - Add global flag to mark whether S3 has been previously initialized and `atexit` handlers registered. ### Are these changes tested? Yes, existing S3 tests check whether it has been initialized, otherwise failing with a C++ exception. ### Are there any user-facing changes? No, the behavior is now slightly different with S3 initialization not happening immediately after `pyarrow.fs` is imported, but no changes are expected from a user perspective relying on the public API alone. **This PR contains a "Critical Fix".** A bug in aws-sdk-cpp reported in aws/aws-sdk-cpp#2681 causes segmentation faults under specific circumstances when Python processes shutdown, specifically observed with Dask+GPUs (so far we were unable to pinpoint the exact correlation of Dask+GPUs+S3). While this definitely doesn't seem to affect all users and is not directly sourced in Arrow, it may affect use cases that are completely independent of S3 to operate, which is particularly problematic in CI where all tests pass successfully but the process crashes at shutdown. * Closes: apache#38364 Lead-authored-by: Peter Andreas Entschev <[email protected]> Co-authored-by: Antoine Pitrou <[email protected]> Signed-off-by: Antoine Pitrou <[email protected]>

### Rationale for this change In accordance to #38364, we believe that for various reasons (shortening import time, preventing unnecessary resource consumption and potential bugs with S3 library) it is appropriate to avoid initialization of S3 resources at import time and move that step to occur at first-use. ### What changes are included in this PR? - Remove calls to `ensure_s3_initialized()` that were up until now executed during `import pyarrow.fs`; - Move `ensure_s3_intialized()` calls to `python/pyarrow/_s3fs.pyx` module; - Add global flag to mark whether S3 has been previously initialized and `atexit` handlers registered. ### Are these changes tested? Yes, existing S3 tests check whether it has been initialized, otherwise failing with a C++ exception. ### Are there any user-facing changes? No, the behavior is now slightly different with S3 initialization not happening immediately after `pyarrow.fs` is imported, but no changes are expected from a user perspective relying on the public API alone. **This PR contains a "Critical Fix".** A bug in aws-sdk-cpp reported in aws/aws-sdk-cpp#2681 causes segmentation faults under specific circumstances when Python processes shutdown, specifically observed with Dask+GPUs (so far we were unable to pinpoint the exact correlation of Dask+GPUs+S3). While this definitely doesn't seem to affect all users and is not directly sourced in Arrow, it may affect use cases that are completely independent of S3 to operate, which is particularly problematic in CI where all tests pass successfully but the process crashes at shutdown. * Closes: #38364 Lead-authored-by: Peter Andreas Entschev <[email protected]> Co-authored-by: Antoine Pitrou <[email protected]> Signed-off-by: Antoine Pitrou <[email protected]>

### Rationale for this change In accordance to apache#38364, we believe that for various reasons (shortening import time, preventing unnecessary resource consumption and potential bugs with S3 library) it is appropriate to avoid initialization of S3 resources at import time and move that step to occur at first-use. ### What changes are included in this PR? - Remove calls to `ensure_s3_initialized()` that were up until now executed during `import pyarrow.fs`; - Move `ensure_s3_intialized()` calls to `python/pyarrow/_s3fs.pyx` module; - Add global flag to mark whether S3 has been previously initialized and `atexit` handlers registered. ### Are these changes tested? Yes, existing S3 tests check whether it has been initialized, otherwise failing with a C++ exception. ### Are there any user-facing changes? No, the behavior is now slightly different with S3 initialization not happening immediately after `pyarrow.fs` is imported, but no changes are expected from a user perspective relying on the public API alone. **This PR contains a "Critical Fix".** A bug in aws-sdk-cpp reported in aws/aws-sdk-cpp#2681 causes segmentation faults under specific circumstances when Python processes shutdown, specifically observed with Dask+GPUs (so far we were unable to pinpoint the exact correlation of Dask+GPUs+S3). While this definitely doesn't seem to affect all users and is not directly sourced in Arrow, it may affect use cases that are completely independent of S3 to operate, which is particularly problematic in CI where all tests pass successfully but the process crashes at shutdown. * Closes: apache#38364 Lead-authored-by: Peter Andreas Entschev <[email protected]> Co-authored-by: Antoine Pitrou <[email protected]> Signed-off-by: Antoine Pitrou <[email protected]>

jmklix · 2024-04-05T21:51:26Z

This should be fixed with these two PR's:

Workaround race condition in ShutdownAPI and logging and custom mem mngmt #2859
Remove multi-thread synchronization from CRT logger wrapper #2885
Please let us know if you're still running into this error.

github-actions · 2024-04-05T21:52:52Z

This issue is now closed. Comments on closed issues are hard for our team to see.
If you need more assistance, please open a new issue that references this one.

wence- · 2024-04-08T10:57:26Z

Thanks, we'll keep an eye out.

wence- added bug This issue is a bug. needs-triage This issue or PR still needs to be triaged. labels Sep 22, 2023

wence- changed the title ~~(short issue description)~~ Occasional segfaults in event_loop teardown due to dead logger Sep 22, 2023

pentschev mentioned this issue Sep 22, 2023

Do not reset CUDA context after UCX tests dask/distributed#8201

Merged

2 tasks

jmklix self-assigned this Sep 22, 2023

jmklix added response-requested Waiting on additional info and feedback. Will move to "closing-soon" in 10 days. p2 This is a standard priority issue and removed needs-triage This issue or PR still needs to be triaged. labels Sep 25, 2023

github-actions bot removed the response-requested Waiting on additional info and feedback. Will move to "closing-soon" in 10 days. label Sep 28, 2023

vyasr mentioned this issue Oct 19, 2023

[Python] Stop initializing s3 upon import apache/arrow#38364

Closed

pentschev mentioned this issue Oct 20, 2023

GH-38364: [Python] Initialize S3 on first use apache/arrow#38375

Merged

vyasr mentioned this issue Oct 23, 2023

Remove aws-sdk-pinning and revert to arrow 12.0.1 rapidsai/cudf#14319

Merged

3 tasks

rjzamora mentioned this issue Oct 23, 2023

Avoid pyarrow.fs import for local storage rapidsai/cudf#14321

Merged

3 tasks

bdice mentioned this issue Oct 24, 2023

Backport critical fix for S3 finalization from GH-38375. conda-forge/arrow-cpp-feedstock#1211

Merged

5 tasks

jmklix closed this as completed Apr 5, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Occasional segfaults in event_loop teardown due to dead logger #2681

Occasional segfaults in event_loop teardown due to dead logger #2681

wence- commented Sep 22, 2023

jmklix commented Sep 25, 2023

wence- commented Sep 27, 2023

pentschev commented Oct 19, 2023

jmklix commented Apr 5, 2024

github-actions bot commented Apr 5, 2024

wence- commented Apr 8, 2024