Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SGE Tests segfault in CI #653

Open
jacobtomlinson opened this issue Aug 5, 2024 · 5 comments
Open

SGE Tests segfault in CI #653

jacobtomlinson opened this issue Aug 5, 2024 · 5 comments
Labels
bug Something isn't working CI Continuous Integration tools SGE

Comments

@jacobtomlinson
Copy link
Member

jacobtomlinson commented Aug 5, 2024

Opening an issue to triage the segfault that seems to be happening in th SGE tests.

For some time the SGE tests have been failing. When you look at the logs of a recent run on main it contains the following error.

 *** Error in `/opt/anaconda/envs/dask-jobqueue/bin/python3.8': double free or corruption (!prev): 0x00005626c18fa470 ***
/bin/bash: line 1:   588 Aborted                 (core dumped) pytest /dask-jobqueue/dask_jobqueue --verbose --full-trace -s -E sge

I also opened #652 to bump the minimum Python version here to 3.9 and I see a similar issue happening but with a slightly different error.

*** Error in `/opt/anaconda/envs/dask-jobqueue/bin/python3.9': corrupted size vs. prev_size: 0x00005560404681a0 ***
/bin/bash: line 1:   592 Aborted                 (core dumped) pytest /dask-jobqueue/dask_jobqueue --verbose --full-trace -s -E sge

Strangely in both cases pytest reports everything has passed.

9 passed, 270 skipped in 26.69s
@jacobtomlinson jacobtomlinson added SGE bug Something isn't working CI Continuous Integration tools labels Aug 5, 2024
@jacobtomlinson
Copy link
Member Author

jacobtomlinson commented Aug 5, 2024

In #654 I've been playing around with skipping various tests and enabling them again. It seems like enabling any two of the tests results in the segfault. Enabling more than one only causes the error to appear once though.

@jacobtomlinson
Copy link
Member Author

I have a local reproducer now. Here are the steps I took to get it set up on my machine.

# Build SGE container
cd ci/sge
cp ../environment.yaml .
docker compose build

# Start SGE stack (based on ci/sge.sh)
./start-sge.sh
docker exec sge_master /bin/bash -c "chmod -R 777 /shared_space"

# Install dask-jobqueue in editible install
docker exec sge_master conda run -n dask-jobqueue /bin/bash -c "cd /dask-jobqueue; pip install -e ."

I also installed anyio and used @pytest.mark.anyio instead of @pytest.mark.asyncio because I find the behaviour a lot more consistent. See #655.

I then created a new test file with a single test that consistently reproduces the segfault.

# dask_jobqueue/tests/test_jsge_segfault.py
from dask_jobqueue.sge import SGECluster
from dask.distributed import Client

import pytest


@pytest.mark.anyio
@pytest.mark.env("sge")
async def test_cluster():
    async with SGECluster(1, cores=1, memory="1GB", asynchronous=True) as cluster:
        async with Client(cluster, asynchronous=True):
            pass

Then you can run the test via docker exec.

$ docker exec sge_master conda run -n dask-jobqueue /bin/bash -c "cd; pytest /dask-jobqueue/dask_jobqueue/tests/test_sge_segfault.py --verbose --full-trace -s -E sge"
*** Error in `/opt/anaconda/envs/dask-jobqueue/bin/python3.8': corrupted size vs. prev_size: 0x0000560d54c76aa0 ***
/bin/bash: line 1: 29477 Aborted                 (core dumped) pytest /dask-jobqueue/dask_jobqueue/tests/test_sge_segfault.py --verbose --full-trace -s -E sge

ERROR conda.cli.main_run:execute(125): `conda run /bin/bash -c cd; pytest /dask-jobqueue/dask_jobqueue/tests/test_sge_segfault.py --verbose --full-trace -s -E sge` failed. (See above for error)
============================= test session starts ==============================
platform linux -- Python 3.8.19, pytest-8.3.2, pluggy-1.5.0 -- /opt/anaconda/envs/dask-jobqueue/bin/python3.8
cachedir: .pytest_cache
rootdir: /dask-jobqueue
plugins: anyio-4.4.0
collecting ... collected 1 item

../dask-jobqueue/dask_jobqueue/tests/test_sge_segfault.py::test_cluster PASSED

============================== 1 passed in 1.07s ===============================

$ echo $?
134

@jacobtomlinson
Copy link
Member Author

Since upgrading to Python 3.9 in CI this issues seems to have gone away. It's strange because I'm still able to reproduce some problems locally, but perhaps there is something cached that I'm not taking into account.

Given that CI is all green and PRs and merges are passing consistently I'm going to close this out.

@jacobtomlinson
Copy link
Member Author

Looks like a similar erorr happened when running CI for #660.

 *** Error in `/opt/anaconda/envs/dask-jobqueue/bin/python3.9': free(): invalid pointer: 0x0000557fe477a210 ***
/bin/bash: line 1:   588 Aborted                 (core dumped) pytest /dask-jobqueue/dask_jobqueue --verbose --full-trace -s -E sge

ERROR conda.cli.main_run:execute(125): `conda run /bin/bash -c cd; pytest /dask-jobqueue/dask_jobqueue --verbose --full-trace -s -E sge` failed. (See above for error)

Perhaps it's not as resolved as I had hoped.

@jacobtomlinson
Copy link
Member Author

Still seeing this after bumping to Python 3.10.

 *** Error in `/opt/anaconda/envs/dask-jobqueue/bin/python3.10': double free or corruption (!prev): 0x000055e87b68bf90 ***
/bin/bash: line 1:   591 Aborted                 (core dumped) pytest /dask-jobqueue/dask_jobqueue --verbose --full-trace -s -E sge

ERROR conda.cli.main_run:execute(125): `conda run /bin/bash -c cd; pytest /dask-jobqueue/dask_jobqueue --verbose --full-trace -s -E sge` failed. (See above for error)

This was referenced Aug 21, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working CI Continuous Integration tools SGE
Projects
None yet
Development

No branches or pull requests

1 participant