SGE Tests segfault in CI #653

jacobtomlinson · 2024-08-05T18:53:40Z

Opening an issue to triage the segfault that seems to be happening in th SGE tests.

For some time the SGE tests have been failing. When you look at the logs of a recent run on main it contains the following error.

 *** Error in `/opt/anaconda/envs/dask-jobqueue/bin/python3.8': double free or corruption (!prev): 0x00005626c18fa470 ***
/bin/bash: line 1:   588 Aborted                 (core dumped) pytest /dask-jobqueue/dask_jobqueue --verbose --full-trace -s -E sge

I also opened #652 to bump the minimum Python version here to 3.9 and I see a similar issue happening but with a slightly different error.

*** Error in `/opt/anaconda/envs/dask-jobqueue/bin/python3.9': corrupted size vs. prev_size: 0x00005560404681a0 ***
/bin/bash: line 1:   592 Aborted                 (core dumped) pytest /dask-jobqueue/dask_jobqueue --verbose --full-trace -s -E sge

Strangely in both cases pytest reports everything has passed.

9 passed, 270 skipped in 26.69s

The text was updated successfully, but these errors were encountered:

jacobtomlinson · 2024-08-05T19:37:09Z

In #654 I've been playing around with skipping various tests and enabling them again. It seems like enabling any two of the tests results in the segfault. Enabling more than one only causes the error to appear once though.

jacobtomlinson · 2024-08-06T08:54:42Z

I have a local reproducer now. Here are the steps I took to get it set up on my machine.

# Build SGE container
cd ci/sge
cp ../environment.yaml .
docker compose build

# Start SGE stack (based on ci/sge.sh)
./start-sge.sh
docker exec sge_master /bin/bash -c "chmod -R 777 /shared_space"

# Install dask-jobqueue in editible install
docker exec sge_master conda run -n dask-jobqueue /bin/bash -c "cd /dask-jobqueue; pip install -e ."

I also installed anyio and used @pytest.mark.anyio instead of @pytest.mark.asyncio because I find the behaviour a lot more consistent. See #655.

I then created a new test file with a single test that consistently reproduces the segfault.

# dask_jobqueue/tests/test_jsge_segfault.py
from dask_jobqueue.sge import SGECluster
from dask.distributed import Client

import pytest


@pytest.mark.anyio
@pytest.mark.env("sge")
async def test_cluster():
    async with SGECluster(1, cores=1, memory="1GB", asynchronous=True) as cluster:
        async with Client(cluster, asynchronous=True):
            pass

Then you can run the test via docker exec.

$ docker exec sge_master conda run -n dask-jobqueue /bin/bash -c "cd; pytest /dask-jobqueue/dask_jobqueue/tests/test_sge_segfault.py --verbose --full-trace -s -E sge"
*** Error in `/opt/anaconda/envs/dask-jobqueue/bin/python3.8': corrupted size vs. prev_size: 0x0000560d54c76aa0 ***
/bin/bash: line 1: 29477 Aborted                 (core dumped) pytest /dask-jobqueue/dask_jobqueue/tests/test_sge_segfault.py --verbose --full-trace -s -E sge

ERROR conda.cli.main_run:execute(125): `conda run /bin/bash -c cd; pytest /dask-jobqueue/dask_jobqueue/tests/test_sge_segfault.py --verbose --full-trace -s -E sge` failed. (See above for error)
============================= test session starts ==============================
platform linux -- Python 3.8.19, pytest-8.3.2, pluggy-1.5.0 -- /opt/anaconda/envs/dask-jobqueue/bin/python3.8
cachedir: .pytest_cache
rootdir: /dask-jobqueue
plugins: anyio-4.4.0
collecting ... collected 1 item

../dask-jobqueue/dask_jobqueue/tests/test_sge_segfault.py::test_cluster PASSED

============================== 1 passed in 1.07s ===============================

$ echo $?
134

jacobtomlinson · 2024-08-06T15:43:44Z

Since upgrading to Python 3.9 in CI this issues seems to have gone away. It's strange because I'm still able to reproduce some problems locally, but perhaps there is something cached that I'm not taking into account.

Given that CI is all green and PRs and merges are passing consistently I'm going to close this out.

jacobtomlinson · 2024-08-07T14:35:28Z

Looks like a similar erorr happened when running CI for #660.

 *** Error in `/opt/anaconda/envs/dask-jobqueue/bin/python3.9': free(): invalid pointer: 0x0000557fe477a210 ***
/bin/bash: line 1:   588 Aborted                 (core dumped) pytest /dask-jobqueue/dask_jobqueue --verbose --full-trace -s -E sge

ERROR conda.cli.main_run:execute(125): `conda run /bin/bash -c cd; pytest /dask-jobqueue/dask_jobqueue --verbose --full-trace -s -E sge` failed. (See above for error)

Perhaps it's not as resolved as I had hoped.

jacobtomlinson · 2024-08-21T10:47:44Z

Still seeing this after bumping to Python 3.10.

 *** Error in `/opt/anaconda/envs/dask-jobqueue/bin/python3.10': double free or corruption (!prev): 0x000055e87b68bf90 ***
/bin/bash: line 1:   591 Aborted                 (core dumped) pytest /dask-jobqueue/dask_jobqueue --verbose --full-trace -s -E sge

ERROR conda.cli.main_run:execute(125): `conda run /bin/bash -c cd; pytest /dask-jobqueue/dask_jobqueue --verbose --full-trace -s -E sge` failed. (See above for error)

jacobtomlinson added SGE bug Something isn't working CI Continuous Integration tools labels Aug 5, 2024

jacobtomlinson mentioned this issue Aug 6, 2024

Bump minimum Python to 3.9 #652

Merged

jacobtomlinson closed this as completed Aug 6, 2024

jacobtomlinson reopened this Aug 7, 2024

jacobtomlinson mentioned this issue Aug 8, 2024

Add SLURMRunner from jacobtomlinson/dask-hpc-runners #659

Merged

5 tasks

This was referenced Aug 19, 2024

Fixed LSFCluster stdin job setup not being run in a shell #661

Merged

Bump minimum Python to 3.10 #662

Merged

This was referenced Aug 21, 2024

CI is currently failing #613

Closed

Changelog for 0.9.0 release #664

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SGE Tests segfault in CI #653

SGE Tests segfault in CI #653

jacobtomlinson commented Aug 5, 2024 •

edited

Loading

jacobtomlinson commented Aug 5, 2024 •

edited

Loading

jacobtomlinson commented Aug 6, 2024

jacobtomlinson commented Aug 6, 2024

jacobtomlinson commented Aug 7, 2024

jacobtomlinson commented Aug 21, 2024

SGE Tests segfault in CI #653

SGE Tests segfault in CI #653

Comments

jacobtomlinson commented Aug 5, 2024 • edited Loading

jacobtomlinson commented Aug 5, 2024 • edited Loading

jacobtomlinson commented Aug 6, 2024

jacobtomlinson commented Aug 6, 2024

jacobtomlinson commented Aug 7, 2024

jacobtomlinson commented Aug 21, 2024

jacobtomlinson commented Aug 5, 2024 •

edited

Loading

jacobtomlinson commented Aug 5, 2024 •

edited

Loading