Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SIGSEGV when Python tests use the ctypes module #442

Closed
2 of 5 tasks
P403n1x87 opened this issue Jun 24, 2022 · 28 comments
Closed
2 of 5 tasks

SIGSEGV when Python tests use the ctypes module #442

P403n1x87 opened this issue Jun 24, 2022 · 28 comments
Assignees
Labels
bug Something isn't working

Comments

@P403n1x87
Copy link

P403n1x87 commented Jun 24, 2022

Description:

When using the Python installed with this action I run into SIGSEGV on Linux if my tests use ctypes. This is is the offending workflow

name: Tests
on: push
concurrency: 
  group: ${{ github.head_ref || github.run_id }}
  cancel-in-progress: true
jobs:
  tests-linux:
    runs-on: ubuntu-latest
    
    strategy:
      fail-fast: false
      matrix:
        python-version: ["2.7", "3.5", "3.6", "3.7", "3.8", "3.9", "3.10", "3.11"]
    
    env:
      AUSTIN_TESTS_PYTHON_VERSIONS: ${{ matrix.python-version }}
    
    name: Tests on Linux with Python ${{ matrix.python-version }}
    steps:
      - uses: actions/checkout@v2

      - name: Install build dependencies
        run: |
          sudo apt-get -y install libunwind-dev binutils-dev libiberty-dev

      - name: Install test dependencies
        run: |
          sudo apt-get -y install \
            valgrind \
            gdb

      - name: Install Python
        uses: actions/setup-python@v4
        with:
          python-version: ${{ matrix.python-version }}

      - name: Install Python 3.10
        uses: actions/setup-python@v4
        with:
          python-version: "3.10"

      - name: Compile Austin
        run: |
          autoreconf --install
          ./configure --enable-debug-symbols true
          make

      - name: Run tests
        run: |
          ulimit -c unlimited
          python3.10 -m venv .venv
          source .venv/bin/activate
          pip install --upgrade pip
          pip install -r test/requirements.txt
          sudo -E .venv/bin/pytest --pastebin=failed --no-flaky-report -sr a -n auto
          .venv/bin/pytest --pastebin=failed --no-flaky-report -sr a -n auto
          deactivate

If Python is installed from, e.g., deadsnakes/ppa, then the tests pass without SIGSEGV. I can also get the tests to pass locally.

This is an example of a happy workflow that pulls Python from the PPA: https://github.com/P403n1x87/austin/runs/7046194671?check_suite_focus=true

This is a run with the action: https://github.com/P403n1x87/austin/runs/7045119660?check_suite_focus=true

This was discovered with this PR: https://github.com/P403n1x87/austin/pull/120/files

Action version:

v4

Platform:

  • Ubuntu
  • macOS
  • Windows

Runner type:

  • Hosted
  • Self-hosted

Tools version:

Repro steps:

See description above

Expected behavior:

No SIGSEGV, like with the Pythons from the PPA.

Actual behavior:

SIGSEGV if ctypes is used

@dsame
Copy link
Contributor

dsame commented Jun 27, 2022

Hello @P403n1x87,

Thanks for your input, but the data you've provided is not enough to discovery the root cause of the issue.

        if result.returncode == -11:
            binary_name = Path(module).stem.replace("test_", "")
>           raise SegmentationFault(bt((SRC / binary_name).with_suffix(".so")))
E           test.cunit.conftest.SegmentationFault: "/var/crash/_opt_hostedtoolcache_Python_3.10.5_x64_bin_python3.10.0.crash" is not a core dump: file format not recognized
E           No stack.

says there's no crash dump and there's nothing to tracebak - i advise you to add the step with `ls -l /var/crash/' to check the exact path of core dump.

Alternatively can you please provide the exact python code that segfaults.

I tried to create a simple repro with typical ctypes use case, but it works as expected.

Ideally it would be good if you changed my minimal repro in order to get the same SEGFAULT error. In this case we will resolve the issue as soon as possible.

@P403n1x87
Copy link
Author

P403n1x87 commented Jun 27, 2022

@dsame here you can see a run with a backtrace https://github.com/P403n1x87/austin/runs/7045867473?check_suite_focus=true

This seems to point to this line of Python test code

https://github.com/P403n1x87/austin/blob/d1e56c1ec4aa3322d7b8824513d2e01db5c792f9/test/cunit/test_cache.py#L14

which performs a call to

https://github.com/P403n1x87/austin/blob/d1e56c1ec4aa3322d7b8824513d2e01db5c792f9/src/cache.c#L46-L54

To run this locally, you could clone the devel branch of https://github.com/P403n1x87/austin/, create a Python 3.10 virtual environment, activate it, install the test dependencies in test/requirements.txt and then run

pytest test/cunit

This single-line command should do the trick once inside the cloned austin folder

python3.10 -m venv /tmp/austin-venv && source /tmp/austin-venv/bin/activate && pip install -r test/requirements.txt && pytest test/cunit

Note that this requires gcc to be available from PATH

I tried to create a simple repro with typical ctypes use case, but it works as expected.

I would try calling libc.free on a memory allocation in your repro and see what happens.

@dsame
Copy link
Contributor

dsame commented Jun 27, 2022

@P403n1x87 thanks a lot for the backtrace, now i see a SEGFAULT is caused by an instruction of from /lib/x86_64-linux-gnu/libffi.so.7

It is known problem: toolcache python is built against libffi.so.6 which does not exist in ubuntu-20.04

While we are providing the solution, please try to add the step installing libffi.so.6 before running tests as workaround

        - run: |
            curl -LO http://archive.ubuntu.com/ubuntu/pool/main/libf/libffi/libffi6_3.2.1-8_amd64.deb
            sudo dpkg -i libffi6_3.2.1-8_amd64.deb
            sudo ln -sf libffi.so.6.0.4 /usr/lib/x86_64-linux-gnu/libffi.so

@P403n1x87
Copy link
Author

@dsame thanks for looking into this. I will give the workaround a try and let you know how it goes!

@dsame
Copy link
Contributor

dsame commented Jul 14, 2022

Hello @P403n1x87,
How it is going with the checking the workaround?
Be aware i am going to close the issue in the next 24 hours due to inactivity, but you still will be able to reopen this issue or create new one in case of a problem still exists.

@P403n1x87
Copy link
Author

@dsame sorry I had moved to other things in the meantime. Just opened a draft PR to test the workaround, but it doesn't seem to solve the issue unfortunately 🙁

https://github.com/P403n1x87/austin/runs/7336236967?check_suite_focus=true

@dsame
Copy link
Contributor

dsame commented Jul 18, 2022

Hello @P403n1x87 ,

now the error messages changed
from "E test.cunit.conftest.SegmentationFault"
to "E OSError: [Errno 39] Directory not empty: '_usr_lib_cnf-update-db.0'"

Although i still see the same point of the error origin, can you please double check the new hint?

@P403n1x87
Copy link
Author

@dsame

A Segmentation Fault still causes this. Looking at the full traceback from the first failing test you can see the attempt to raise SegmentationFault to report the SIGSEGV and get the backtrace with the bt helper. This helper calls rmdir which now fails because the directory it is trying to delete is probably not empty. I'll try to resolve this so we can get back the backtrace information, but this is definitely still a SIGSEGV.

@P403n1x87
Copy link
Author

@dsame This run has the backtrace information

https://github.com/P403n1x87/austin/runs/7385733844?check_suite_focus=true

It looks like this is using /lib/x86_64-linux-gnu/libffi.so.7, which is not what the workaround is replacing I believe?

@P403n1x87
Copy link
Author

@dsame I have tried

          curl -LO http://archive.ubuntu.com/ubuntu/pool/main/libf/libffi/libffi6_3.2.1-8_amd64.deb
          curl -LO http://archive.ubuntu.com/ubuntu/pool/main/libf/libffi/libffi7_3.3-4_amd64.deb
          sudo dpkg -i libffi6_3.2.1-8_amd64.deb
          sudo ln -sf libffi.so.6.0.4 /usr/lib/x86_64-linux-gnu/libffi.so
          sudo dpkg -i libffi7_3.3-4_amd64.deb
          sudo ln -sf libffi.so.7.1.0 /lib/x86_64-linux-gnu/libffi.so.7

but the result is still the same 🙁

https://github.com/P403n1x87/austin/runs/7385900441?check_suite_focus=true

@dsame
Copy link
Contributor

dsame commented Jul 18, 2022

@P403n1x87
Yes, now i am confirm python is compatible both with the generic ffi call and generic ctypes call

Trying to investigate the code i am not able to find the place there the problematic function queue_item__destroy is called from.

I see this test fails - https://github.com/P403n1x87/austin/blob/ci/setup-python-workaround/test/cunit/cache.py
But it goes to innocent code https://github.com/P403n1x87/austin/blob/50d7a25ec73e4616a30acc8e058a471c600ef4bb/test/cunit/__init__.py#L67 which definitely has no problem during the complation cache.c

I suppose the compiled code is executed after that, causing the SEGFLT during the call of queue_item__destroy , but i do not see where does it happen. Can you please point me where the call of queue_item__destroy ? I searched the whole repo and found nothing. I afraid we face some race condition where queue is ether destroyed before it is created or destroyed twice but it is not possible to guess the exact reason.

Also did you ever have this test passed on linux? Can you show the most recent success commit then?

@P403n1x87
Copy link
Author

@dsame the tests are in the devel branch. The call to free happens here

https://github.com/P403n1x87/austin/blob/67f00b2817c179ce5aaea67d89c50fedfc0faac8/test/cunit/test_cache.py#L14

You can see that tests on the devel branch are all passing and I've never seen a SIGSEGV with deadsnakes, neither on GH Actions nor locally. The C unit tests are single-threaded, so I think a race condition is unlikely.

@dsame
Copy link
Contributor

dsame commented Jul 22, 2022

@P403n1x87
Status update.

The problem has been reproduced on simplified code:

def test_queue_item():
    value = C.malloc(16)
    C.free(value)

@dsame
Copy link
Contributor

dsame commented Jul 23, 2022

Since python 3.10 and ubuntu 22.04 the same problem appears even with the native, default python
https://github.com/akv-platform/austin/runs/7479929454?check_suite_focus=true

I suppose there's a conflict of the python memory management and linux memory management, because the result of C.malloc is unknown and treated as Python object which is under Python control.

To fix it, the arguments should be explicitly set

from ctypes import CDLL
import ctypes

C = CDLL("libc.so.6")
C.malloc.restype = ctypes.c_void_p
C.free.argtypes = [ctypes.c_void_p]

value = C.malloc(16)
C.free(value)

https://github.com/akv-platform/austin/actions/runs/2723152503

Alternatively you can use python installed from ppa:deadsnakes, like this

      - name: Install Python 3.10 from repo
        run: |
          sudo apt update && sudo apt upgrade
          sudo apt install software-properties-common -y
          sudo add-apt-repository ppa:deadsnakes/ppa -y
          sudo apt update
          sudo apt install python3.10 -y
          sudo ln -s -f python3.10 /usr/bin/python3
          python3.10 --version
          python3 --version
          python --version

i confirmed it works so far, but i believe it is only temporary workaround solution

Does the answer help to solve the issue?

@P403n1x87
Copy link
Author

Since python 3.10 and ubuntu 22.04 the same problem appears even with the native, default python
https://github.com/akv-platform/austin/runs/7479929454?check_suite_focus=true

Ah this is weird. Currently I'm relying on deadsnakes for the tests on Linux. This is one of the latest runs with ubuntu-latest (which seems to pull 20.04) and everything looks good.

https://github.com/P403n1x87/austin/runs/7425790895?check_suite_focus=true

So using deadsnakes as a workaround on 20.04 works for me. The only downside is that it takes slightly longer to install from the PPA repo than with the action, so it would be cool to have this working on Linux too at some point.

@dsame
Copy link
Contributor

dsame commented Jul 23, 2022

Ah this is weird.

In fact it is not. I have no confirmation but it looks to me that if there's no argument result type explicitly set then some Python object is created which is subject of GC. As a result, at the moment of accessing the value it either maybe dead or instead of passing the reference to the memory block we pass a reference to the object that holds it.

Declaring the types:

C.malloc.restype = ctypes.c_void_p
C.free.argtypes = [ctypes.c_void_p]

we avoid the the problem.

@dsame
Copy link
Contributor

dsame commented Jul 25, 2022

Hello @P403n1x87, i am going to close the issue because of provided workarounds and the root cause is in the area of Python teams, but feel free to reopen this issue or create new one if you feel you need it.

@dsame dsame closed this as completed Jul 25, 2022
@P403n1x87
Copy link
Author

@dsame thanks for your investigation into this issue. What I find weird is that the original coding worked for some Python distros but not for others. Your proposed type declarations make total sense.

@P403n1x87
Copy link
Author

@dsame FYI, I just tried declaring the types on malloc and free, but the tests are still SIGSEGVing when free is called 😭

https://github.com/P403n1x87/austin/runs/7496599298?check_suite_focus=true

This is the change:

https://github.com/P403n1x87/austin/blob/46cff2569be4a8932dee8c9a38bc51755d13b235/test/cunit/__init__.py#L92-L93

@dsame
Copy link
Contributor

dsame commented Jul 27, 2022

I suppose you have to declare the types for queue_item__destroy because it is passed between Python, and shared module while malloc and free are in your case are not exposed at first glance from the shared module, but let me double check.

@dsame dsame reopened this Jul 27, 2022
@dsame
Copy link
Contributor

dsame commented Aug 7, 2022

Status:
It is definitely problem with passing the arguments to and returning result from c-function.

For example:
while C.malloc(16) returns 0x55d08b83e7d0 the queue_item_new receieves 0xffffffff8b83e7d0
it is obvious attempt to free 0xffffffff8b83e7d0 causes the SEGFAULT

There might be 2 reasons:

  • new QueueItem has invalid MetaType received from DeclCollector().preprocess(source.with_suffix(".h")) (see test/cunit/__init__.py:307) and i will try to fix it with some hack as a proof and pycparser team should provide the fix
  • there's 32/64bit issue while compiling and linking cache.so

@P403n1x87 any advises about using pycparser and/or calling gcc to build cache.so very appreciated.

@P403n1x87
Copy link
Author

P403n1x87 commented Aug 8, 2022

new QueueItem has invalid MetaType received from DeclCollector().preprocess(source.with_suffix(".h"))

The code in cunit.__init__ sets at most the restype when the function is known to return char*. In all other cases, we don't explicitly declare any other types for the function signature.

while C.malloc(16) returns 0x55d08b83e7d0 the queue_item_new receieves 0xffffffff8b83e7d0

This is weird. If memalloc is returning a legit value, but queue_item_new is receiving a wrong one, because the top dword is being mangled, I'd expect this to be caused by ctypes/cffi. Maybe these are the modules that actually have 32/64bit compilation issues? Note that cache.so is being compiled the same way whether I use setup-python or deadsnakes, so I can't really see it being the source of the problem here.

ladislas added a commit to leka/LekaOS that referenced this issue Sep 22, 2022
it was causing CI to crash, see:

actions/setup-python#442

With ubuntu-22.04 the action is not needed anymore as the default Python
version is 3.10.4
ladislas added a commit to leka/LekaOS that referenced this issue Sep 22, 2022
it was causing CI to crash, see:

actions/setup-python#442

With ubuntu-22.04 the action is not needed anymore as the default Python
version is 3.10.4
@dsame
Copy link
Contributor

dsame commented Sep 26, 2022

hello @P403n1x87,

I had a chance to investigate the problem more and now i've confirmed the problem is with correct description of the arguments in results.

I.e. the folowing code

cache.queue_item_new.argtypes = [ctypes.c_void_p, ctypes.c_long]
cache.queue_item_new.restype = ctypes.c_void_p
queue_item = cache.queue_item_new(value, 42)

cache.queue_item__destroy.argtypes = [ctypes.c_void_p, ctypes.c_void_p]
cache.queue_item__destroy(0,queue_item)

works without problem (with commenting out free(self) in queue_item__destroy) and 64bit pointers are passed as they should
https://github.com/akv-platform/austin/actions/runs/3128546020/jobs/5076521939 (see "run tests" step)

Now the problem is how to set argtypes for the self-generated constructor QueueItem.new

the quick naive attempts

test.cunit.cache.QueueItem.new.argtypes = [ctypes.c_void_p, ctypes.c_long]
test.cunit.cache.QueueItem.__new__.argtypes = [ctypes.c_void_p, ctypes.c_long]

do not work so far, but may be you can suggest something better than these?

@dsame
Copy link
Contributor

dsame commented Sep 26, 2022

This way i am able to fix QueueItem constructor to assept the correct arguments

test.cunit.cache.QueueItem.new.__cfunc__.argtypes = [ctypes.c_void_p, ctypes.c_long]

also it maybe make sens to set test.cunit.cache.QueueItem.new.__args__ to something meaningful but not sure

This is expected to fix the return values but i am not able to confirm it, neither if can provide working code for destroying QueueItem

test.cunit.cache.QueueItem.new.__cfunc__.restype = ctypes.c_void_p

@P403n1x87
Copy link
Author

@dsame thanks for your further investigation and the reproducer showing that the issue, in this case, is with the argument types. The result of CDLL is cached on the module-like object tests.cunit.cache.__binary__ so the C functions can be accessed from there. But your way of getting hold of it is also valid. A cleaner solution would be to enhance the parsing of C to reduce type definitions to basic types, but this involves quite some effort.

Based on your analysis, I suspect that using c_int instead of c_void_p would reproduce the issue? So the safest option would be to infer the signature of the C function. But I do still wonder about what makes the setup-python build behave differently from deadsnake.

@dsame
Copy link
Contributor

dsame commented Sep 26, 2022

"A cleaner solution would be to enhance the parsing of C to reduce type definitions to basic types, but this involves quite some effort"

Yes, it is assumed to end up with the enhancing the parser, but i wanted to get a proof the problem is with the passing the argument.

"using c_int instead of c_void_p would reproduce the issue?"

Yes, arguments and result are treated by int by default.

"But I do still wonder about what makes the setup-python build behave differently from deadsnake."

Hm, i did not think about it. Let me try to investigate this side as well. Might be it could point out some other idea beside complicating the parser.

@dsame
Copy link
Contributor

dsame commented Oct 5, 2022

@P403n1x87 i confirm the code works with PPA python without problem, but i am not able to find out how it could be. Unfortunately the the further investigation is beyond the supporting the action and i have to close the issue with the conclusion "prebuilt pythons keeps the behavior of the official binaries".

I advise you to ask the support from the python teams which might be most effective with the provided info about required ctypes annotations. Also it might be helpful to compare the ./configure options used to build PPA python binaries with the ones used to build ubuntu& macos official binaries and the options used by python-versions.
Also feel free to reopen this issue or create another one in this repository if you feel we have to continue to investigate the problem.

@dsame dsame closed this as completed Oct 5, 2022
@P403n1x87
Copy link
Author

@dsame many thanks for your investigations and support thus far. I believe the best fix, going forward, would be to enhance the parsing and explicitly assign types to args and return, to be protected against these sorts of build differences. I will try to get to that when I can find the time!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants