Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Integration] C Data Interface integration segmentation fault with Java and other runtimes #38684

Closed
tustvold opened this issue Nov 13, 2023 · 6 comments · Fixed by #38846
Closed

Comments

@tustvold
Copy link
Contributor

Describe the bug, including details regarding any error messages, version, and platform.

Over the last few weeks I've noticed intermittent segfaults when running the python integration tests.

It is possible this is something specific to the arrow-rs testing setup, but thought I would flag in case there is something else going on here

https://github.com/apache/arrow-rs/actions/runs/6850128392/job/18623690451

https://github.com/apache/arrow-rs/actions/runs/6837344006/job/18593220363

https://github.com/apache/arrow-rs/actions/runs/6692138689/job/18180724463

Component(s)

Archery

@tustvold
Copy link
Contributor Author

This bug is still popping up on a semi-regular basis - https://github.com/apache/arrow-rs/actions/runs/6955964090/attempts/1?pr=5111

@pitrou FYI

@pitrou
Copy link
Member

pitrou commented Nov 22, 2023

Yes, I've seen it recently as well. For now we're not able to point at a particular problem. It might be an unfortunate effect of several managed runtimes (.Net, Java, Go) competing for resources.

I managed to reproduce it locally once and got the following traceback, but I don't really know what to make of it except stare at the ghastly nesting of signal handlers :-)
https://gist.github.com/pitrou/4b8fd894f6d8a9f112d88b360a3f95ac

@pitrou pitrou changed the title PyArrow / Java Integration Test Segmentation fault [Integration] C Data Interface integration segmentation fault with Java and other runtimes Nov 22, 2023
@tustvold
Copy link
Contributor Author

Oh my, that is spicy 😅

@pitrou
Copy link
Member

pitrou commented Nov 22, 2023

A potential culprit here:
https://stackoverflow.com/questions/34951812/why-does-xrs-reduce-performance

Edit: no, it seems to be a red herring, though it's probably a good idea to enable this option anyway.

@pitrou
Copy link
Member

pitrou commented Nov 22, 2023

By disabling Java signals and the Go backend I get a more palatable stack trace:

#0  __pthread_kill_implementation (no_tid=0, signo=11, threadid=140057680635456) at ./nptl/pthread_kill.c:44
#1  __pthread_kill_internal (signo=11, threadid=140057680635456) at ./nptl/pthread_kill.c:78
#2  __GI___pthread_kill (threadid=140057680635456, signo=signo@entry=11) at ./nptl/pthread_kill.c:89
#3  0x00007f62d0ada476 in __GI_raise (sig=11) at ../sysdeps/posix/raise.c:26
#4  0x00007f6205763ace in os::Linux::chained_handler(int, siginfo*, void*) () from /opt/conda/envs/arrow/jre/lib/amd64/server/libjvm.so
#5  0x00007f62057687e0 in JVM_handle_linux_signal () from /opt/conda/envs/arrow/jre/lib/amd64/server/libjvm.so
#6  0x00007f620575ace8 in signalHandler(int, siginfo*, void*) () from /opt/conda/envs/arrow/jre/lib/amd64/server/libjvm.so
#7  0x00007f61b1e5463d in ?? () from /opt/dotnet/shared/Microsoft.NETCore.App/7.0.12/libcoreclr.so
#8  0x00007f61b1e53d17 in ?? () from /opt/dotnet/shared/Microsoft.NETCore.App/7.0.12/libcoreclr.so
#9  <signal handler called>
#10 0x00007f61b1e3d00c in ?? () from /opt/dotnet/shared/Microsoft.NETCore.App/7.0.12/libcoreclr.so
#11 0x00007f61b1e54650 in ?? () from /opt/dotnet/shared/Microsoft.NETCore.App/7.0.12/libcoreclr.so
#12 0x00007f61b1e53d17 in ?? () from /opt/dotnet/shared/Microsoft.NETCore.App/7.0.12/libcoreclr.so
#13 <signal handler called>
#14 0x00007f62051f66eb in ciMethod::has_option(char const*) () from /opt/conda/envs/arrow/jre/lib/amd64/server/libjvm.so
#15 0x00007f6205272f62 in Compile::Compile(ciEnv*, C2Compiler*, ciMethod*, int, bool, bool, bool) () from /opt/conda/envs/arrow/jre/lib/amd64/server/libjvm.so
#16 0x00007f62051b4753 in C2Compiler::compile_method(ciEnv*, ciMethod*, int) () from /opt/conda/envs/arrow/jre/lib/amd64/server/libjvm.so
#17 0x00007f620527e8a8 in CompileBroker::invoke_compiler_on_method(CompileTask*) () from /opt/conda/envs/arrow/jre/lib/amd64/server/libjvm.so
#18 0x00007f6205281318 in CompileBroker::compiler_thread_loop() () from /opt/conda/envs/arrow/jre/lib/amd64/server/libjvm.so
#19 0x00007f62058d3de3 in JavaThread::thread_main_inner() () from /opt/conda/envs/arrow/jre/lib/amd64/server/libjvm.so
#20 0x00007f62058d417d in JavaThread::run() () from /opt/conda/envs/arrow/jre/lib/amd64/server/libjvm.so
#21 0x00007f620575cdb2 in java_start(Thread*) () from /opt/conda/envs/arrow/jre/lib/amd64/server/libjvm.so
#22 0x00007f62d0b2cac3 in start_thread (arg=<optimized out>) at ./nptl/pthread_create.c:442
#23 0x00007f62d0bbdbf4 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:100

@pitrou
Copy link
Member

pitrou commented Nov 22, 2023

This seems generally related to C#.

pitrou added a commit that referenced this issue Nov 22, 2023
…38846)

### Rationale for this change

C Data Interface integration tests sometimes crash with a complicated traceback involving Java, Go and C# signal handlers.

### What changes are included in this PR?

* Update JDK version in integration build
* Disable signal-based memory management in the JVM so as to improve cooperation with other runtimes
* Set memory limits to the various managed runtimes (Java, .Net, Go)
* Enable some checks in the Go runtime
* Enable debug allocator in Arrow Java

### Are these changes tested?

Yes. I am not able to reproduce any sporadic crash using these changes.

### Are there any user-facing changes?

No.

* Closes: #38684

Authored-by: Antoine Pitrou <[email protected]>
Signed-off-by: Antoine Pitrou <[email protected]>
@pitrou pitrou added this to the 15.0.0 milestone Nov 22, 2023
dgreiss pushed a commit to dgreiss/arrow that referenced this issue Feb 19, 2024
…ing (apache#38846)

### Rationale for this change

C Data Interface integration tests sometimes crash with a complicated traceback involving Java, Go and C# signal handlers.

### What changes are included in this PR?

* Update JDK version in integration build
* Disable signal-based memory management in the JVM so as to improve cooperation with other runtimes
* Set memory limits to the various managed runtimes (Java, .Net, Go)
* Enable some checks in the Go runtime
* Enable debug allocator in Arrow Java

### Are these changes tested?

Yes. I am not able to reproduce any sporadic crash using these changes.

### Are there any user-facing changes?

No.

* Closes: apache#38684

Authored-by: Antoine Pitrou <[email protected]>
Signed-off-by: Antoine Pitrou <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants