Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Memory corruption with Thread on musl #8738

Closed
straight-shoota opened this issue Feb 4, 2020 · 8 comments · Fixed by #15043
Closed

Memory corruption with Thread on musl #8738

straight-shoota opened this issue Feb 4, 2020 · 8 comments · Fixed by #15043
Labels
kind:bug A bug in the code. Does not apply to documentation, specs, etc. platform:linux-musl topic:stdlib:concurrency

Comments

@straight-shoota
Copy link
Member

The test_alpine job fails in some workflows on master (Example: https://circleci.com/gh/crystal-lang/crystal/37573) but not every time (Example: https://circleci.com/gh/crystal-lang/crystal/37643).
The error happens while running stdlib specs with a newly built compiler. The error message is Failed to raise an exception: END_OF_STACK, so it's not clear what's the issue. I'd assume that would be improved by #8735 but on that branch the error message is identical: https://circleci.com/gh/crystal-lang/crystal/37656

@straight-shoota straight-shoota added kind:bug A bug in the code. Does not apply to documentation, specs, etc. platform:linux-musl labels Feb 4, 2020
@Blacksmoke16
Copy link
Member

Blacksmoke16 commented Feb 4, 2020

Maybe #8743?

EDIT: NVM.

@bcardiff
Copy link
Member

bcardiff commented Feb 7, 2020

It fails often. Either we find a fix soon or we should turn off the CI for alpine. It's getting noisy.

@straight-shoota
Copy link
Member Author

straight-shoota commented Feb 7, 2020

Running against master with #8743 the error message is now Error while trying to determine if a stack overflow has occurred. Probable memory corrpution.

The error seems to occur in different thread specs, I've found these locations:

  • spec/std/thread/condition_variable_spec.cr:17
  • spec/std/thread/mutex_spec.cr:5
  • spec/std/thread/thread_spec.cr:30
  • spec/std/thread/thread_spec.cr:4
  • spec/std/thread/thread_spec.cr:11

It doesn't seem to be consistent. And it's not consistent whether the error occurs at all. When running only thread specs (--example Thread) it's about 50% failure rate. I didn't tally when running the entire spec suit, but it feels like it fails more often.

When running with strace (and --example Thread), the error occurs every single time. But it's always in one of the first specs of thread_spec.cr (most frequently the first one) which never failed when running the entire spec suite without strace.

Those are the traces:

write(4, "allows passing an argumentless f"..., 45allows passing an argumentless fun to execute) = 45
mmap(NULL, 143360, PROT_NONE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f91579cf000
mprotect(0x7f91579d1000, 135168, PROT_READ|PROT_WRITE) = 0
rt_sigprocmask(SIG_BLOCK, ~[RTMIN RT_1 RT_2], [], 8) = 0
clone(child_stack=0x7f91579f1a88, flags=CLONE_VM|CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|CLONE_SYSVSEM|CLONE_SETTLS|CLONE_PARENT_SETTID|CLONE_CHILD_CLEARTID|0x400000, parent_tid=[13], tls=0x7f91579f1b20, child_tidptr=0x7f915ce7c31c) = 13
rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
Error while trying to determine if a stack overflow has occurred. Probable memory corrpution
futex(0x7f91579f1b60, FUTEX_WAIT_PRIVATE, 1, NULLInvalid memory access (signal 11) at address 0x40
[0x55735e770d06] *CallStack::print_backtrace:Int32 +118
[0x55735e19c9dc] __crystal_sigfault_handler +348
[0x55735feb7db4] sigfault_handler +40
[0x7f915ce2d0d4] ???
) = ?
write(4, "raises inside thread and gets it"..., 40raises inside thread and gets it on join) = 40
mmap(NULL, 143360, PROT_NONE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fca6bc6a000
mprotect(0x7fca6bc6c000, 135168, PROT_READ|PROT_WRITE) = 0
rt_sigprocmask(SIG_BLOCK, ~[RTMIN RT_1 RT_2], [], 8) = 0
clone(child_stack=0x7fca6bc8ca88, flags=CLONE_VM|CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|CLONE_SYSVSEM|CLONE_SETTLS|CLONE_PARENT_SETTID|CLONE_CHILD_CLEARTID|0x400000, parent_tid=[14], tls=0x7fca6bc8cb20, child_tidptr=0x7fca7111731c) = 14
rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
futex(0x7fca6bc8cb60, FUTEX_WAIT_PRIVATE, 1, NULLError while trying to determine if a stack overflow has occurred. Probable memory corrpution
Invalid memory access (signal 11) at address 0x40
[0x55b6cf4c7d06] *CallStack::print_backtrace:Int32 +118
[0x55b6ceef39dc] __crystal_sigfault_handler +348
[0x55b6d0c0edb4] sigfault_handler +40
[0x7fca710c80d4] ???
) = ?

@straight-shoota
Copy link
Member Author

To reduce noise we could temporarily disable the specs in question for musl/alpine.

@RX14
Copy link
Contributor

RX14 commented Feb 9, 2020

Can we at least disable the emails? Please?

@straight-shoota
Copy link
Member Author

The specs have been disabled to remove noise on CI jobs (#8801). But the original issue still needs to be fixed.

@straight-shoota straight-shoota changed the title [CI] test_alpine occasionally fails on master Memory corruption with Thread on musl Feb 17, 2020
@HertzDevil
Copy link
Contributor

HertzDevil commented Sep 27, 2024

This is the stacktrace I got on a Docker image: (you will need to re-enable ASLR in the debugger first, e.g. settings set target.disable-aslr false in LLDB, or you could use a VM instead)

* thread #17, name = 'test', stop reason = signal SIGSEGV: address not mapped to object (fault address: 0x38)
  * frame #0: 0x00007fe2aaccda89 ld-musl-x86_64.so.1`pthread_getattr_np(t=0x0000000000000000, a=0x00007fe2a95288f8) at pthread_getattr_np.c:9:18
    frame #1: 0x0000557b13b823e3 test`stack_address(self=0x00007fe2aaaa39a0) at pthread.cr:103:5
    frame #2: 0x0000557b13b82a0a test`start(self=0x00007fe2aaaa39a0) at thread.cr:164:54
    frame #3: 0x0000557b13aee6ad test`->(data=0x00007fe2aaaa39a0) at pthread.cr:19:33
    frame #4: 0x00007fe2aab67faa libgc.so.1`GC_inner_start_routine + 84
    frame #5: 0x00007fe2aab6354f libgc.so.1`GC_call_with_stack_base + 37
    frame #6: 0x00007fe2aaccd9d2 ld-musl-x86_64.so.1`start(p=0x00007fe2a9528ad0) at pthread_create.c:207:2
    frame #7: 0x00007fe2aaccf314 ld-musl-x86_64.so.1`__clone at clone.s:22

The thread routine may be started before GC.pthread_create returns; Thread#start then calls Crystal::System::Thread#stack_address, which may call LibC.pthread_getattr_np before GC.pthread_create initializes the @system_handle instance variable, hence the segmentation fault. (On musl-libc, the underlying non-atomic store occurs right before LibC.pthread_create returns.) Thus there needs to be some kind of synchronization:

module Crystal::System::Thread
  private def init_handle
    ret = GC.pthread_create(
      thread: pointerof(@system_handle),
      attr: Pointer(LibC::PthreadAttrT).null,
      start: ->(data : Void*) {
        # it looks like `LibC.pthread_self` isn't valid here yet?
        # otherwise we could assign `th.@system_handle` at this point
        th = data.as(::Thread)
        until th.to_unsafe
          Thread.yield_current # ???
        end
        th.start
        Pointer(Void).null
      },
      arg: self.as(Void*),
    )

    raise RuntimeError.from_os_error("pthread_create", Errno.new(ret)) unless ret == 0
  end
end

With this all three thread spec files seem to never fail again. I don't know why this failure doesn't happen on other POSIX systems.

The Windows Crystal::System::Thread most likely has a similar race condition, except that we assign the thread handle explicitly with @system_handle = GC.beginthreadex(...), and that #stack_address doesn't depend on it.

@ysbaddaden
Copy link
Contributor

Oh nice, great catch @HertzDevil!

Maybe we can just set the handle, so both threads will set the handle, but at least we're sure the value is visible (an atomic wouldn't solve the race).

thread = data.as(::Thread)
thread.@system_handle ||= LibC.pthread_self
thread.start

Or maybe Thread#start should always do it first so it would be correct for every target?

protected def start
  @system_handle ||= Crystal::System::Thread.current_handle
  # ...
end

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind:bug A bug in the code. Does not apply to documentation, specs, etc. platform:linux-musl topic:stdlib:concurrency
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants