Crash on building ECL on macOS 10.5 i386 with modern Clang #569

catap · 2023-08-29T00:15:48Z

I not sure that it is BoehmGC issues, but ECL's maintainer guess that it is. See: https://gitlab.com/embeddable-common-lisp/ecl/-/issues/705

Long story short: an attempt to build ECL on macOS 10.5 i386 fails as:

;;; ECL C Backtrace
;;; 0   ecl_min                             0x00038c9b _ecl_dump_c_backtrace + 47
;;; 1   ecl_min                             0x0002cfc8 ecl_internal_error + 110
;;; 2   ecl_min                             0x0004fa0f deferred_signal_handler + 0
;;; 3   ecl_min                             0x0004f79d sigsegv_handler + 262
;;; 4   libSystem.B.dylib                   0x0025d05b _sigtramp + 43
;;; 5   ???                                 0xffffffff 0x0 + 4294967295
;;; 6   ecl_min                             0x00051db7 _ecl_sethash_pack + 66
;;; 7   ecl_min                             0x00007a97 cl_export2 + 579
;;; 8   ecl_min                             0x000042ef init_all_symbols + 431
;;; 9   ecl_min                             0x00004ba0 cl_boot + 1065
;;; 10  ecl_min                             0x00003a0e main + 33
;;; 11  ecl_min                             0x00003575 start + 53
/bin/sh: line 1: 76282 Abort trap              ECLDIR=`pwd`/ ./ecl_min compile

If I change compiler from Clang-7 (default at MacPorts) to GCC-7 => it works.

I've run some tests and discovers that Clang-5+ leads to this result. But clang-3.7 works fine.

The text was updated successfully, but these errors were encountered:

ivmai · 2023-08-29T03:51:41Z

I've checked the link above: I don't have an immediate advice how to find the root case, it is strange to see a stack trace with tons of GC_inner_start_routine and GC_wait_marker, like it's an infinite recursion.

catap · 2023-08-29T09:35:57Z

@ivmai I can't reproduce Sergey's behaviour but one which I shown before is easy to reproduce by switching compilers.

And it seems like a strong lead to the GC because I can find compiler specified code only at GC.

ivmai · 2023-08-29T11:15:23Z

I can find compiler specified code only at GC.

I don't understand what does it mean

;;; ECL C Backtrace

I don't anything related to bdwgc in the stack trace

catap · 2023-08-30T20:55:36Z

Ok, I do have some solid evidence that GC might be the cause. With help of git bisect I was able to find the exact commit which introduced this crash: https://gitlab.com/embeddable-common-lisp/ecl/-/commit/6e5016dcb869bb02ab6ee089d4dc9f266b042bf1

We're interesting in chunk:

@@ -398,6 +399,8 @@ struct ecl_hashtable {          /*  hash table header  */
         _ECL_HDR2(test,weak);
         struct ecl_hashtable_entry *data; /*  pointer to the hash table  */
         cl_object sync_lock;              /*  synchronization lock  */
+        cl_object generic_test;           /*  generic test function */
+        cl_object generic_hash;           /*  generic hashing function */
         cl_index entries;       /*  number of entries  */
         cl_index size;          /*  hash table size  */
         cl_index limit;         /*  hash table threshold (integer value)  */

which patching a structure:

struct ecl_hashtable {          /*  hash table header  */
        _ECL_HDR2(test,weak);
        struct ecl_hashtable_entry *data; /*  pointer to the hash table  */
        cl_object sync_lock;              /*  synchronization lock  */
        cl_index entries;       /*  number of entries  */
        cl_index size;          /*  hash table size  */
        cl_index limit;         /*  hash table threshold (integer value)  */
        cl_object rehash_size;  /*  rehash size  */
        cl_object threshold;    /*  rehash threshold  */
        double factor;          /*  cached value of threshold  */
        cl_object (*get)(cl_object, cl_object, cl_object);
        cl_object (*set)(cl_object, cl_object, cl_object);
        bool (*rem)(cl_object, cl_object);
        /* Unsafe variants are used to store the real accessors when
           the synchronized variant is bound to get/set/rem. */
        cl_object (*get_unsafe)(cl_object, cl_object, cl_object);
        cl_object (*set_unsafe)(cl_object, cl_object, cl_object);
        bool (*rem_unsafe)(cl_object, cl_object);
};

Simple adds two more fields.

This structure is allocated as

    cl_object obj;
    ecl_disable_interrupts_env(the_env);
    obj = (cl_object)GC_MALLOC(type_info[t].size);
    ecl_enable_interrupts_env(the_env);
    obj->d.t = t;
    return obj;

And size is defined that that structure as sizeof(struct ecl_hashtable).

If I hacked code and put 2 * sizeof(struct ecl_hashtable) => it works.

So, let me summarizy. If I use https://gitlab.com/embeddable-common-lisp/ecl/-/commit/aa985f566fdedd45e2c74774d6e81f2442dd3802 as local root it works.

When I apply patch:

--- a/src/h/object.h
+++ b/src/h/object.h
@@ -398,6 +398,8 @@ struct ecl_hashtable {          /*  hash table header  */
         _ECL_HDR2(test,weak);
         struct ecl_hashtable_entry *data; /*  pointer to the hash table  */
         cl_object sync_lock;              /*  synchronization lock  */
+        cl_object generic_test;           /*  generic test function */
+        cl_object generic_hash;           /*  generic hashing function */
         cl_index entries;       /*  number of entries  */
         cl_index size;          /*  hash table size  */
         cl_index limit;         /*  hash table threshold (integer value)  */

=> it crashes, but when I reserve twi time more space via hack:

--- a/src/c/alloc_2.d
+++ b/src/c/alloc_2.d
@@ -860,9 +860,9 @@ init_alloc(void)
   init_tm(t_symbol, "SYMBOL", sizeof(struct ecl_symbol), 5);
   init_tm(t_package, "PACKAGE", sizeof(struct ecl_package), -1); /* 36 */
 #ifdef ECL_THREADS
-  init_tm(t_hashtable, "HASH-TABLE", sizeof(struct ecl_hashtable), 3);
+  init_tm(t_hashtable, "HASH-TABLE", 2 * sizeof(struct ecl_hashtable), 3);
 #else
-  init_tm(t_hashtable, "HASH-TABLE", sizeof(struct ecl_hashtable), 4);
+  init_tm(t_hashtable, "HASH-TABLE", 2 * sizeof(struct ecl_hashtable), 4);
 #endif
   init_tm(t_array, "ARRAY", sizeof(struct ecl_array), 3);
   init_tm(t_vector, "VECTOR", sizeof(struct ecl_vector), 2);

well.. it works again.

Have I missed something? Thus, when I change compiler, it masks the issue.

catap · 2023-09-03T11:13:50Z

The issue is reproduced when I build GC from 8.2.4 release, 75e522c (release-8.2 branch) and 07a6d0e (master branch).

catap · 2023-09-03T11:41:43Z

And libatomic-7.6.6 with boehmgc-7.6.8 reproduces the issue.

catap · 2023-09-03T11:47:56Z

And libatomic-7.4.4 with boehmgc-7.6.0 reproduces the issue.

catap · 2023-09-03T11:52:50Z

From another hand this issue can't be reproduced by GCC-7.5.0 that means that it might be clang-only things.

catap · 2023-09-03T15:57:02Z

From another hand sizeof(struct ecl_hashtable) is 72 on both compilers: GCC-7.5.0 and Clang-7.1.0.

@ivmai do you agree that it seems like GC issue?

Do you have any suggestion to dig?

catap · 2023-09-23T07:47:43Z

Probably another related issue on macOS: https://gitlab.com/embeddable-common-lisp/ecl/-/issues/718

catap · 2023-09-23T11:05:06Z

I was able to reproduce it with some probability on my laptop with macOS 12.

I've applied a patch on 8.2.4:

--- darwin_stop_world.c
+++ darwin_stop_world.c
@@ -639,6 +639,11 @@ GC_INNER void GC_stop_world(void)
             kern_result = thread_suspend(p -> mach_thread);
           } while (kern_result == KERN_ABORTED);
           GC_release_dirty_lock();
+          if ((((p) -> flags & FINISHED) != FINISHED)
+              && kern_result == KERN_TERMINATED)
+            continue;
+          if (kern_result == KERN_TERMINATED)
+            ABORT("thread_suspend failed: it was already termindated");
           if (kern_result != KERN_SUCCESS)
             ABORT("thread_suspend failed");
           if (GC_on_thread_event)

and run tests of cl-njson multiple times via ECL with that boehmgc. Let say 1-out-of-15 attempts fails as:

thread_suspend failed: it was already termindated

catap · 2023-09-23T11:14:15Z

Well.. seems that it broke stack or some memory somehow.

A function thread_suspend is looks like:

kern_return_t
thread_suspend(thread_t thread)
{
	kern_return_t result = KERN_SUCCESS;

	if (thread == THREAD_NULL || get_threadtask(thread) == kernel_task) {
		return KERN_INVALID_ARGUMENT;
	}

	thread_mtx_lock(thread);

	if (thread->active) {
		if (thread->user_stop_count++ == 0) {
			thread_hold(thread);
		}
	} else {
		result = KERN_TERMINATED;
	}

	thread_mtx_unlock(thread);

	if (thread != current_thread() && result == KERN_SUCCESS) {
		thread_wait(thread, FALSE);
	}

	return result;
}

from here: https://github.com/apple-open-source/macos/blob/master/xnu/osfmk/kern/thread_act.c#L382-L408

and if I modify mine patch to:

@@ -639,8 +639,13 @@ GC_INNER void GC_stop_world(void)
             kern_result = thread_suspend(p -> mach_thread);
           } while (kern_result == KERN_ABORTED);
           GC_release_dirty_lock();
-          if (kern_result != KERN_SUCCESS)
+          if (kern_result != KERN_SUCCESS) {
+            fprintf(stderr, "KERN_SUCCESS: %d\n", KERN_SUCCESS);
+            fprintf(stderr, "KERN_TERMINATED: %d\n", KERN_TERMINATED);
+            fprintf(stderr, "KERN_INVALID_ARGUMENT: %d\n", KERN_INVALID_ARGUMENT);
+            fprintf(stderr, "kern_result: %d\n", kern_result);
             ABORT("thread_suspend failed");
+          }
           if (GC_on_thread_event)
             GC_on_thread_event(GC_EVENT_THREAD_SUSPENDED,
                                (void *)(word)(p -> mach_thread));

and run it a lot of times (more than hundred) in the loop, it might fails as:

:info:build KERN_SUCCESS: 0
:info:build KERN_TERMINATED: 37
:info:build KERN_INVALID_ARGUMENT: 4
:info:build kern_result: 268435459
:info:build thread_suspend failed

ivmai · 2023-09-23T12:36:56Z

It looks like the thread is terminated in parallel to GC_stop_world(). This is strange becase the GC lock is held during GC_stop_world and the terminating threads should call GC_thread_exit_proc which also should acquire the lock.

Please try to figure out the death scenario of the the thread for which thread_suspend returns anything other than KERN_SUCCESS.

catap · 2023-09-23T13:23:02Z

@ivmai i have no idea how.

I run tests via ECL and it fails. Rarely.

catap · 2023-09-23T15:28:13Z

@ivmai I've added one clean debug log line:

          if (kern_result != KERN_SUCCESS) {
# ifdef DEBUG_THREADS
            GC_log_printf("thread_suspend(%d) returns %d\n", p->stop_info.mach_thread, kern_result);
# endif
            ABORT("thread_suspend failed");
          }

and rebuild GC with -DDEBUG_THREADS which leads to log:

:info:build Stopping the world from thread 0x303
:info:build thread_suspend(259) returns 4
:info:build thread_suspend failed
:info:build An error occurred during initialization:
:info:build Error code 6 when executing
:info:build (EXT:RUN-PROGRAM "/usr/bin/clang" Stopping the world from thread 0x103
:info:build World stopped from 0x103
:info:build Darwin: Stack for thread 0x103 is [0x7ff7b5a81e80,0x7ff7b5a8c000)
:info:build thread_get_state returns 0
:info:build Darwin: Stack for thread 0x2803 is [0x700003188d08,0x700003188f98)

259 is 0x103 and 4 is KERN_INVALID_ARGUMENT which may occures only on condition:

	if (thread == THREAD_NULL || get_threadtask(thread) == kernel_task) {
		return KERN_INVALID_ARGUMENT;
	}

It defently not a kernel task, and can't be THREAD_NULL because it is used to stop the world after abort.

Something very wrong here.

catap · 2023-09-24T11:00:11Z

It looks like the thread is terminated in parallel to GC_stop_world(). This is strange becase the GC lock is held during GC_stop_world and the terminating threads should call GC_thread_exit_proc which also should acquire the lock.

Please try to figure out the death scenario of the the thread for which thread_suspend returns anything other than KERN_SUCCESS.

And I've figured out it.

The root cause of issue that mach_thread_self() may return different value after fork(). Different means different that is stored inside GC_threads.

GC_remove_all_threads_but_me() updated that value but seems that it's doing so too late.

As result thread_suspend returns KERN_INVALID_ARGUMENT or 268435459 which is (ipc/send) invalid destination port by mach_error_string.

As a naive approach I've tried to move GC_remove_all_threads_but_me() inside fork_child_proc() to the first line, but it doesn't help.

catap · 2023-09-24T11:24:45Z

The proof. I've used one commit ahead of 8.2.4: catap@a7e17f9 which adds getpid() to the log.

If may fail on test case with output like:

[14298] Stopping the world from thread 0x103
[14298] thread_suspend(0xb03) returns (os/kern) successful
World stopped from 0x103
thread_get_state returns 0
Darwin: Stack for thread 0xb03 is [0x70000aebb068,0x70000aebdf78)
Darwin: Stack for thread 0x103 is [0x7ff7b6401280,0x7ff7b640c000)
Starting mark helper for mark number 566
Starting mark helper for mark number 566
World starting
Resuming thread 0xb03 with state 3
World started
Total stacks size: 56464
;;; Note:
;;;   Invoking external command:
;;;   clang -I. -I/Users/runner/work/cl-duckdb/cl-duckdb/ecl/build/ -DECL_API -I/Users/runner/work/cl-duckdb/cl-duckdb/ecl/build/c -DDEBUG_THREADS -fPIC -fno-common -D_THREAD_SAFE -Ddarwin -I/Users/runner/work/cl-duckdb/cl-duckdb/ecl/src/c -c eclinit3Cg6rN.c -o eclinit3Cg6rN.o 
;;; Note:
;;;   Invoking external command:
[14430] Stopping the world from thread 0x303
[14430] thread_suspend(0xb03) returns (ipc/send) invalid destination port
thread_suspend failed
[14298] Stopping the world from thread 0x103
[14298] thread_suspend(0xb03) returns (os/kern) successful
World stopped from 0x103
thread_get_state returns 0
Darwin: Stack for thread 0xb03 is [0x70000aebb068,0x70000aebdf78)
Darwin: Stack for thread 0x103 is [0x7ff7b63f0a00,0x7ff7b640c000)
Starting mark helper for mark number 567
Starting mark helper for mark number 567
World starting
Resuming thread 0xb03 with state 3
World started

and 0x303 it is the first and the only appears of 0x303; thus no update thread id which means GC_remove_all_threads_but_me() hasn't been called or hasn't updated the reference (yet?)

ivmai · 2023-09-24T15:17:32Z

Was GC_remove_all_threads_but_me called?

catap · 2023-09-24T15:45:21Z

@ivmai I've added a log line into GC_remove_all_threads_but_me and can't find it in output => seems that it hadn't been called.

Which is very strange.

catap · 2023-09-24T21:43:18Z

Yes, it is happened after fork. I was able to recover a stack trace:

0   libsystem_kernel.dylib        	    0x7ff81e0a7ffe __pthread_kill + 10
1   libsystem_pthread.dylib       	    0x7ff81e0de1ff pthread_kill + 263
2   libsystem_c.dylib             	    0x7ff81e029d24 abort + 123
3   libgc.1.dylib                 	       0x10a5c4bc3 GC_stop_world.cold.1 + 43
4   libgc.1.dylib                 	       0x10a5c340f GC_stop_world + 276
5   libgc.1.dylib                 	       0x10a5b3010 GC_stopped_mark + 70
6   libgc.1.dylib                 	       0x10a5b2ecd GC_try_to_collect_inner + 311
7   libgc.1.dylib                 	       0x10a5b40fe GC_collect_or_expand + 189
8   libgc.1.dylib                 	       0x10a5b4380 GC_allocobj + 265
9   libgc.1.dylib                 	       0x10a5b8baa GC_generic_malloc_inner + 253
10  libgc.1.dylib                 	       0x10a5b8e3a GC_generic_malloc + 83
11  libecl.23.9.9.dylib           	       0x10a9eb821 ecl_alloc_atomic + 32
12  libecl.23.9.9.dylib           	       0x10a99db75 si_make_sequence_output_stream + 333
13  libecl.23.9.9.dylib           	       0x10a9aeb23 si_string_to_octets + 499
14  libecl.23.9.9.dylib           	       0x10a9e35dd si_spawn_subprocess + 577
15  libecl.23.9.9.dylib           	       0x10a96928e si_run_program + 1744

inside ECL the code which leads to an issue looks like:

    child_pid = fork();
    if (child_pid == 0) {
      /* Child */
      int j;
      cl_object p;
      char **argv_ptr = ecl_alloc((ecl_length(argv) + 1) * sizeof(char*));
      for (p = argv, j = 0; p != ECL_NIL; p = ECL_CONS_CDR(p)) {
        cl_object arg = si_string_to_octets(3, ECL_CONS_CAR(p), @':null-terminate', ECL_T);
        argv_ptr[j++] = (char*)arg->base_string.self;
      }
      argv_ptr[j] = NULL;

catap · 2023-09-25T00:19:27Z

Status update:

this isn't related to original issue which can be covered by GCC;
the root cause of issue was... nobody setup handle_fork, see: https://gitlab.com/embeddable-common-lisp/ecl/-/merge_requests/303

ivmai · 2023-09-25T04:32:28Z

Okay, more investigation is needed for the original issue.

catap · 2023-09-25T07:23:23Z

@ivmai yes, but I'm out of idea how to proceed

catap · 2023-09-25T10:10:32Z

@ivmai what do you think to enable "handle fork" by default for macOS? Base on behaviour of mach_thread_self() I feel that it is common and very difficult to track issue.

For example scala-native uses that code and instead of replicate ECL's changes in scala-native and many other software I feel that it simple should be enabled by default.

ivmai · 2023-09-26T11:22:09Z

what do you think to enable "handle fork" by default for macOS? Base on behavior of mach_thread_self() I feel that it is common and very difficult to track issue.

Yes, agree, but after implementing #103

As of now: # The incremental mode conflicts with fork handling on Darwin.

ivmai · 2023-11-30T16:55:31Z

what do you think to enable "handle fork" by default for macOS? Base on behaviour of mach_thread_self() I feel that it is common and very difficult to track issue.

Implemented by commit ba2861e
@catap, please test it

ivmai · 2023-11-30T16:56:43Z

As I understand the original issue is not resolve, so I'm leaving it open.

cooljeanius · 2024-03-18T12:27:14Z

As I understand the original issue is not resolve, so I'm leaving it open.

What remains to be done to fix it?

catap · 2024-03-18T12:57:07Z

As I understand the original issue is not resolve, so I'm leaving it open.

What remains to be done to fix it?

It still fails as it was described here: #569 (comment)

ivmai · 2024-03-18T13:03:26Z

Is there any fix proposed?

ivmai · 2024-03-18T13:04:16Z

The root cause of issue that mach_thread_self() may return different value after fork(). Different means different that is stored inside GC_threads.
GC_remove_all_threads_but_me() updated that value but seems that it's doing so too late.

This is not fixed, right?

catap · 2024-03-18T15:15:01Z

@ivmai let me summarize everything.

Here I've mixed up two issues.

thread related which was fixed as far as I recall.
Something brokes stack -- not fixed.

(2) can be reproduced only by clang on i386 on macOS. I can't reproduce it by gcc.

This was referenced Dec 8, 2023

Some tests are failed on ECL sharplispers/ironclad#63

Closed

Enable GC_set_handle_fork(1) for Darwin with GC_incremental on #103

Closed

catap mentioned this issue Jan 9, 2024

gcc6: restore gcj macports/macports-ports#18486

Closed

12 tasks

collares mentioned this issue Sep 7, 2024

Sage: Fix on Darwin NixOS/nixpkgs#264126

Merged

13 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Crash on building ECL on macOS 10.5 i386 with modern Clang #569

Crash on building ECL on macOS 10.5 i386 with modern Clang #569

catap commented Aug 29, 2023

ivmai commented Aug 29, 2023

catap commented Aug 29, 2023

ivmai commented Aug 29, 2023

catap commented Aug 30, 2023

catap commented Sep 3, 2023

catap commented Sep 3, 2023

catap commented Sep 3, 2023

catap commented Sep 3, 2023

catap commented Sep 3, 2023

catap commented Sep 23, 2023

catap commented Sep 23, 2023

catap commented Sep 23, 2023

ivmai commented Sep 23, 2023

catap commented Sep 23, 2023

catap commented Sep 23, 2023

catap commented Sep 24, 2023

catap commented Sep 24, 2023 •

edited

Loading

ivmai commented Sep 24, 2023

catap commented Sep 24, 2023

catap commented Sep 24, 2023

catap commented Sep 25, 2023

ivmai commented Sep 25, 2023

catap commented Sep 25, 2023

catap commented Sep 25, 2023

ivmai commented Sep 26, 2023 •

edited

Loading

ivmai commented Nov 30, 2023

ivmai commented Nov 30, 2023

cooljeanius commented Mar 18, 2024

catap commented Mar 18, 2024

ivmai commented Mar 18, 2024

ivmai commented Mar 18, 2024

catap commented Mar 18, 2024

Crash on building ECL on macOS 10.5 i386 with modern Clang #569

Crash on building ECL on macOS 10.5 i386 with modern Clang #569

Comments

catap commented Aug 29, 2023

ivmai commented Aug 29, 2023

catap commented Aug 29, 2023

ivmai commented Aug 29, 2023

catap commented Aug 30, 2023

catap commented Sep 3, 2023

catap commented Sep 3, 2023

catap commented Sep 3, 2023

catap commented Sep 3, 2023

catap commented Sep 3, 2023

catap commented Sep 23, 2023

catap commented Sep 23, 2023

catap commented Sep 23, 2023

ivmai commented Sep 23, 2023

catap commented Sep 23, 2023

catap commented Sep 23, 2023

catap commented Sep 24, 2023

catap commented Sep 24, 2023 • edited Loading

ivmai commented Sep 24, 2023

catap commented Sep 24, 2023

catap commented Sep 24, 2023

catap commented Sep 25, 2023

ivmai commented Sep 25, 2023

catap commented Sep 25, 2023

catap commented Sep 25, 2023

ivmai commented Sep 26, 2023 • edited Loading

ivmai commented Nov 30, 2023

ivmai commented Nov 30, 2023

cooljeanius commented Mar 18, 2024

catap commented Mar 18, 2024

ivmai commented Mar 18, 2024

ivmai commented Mar 18, 2024

catap commented Mar 18, 2024

catap commented Sep 24, 2023 •

edited

Loading

ivmai commented Sep 26, 2023 •

edited

Loading