Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

gctest fail on ppc64le with SOFT_VDB #376

Closed
ivmai opened this issue Oct 2, 2021 · 20 comments
Closed

gctest fail on ppc64le with SOFT_VDB #376

ivmai opened this issue Oct 2, 2021 · 20 comments

Comments

@ivmai
Copy link
Owner

ivmai commented Oct 2, 2021

Build: https://app.travis-ci.com/github/ivmai/bdwgc/jobs/540924812
Source master (commit 2e7c81e)
Host: Ubuntu/ppc64le
Compiler: gcc
Config: configure default
Occurrence: < 1/60th

gctest output:
Switched to incremental mode
Reading dirty bits from /proc
Lost a node at level 4 - collector is broken
Test failed

@ivmai
Copy link
Owner Author

ivmai commented Oct 2, 2021

Probably caused by SOFT_VDB

@ivmai
Copy link
Owner Author

ivmai commented Oct 2, 2021

Hello @sharkcz
If you meet this issue one day , please let me know, I will think how to localize it

@ivmai ivmai changed the title gctest fail on ppc64 (v8.2.0) gctest fail on ppc64le with SOFT_VDB Mar 12, 2022
@ivmai
Copy link
Owner Author

ivmai commented May 20, 2022

Latest build: https://app.travis-ci.com/github/ivmai/bdwgc/jobs/570835934
Source: master (107cfe0)

@ivmai
Copy link
Owner Author

ivmai commented May 25, 2022

@ivmai
Copy link
Owner Author

ivmai commented Jul 15, 2022

Build: https://app.travis-ci.com/github/ivmai/bdwgc/jobs/576835649
Source: release-8_2 (4919305)

@ivmai
Copy link
Owner Author

ivmai commented Oct 22, 2022

Build: https://app.travis-ci.com/github/ivmai/bdwgc/jobs/586400606
Source: master (9338177)
Host: Linux/ppc64le
Compiler: clang-12
Config: CFLAGS_EXTRA="-fsanitize=memory,undefined -fno-omit-frame-pointer" CONF_OPTIONS="--disable-shared"

Seems to be same issue.
Output:

./gctest
Switched to incremental mode
Reading dirty bits from /proc
List reversal produced incorrect list - collector is broken
Test failed

@jiegec
Copy link

jiegec commented Oct 29, 2022

I got this error when building bowhm-gc in nix on ppc64le:

# TOTAL: 17
# PASS:  16
# SKIP:  0
# XFAIL: 0
# FAIL:  1
# XPASS: 0
# ERROR: 0

# FAIL:  1
# XPASS: 0
# ERROR: 0

.. contents:: :depth: 2

FAIL: gctest
============

Switched to incremental mode
Reading dirty bits from /proc
FAIL gctest (exit status: 139)

============================================================================
Testsuite summary for gc 8.2.2
============================================================================
# TOTAL: 17
# PASS:  16
# SKIP:  0
# XFAIL: 0
# FAIL:  1
# XPASS: 0
# ERROR: 0
============================================================================
See ./test-suite.log
Please report to https://github.com/ivmai/bdwgc/issues
============================================================================
make[3]: *** [Makefile:2048: test-suite.log] Error 1
make[3]: Leaving directory '/build/gc-8.2.2'
make[2]: *** [Makefile:2156: check-TESTS] Error 2

@jiegec
Copy link

jiegec commented Oct 29, 2022

Backtrace on v8.2.2:

Lost a node at level 1 - collector is broken
Test failed

Thread 19 "gctest" received signal SIGABRT, Aborted.
[Switching to Thread 0x7fffef2cf170 (LWP 2171270)]
0x00007ffff7c7d168 in __libc_signal_restore_set (set=0x7fffef2cda18) at ../sysdeps/unix/sysv/linux/internal-signals.h:86
86      ../sysdeps/unix/sysv/linux/internal-signals.h: No such file or directory.
(gdb) bt
#0  0x00007ffff7c7d168 in __libc_signal_restore_set (set=0x7fffef2cda18) at ../sysdeps/unix/sysv/linux/internal-signals.h:86
#1  __GI_raise (sig=<optimized out>) at ../sysdeps/unix/sysv/linux/raise.c:48
#2  0x00007ffff7c54850 in __GI_abort () at abort.c:79
#3  0x0000000100007b84 in chktree (t=<optimized out>, n=<optimized out>) at tests/test.c:1050
#4  0x0000000100007a04 in chktree (t=<optimized out>, n=<optimized out>) at tests/test.c:1056
#5  0x0000000100007a04 in chktree (t=<optimized out>, n=<optimized out>) at tests/test.c:1056
#6  0x0000000100007a04 in chktree (t=<optimized out>, n=<optimized out>) at tests/test.c:1056
#7  0x0000000100007a04 in chktree (t=<optimized out>, n=<optimized out>) at tests/test.c:1056
#8  0x0000000100007a04 in chktree (t=<optimized out>, n=<optimized out>) at tests/test.c:1056
#9  0x0000000100007a04 in chktree (t=<optimized out>, n=<optimized out>) at tests/test.c:1056
#10 0x0000000100007a04 in chktree (t=<optimized out>, n=<optimized out>) at tests/test.c:1056
#11 0x0000000100007f54 in tree_test () at tests/test.c:1171
#12 tree_test () at tests/test.c:1148
#13 0x0000000100008cdc in run_one_test () at tests/test.c:1626
#14 0x0000000100009148 in thr_run_one_test (arg=<optimized out>) at tests/test.c:2344
#15 0x00007ffff7f1dd6c in GC_inner_start_routine (sb=<optimized out>, arg=<optimized out>) at pthread_start.c:57
#16 0x00007ffff7f0bbf0 in GC_call_with_stack_base (fn=<optimized out>, arg=<optimized out>) at extra/../misc.c:2173
#17 0x00007ffff7f0bc74 in GC_start_routine (arg=<optimized out>) at extra/../pthread_support.c:2183
#18 0x00007ffff7e78838 in start_thread (arg=0x7fffef2cf170) at pthread_create.c:477
#19 0x00007ffff7d7b884 in clone () at ../sysdeps/unix/sysv/linux/powerpc/powerpc64/clone.S:82

@jiegec
Copy link

jiegec commented Oct 29, 2022

It is weird that, if I enable ASan or valgrind, the error goes:

Completed 6 tests
Allocated 11602730 collectable objects
Allocated 1224 uncollectable objects
Allocated 8220420 atomic objects
Reallocated 36 objects
Garbage collection after fork is tested too
Finalized 13223/13223 objects - finalization is probably OK
Total number of bytes allocated is 660239708
Total memory use by allocated blocks is 3870720 bytes
Final heap size is 10747904 bytes
Obtained 45285376 bytes from OS (of which 23134208 bytes unmapped)
Final number of reachable objects is 3976
Completed 444 collections in 25451 ms (using 16 marker threads)
Collector appears to work

@jiegec
Copy link

jiegec commented Oct 29, 2022

I finally captured a asan error:

==2726945==Running thread 2726833 was not suspended. False leaks are possible.
==2726945==Running thread 2726834 was not suspended. False leaks are possible.
==2726945==Running thread 2726835 was not suspended. False leaks are possible.
==2726945==Running thread 2726836 was not suspended. False leaks are possible.
tests/test.c:518:9: runtime error: member access within misaligned address 0x000000000001 for type 'struct SEXPR', which requires 8 byte alignment
0x000000000001: note: pointer points here
<memory cannot be printed>
AddressSanitizer:DEADLYSIGNAL
=================================================================
==2726803==ERROR: AddressSanitizer: SEGV on unknown address 0x000000000001 (pc 0x00010001339c bp 0x7dffe60b9d40 sp 0x7dffeb40cf80 T16)
==2726803==The signal is caused by a UNKNOWN memory access.
==2726803==Hint: address points to the zero page.
    #0 0x100013398 in check_ints tests/test.c:518
    #1 0x100015528 in reverse_test_inner tests/test.c:812
    #2 0x7ffff736c578 in GC_call_with_gc_active extra/../pthread_support.c:1788
    #3 0x1000159f0 in reverse_test_inner tests/test.c:727
    #4 0x7ffff73026e8 in GC_do_blocking_inner extra/../pthread_support.c:1597
    #5 0x7ffff7309af0 in GC_with_callee_saves_pushed extra/../mach_dep.c:421
    #6 0x7ffff733ede0 in GC_do_blocking extra/../misc.c:2305
    #7 0x10001aa70 in reverse_test tests/test.c:859
    #8 0x10001aa70 in run_one_test tests/test.c:1642
    #9 0x10001b124 in thr_run_one_test tests/test.c:2344
    #10 0x7ffff736f528 in GC_inner_start_routine /home/jiegec/bdwgc/pthread_start.c:57
    #11 0x7ffff733eb80 in GC_call_with_stack_base extra/../misc.c:2173
    #12 0x7ffff733eca0 in GC_start_routine extra/../pthread_support.c:2183
    #13 0x7ffff75ed2c4 in __asan::AsanThread::ThreadStart(unsigned long long, __sanitizer::atomic_uintptr_t*) ../../../../src/libsanitizer/asan/asan_thread.cc:260
    #14 0x7ffff74e7468 in asan_thread_start ../../../../src/libsanitizer/asan/asan_interceptors.cc:199
    #15 0x7ffff7218834 in start_thread /build/glibc-p3rpmK/glibc-2.31/nptl/pthread_create.c:477
    #16 0x7ffff674b880 in clone (/lib/powerpc64le-linux-gnu/libc.so.6+0x14b880)

AddressSanitizer can not provide additional info.
SUMMARY: AddressSanitizer: SEGV tests/test.c:518 in check_ints
Thread T16 created by T0 here:
    #0 0x7ffff74e752c in __interceptor_pthread_create ../../../../src/libsanitizer/asan/asan_interceptors.cc:208
    #1 0x7ffff736f040 in GC_pthread_create extra/../pthread_support.c:2261
    #2 0x100010250 in main tests/test.c:2414
    #3 0x7ffff6624cc8 in generic_start_main ../csu/libc-start.c:308
    #4 0x7ffff6624ea0 in __libc_start_main ../sysdeps/unix/sysv/linux/powerpc/libc-start.c:98

==2726803==ABORTING

@jiegec
Copy link

jiegec commented Oct 29, 2022

After some repetitions, it only fails in the following places (line numbers are off-by one due to removing fork tests to make it easier to reproduce):

Lost a node at level 3 - collector is broken
Test failed
--Type <RET> for more, q to quit, c to continue without paging--

Thread 19 "gctest" received signal SIGABRT, Aborted.
[Switching to Thread 0x7fffef2cf170 (LWP 3618563)]
0x00007ffff7c7d168 in __libc_signal_restore_set (set=0x7fffef2cd7f8) at ../sysdeps/unix/sysv/linux/internal-signals.h:86
86      ../sysdeps/unix/sysv/linux/internal-signals.h: No such file or directory.
(gdb) bt
#0  0x00007ffff7c7d168 in __libc_signal_restore_set (set=0x7fffef2cd7f8) at ../sysdeps/unix/sysv/linux/internal-signals.h:86
#1  __GI_raise (sig=<optimized out>) at ../sysdeps/unix/sysv/linux/raise.c:48
#2  0x00007ffff7c54850 in __GI_abort () at abort.c:79
#3  0x0000000100007198 in chktree (t=0x7fffed8b6780, n=3) at tests/test.c:1051
#4  0x000000010000731c in chktree (t=0x7fffed8ab0e0, n=4) at tests/test.c:1062
#5  0x000000010000731c in chktree (t=0x7fffed8a6060, n=5) at tests/test.c:1062
#6  0x000000010000725c in chktree (t=0x7fffed8a6080, n=6) at tests/test.c:1057
#7  0x000000010000731c in chktree (t=0x7fffed8110e0, n=7) at tests/test.c:1062
#8  0x000000010000725c in chktree (t=0x7fffed811100, n=8) at tests/test.c:1057
#9  0x000000010000731c in chktree (t=0x7fffed20b120, n=9) at tests/test.c:1062
#10 0x000000010000731c in chktree (t=0x7fffeda172c0, n=10) at tests/test.c:1062
#11 0x000000010000731c in chktree (t=0x7fffe7c9f4e0, n=11) at tests/test.c:1062
#12 0x000000010000731c in chktree (t=0x7fffed4feb20, n=12) at tests/test.c:1062
#13 0x000000010000725c in chktree (t=0x7fffed4feb40, n=13) at tests/test.c:1057
#14 0x000000010000725c in chktree (t=0x7fffed4feb60, n=14) at tests/test.c:1057
#15 0x000000010000731c in chktree (t=0x7fffed662a80, n=15) at tests/test.c:1062
#16 0x000000010000731c in chktree (t=0x7fffecf43360, n=16) at tests/test.c:1062
#17 0x0000000100007c5c in tree_test () at tests/test.c:1169
#18 0x00000001000095f8 in run_one_test () at tests/test.c:1627
#19 0x000000010000a308 in thr_run_one_test (arg=0x0) at tests/test.c:2345
#20 0x00007ffff7f25384 in GC_inner_start_routine (sb=0x7fffef2ce728, arg=0x7fffffffed10) at pthread_start.c:57
#21 0x00007ffff7f182b8 in GC_call_with_stack_base (fn=0x7ffff7f2529c <GC_inner_start_routine>, arg=0x7fffffffed10)
    at extra/../misc.c:2173
#22 0x00007ffff7f249d8 in GC_start_routine (arg=0x7fffffffed10) at extra/../pthread_support.c:2183
#23 0x00007ffff7e78838 in start_thread (arg=0x7fffef2cf170) at pthread_create.c:477
#24 0x00007ffff7d7b884 in clone () at ../sysdeps/unix/sysv/linux/powerpc/powerpc64/clone.S:82
[Thread 0x7fffebf7f170 (LWP 3618757) exited]
List reversal produced incorrect list - collector is broken
Test failed
--Type <RET> for more, q to quit, c to continue without paging--

Thread 19 "gctest" received signal SIGABRT, Aborted.
[Switching to Thread 0x7fffef2cf170 (LWP 3618684)]
0x00007ffff7c7d168 in __libc_signal_restore_set (set=0x7fffef2cd428) at ../sysdeps/unix/sysv/linux/internal-signals.h:86
86      ../sysdeps/unix/sysv/linux/internal-signals.h: No such file or directory.
(gdb) bt
#0  0x00007ffff7c7d168 in __libc_signal_restore_set (set=0x7fffef2cd428) at ../sysdeps/unix/sysv/linux/internal-signals.h:86
#1  __GI_raise (sig=<optimized out>) at ../sysdeps/unix/sysv/linux/raise.c:48
#2  0x00007ffff7c54850 in __GI_abort () at abort.c:79
#3  0x000000010000576c in check_ints (list=0x7fffed8d0180, low=1, up=17) at tests/test.c:522
#4  0x00000001000065d0 in reverse_test_inner (data=0x1) at tests/test.c:843
#5  0x00007ffff7f23c14 in GC_call_with_gc_active (fn=0x10000604c <reverse_test_inner>, client_data=0x1)
    at extra/../pthread_support.c:1788
#6  0x0000000100006098 in reverse_test_inner (data=0x0) at tests/test.c:728
#7  0x00007ffff7f235d0 in GC_do_blocking_inner (data=0x7fffef2ce198 "L`", context=0x7fffef2cd9d8)
    at extra/../pthread_support.c:1597
#8  0x00007ffff7f1f0a0 in GC_with_callee_saves_pushed (fn=0x7ffff7f234fc <GC_do_blocking_inner>, arg=0x7fffef2ce198 "L`")
    at extra/../mach_dep.c:421
#9  0x00007ffff7f1836c in GC_do_blocking (fn=0x10000604c <reverse_test_inner>, client_data=0x0) at extra/../misc.c:2305
#10 0x00000001000066c4 in reverse_test () at tests/test.c:860
#11 0x00000001000096d0 in run_one_test () at tests/test.c:1643
#12 0x000000010000a308 in thr_run_one_test (arg=0x0) at tests/test.c:2345
#13 0x00007ffff7f25384 in GC_inner_start_routine (sb=0x7fffef2ce728, arg=0x7fffffffed10) at pthread_start.c:57
#14 0x00007ffff7f182b8 in GC_call_with_stack_base (fn=0x7ffff7f2529c <GC_inner_start_routine>, arg=0x7fffffffed10)
    at extra/../misc.c:2173
#15 0x00007ffff7f249d8 in GC_start_routine (arg=0x7fffffffed10) at extra/../pthread_support.c:2183
#16 0x00007ffff7e78838 in start_thread (arg=0x7fffef2cf170) at pthread_create.c:477
#17 0x00007ffff7d7b884 in clone () at ../sysdeps/unix/sysv/linux/powerpc/powerpc64/clone.S:82
[Thread 0x7fffecbcf170 (LWP 2452) exited]
[Thread 0x7fffe737f170 (LWP 2453) exited]
--Type <RET> for more, q to quit, c to continue without paging--

Thread 20 "gctest" received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7fffeeabf170 (LWP 2411)]
0x00007ffff7f0c968 in GC_is_marked (p=0xffffffffffffffff) at extra/../mark.c:209
209         return (int)mark_bit_from_hdr(hhdr, bit_no); /* 0 or 1 */
(gdb) bt
#0  0x00007ffff7f0c968 in GC_is_marked (p=0xffffffffffffffff) at extra/../mark.c:209
#1  0x00007ffff7f07d8c in GC_make_disappearing_links_disappear (dl_hashtbl=0x7ffff7f524e8 <GC_arrays+280>, is_remove_dangling=0)
    at extra/../finalize.c:938
#2  0x00007ffff7f086c8 in GC_finalize () at extra/../finalize.c:1118
#3  0x00007ffff7f00b70 in GC_finish_collection () at extra/../alloc.c:1178
#4  0x00007ffff7eff2e0 in GC_maybe_gc () at extra/../alloc.c:534
#5  0x00007ffff7effd38 in GC_collect_a_little_inner (n=1) at extra/../alloc.c:769
#6  0x00007ffff7f0b7b4 in GC_generic_malloc_many (lb=16, k=1, result=0x7fffeeabe030) at extra/../mallocx.c:343
#7  0x00007ffff7f0bec8 in GC_malloc_many (lb=16) at extra/../mallocx.c:495
#8  0x0000000100007464 in alloc8bytes () at tests/test.c:1091
#9  0x0000000100007ac0 in alloc_small (n=5000000) at tests/test.c:1128
#10 0x0000000100007ba0 in tree_test () at tests/test.c:1156
#11 0x00000001000095f8 in run_one_test () at tests/test.c:1627
#12 0x000000010000a308 in thr_run_one_test (arg=0x0) at tests/test.c:2345
#13 0x00007ffff7f25384 in GC_inner_start_routine (sb=0x7fffeeabe728, arg=0x7fffffffed10) at pthread_start.c:57
#14 0x00007ffff7f182b8 in GC_call_with_stack_base (fn=0x7ffff7f2529c <GC_inner_start_routine>, arg=0x7fffffffed10)
    at extra/../misc.c:2173
#15 0x00007ffff7f249d8 in GC_start_routine (arg=0x7fffffffed10) at extra/../pthread_support.c:2183
#16 0x00007ffff7e78838 in start_thread (arg=0x7fffeeabf170) at pthread_create.c:477
#17 0x00007ffff7d7b884 in clone () at ../sysdeps/unix/sysv/linux/powerpc/powerpc64/clone.S:82

@ivmai
Copy link
Owner Author

ivmai commented Oct 30, 2022

This means that the collector collected some live object.

@ivmai
Copy link
Owner Author

ivmai commented Oct 30, 2022

ASan error is just a consequence (of reusing live object).
Sorry, I don't have time to investigate the issue now. But if you figure out the root cause I don't think it would be difficult to prepare the patch.

@jiegec
Copy link

jiegec commented Oct 30, 2022

Adding CFLAGS_EXTRA="-DNO_SOFT_VDB" does work for me on Power8, as mentioned in #479

@ivmai
Copy link
Owner Author

ivmai commented Nov 2, 2022

Tip for me:
In https://app.travis-ci.com/github/ivmai/bdwgc/builds/257276745 (and later builds) all ppc64le builds failed (gctest fail).

@ivmai
Copy link
Owner Author

ivmai commented Nov 16, 2022

Tip for me:
Recent failed build: https://app.travis-ci.com/github/ivmai/bdwgc/jobs/588637489
Source: release-8_2 (2b342c4)

peterhoeg pushed a commit to peterhoeg/nixpkgs that referenced this issue Nov 18, 2022
Upstream has not yet fixed the bug:

  ivmai/bdwgc#376
  ivmai/bdwgc#479

However there is a recommended workaround:

  ivmai/bdwgc#479 (comment)

This adds `CFLAGS_EXTRA=-DNO_SOFT_VDB` to the `makeFlags`, which
prevents direct accesses to `/proc` being used for tracking dirtied
pages (which must be rescanned):

  https://github.com/ivmai/bdwgc/blob/54522af853de28f45195044dadfd795c4e5942aa/include/private/gcconfig.h#L741

The collector will fall back to using mprotect() to trigger page
faults on writes to clean pages and maintain its own dirty bits,
which is slightly less efficient but (in this case) more reliable.
Unreliable page-dirtiness bits can lead to use-after-free()
corruption; this is not a situation where disabling the tests is a
good idea.
@ivmai
Copy link
Owner Author

ivmai commented Jun 28, 2023

Build: https://app.travis-ci.com/github/ivmai/bdwgc/jobs/605030680
Source: release-8_2 (8f6d39d)
Host: Ubuntu 16.04.7 LTS / ppc64le
Compiler: gcc (Ubuntu/IBM 5.4.0-6ubuntu1~16.04.12) 5.4.0 20160609
How to build: ./configure && make -j check

rtimush pushed a commit to rtimush/nixpkgs that referenced this issue Sep 21, 2023
Upstream has not yet fixed the bug:

  ivmai/bdwgc#376
  ivmai/bdwgc#479

However there is a recommended workaround:

  ivmai/bdwgc#479 (comment)

This adds `CFLAGS_EXTRA=-DNO_SOFT_VDB` to the `makeFlags`, which
prevents direct accesses to `/proc` being used for tracking dirtied
pages (which must be rescanned):

  https://github.com/ivmai/bdwgc/blob/54522af853de28f45195044dadfd795c4e5942aa/include/private/gcconfig.h#L741

The collector will fall back to using mprotect() to trigger page
faults on writes to clean pages and maintain its own dirty bits,
which is slightly less efficient but (in this case) more reliable.
Unreliable page-dirtiness bits can lead to use-after-free()
corruption; this is not a situation where disabling the tests is a
good idea.
@ivmai
Copy link
Owner Author

ivmai commented Dec 8, 2023

Probably related fail of gctest.
Source: master (f369491)
Build: https://app.travis-ci.com/github/ivmai/bdwgc/jobs/614783228
Compiler: clang
Cmake options: -DCMAKE_BUILD_TYPE=Release -Dbuild_tests=ON -Denable_cplusplus=ON -Denable_gc_assertions=ON
Output:
gctest ...........................Subprocess aborted***Exception: 7.84 sec

@ivmai
Copy link
Owner Author

ivmai commented Feb 13, 2024

Source: master (d934e7d)
Build: https://app.travis-ci.com/github/ivmai/bdwgc/jobs/617749260
Config: CFLAGS_EXTRA="-fsanitize=memory,undefined -fno-omit-frame-pointer" CONF_OPTIONS="--disable-shared"
Output (gctest.log):

Supported VDBs: manual soft mprotect
Switched to incremental mode
Reading dirty bits from /proc
Lost a node at level 1 - collector is broken

@ivmai
Copy link
Owner Author

ivmai commented Feb 26, 2024

Should be fixed by 6601eec
I'll backport to release-8_2 branch later.

@ivmai ivmai closed this as completed Feb 26, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants