Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Incremental GC collecting referenced objects (gctest fails with soft vdb if shared build by zig) #544

Closed
plajjan opened this issue Mar 31, 2023 · 13 comments

Comments

@plajjan
Copy link
Contributor

plajjan commented Mar 31, 2023

I am successfully using the GC in the Acton language run time system. In order to improve support for large heaps, I have recently started looking at enabling the incremental & generational GC. When I do, by calling GC_enable_incremental(), I get errors, mostly SIGILL but also some SIGSEGV, based on the 100+ apps in the acton test suite.

I have taken one of those programs that consistently crashes with SIGILL. We have a struct pointer reQ_W_125 with a $class field that points to a malloced object on the heap. I have a debug print when we initialize this struct and another debug print when we try to access it later on. For a working program, the pointers will be the same:

kll@Boxy:~/terastream/acton/foobar$ out/rel/bin/test_re --rts-wthreads=1
incremental: 0
reQ_W_125: 0x7f20be949ff0  $class: 0x4642b0
reQ_W_125: 0x7f20be949ff0  $class: 0x4642b0
kll@Boxy:~/terastream/acton/foobar$ 

That is with the GC enabled but incremental mode turned off. If I enable incremental mode the program crashes and we can see that the $class address has changed. There is nothing in the program that modifies $class, it is set once on startup and then left unchanged.

kll@Boxy:~/terastream/acton/foobar$ GC_ENABLE_INCREMENTAL=1 out/rel/bin/test_re --rts-wthreads=1
incremental: 1
reQ_W_125: 0x7f8a6ffc4ff0  $class: 0x4642d0
reQ_W_125: 0x7f8a6ffc4ff0  $class: 0x7261626f6f66
Illegal instruction (core dumped)
kll@Boxy:~/terastream/acton/foobar$ 

I ran the program with gdb, put a breakpoint at initialization of reQ_W_125, grabbed the memory address and installed a watch point on that address. We can see how the memory is next modified by reclaim, so I suppose bdwgc has determined that this address it not used and thus freed. GC_PRINT_STATS is also enabled in this output...

Breakpoint 1, reQ___init__ () at foobar/out/types/re.c:95
95	    printf("reQ_W_125: %p  $class: %p\n", reQ_W_125, reQ_W_125->$class);
(gdb) next
Adding block map for size of 84 granules (1344 bytes)
reQ_W_125: 0x7ffff7cb4ff0  $class: 0x4642d0
96	    B_Sequence W_95 = ((B_Sequence)B_SequenceD_listG_new());
(gdb) p reQ_W_125
$3 = (B_Ord) 0x7ffff7cb4ff0
(gdb) watch *0x7ffff7cb4ff0
Hardware watchpoint 2: *0x7ffff7cb4ff0
(gdb) continue
Continuing.
Adding block map for size of 0 granules (0 bytes)
Adding block map for size of 4 granules (64 bytes)
Adding block map for size of 7 granules (112 bytes)
Adding block map for size of 18 granules (288 bytes)
[New Thread 0x7ffff7c9a6c0 (LWP 1364660)]
In-use heap: 26% (0 KiB pointers + 17 KiB other)
Grow heap to 128 KiB after 17152 bytes allocated

Thread 1 "test_re" received signal SIGPWR, Power fail/restart.

--> Marking for collection #2 after 17152 allocated bytes
Marked from 1 dirty pages

Thread 1 "test_re" received signal SIGXCPU, CPU time limit exceeded.
World-stopped marking took 5 ms 584830 ns (2 ms in average)
GC #2 freed -42848 bytes, heap 128 KiB (+ 0 KiB unmapped + 434 KiB internal)
In-use heap: 23% (13 KiB pointers + 17 KiB other)
0 finalization entries; 0/0 short/long disappearing links alive
0 finalization-ready objects; 0/0 short/long links cleared
Finalize and initiate sweep took 0 ms 2274 ns + 0 ms 20188 ns
[Switching to Thread 0x7ffff7c9a6c0 (LWP 1364660)]

Thread 2 "Worker 1" hit Hardware watchpoint 2: *0x7ffff7cb4ff0

Old value = 4604624
New value = -137670848
GC_reclaim_clear (hbp=hbp@entry=0x7ffff7cb4000, hhdr=<optimized out>, sz=16, list=0x7ffff7cb4ff0 "@O\313\367\377\177", list@entry=0x0, count=count@entry=0x4956b0 <GC_bytes_found>) at reclaim.c:204
204	                p = (ptr_t)GC_clear_block((word *)p, sz, count);
(gdb) 

I added in GC_dump_named() and can see that the address of reQ_W_125 is in the root set

...
***Static roots:
From 0x7f9412cca000 to 0x7f9412cfc000
...
reQ_W_125: 0x7f9412cd3ff0  $class: 0x4642c0

How come the GC, when run in incremental mode, thinks this memory is not referenced? How do I go about debugging this?

I am running this on Linux x86_64.

This is trivial to reproduce, if interesting, I can provide simple instructions for how to run this using the acton repo.

@ivmai
Copy link
Owner

ivmai commented Apr 1, 2023

Looks strange, let's figure out the root cause.

Which libgc version? Reproducible on fresh release-8_2 (or master)?

Which VDB mode? SOFT_VDB? The issue is not reproducible if compiled with NO_SOFT_VDB, right?

Also, try with NO_VDB_FOR_STATIC_ROOTS (instead of NO_SOFT_VDB) - based on your report, the issue is here.

(We need to understand why an object referenced from a static root directly is not marked.)

@ivmai
Copy link
Owner

ivmai commented Apr 1, 2023

If you confirm that SOFT_VDB is suspected then this might have the same root cause as issue #479

@plajjan
Copy link
Contributor Author

plajjan commented Apr 1, 2023

I'm compiling using zig / clang and targetting glibc 2.28.

Using git commit daea2f1 so that's relatively close to HEAD of master.

I haven't explicitly picked a VDB mode. I think it defaults to SOFT_VDB on Linux, right? So yes, SOFT_VDB!?

And yes indeed, compiling with either -DNO_SOFT_VDB or -DNO_VDB_FOR_STATIC_ROOTS appears to fix the problem, I don't see any more crashes while running the acton test suite.

What does NO_VDB_FOR_STATIC_ROOTS actually mean? That we scan the static root ranges on every GC run, right? But for the rest we rely on VDB?

@ivmai
Copy link
Owner

ivmai commented Apr 1, 2023

I think it defaults to SOFT_VDB on Linux

Yes

And yes indeed, compiling with either -DNO_SOFT_VDB or -DNO_VDB_FOR_STATIC_ROOTS appears to fix the problem, I don't see any more crashes while running the acton test suite.

Got it. The issue relates to code within ifndef NO_VDB_FOR_STATIC_ROOTS block.

What does NO_VDB_FOR_STATIC_ROOTS actually mean?

With SOFT VDB (and PROC_VDB) facility the scanning of static roots could be optimized too (unless NO_VDB_FOR_STATIC_ROOTS is defined) to skip unmodified pages during collections other than full ones.

The algorithm of full collections should be re-inspected.

@ivmai
Copy link
Owner

ivmai commented May 1, 2023

I need your help with figuring out the root cause.

I ran the program with gdb, put a breakpoint at initialization of reQ_W_125

Could you please check that soft_set_grungy_pages(0x7f9412cca000, 0x7f9412cfc000, ...) is called? (without -D NO_VDB_FOR_STATIC_ROOTS) And when is it called among output lines above?

@ivmai
Copy link
Owner

ivmai commented May 1, 2023

It seems the information that reQ_W_125 has been written (i.e. page is dirty) is lost somehow.
-D DEBUG_DIRTY_BITS enables the relevant logging.

@ivmai
Copy link
Owner

ivmai commented May 4, 2023

I could debug this but I need a a guide how to setup the environment. Which host OS/arch?

@plajjan
Copy link
Contributor Author

plajjan commented May 4, 2023

@ivmai I'm super busy at the moment preparing for a conference next week, I'll get back as soon as I can after that with more info! :)

Repository owner deleted a comment from michaellilltokiwa May 24, 2023
Repository owner deleted a comment from michaellilltokiwa May 24, 2023
Repository owner deleted a comment from michaellilltokiwa May 24, 2023
Repository owner deleted a comment from michaellilltokiwa May 24, 2023
Repository owner deleted a comment from michaellilltokiwa May 24, 2023
@ivmai
Copy link
Owner

ivmai commented May 24, 2023

@michaellilltokiwa, I moved the issue reported by you in a separate one - #552

@ivmai
Copy link
Owner

ivmai commented May 24, 2023

@plajjan, I've implemented a checker (CHECK_SOFT_VDB) that if some page was identified as dirty by mprotect-based VDB then it should be also identified as dirty by SOFT_VDB implementation, commit b4e1ce5.
Please try to build bdwgc (master) passing -D CHECK_SOFT_VDB to CFLAGS_EXTRA. Just in case it would be useful in figuring out the root case.
PS. On my Ubuntu 22.04 x64 machine I see some rare inconsistency on gctest (i.e. some page is reported dirty by mprotect-based but pagemap file says it is clean) - I have not investigated it yet by looks like an issue in the kernel.

@ivmai
Copy link
Owner

ivmai commented Jan 3, 2024

Is the issue not observed if -D NO_SFT_VDB passed to CFLAGS?

@ivmai ivmai changed the title Incremental GC collecting referenced objects Incremental GC collecting referenced objects (gctest fails with soft vdb if shared build by zig) Jan 3, 2024
@ivmai
Copy link
Owner

ivmai commented Jan 10, 2024

Reproduced with the following build command:
../zig/zig build -DBUILD_SHARED_LIBS -Dbuild_tests -Denable_gc_assertions -Denable_threads=false -Denable_gcj_support=false -Denable_disclaim=false -Denable_munmap=false -Ddisable_handle_fork test
Also reproduced if -Ddisable_single_obj_compilation option added.

Not reproduced with any of:

  • -DBUILD_SHARED_LIBS=false option
  • -D NO_VDB_FOR_STATIC_ROOTS added to C flags
  • without -fvisibility=hidden in C flags (but only if -Denable_gc_assertions option is given)
  • build with cmake
  • build with configure

@ivmai
Copy link
Owner

ivmai commented Feb 26, 2024

The root cause seems to be the same as in #376 .
soft_set_grungy_pages is to be fixed.

ivmai added a commit that referenced this issue Feb 26, 2024
(fix of commits 5b75fae, 4875114, ee00eb1)

Issue #544 (bdwgc).

* build.zig [build_shared_libs && t.os.tag==.linux] (build): Do not
define NO_VDB_FOR_STATIC_ROOTS macro; remove FIXME item.
* os_dep.c [SOFT_VDB] (soft_set_grungy_pages): Rename vaddr argument to
start; define vaddr and next_fpos_hint local variables; change type of
vaddr and next_vaddr from ptr_t to word; initialize vaddr to start
rounded down to page granularity; add assertion that start is
hblk-aligned; enforce next_vaddr is not greater than limit; enforce
h is not smaller than start.
* os_dep.c [SOFT_VDB] (GC_soft_read_dirty): Rename vaddr local variable
to start.
* os_dep.c [SOFT_VDB && !NO_VDB_FOR_STATIC_ROOTS] (GC_soft_read_dirty):
Round start argument of soft_set_grungy_pages() down to block
granularity.
@ivmai ivmai closed this as completed Feb 26, 2024
ivmai added a commit that referenced this issue Mar 4, 2024
(a cherry-pick of commit 6601eec from 'master')

Issue #544 (bdwgc).

* os_dep.c [SOFT_VDB] (soft_set_grungy_pages): Rename vaddr argument to
start; define vaddr and next_fpos_hint local variables; change type of
vaddr and next_vaddr from ptr_t to word; initialize vaddr to start
rounded down to page granularity; add assertion that start is
hblk-aligned; enforce next_vaddr is not greater than limit; enforce
h is not smaller than start.
* os_dep.c [SOFT_VDB] (GC_soft_read_dirty): Rename vaddr local variable
to start.
* os_dep.c [SOFT_VDB && !NO_VDB_FOR_STATIC_ROOTS] (GC_soft_read_dirty):
Round start argument of soft_set_grungy_pages() down to block
granularity.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants