Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

zipos threading stress test #1011

Merged
merged 5 commits into from
Dec 15, 2023
Merged

zipos threading stress test #1011

merged 5 commits into from
Dec 15, 2023

Conversation

mrdomino
Copy link
Collaborator

No description provided.

@mrdomino
Copy link
Collaborator Author

This test usually passes, sometimes deadlocks, and rarely raises SIGSEGV. Before this is merged, that situation should be rectified.

@mrdomino
Copy link
Collaborator Author

Current guess is that it's a duplicate __fds_lock() somewhere, fwiw. I doubt that there is a deadlock in the atomic read/seek code, but that's not out of the question either.

@mrdomino
Copy link
Collaborator Author

As for the segfault, I haven't managed to get good tracing on it, but from what I've seen it's happening in the __isfdkind(newfd, kFdZip) branch of __zipos_postdup, which strongly suggests a double-free and failure of the cleanup code. But changing close() to block signals and __fds_lock() seemed to exacerbate the deadlock.

@jart
Copy link
Owner

jart commented Dec 14, 2023

Synchronization issues and crashes in ZipOS are normally tricky to debug, due to the need to block signals when ZipOS does heavy lifting inside functions like open(), which prevents crash reports / backtraces from being shown. You might have more luck troubleshooting things if you comment out the BLOCK_SIGNALS; lines temporarily.

@mrdomino
Copy link
Collaborator Author

The only BLOCK_SIGNALS in libc/runtime/zipos* is in __zipos_open, which has definitely ended by the time we encounter either the deadlock or the segfault. I do see a backtrace in the segfault case fwiw. I'll copy one in here next time I encounter it.

@mrdomino
Copy link
Collaborator Author

Sample SIGSEGV output (they are very rare it turns out):

10008004-10008007 rw-pa- 4x automap 256kB w/ 256kB hole
1000800c-1000800d rw-pa- 2x automap 128kB
1000800e-1000800f r--s-- 2x automap 80kB w/ 3840kB hole
1000804c-1000804d rw-pa- 2x automap 128kB w/ 96tB hole
6fc00004-6fc00004 rw-paF 1x nsync 64kB w/ 64gB hole
6fd00004-6fd00007 rw-paF 4x zipos 256kB w/ 64gB hole
6fe00004-6fe00004 rw-paF 1x g_fds 64kB
# 1024kB total mapped memory
pthread_main depth:0 fd:4
pthread_main depth:1 fd:6
pthread_main depth:1 fd:5
pthread_main depth:2 fd:7
pthread_main depth:3 fd:5
pthread_main depth:2 fd:8
pthread_main depth:3 fd:6
pthread_main depth:2 fd:4
pthread_main depth:2 fd:9
pthread_main depth:3 fd:5
pthread_main depth:3 fd:10
pthread_main depth:3 fd:6
pthread_main depth:3 fd:5
pthread_main depth:3 fd:7
pthread_main depth:3 fd:6

error: Uncaught SIGSEGV (SEGV_ACCERR) on MacBook-Air.local pid 77598 tid 262169
 o/aarch64/test/libc/runtime/zipos_test.com
 Darwin Cosmopolitan 3.1.3 MODE=aarch64; Darwin Kernel Version 23.1.0: Mon Oct  9 21:28:12 PDT 2023; root:xnu-10002.41.9~6/RELEASE_ARM64_T8103 MacBook-Air.local 23.1.0
 cosmoaddr2line /Users/joshin/src/cosmo/o/aarch64/test/libc/runtime/zipos_test.com.dbg 10000025650 100000247f4 10000028e8c 10000005610 1000000c7b8 1000002f330
 faulting address is 0000000000000028
 0000000000000028 x0 0000000000000039 x8  0000000000000006 x16 000010008027ff68 x24
 00006fe000040000 x1 0000000000000000 x9  0000000000000000 x17 0000000000000000 x25
 000000000000000b x2 0000000000000000 x10 0000000000000000 x18 0000000000000000 x26
 000001000006b000 x3 0000000000000000 x11 0000000000000000 x19 000000016d4eaf90 x27
 0000000000000000 x4 0000000000000000 x12 0000000000000006 x20 0000100080041cc0 x28
 000010008027ff70 x5 000001000004d62c x13 000010008017ff70 x21 000010008027feb0 x29
 000001000006b000 x6 000001000003e6b0 x14 0000000000000000 x22 00000100000247f4 x30
 0000000000000008 x7 0000000000000006 x15 0000000000000000 x23 000010008027feb0 x31
 000010008027feb0 sp 10000025650 pc __zipos_free+24
 000010008027feb0 sp 100000247f4 lr __zipos_close+76
 000010008027fed0 fp 10000028e8c lr close+180
 000010008027fef0 fp 10000005610 lr pthread_main+260
 000010008027ff10 fp 1000000c7b8 lr PosixThread+80
 000010008027ff90 fp 1000002f330 lr __stack_call+24

And another:

10008004-10008007 rw-pa- 4x automap 256kB w/ 256kB hole
1000800c-1000800d rw-pa- 2x automap 128kB
1000800e-1000800f r--s-- 2x automap 80kB w/ 3840kB hole
1000804c-1000804c rw-pa- 1x automap 64kB w/ 96tB hole
6fc00004-6fc00004 rw-paF 1x nsync 64kB w/ 64gB hole
6fd00004-6fd00007 rw-paF 4x zipos 256kB w/ 64gB hole
6fe00004-6fe00004 rw-paF 1x g_fds 64kB
# 960kB total mapped memory
pthread_main depth:0 fd:4
pthread_main depth:1 fd:5
pthread_main depth:1 fd:6
pthread_main depth:2 fd:8
pthread_main depth:2 fd:7
pthread_main depth:3 fd:5
pthread_main depth:2 fd:4
pthread_main depth:3 fd:6
pthread_main depth:3 fd:10
pthread_main depth:2 fd:9
pthread_main depth:3 fd:8
pthread_main depth:3 fd:11
pthread_main depth:3 fd:6
pthread_main depth:3 fd:8

error: Uncaught SIGSEGV (SEGV_ACCERR) on MacBook-Air.local pid 80350 tid 262162
 o/aarch64/test/libc/runtime/zipos_test.com
 Darwin Cosmopolitan 3.1.3 MODE=aarch64; Darwin Kernel Version 23.1.0: Mon Oct  9 21:28:12 PDT 2023; root:xnu-10002.41.9~6/RELEASE_ARM64_T8103 MacBook-Air.local 23.1.0
 cosmoaddr2line /Users/joshin/src/cosmo/o/aarch64/test/libc/runtime/zipos_test.com.dbg 10000025650 10000025b00 10000029214 100000056e4 1000000c7b8 1000002f330
 faulting address is 0000000000000028
 0000000000000028 x0 0000000000000017 x8  0000000000000029 x16 000001000000550c x24
 0000000000000006 x1 0000000000000000 x9  00000001e8c87c58 x17 000010008017ff90 x25
 000000000000000b x2 0000000054484441 x10 0000000000000000 x18 0000000000000000 x26
 000001000006b000 x3 0000000000000000 x11 0000000000000000 x19 000000016bbb6f90 x27
 0000000000000000 x4 00000001dfca4598 x12 0000000000000007 x20 0000100080040b40 x28
 0000000000000003 x5 000001000004d62c x13 000001000006b2e8 x21 000010008017fe90 x29
 000001000006b000 x6 000001000003e6b0 x14 000010008017ff80 x22 0000010000025b00 x30
 0000000000000008 x7 0000000000000007 x15 0000000000000140 x23 000010008017fe90 x31
 000010008017fe90 sp 10000025650 pc __zipos_free+24
 000010008017fe90 sp 10000025b00 lr __zipos_postdup+244
 000010008017feb0 fp 10000029214 lr dup+156
 000010008017fef0 fp 100000056e4 lr pthread_main+472
 000010008017ff10 fp 1000000c7b8 lr PosixThread+80
 000010008017ff90 fp 1000002f330 lr __stack_call+24

@mrdomino
Copy link
Collaborator Author

So, the segfault looks like a straightforward synchronization failure on g_fds; __releasefd mutates it without locking it, and this is not thread-safe.

To avoid a lock here, maybe instead of mutating g_fds, we can store an atomic bitmap of used fd positions? Just thinking out loud.

@mrdomino
Copy link
Collaborator Author

Strange, adding __fds_lock() in __releasefd does not make the segfault go away, but it seems to now only be happening on one of the two branches — close -> __zipos_close -> __zipos_free, not dup -> __zipos_postdup -> __zipos_free.

@mrdomino
Copy link
Collaborator Author

__fds_lock() in __releasefd is not good enough; it has to be at the top level in close, because __zipos_close reads state out of g_fds.

Now the segfault is much rarer (and the deadlock is totally gone afaict), but it is only happening in the dup -> __zipos_postdup branch.

Locks `g_fds` for /zip `close()` calls. Moves lock to the top level in
`__zipos_postdup`. The latter needs review for signal blocking
correctness; it is no more wrong than it was before on that front, but
I'm not at all confident that it was ever right.

Added a defensive assert in `__zipos_read`.
@mrdomino
Copy link
Collaborator Author

Cracked it. Test passes all day now. Needs review, but the deadlock and segfault is gone.

@mrdomino mrdomino marked this pull request as ready for review December 14, 2023 21:56
@mrdomino
Copy link
Collaborator Author

(N.B. this may be the wrong solution; it blocks signals for close() calls on /zip files. If that's bad, maybe a different approach can be found.)

This was wrong before. Now it mirrors `sys_dup_nt`. As far as I can tell
this is okay at all call sites; most of the time, there are parallel
code paths for `sys_dup_nt` anyway.
Copy link
Owner

@jart jart left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. Thanks!

@jart jart merged commit 8a10ccf into jart:master Dec 15, 2023
5 checks passed
@mrdomino mrdomino deleted the zipos-threads branch December 15, 2023 14:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants