-
Notifications
You must be signed in to change notification settings - Fork 29.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
src: refactor thread stopping mechanism #26757
Conversation
- Follow style guide for naming, e.g. use lower_snake_case for simple setters/getters. - For performance, use atomics instead of a mutex, and inline the corresponding getter/setter pair.
src/env-inl.h
Outdated
} | ||
|
||
void AsyncRequest::set_stopped(bool flag) { | ||
stopped_.store(flag, std::memory_order_relaxed); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so: the heavy usage of muexes around the worker code where multi-thread data access was expected, was almost always to ensure data consistency by flushing cache lines? (in other words, writes in one thread is made visible to other threads instantly). If so, std::memory_order_relaxed
constraint is insufficient to ensure that? we might need at least memory_order_acquire
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess I retract above question - when I tested with a small code, I see mfence
or sync
instruction being added with memory_order_relaxed
itself; so pls ignore.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So … my line of thinking was that the syscalls behind uv_async_send()
would themselves present full memory barriers. I’ll check later and verify that that’s indeed correct.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@addaleax - that may be true; but we do have set_stopped
is_stopped
calls that the threads can call but do not involve syscalls? however:
int foo(bool flag) {
stopped_.store(flag, std::memory_order_relaxed);
}
this is what I tested and this is what I see in the generated code:
(gdb) x/30i 0x40053e
0x40053e <_ZNSt11atomic_bool5storeEbSt12memory_order>: push rbp
0x40053f <_ZNSt11atomic_bool5storeEbSt12memory_order+1>: mov rbp,rsp
0x400542 <_ZNSt11atomic_bool5storeEbSt12memory_order+4>: sub rsp,0x30
0x400546 <_ZNSt11atomic_bool5storeEbSt12memory_order+8>: mov QWORD PTR [rbp-0x28],rdi
0x40054a <_ZNSt11atomic_bool5storeEbSt12memory_order+12>: mov eax,esi
0x40054c <_ZNSt11atomic_bool5storeEbSt12memory_order+14>: mov DWORD PTR [rbp-0x30],edx
0x40054f <_ZNSt11atomic_bool5storeEbSt12memory_order+17>: mov BYTE PTR [rbp-0x2c],al
0x400552 <_ZNSt11atomic_bool5storeEbSt12memory_order+20>: movzx eax,BYTE PTR [rbp-0x2c]
0x400556 <_ZNSt11atomic_bool5storeEbSt12memory_order+24>: mov rdx,QWORD PTR [rbp-0x28]
0x40055a <_ZNSt11atomic_bool5storeEbSt12memory_order+28>: mov QWORD PTR [rbp-0x8],rdx
0x40055e <_ZNSt11atomic_bool5storeEbSt12memory_order+32>: mov BYTE PTR [rbp-0x9],al
0x400561 <_ZNSt11atomic_bool5storeEbSt12memory_order+35>: and BYTE PTR [rbp-0x9],0x1
0x400565 <_ZNSt11atomic_bool5storeEbSt12memory_order+39>: mov eax,DWORD PTR [rbp-0x30]
0x400568 <_ZNSt11atomic_bool5storeEbSt12memory_order+42>: mov DWORD PTR [rbp-0x10],eax
0x40056b <_ZNSt11atomic_bool5storeEbSt12memory_order+45>: mov eax,DWORD PTR [rbp-0x10]
0x40056e <_ZNSt11atomic_bool5storeEbSt12memory_order+48>: mov esi,0xffff
0x400573 <_ZNSt11atomic_bool5storeEbSt12memory_order+53>: mov edi,eax
0x400575 <_ZNSt11atomic_bool5storeEbSt12memory_order+55>:
call 0x40052a <_ZStanSt12memory_orderSt23__memory_order_modifier>
0x40057a <_ZNSt11atomic_bool5storeEbSt12memory_order+60>: mov DWORD PTR [rbp-0x14],eax
0x40057d <_ZNSt11atomic_bool5storeEbSt12memory_order+63>: movzx edx,BYTE PTR [rbp-0x9]
0x400581 <_ZNSt11atomic_bool5storeEbSt12memory_order+67>: mov rax,QWORD PTR [rbp-0x8]
0x400585 <_ZNSt11atomic_bool5storeEbSt12memory_order+71>: mov BYTE PTR [rax],dl
0x400587 <_ZNSt11atomic_bool5storeEbSt12memory_order+73>: mfence
0x40058a <_ZNSt11atomic_bool5storeEbSt12memory_order+76>: leave
0x40058b <_ZNSt11atomic_bool5storeEbSt12memory_order+77>: ret
please note that mfence
at 0x400587
that settles the matter?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please note that
mfence
at0x400587
that settles the matter?
On x64 it does, yes – to be honest, I don’t know how the different memory order modes are implemented on different platforms? It seems like this disassembled implementation simply ignores the order argument?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
0x400529 <_Z3foob+35>: ret
0x40052a <_ZStanSt12memory_orderSt23__memory_order_modifier>: push rbp
@addaleax - if you look at the continuity of the instructions, looks like these (the atomic* helpers) are not static APIs, but compiler-generated code, on the fly; so it is possible that only necessary code was generated, on a per compilation unit basis?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
my line of thinking was that the syscalls behind uv_async_send() would themselves present full memory barriers.
Libuv doesn't promise that. uv_async_send()
can (at least in theory) elide the system call.
I'm kind of surprised the compiler emits an mfence. It's not needed on x64 (nor any other architecture, I think?) because aligned loads and stores are always atomic. It might just be a compiler bug; I wouldn't depend on it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@gireeshpunathil I wouldn’t think so, the linker should be able to elide multiple copies of that variable into a single one.
@bnoordhuis Yeah, thanks. I’ve removed the memory_order_relaxed
bit.
@@ -381,7 +381,8 @@ void Worker::OnThreadStopped() { | |||
Worker::~Worker() { | |||
Mutex::ScopedLock lock(mutex_); | |||
|
|||
CHECK(stopped_ || env_ == nullptr || env_->GetAsyncRequest()->IsStopped()); | |||
CHECK(stopped_); | |||
CHECK_NULL(env_); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IIRC, there was a control flow that takes to Worker destructor without nullifying env_
, not able to figure that out now; do you know?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@gireeshpunathil I think that would be a bug – the child thread is not allowed to exist at this point (and the next CHECK
verifies that the thread has been joined), and the child thread in turn owns the Environment
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm curious, does the overhead of locking/unlocking the mutex show up in profiles anywhere?
(I mean, I could imagine it does but I can also imagine it doesn't. I'd like to be convinced by numbers. :-))
src/env-inl.h
Outdated
} | ||
|
||
void AsyncRequest::set_stopped(bool flag) { | ||
stopped_.store(flag, std::memory_order_relaxed); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
my line of thinking was that the syscalls behind uv_async_send() would themselves present full memory barriers.
Libuv doesn't promise that. uv_async_send()
can (at least in theory) elide the system call.
I'm kind of surprised the compiler emits an mfence. It's not needed on x64 (nor any other architecture, I think?) because aligned loads and stores are always atomic. It might just be a compiler bug; I wouldn't depend on it.
This PR was created because it does :) It doesn’t have a huge impact, but this code is run a lot, and it might make up 1 % or so of the runtime for some processes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Landed in d812dbb 🎉 |
- Follow style guide for naming, e.g. use lower_snake_case for simple setters/getters. - For performance, use atomics instead of a mutex, and inline the corresponding getter/setter pair. PR-URL: nodejs#26757 Reviewed-By: James M Snell <[email protected]> Reviewed-By: Gireesh Punathil <[email protected]> Reviewed-By: Franziska Hinkelmann <[email protected]> Reviewed-By: Michael Dawson <[email protected]>
- Follow style guide for naming, e.g. use lower_snake_case for simple setters/getters. - For performance, use atomics instead of a mutex, and inline the corresponding getter/setter pair. PR-URL: nodejs#26757 Reviewed-By: James M Snell <[email protected]> Reviewed-By: Gireesh Punathil <[email protected]> Reviewed-By: Franziska Hinkelmann <[email protected]> Reviewed-By: Michael Dawson <[email protected]>
- Follow style guide for naming, e.g. use lower_snake_case for simple setters/getters. - For performance, use atomics instead of a mutex, and inline the corresponding getter/setter pair. PR-URL: #26757 Reviewed-By: James M Snell <[email protected]> Reviewed-By: Gireesh Punathil <[email protected]> Reviewed-By: Franziska Hinkelmann <[email protected]> Reviewed-By: Michael Dawson <[email protected]>
for simple setters/getters.
the corresponding getter/setter pair.
Checklist
make -j4 test
(UNIX), orvcbuild test
(Windows) passes