Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WaitForEvent( 0 ) can return WAIT_TIMEOUT under lock contention when actually in the signaled state #18

Closed
qwertymaster617 opened this issue Jun 26, 2021 · 4 comments · Fixed by #19

Comments

@qwertymaster617
Copy link

The pthread_mutex_trylock implementation of WaitForEvent (with no timeout) is flawed as it can easily return WAIT_TIMEOUT when an Event, either auto-reset or manual-reset, is clearly in the signaled state. This issue only affects WaitForEvent calls where timeout = 0.

First, consider an already-signaled manual-reset event with 2 threads, A and B, each calling WaitForEvent( 0 ) concurrently in a tight loop. If A is holding the lock while B attempts to grab the lock, B will erroneously return WAIT_TIMEOUT, even though the event is always in the signaled state.

Since we know that:

  1. The event is already signaled,
  2. Manual-reset events can only change from Signaled -> Nonsignaled with a call to ResetEvent, and
  3. Neither A nor B can change the signaled state of a manual-reset event solely by calling WaitForEvent

We know the pthread_mutex_trylock implementation is incorrect here.

Second, consider an already-signaled auto-reset event with several threads each calling SetEvent concurrently.
If some other thread, C, continuously loops on calls to WaitForEvent as follows:

using namespace neosmart;

// ...

auto event = CreateEvent( false, true ); // Already-set Auto-Reset Event

// Spawn some SetEvent threads...

bool signaled = true;
while ( signaled )
{
  signaled = WaitForEvent( event, 0 ) == 0;
  SetEvent( event );
}
// destroy, etc.

At some point, if one of the threads is calling SetEvent while C just begins a call to WaitForEvent, C's call to WaitForEvent will incorrectly return WAIT_TIMEOUT simply because C may fail to grab the lock.

Since we know that:

  1. C is the only waiter,
  2. There are no calls to ResetEvent by any threads, and
  3. C ensures that upon every iteration of the loop, event is in the signaled state

C's call to WaitForEvent should never return WAIT_TIMEOUT in this case.

Tests cases exhibit this behavior perfectly. I'd put some up, but they're easy enough to reason about that it's not really necessary.

The only fix I see is to replace the pthread_mutex_trylock logic with a simple call to pthread_mutex_lock only.

I almost additionally recommended the possibility of making event->State atomic, and returning WAIT_TIMEOUT or 0 based on the value of the atomic load of event-State, except that auto-reset events must be reset before returning. This might warrant introducing extra logic just to figure out if you're able to reset the state. Since it's unclear what the ramifications are in that case, I think the only simply-correct solution is to grab the lock and block if its already taken.

@mqudsi
Copy link
Member

mqudsi commented Jun 27, 2021

Nice catch. The only reason a separate _trylock path exists for timeout == 0 is to bypass acquiring the internal lock to determine the current status, but that definitely shouldn't end up leaking side effects of internal state as an actual result.

The original implementation used a spinlock for gaining access to the event object's internals, but still had the mutex because it's required for the condition variables to marshal correctly. Given the increase in surface area for synchronization issues combined with the fact that the mutex was still used to wake one waiting thread at a time (via the c-variable), I ended up just folding everything into the mutex.

I think (but I'll need to study the code again to be sure) an optimization that should safely mitigate at least some of the performance drawbacks of dropping _trylock in case of a 0 timeout would be to use the current state to return EBUSY immediately if the event is unavailable (for both auto and manual reset events) but still call pthread_mutex_lock() otherwise.

(One minor issue is that I would want to use std::atomic<bool> to guarantee the atomic access semantics without using compiler-specific builtins, but I know some platforms end up requiring an additional build step to link against libatomic, which I'm loathe to have to deal with.)

mqudsi added a commit that referenced this issue Jun 27, 2021
Test case based off the pseudocode provided by @qwertymaster617 in #18.
mqudsi added a commit that referenced this issue Jun 27, 2021
Test case based off the pseudocode provided by @qwertymaster617 in #18.
@mqudsi mqudsi closed this as completed in 0bc424c Jun 28, 2021
@qwertymaster617
Copy link
Author

I've had a lot to think about regarding your patches last week.

I have a few issues with this solution.

TL/DR:

  1. I think it's safe to use memory_order_relaxed everywhere.
  2. Consider returning wait-success in WaitForEvent if ( timeout == 0 && load() && manual_reset ).

I've had to brush up significantly on my memory ordering rules this week so I may be incorrect. But here's what I think.

First:

I've looked into the issue presented in the url cited in the source, and IMHO, the concern you bring up doesn't apply here for 2 reasons:

  1. The issue cited involves at least 2 or more distinct atomic objects being waited upon. If only 1 thing is ever being waited upon, there can be no deadlock in this case. And we only ever deal with waiting/synchronizing on the mutex, so in this case, using memory_order_release specifically on state is unnecessary.
  2. We don't actually establish a synchronization point with state. The standard only guarantees release-acquire semantics if one both releases and acquires the same atomic resource. It is not defined what happens when one only releases but does not acquire said atomic resource. Thus in this case, memory_order_release is not necessarily correct.

So if using memory_order_relaxed everywhere, and what I believe to be the more correct case here, now the atomic bool that is state can be ordered via the as-if rule just like non-atomic-bool state would already be. The memory fences as specified by mutex lock (acquire)/unlock (release) pairs should guarantee a correct memory state "window" upon synchronization. The optimizer should literally generate the same machine code with the atomic read in exactly the same program order as-if using the non-atomic read. The only difference should be the exact instruction used. We really don't care what the "still-in-progress" values of the state are while we hold the lock. Because of the use of mutex unlock/lock, we're already guaranteeing correct release-acquire synchronization via mutex unlock-lock pairs, respectively, and additionally guaranteeing mutual exclusion at the same time. So no one else is going to be looking at those "still-in-progress" values anyway. "The next thread" will only see the most-recently stored state after the mutex releases on a given thread and after the mutex is acquired by "the next thread".

Introducing memory_order_release on state.store not only significantly slows the implementation (now WaitFors can do up to 3x times the memory write synchronization, Set: 2x times, and unfortunately with no benefit because of no properly-matching memory_order_acquire), but also the whole endeavor seems dubious because the mutex guarantees mutual exclusion. No one can check the state without the lock. Even if they do, in the case of WaitForEvent( 0 ), we don't care if they see the "wrong" value, because we only ever want to check the value already present in the cache, no matter how "stale" it may be. The worst this may mean is that the check to WFE( 0 ) was "seemingly" called too early or too late. Since it is unspecfied which of two or more threads running concurrently will set/reset/check the state of an event at which time, this is absolutely one out of many correct behaviors.

From cppreference, memory-order on Release-Acquire ordering:

If an atomic store in thread A is tagged memory_order_release and an atomic load in thread B from the same variable is tagged memory_order_acquire, all memory writes (non-atomic and relaxed atomic) that happened-before the atomic store from the point of view of thread A, become visible side-effects in thread B. That is, once the atomic load is completed, thread B is guaranteed to see everything thread A wrote to memory.

This is exactly what mutex unlock-lock does. And since the call to state.store is sequenced-before the mutex unlock (the release) in thread A, the release-acquire rule applies to state through the mutex.

Therefore, I say use memory_order_relaxed everywhere.

Finally, the only remaining overhead is the interlocked instruction being used while holding the lock, which is unfortunate but necessary because of the naked load()s at the beginning of WFE( 0 ). Just interlocked instructions around a similar memory structure as was previously should only be low-10s of cycles to compute instead of 1s of cycles for a raw read/write to a bool, which is probably(?) an acceptable tradeoff. My benchmarks seem to indicate mostly negligible decreases in performance this way compared to up-to 3x slowdowns when using memory_order_release as-is.

Second:

The optimization that replaced the bug seems nice, but why not go all the way? I'd argue that for the same exact reasons that you can do
if ( timeout == 0 && !load() ) return timeout;,

why not:
if ( timeout == 0 && load() && manual_reset ) return success;?

I believe the same reasoning applies to both cases.

Just as with checking the "possibly-or-not-possibly-signaled" state and assuming failure if the data in the cache says unsignaled, we can absolutely assume the same thing on load() == true if the event is manual-reset. The call to WaitForEvent( 0 ) either happens-before the call to SetEvent or after (and notably, never during, because the atomic load of state guarantees mutual exclusion). When any given manual-reset event is under WaitFor/SetEvent/ResetEvent contention by 2 or more threads, all memory_order_relaxed can do here in the worst case is possibly cause the call to WFE( 0 ) to seem to happen even more before or even more after a SetEvent and/or ResetEvent call. And just like the load() == false case:

  • the state of load() will always be 'valid', i.e. never a garbage value, only true or false at what was at some point the "current" point in time
  • there are no data dependencies
  • being ordered any time before or after the memory_order_acquire of mutex lock() is acceptable
  • and most importantly, there are no observable side-effects by assuming success when load() == true when talking about manual-reset events because a waitfor is guaranteed to never change the state of a manual-reset event.

Of course, we won't have the opportunity to cleanup expired shared waits under another code path, but I'd argue that this may be a better tradeoff when checking for the instantaneous state.

I'm not entirely convinced this is valid, but I can't think of any sure-fire reasons why it is less correct in reasoning or just plain wrong compared to the current optimization you present with this patch, and I feel like we may be the only people worth asking specifics about this particular subject.

What do you think? The WaitForMultiple problem just might be my favorite problem in the world. This problem never ceases to disappoint.

@qwertymaster617
Copy link
Author

So I keep thinking about my second point in my post above. Despite all that stuff I said above being true, and even if it all is, I think the reason it could be an incorrect assertion to return success when manual-reset, load(), and timeout == 0 is simply because philosophically speaking, when waiting-for or waiting-on a given event, we must assume failure to wait on it at all times. That's why we use it. And importantly, the reason for this is the converse where if we were to ever assume success when checking instanteously, then why would we ever need to use an event? Why not just comment out that WaitForEvent call?

By simply placing the code to waitfor, we carefully approach that spooky-looking, I-think-is-only-a-2-way stop sign intersection (event.waitfor) at 2am when it's raining... and snowing. It's never safe to roll through it. You can do it, but it's risky. Even if someone is waving you on with an orange vest, you still intinctively slow down and look as much as possible. Someone may appear right on top of you (the boogey man?).

So instead, you assume it's not safe until you stop (acquire the lock), make sure there is no boogey man (confirm state), and then cross (release).

If you ever assume, when waiting on an event, that you can just go, then just go and ask yourself why you ever need to call an event waitfor. There would be no reason to use it then. Just execute. And my guess is one's code would subsequently look dramatically differently to handle such a case in comparison to code that is using an event waitfor.

Thus, I don't think we can assume a successful wait just because it says it's ok and we haven't obtained the lock.

Is this correct reasoning? Is it valid in this context? Is it valid in any context? That metaphor was somewhat helpful to me but I'm not sure it's quite right, either.

In this context I guess we can say that the default state of an event is non-signaled. Otherwise, why have the functionality at all? So, if we check the cache and load() == false and we need an answer now; we return failure to wait. And we continue to observe that the event is non-signaled. It does not change the observable behavior of the program. The value will propagate to us in some finite amount of time as guaranteed by the standard. And we can be programmed to check again later because we did not block execution of the program. Eventually we'll get the updated value, and we can be released. But we could say that the waitfor happened sooner than we expected based on the "written" program flow. The above post talks about this point already.

However, if we were to ever assume the opposite, i.e. successful wait, as I state in my previous post, then saying success without synchronizing breaks the contract of no observable side-effects, because now the program may think that the non-default case is/has occurred, thus what I believe to be the breaking of the no observable side-effects rule of the validity of atomic loads and stores. Thus why the current implementation is (most? technically?) correct.

I'm a noob to memory order. I have gone down the rabbit hole; please someone point out anything I got wrong.

If this is all obvious to others, I can assure you that this certainly was not to me a week ago, and it most certainly will not be several weeks from now (C++ is no longer my profession) before I forget most of this context. So I'm capturing for my benefit as well as any poor sap who can follow this huge, dry train of thought.

All that being said, does this seem to be a sound explanation as to why my second point is not correct and why the current implementation is most-likely the best we'll get for instantaeous state checking?

@mqudsi
Copy link
Member

mqudsi commented Jul 10, 2021

I confused myself a bit in an earlier reply. You might want to look at 9c22e4e which changes the memory access semantics based off of an article I found that addressed my concerns regarding resetting event->State to false (and only false) with relaxed memory semantics (instead of release). That website is extremely useful for understanding acquire/release if you go back to the posts from 2012.

But specifically regarding an early return for manual reset events, I don't think that's actually a matter of memory access semantics. It would definitely be valid to do an acquire load and return success if the state is set, because manual reset events are racy by nature so any consecutive calls to ResetEvent() that occur at the same time are not guaranteed to preempt the WFMO (regardless of a 0 timeout or not). But that's not your question: you are actually asking about using a relaxed load here, which is a much more difficult question to answer.

A relaxed load will (by definition) return a potentially stale value. It can be argued that the semantics of WFMO(0) means false negatives are OK, but if you remember your other bug, that's not strictly the case. The real question is if a calling thread can distinguish between a potential race condition with another thread (calling ResetEvent for either manual or auto events, or even calling WFMO in case of auto events), stale value, or internal optimization (such as the pthread try_wait(0) that we ended up removing).

If it's impossible to devise a test where one can explicitly rule out race conditions while still possibly getting a stale result, then it's OK to conflate the results and return a stale false as a WAIT_TIMEOUT even if the real state is success. But if it's possible to imagine some sort of test that guarantees a race-free WAIT_SUCCESS result, then we cannot return a stale false (or vice versa with a stale true).

Once you put it like that, it becomes easy to see why we can't serve a stale WAIT_SUCCESS here, even for a manual reset event:

auto mreset = CreateEvent(true, true); // manual reset, initially set
auto areset1 = CreateEvent(false, false); // auto reset, not set
auto areset2 = CreateEvent(false, false); // auto reset, not set
auto areset3 = CreateEvent(false, false); // auto reset, not set

CreateThread([&] {
    WaitForSingleEvent(areset1);
    ResetEvent(mreset);
    SetEvent(areset2);
});

CreateThread([&] {
    // Force caching of currently available state
    assert(WaitForSingleEvent(mreset, 0) == 0);
    SetEvent(areset1);
    WaitForSingleEvent(areset3);
    assert(WaitForSingleEvent(mreset, 0) != 0);
});

WaitForSingleEvent(areset2);
SetEvent(areset3);

The main thread starts thread 1 and then thread 2, thread 2 caches the current state of mreset (available), triggers thread 1 to reset the manual event, and waits for areset3, at which point our design guarantees that mreset is not available. But there is no forced synchronization between thread 1 and thread 2 after mreset is set, because thread 1 tells thread 2 to check mreset indirectly via areset3 (handled by the main thread).

Via static analysis, we know the second call to WaitForSingleEvent(mreset, 0) must fail. But there's no guarantee that the stale state as cached at the time of the first call has been updated.

What this tells us is that even the current optimization to return false in case of a relaxed load false value is technically incorrect, since you can invert the test (initial mstate is false, and instead of calling ResetEvent(mreset) just call SetEvent(mreset) and you'll risk getting a stale result that contradicts the static analysis of the situation. It's at this point that the API design kicks in and you have to ask whether it makes sense to sacrifice that performance. Like you said, the default for an event is to be unavailable (I don't mind going so far as to say that while you can loop over WFMO(e, 0) and wait for it to return WAIT_FAILED to kick off a job, that this is 100% a wrong usage of the API, whereas looping over a zero timeout to wait for WAIT_SUCCESS is inefficient but 100% correct).

I have half a mind to reintroduce the timed wait optimization and just document that a spurious WAIT_TIMEOUT is a valid result to a WaitForSingleEvent(e, 0) call, but the only concern there is that repeated calls to WFME(e, 0) may flip-flop between success and failure even if the event remains set at all times (due to contention), whereas with a stale read the result will settle fairly quickly at the correct answer.

mqudsi added a commit that referenced this issue Jul 11, 2021
This requires a memory barrier to ensure correctness (avoiding false
positive caused by stale read) but should still be cheaper than a
syscall and only happens if we expect it to be available (based off the
initial relaxed read).

See #18
mqudsi added a commit that referenced this issue Jul 11, 2021
Extending 8db9ebf, try to bypass
obtaining the mutex for manual reset events even in calls to
`WaitForMultipleEvents()`. See #18.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants