-
Notifications
You must be signed in to change notification settings - Fork 2.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Windows MT layer bug fixes #3364
Conversation
@eli-schwartz - can you verify that this PR solves the issues observed in #3120? |
Unfortunately I still get the issue. I'm using this branch to test with, since it has a couple other fixes (particularly the valgrind issue, but I discovered another bug that was merged today): https://github.com/eli-schwartz/zstd/commits/pr3121
|
16ebc6a
to
b2f37ee
Compare
I've tested again and verified that locally I'm not seeing crashes after the patch is applied. |
Hmm okay, but how should we do that? Add a CI job that reports failure and causes |
If we just add a PR with the CI and not merge it it will run just for that CI. EDIT: Perhaps we can use #3120 as a base, I'll give it a try. |
Try basing on the first 6 of the 7 commits now there. The last one is the TODO marker. With the issue still present but the 7th commit also there, the
|
b2f37ee
to
dde58cd
Compare
So I merged the two PRs into one (#3370) and I'm seeing the same thing as you -- it's passing although I expected it to fail with UNEXPECTEDPASS. EDIT: I have managed to generate more crashes in local environment, seems to trigger more when pinning the process to one core -- suggests a race condition. EDIT2: I think I know where the race condition is, I'll test it later today. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, it took me a second to understand the problem, because I expected pthread_t
to be un-movable. But it looks like pthreads doesn't require that the pthread_t*
is valid for the lifetime of the worker.
Which means that this fix is insufficient. By the time worker(void *arg)
is called, the pthread_t
may have already been moved.
We need to do something like:
typedef struct {
void* (*start_routine)(void*);
void* arg;
} ZSTD_pthread_state;
typedef struct {
HANDLE handle;
ZSTD_pthread_state* state;
} ZSTD_pthread_t;
static unsigned __stdcall worker(void *arg)
{
ZSTD_pthread_state* const state = (ZSTD_pthread_state*) state;
state->start_routine(state->arg);
return 0;
}
int ZSTD_pthread_create(ZSTD_pthread_t* thread, const void* unused,
void* (*start_routine) (void*), void* arg)
{
(void)unused;
thread->state = malloc(sizeof(ZSTD_pthread_state));
if (!thread->state) {
thread->handle = NULL;
return ENOMEM;
}
thread->state->arg = arg;
thread->state->start_routine = start_routine;
thread->handle = (HANDLE) _beginthreadex(NULL, 0, worker,thread->state, 0, NULL);
if (!thread->handle) {
free(thread->state);
thread->state = NULL;
return errno;
}
return 0;
}
int ZSTD_pthread_join(ZSTD_pthread_t thread)
{
DWORD result;
if (!thread.handle) {
assert(!thread.state);
return 0;
}
result = WaitForSingleObject(thread.handle, INFINITE);
CloseHandle(thread.handle);
free(thread.state);
// ...
}
f5afaf1
to
f592ae4
Compare
See two latest commits - one fixes the bug and the other updates the meson tests to expect to pass on Windows. |
1. If threads are resized the threads' `ZSTD_pthread_t` might move while the worker still holds a pointer into it (see more details in facebook#3120). 2. The join operation was waiting for a thread and then return its `thread.arg` as a return value, but since the `ZSTD_pthread_t thread` was passed by value it would have a stale `arg` that wouldn't match the thread's actual return value. This fix changes the `ZSTD_pthread_join` API and removes support for returning a value. This means that we are diverging from the `pthread_join` API and this is no longer just an alias. In the future, if needed, we could return a Windows thread's return value using `GetExitCodeThread`, but as this path wouldn't be excised in any case, it's preferable to not add it right now.
When spawning a Windows thread we have small worker wrapper function that translates between the interfaces of Windows and POSIX threads. This wrapper is given a pointer that might get stale before the worker starts running, resulting in UB and crashes. This commit adds synchronization so that we know the wrapper has finished reading the data it needs before we allow the main thread to resume execution.
f592ae4
to
aaa38b2
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall strategy looks good, just needs some fixes in error conditions.
lib/common/threading.c
Outdated
error |= ZSTD_pthread_mutex_init(&thread_param.initialized_mutex, NULL); | ||
if(error) | ||
return -1; | ||
ZSTD_pthread_mutex_lock(&thread_param.initialized_mutex); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why lock this mutex here? I don't see any reason to hold it while creating the thread. So lets minimize the scope and move it to line 84.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are correct, don't know why it's there, may have moved stuff and missed it.
|
||
/* Spawn thread */ | ||
*thread = (HANDLE)_beginthreadex(NULL, 0, worker, &thread_param, 0, NULL); | ||
if (!thread) | ||
return errno; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need to unlock (until it is moved below the thread creation), and destroy the mutex/cond in this case.
lib/common/threading.c
Outdated
if(error) | ||
return -1; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd recommend against this pattern here. We need to destroy whichever one we initialized correctly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These two functions can't really fail, so we will never hit this error condition, it's just there because we have a return value (to align with pthreads) and we have to handle the return value to not get compilation errors.
Still, I'll add cleanup for for compatibility.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could an assert()
be used if only to document that this part of the code path, or this condition, can never be reached ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've added a comment and handled the errors in any case. If you think an assert would be better let me know.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, just need to destroy the condition variable if creating the thread fails.
*thread = (HANDLE)_beginthreadex(NULL, 0, worker, &thread_param, 0, NULL); | ||
if (!thread) { | ||
ZSTD_pthread_mutex_unlock(&thread_param.initialized_mutex); | ||
ZSTD_pthread_mutex_destroy(&thread_param.initialized_mutex); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also need to destroy the condition variable.
Edit: Wrong PR |
c4868cf
to
60cd3d6
Compare
lib/common/threading.c
Outdated
ZSTD_pthread_mutex_unlock(&thread_param->initialized_mutex); | ||
ZSTD_pthread_cond_signal(&thread_param->initialized_cond); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I just realized this issue.
It is incorrect to signal the condition variable after we release the lock. ZSTD_pthread_cond_wait()
is allowed to spuriously wake up. So we could get this execution:
Main thread: Launches the worker thread & waits on the condition variable.
Worker thread: Sets initialized to 1 & unlocks the mutex
Main thread: Spuriously wakes, sees that it is initialized, and destroys the condition variable
Worker thread: Signals the already destroyed condition variable.
So we just need to signal the CV before we unlock the mutex. This is allowed, and can actually generally more efficient, because threading libraries optimize this use case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point, will make the change.
60cd3d6
to
26f1bf7
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Fixes two bugs in the Windows thread / pthread translation layer:
ZSTD_pthread_t
might move while the worker still holds a pointer into it (see more details in Meson test fixups #3120).thread.arg
as a return value, but since theZSTD_pthread_t thread
was passed by value it would have a stalearg
that wouldn't match the thread's actual return value.This fix changes the
ZSTD_pthread_join
API and removes support for returning a value. This means that we are diverging from thepthread_join
API and this is no longer just an alias. In the future, if needed, we could return a Windows thread's return value usingGetExitCodeThread
, but as this path wouldn't be excised in any case, it's preferable to not add it right now.