Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Windows MT layer bug fixes #3364

Merged
merged 4 commits into from
Dec 19, 2022

Conversation

yoniko
Copy link
Contributor

@yoniko yoniko commented Dec 16, 2022

Fixes two bugs in the Windows thread / pthread translation layer:

  1. If threads are resized the threads' ZSTD_pthread_t might move while the worker still holds a pointer into it (see more details in Meson test fixups #3120).
  2. The join operation was waiting for a thread and then return its thread.arg as a return value, but since the ZSTD_pthread_t thread was passed by value it would have a stale arg that wouldn't match the thread's actual return value.

This fix changes the ZSTD_pthread_join API and removes support for returning a value. This means that we are diverging from the pthread_join API and this is no longer just an alias. In the future, if needed, we could return a Windows thread's return value using GetExitCodeThread, but as this path wouldn't be excised in any case, it's preferable to not add it right now.

@yoniko
Copy link
Contributor Author

yoniko commented Dec 16, 2022

@eli-schwartz - can you verify that this PR solves the issues observed in #3120?

@eli-schwartz
Copy link
Contributor

Unfortunately I still get the issue. I'm using this branch to test with, since it has a couple other fixes (particularly the valgrind issue, but I discovered another bug that was merged today): https://github.com/eli-schwartz/zstd/commits/pr3121

3/8 test-zstream-3           FAIL            14.01s   (exit status 3221225477 or signal 3221225349 SIGinvalid)
>>> MALLOC_PERTURB_=136 D:\a\wrapdb\wrapdb\_build\subprojects\zstd-pr3121\build/meson/tests\zstreamtest.exe --newapi -t1 -T90s --no-big-tests
------------------------------------- 8< -------------------------------------
stderr:
Starting zstream tester (64-bits, 1.5.3)

Seed = 3089


     1/     1    
     2          
    16          
    32          
    49          
    59          
    66          
    74          
    78          
    87          
   114          
   117          
   123          
   129          
   132          
   150          
   151          
   161          
   168          
   173          
   182          
   191          
   197          
   203          
   204          
   214          
   221          
   223          
   231          
   234          
   237          
   252          
   263          
   273          
   274          
   310          
------------------------------------------------------------------------------

@eli-schwartz
Copy link
Contributor

Aside: the git commit message seems to be a bit off. It doesn't have a one-line subject separated by a blank line from the extended body, so git's own paragraph detection treats that quite oddly when e.g. rebasing that commit into a selected list of commits to test. (Also, lines don't wrap.)
screenshot

@yoniko yoniko force-pushed the fix-windows-mt-thread-resize-bug branch from 16ebc6a to b2f37ee Compare December 16, 2022 01:10
@yoniko
Copy link
Contributor Author

yoniko commented Dec 16, 2022

Unfortunately I still get the issue. I'm using this branch to test with, since it has a couple other fixes (particularly the valgrind issue, but I discovered another bug that was merged today): https://github.com/eli-schwartz/zstd/commits/pr3121

I've tested again and verified that locally I'm not seeing crashes after the patch is applied.
I'm going to try and compile with meson on Windows, but it will looks like the correct way forward is to produce a CI that crashes and that I can apply patches to so we can have a source of truth for the state of the tests.

@eli-schwartz
Copy link
Contributor

Hmm okay, but how should we do that? Add a CI job that reports failure and causes dev to go red? #3120 did the reverse and marked the failure as a TODO, so that plus my CI modifications would allow testing this in CI because any change that fixes the crash would make CI fail with a meson "test unexpectedly started passing" type of message. Or by reverting the TODO mark, the CI would fail if the fix is incomplete.

@yoniko
Copy link
Contributor Author

yoniko commented Dec 16, 2022

If we just add a PR with the CI and not merge it it will run just for that CI.
I can follow up by rebasing this PR on top of the PR with CI - if this PR really solves the issue we will have one red PR and one green one.

EDIT: Perhaps we can use #3120 as a base, I'll give it a try.

@eli-schwartz
Copy link
Contributor

eli-schwartz commented Dec 16, 2022

EDIT: Perhaps we can use #3120 as a base, I'll give it a try.

Try basing on the first 6 of the 7 commits now there. The last one is the TODO marker. With the issue still present but the 7th commit also there, the dev-short-tests / meson-windows job passes. The tests log:

3/8 test-zstream-3           EXPECTEDFAIL     2.46s   (exit status 3221225477 or signal 3221225349 SIGinvalid)

[...]

Ok:                 7   
Expected Fail:      1   

@yoniko yoniko force-pushed the fix-windows-mt-thread-resize-bug branch from b2f37ee to dde58cd Compare December 16, 2022 17:48
@yoniko
Copy link
Contributor Author

yoniko commented Dec 16, 2022

So I merged the two PRs into one (#3370) and I'm seeing the same thing as you -- it's passing although I expected it to fail with UNEXPECTEDPASS.
When replicating the exact same commands the action runs on my Windows VM I'm getting a different result.

EDIT: I have managed to generate more crashes in local environment, seems to trigger more when pinning the process to one core -- suggests a race condition.

EDIT2: I think I know where the race condition is, I'll test it later today.

Copy link
Contributor

@terrelln terrelln left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, it took me a second to understand the problem, because I expected pthread_t to be un-movable. But it looks like pthreads doesn't require that the pthread_t* is valid for the lifetime of the worker.

Which means that this fix is insufficient. By the time worker(void *arg) is called, the pthread_t may have already been moved.

We need to do something like:

typedef struct {
    void* (*start_routine)(void*);
    void* arg;
} ZSTD_pthread_state;

typedef struct {
    HANDLE handle;
    ZSTD_pthread_state* state;
} ZSTD_pthread_t;

static unsigned __stdcall worker(void *arg)
{
    ZSTD_pthread_state* const state = (ZSTD_pthread_state*) state;
    state->start_routine(state->arg);
    return 0;
}

int ZSTD_pthread_create(ZSTD_pthread_t* thread, const void* unused,
            void* (*start_routine) (void*), void* arg)
{
    (void)unused;
    thread->state = malloc(sizeof(ZSTD_pthread_state));
    if (!thread->state) {
        thread->handle = NULL;
        return ENOMEM;
    }
    thread->state->arg = arg;
    thread->state->start_routine = start_routine;
    thread->handle = (HANDLE) _beginthreadex(NULL, 0, worker,thread->state, 0, NULL);
    if (!thread->handle) {
        free(thread->state);
        thread->state = NULL;
        return errno;
    }
    return 0;
}

int ZSTD_pthread_join(ZSTD_pthread_t thread)
{
    DWORD result;

    if (!thread.handle) {
        assert(!thread.state);
        return 0;
    }
    result = WaitForSingleObject(thread.handle, INFINITE);
    CloseHandle(thread.handle);
    free(thread.state);
    // ...
}

@yoniko
Copy link
Contributor Author

yoniko commented Dec 17, 2022

@terrelln - yes, this is the race condition that I was talking about.
I implemented a different solution that doesn't depend on malloc but needs a lock.
It's currently being tested in #3370.

@yoniko yoniko requested a review from terrelln December 17, 2022 02:25
@yoniko yoniko force-pushed the fix-windows-mt-thread-resize-bug branch from f5afaf1 to f592ae4 Compare December 17, 2022 02:34
@yoniko
Copy link
Contributor Author

yoniko commented Dec 17, 2022

See two latest commits - one fixes the bug and the other updates the meson tests to expect to pass on Windows.

1. If threads are resized the threads' `ZSTD_pthread_t` might move
while the worker still holds a pointer into it (see more details in facebook#3120).
2. The join operation was waiting for a thread and then return its `thread.arg`
as a return value, but since the `ZSTD_pthread_t thread` was passed by value it
would have a stale `arg` that wouldn't match the thread's actual return value.

This fix changes the `ZSTD_pthread_join` API and removes support for returning
a value. This means that we are diverging from the `pthread_join` API and this
is no longer just an alias.
In the future, if needed, we could return a Windows thread's return value using
`GetExitCodeThread`, but as this path wouldn't be excised in any case, it's
preferable to not add it right now.
When spawning a Windows thread we have small worker wrapper function that translates
between the interfaces of Windows and POSIX threads.
This wrapper is given a pointer that might get stale before the worker starts running,
resulting in UB and crashes.
This commit adds synchronization so that we know the wrapper has finished reading the data
it needs before we allow the main thread to resume execution.
@yoniko yoniko force-pushed the fix-windows-mt-thread-resize-bug branch from f592ae4 to aaa38b2 Compare December 17, 2022 21:38
@yoniko yoniko changed the title Windows MT layer return value bug fixes Windows MT layer bug fixes Dec 19, 2022
Copy link
Contributor

@terrelln terrelln left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall strategy looks good, just needs some fixes in error conditions.

lib/common/threading.c Show resolved Hide resolved
error |= ZSTD_pthread_mutex_init(&thread_param.initialized_mutex, NULL);
if(error)
return -1;
ZSTD_pthread_mutex_lock(&thread_param.initialized_mutex);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why lock this mutex here? I don't see any reason to hold it while creating the thread. So lets minimize the scope and move it to line 84.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are correct, don't know why it's there, may have moved stuff and missed it.


/* Spawn thread */
*thread = (HANDLE)_beginthreadex(NULL, 0, worker, &thread_param, 0, NULL);
if (!thread)
return errno;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to unlock (until it is moved below the thread creation), and destroy the mutex/cond in this case.

Comment on lines 75 to 76
if(error)
return -1;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd recommend against this pattern here. We need to destroy whichever one we initialized correctly.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These two functions can't really fail, so we will never hit this error condition, it's just there because we have a return value (to align with pthreads) and we have to handle the return value to not get compilation errors.
Still, I'll add cleanup for for compatibility.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could an assert() be used if only to document that this part of the code path, or this condition, can never be reached ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added a comment and handled the errors in any case. If you think an assert would be better let me know.

lib/common/threading.c Show resolved Hide resolved
@yoniko yoniko requested a review from terrelln December 19, 2022 21:49
Copy link
Contributor

@terrelln terrelln left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, just need to destroy the condition variable if creating the thread fails.

*thread = (HANDLE)_beginthreadex(NULL, 0, worker, &thread_param, 0, NULL);
if (!thread) {
ZSTD_pthread_mutex_unlock(&thread_param.initialized_mutex);
ZSTD_pthread_mutex_destroy(&thread_param.initialized_mutex);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also need to destroy the condition variable.

@terrelln
Copy link
Contributor

terrelln commented Dec 19, 2022

Edit: Wrong PR

@terrelln terrelln closed this Dec 19, 2022
@terrelln terrelln reopened this Dec 19, 2022
@yoniko yoniko force-pushed the fix-windows-mt-thread-resize-bug branch from c4868cf to 60cd3d6 Compare December 19, 2022 22:46
ZSTD_pthread_mutex_unlock(&thread_param->initialized_mutex);
ZSTD_pthread_cond_signal(&thread_param->initialized_cond);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I just realized this issue.

It is incorrect to signal the condition variable after we release the lock. ZSTD_pthread_cond_wait() is allowed to spuriously wake up. So we could get this execution:

Main thread: Launches the worker thread & waits on the condition variable.
Worker thread: Sets initialized to 1 & unlocks the mutex
Main thread: Spuriously wakes, sees that it is initialized, and destroys the condition variable
Worker thread: Signals the already destroyed condition variable.

So we just need to signal the CV before we unlock the mutex. This is allowed, and can actually generally more efficient, because threading libraries optimize this use case.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, will make the change.

@yoniko yoniko force-pushed the fix-windows-mt-thread-resize-bug branch from 60cd3d6 to 26f1bf7 Compare December 19, 2022 23:13
Copy link
Contributor

@terrelln terrelln left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@yoniko yoniko merged commit a8add43 into facebook:dev Dec 19, 2022
@yoniko yoniko deleted the fix-windows-mt-thread-resize-bug branch December 20, 2022 00:43
@Cyan4973 Cyan4973 mentioned this pull request Feb 9, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants