Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix race condition in WASM crypto worker #70185

Merged
merged 1 commit into from
Jun 7, 2022
Merged

Conversation

eerhardt
Copy link
Member

@eerhardt eerhardt commented Jun 2, 2022

When sending a message between LibraryChannel and ChannelWorker, there is a race condition where both threads are reading/writing to shared memory at the same time. This can cause message pages to be skipped.

To fix this, add a shared mutex lock so only one side is reading/writing to shared memory at the same time.

Fix #69806

@ghost
Copy link

ghost commented Jun 2, 2022

Tagging subscribers to this area: @dotnet/area-system-security, @vcsjones
See info in area-owners.md if you want to be subscribed.

Issue Details

WIP: Solve race condition in WASM crypto worker

Putting this draft PR up to test in CI.

Fix #69806

Author: eerhardt
Assignees: -
Labels:

area-System.Security

Milestone: -

@eerhardt
Copy link
Member Author

eerhardt commented Jun 3, 2022

/azp run runtime (Build Browser wasm Linux Release LibraryTests)

@azure-pipelines
Copy link

No pipelines are associated with this pull request.

When sending a message between LibraryChannel and ChannelWorker, there is a race condition where both threads are reading/writing to shared memory at the same time. This can cause message pages to be skipped.

To fix this, add a shared mutex lock so only one side is reading/writing to shared memory at the same time.

Fix dotnet#69806
@radical
Copy link
Member

radical commented Jun 3, 2022

/azp run runtime-wasm

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@marek-safar marek-safar added the arch-wasm WebAssembly architecture label Jun 6, 2022
@ghost
Copy link

ghost commented Jun 6, 2022

Tagging subscribers to 'arch-wasm': @lewing
See info in area-owners.md if you want to be subscribed.

Issue Details

When sending a message between LibraryChannel and ChannelWorker, there is a race condition where both threads are reading/writing to shared memory at the same time. This can cause message pages to be skipped.

To fix this, add a shared mutex lock so only one side is reading/writing to shared memory at the same time.

Fix #69806

Author: eerhardt
Assignees: eerhardt
Labels:

arch-wasm, area-System.Security

Milestone: -

// BEGIN ChannelOwner contract - shared constants.
get STATE_IDX() { return 0; }
get MSG_SIZE_IDX() { return 1; }
get LOCK_IDX() { return 2; }
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is fine, and probably simplifies things, but I'm surprised the state machine isn't handling it, since it uses atomics for transitions. Do you know how the state machine is getting desynced?

Copy link
Member Author

@eerhardt eerhardt Jun 6, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't worked out the exact way the race condition happens. What I do know is that when the race happens, the ChannelWorker reads 0 for the MSG_SIZE. The only way that can happen is that the ChannelWorker runs twice in succession, and it is reading the 0 it wrote itself to the MSG_SIZE field in the previous page:

// Reset the size and transition to await state.
Atomics.store(this.comm, this.MSG_SIZE_IDX, 0);

I've verified this by changing that line to -1, and when the race happens, the ChannelWorker reads -1 for the MSG_SIZE.

Then, when the race condition happens, I logged what the total amount of data received by the ChannelWorker was. And when the test fails, the total amount is always 1024 chars (one page size) less than when the test passes. Upping the test from one million as to ten million as, and running the test in a loop on a 2-core machine seems to consistently repro the problem after 5-10 iterations of the loop. (I set the loop for 50, and it never reached the full 50 before reproing.)

My guessing here is that it isn't a simple race condition, but a combination of race conditions that need to happen in a specific order for the code to get into this state.

The "write" side:

private send_request(msg: string): void {
let state;
const msg_len = msg.length;
let msg_written = 0;
for (; ;) {
// Write the message and return how much was written.
const wrote = this.write_to_msg(msg, msg_written, msg_len);
msg_written += wrote;
// Indicate how much was written to the this.msg buffer.
Atomics.store(this.comm, this.MSG_SIZE_IDX, wrote);
// Indicate if this was the whole message or part of it.
state = msg_written === msg_len ? this.STATE_REQ : this.STATE_REQ_P;
// Notify webworker
Atomics.store(this.comm, this.STATE_IDX, state);
Atomics.notify(this.comm, this.STATE_IDX);
// The send message is complete.
if (state === this.STATE_REQ)
break;
// Wait for the worker to be ready for the next part.
// - Atomics.wait() is not permissible on the main thread.
do {
state = Atomics.load(this.comm, this.STATE_IDX);
} while (state !== this.STATE_AWAIT);
}
}

The "read" side:

_read_request() {
var request = "";
for (;;) {
// Get the current state and message size
var state = Atomics.load(this.comm, this.STATE_IDX);
var size_to_read = Atomics.load(this.comm, this.MSG_SIZE_IDX);
// Append the latest part of the message.
request += this._read_from_msg(0, size_to_read);
// The request is complete.
if (state === this.STATE_REQ)
break;
// Shutdown the worker.
if (state === this.STATE_SHUTDOWN)
return this.STATE_SHUTDOWN;
// Reset the size and transition to await state.
Atomics.store(this.comm, this.MSG_SIZE_IDX, 0);
Atomics.store(this.comm, this.STATE_IDX, this.STATE_AWAIT);
Atomics.wait(this.comm, this.STATE_IDX, this.STATE_AWAIT);
}

The first race is at the bottom of the methods - the "write" side doesn't wait to be notified, but instead spins until it reads an AWAIT "state". That means the write side can get unblocked before the "read" side calls ".wait".

Then another race is between lines 139-140 of the "write" and 71-72 of the "read". The "write" side can change the state in between the "read" side moving to 'state = AWAIT' and calling ".wait". So now the "read" side doesn't wait on line 72 (because the state is no longer 'AWAIT'). It then loops around and reads another page of data, sets the state, and then calls ".wait". Meanwhile, the "write" side is after 139, but before 140. Now it runs to call "notify" to wake up the "read" side one more time - and the "read" side now reads 0 for MSG_SIZE.

The part I haven't worked out is how that causes the skipping of a page. But given that you can see these two sides running concurrently, and causing unintended state transitions, adding the lock to ensure only one side is in the "critical section" at a time seems like the most appropriate way to fix it.

@vcsjones
Copy link
Member

vcsjones commented Jun 6, 2022

Just to ask, is this worth considering back porting to release/7.0-preview5?

@eerhardt
Copy link
Member Author

eerhardt commented Jun 6, 2022

Just to ask, is this worth considering back porting to release/7.0-preview5?

I believe we have already built the final "runtime" build for preview5. According to the schedule the build happened on June 1 (which was also the last commit to the preview5 branch).

@radical
Copy link
Member

radical commented Jun 6, 2022

The wasm/AOT failure is #70304 .
The WBT ones are #69860 .

Both can be ignored for this PR.

@@ -101,6 +116,8 @@ var ChannelWorker = {
// Update the state
Atomics.store(this.comm, this.STATE_IDX, state);

this._release_lock();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should this be released if there was any error thrown?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What kind of error are you thinking?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A bug in our code that causes an error to be thrown, for example. Do IIUC that the app would be stuck in that case as the main thread would be spinning?

Copy link
Member Author

@eerhardt eerhardt Jun 6, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do IIUC that the app would be stuck in that case as the main thread would be spinning?

Yeah, probably. Note that this is already the case, since if an error is thrown without the Worker setting the state to AWAIT, the main thread would be stuck spinning as well.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think a lot more work would be necessary to make this code "reentrant" on an error. It would probably be better tracked by #69740, than in this PR.

@@ -202,6 +218,19 @@ class LibraryChannel {
return String.fromCharCode.apply(null, slicedMessage);
}

private acquire_lock() {
while (Atomics.compareExchange(this.comm, this.LOCK_IDX, this.LOCK_UNLOCKED, this.LOCK_OWNED) !== this.LOCK_UNLOCKED) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TIL.. JavaScript has web workers and shared memory, but no locks? is this just a polyfill? (just curious)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@eerhardt
Copy link
Member Author

eerhardt commented Jun 7, 2022

Merging to fix the random CI failures.

@eerhardt eerhardt merged commit 8d34e0a into dotnet:main Jun 7, 2022
@eerhardt eerhardt deleted the Fix69806 branch June 7, 2022 18:09
@vcsjones
Copy link
Member

vcsjones commented Jun 7, 2022

Just to ask, is this worth considering back porting to release/7.0-preview5?

I believe we have already built the final "runtime" build for preview5. According to the schedule the build happened on June 1 (which was also the last commit to the preview5 branch).

Would it make sense to put this in "known issues" for preview 5 then?

@eerhardt
Copy link
Member Author

eerhardt commented Jun 8, 2022

Would it make sense to put this in "known issues" for preview 5 then?

Good idea. I've opened dotnet/core#7524

@ghost ghost locked as resolved and limited conversation to collaborators Jul 8, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
arch-wasm WebAssembly architecture area-System.Security
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Test failures in hashing on Browser
7 participants