-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix race condition in WASM crypto worker #70185
Conversation
Tagging subscribers to this area: @dotnet/area-system-security, @vcsjones Issue DetailsWIP: Solve race condition in WASM crypto worker Putting this draft PR up to test in CI. Fix #69806
|
/azp run runtime (Build Browser wasm Linux Release LibraryTests) |
No pipelines are associated with this pull request. |
When sending a message between LibraryChannel and ChannelWorker, there is a race condition where both threads are reading/writing to shared memory at the same time. This can cause message pages to be skipped. To fix this, add a shared mutex lock so only one side is reading/writing to shared memory at the same time. Fix dotnet#69806
/azp run runtime-wasm |
Azure Pipelines successfully started running 1 pipeline(s). |
Tagging subscribers to 'arch-wasm': @lewing Issue DetailsWhen sending a message between LibraryChannel and ChannelWorker, there is a race condition where both threads are reading/writing to shared memory at the same time. This can cause message pages to be skipped. To fix this, add a shared mutex lock so only one side is reading/writing to shared memory at the same time. Fix #69806
|
// BEGIN ChannelOwner contract - shared constants. | ||
get STATE_IDX() { return 0; } | ||
get MSG_SIZE_IDX() { return 1; } | ||
get LOCK_IDX() { return 2; } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is fine, and probably simplifies things, but I'm surprised the state machine isn't handling it, since it uses atomics for transitions. Do you know how the state machine is getting desynced?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I haven't worked out the exact way the race condition happens. What I do know is that when the race happens, the ChannelWorker
reads 0
for the MSG_SIZE
. The only way that can happen is that the ChannelWorker
runs twice in succession, and it is reading the 0
it wrote itself to the MSG_SIZE
field in the previous page:
runtime/src/mono/wasm/runtime/dotnet-crypto-worker.js
Lines 69 to 70 in 54f104a
// Reset the size and transition to await state. | |
Atomics.store(this.comm, this.MSG_SIZE_IDX, 0); |
I've verified this by changing that line to -1
, and when the race happens, the ChannelWorker
reads -1
for the MSG_SIZE
.
Then, when the race condition happens, I logged what the total amount of data received by the ChannelWorker
was. And when the test fails, the total amount is always 1024
chars (one page size) less than when the test passes. Upping the test from one million a
s to ten million a
s, and running the test in a loop on a 2-core machine seems to consistently repro the problem after 5-10 iterations of the loop. (I set the loop for 50, and it never reached the full 50 before reproing.)
My guessing here is that it isn't a simple race condition, but a combination of race conditions that need to happen in a specific order for the code to get into this state.
The "write" side:
runtime/src/mono/wasm/runtime/crypto-worker.ts
Lines 122 to 152 in 54f104a
private send_request(msg: string): void { | |
let state; | |
const msg_len = msg.length; | |
let msg_written = 0; | |
for (; ;) { | |
// Write the message and return how much was written. | |
const wrote = this.write_to_msg(msg, msg_written, msg_len); | |
msg_written += wrote; | |
// Indicate how much was written to the this.msg buffer. | |
Atomics.store(this.comm, this.MSG_SIZE_IDX, wrote); | |
// Indicate if this was the whole message or part of it. | |
state = msg_written === msg_len ? this.STATE_REQ : this.STATE_REQ_P; | |
// Notify webworker | |
Atomics.store(this.comm, this.STATE_IDX, state); | |
Atomics.notify(this.comm, this.STATE_IDX); | |
// The send message is complete. | |
if (state === this.STATE_REQ) | |
break; | |
// Wait for the worker to be ready for the next part. | |
// - Atomics.wait() is not permissible on the main thread. | |
do { | |
state = Atomics.load(this.comm, this.STATE_IDX); | |
} while (state !== this.STATE_AWAIT); | |
} | |
} |
The "read" side:
runtime/src/mono/wasm/runtime/dotnet-crypto-worker.js
Lines 51 to 73 in 54f104a
_read_request() { | |
var request = ""; | |
for (;;) { | |
// Get the current state and message size | |
var state = Atomics.load(this.comm, this.STATE_IDX); | |
var size_to_read = Atomics.load(this.comm, this.MSG_SIZE_IDX); | |
// Append the latest part of the message. | |
request += this._read_from_msg(0, size_to_read); | |
// The request is complete. | |
if (state === this.STATE_REQ) | |
break; | |
// Shutdown the worker. | |
if (state === this.STATE_SHUTDOWN) | |
return this.STATE_SHUTDOWN; | |
// Reset the size and transition to await state. | |
Atomics.store(this.comm, this.MSG_SIZE_IDX, 0); | |
Atomics.store(this.comm, this.STATE_IDX, this.STATE_AWAIT); | |
Atomics.wait(this.comm, this.STATE_IDX, this.STATE_AWAIT); | |
} |
The first race is at the bottom of the methods - the "write" side doesn't wait to be notified, but instead spins until it reads an AWAIT "state". That means the write side can get unblocked before the "read" side calls ".wait".
Then another race is between lines 139-140 of the "write" and 71-72 of the "read". The "write" side can change the state in between the "read" side moving to 'state = AWAIT' and calling ".wait". So now the "read" side doesn't wait on line 72 (because the state is no longer 'AWAIT'). It then loops around and reads another page of data, sets the state, and then calls ".wait". Meanwhile, the "write" side is after 139, but before 140. Now it runs to call "notify" to wake up the "read" side one more time - and the "read" side now reads 0
for MSG_SIZE
.
The part I haven't worked out is how that causes the skipping of a page. But given that you can see these two sides running concurrently, and causing unintended state transitions, adding the lock to ensure only one side is in the "critical section" at a time seems like the most appropriate way to fix it.
Just to ask, is this worth considering back porting to |
I believe we have already built the final "runtime" build for preview5. According to the schedule the build happened on June 1 (which was also the last commit to the preview5 branch). |
@@ -101,6 +116,8 @@ var ChannelWorker = { | |||
// Update the state | |||
Atomics.store(this.comm, this.STATE_IDX, state); | |||
|
|||
this._release_lock(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should this be released if there was any error thrown?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What kind of error are you thinking?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A bug in our code that causes an error to be thrown, for example. Do IIUC that the app would be stuck in that case as the main thread would be spinning?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do IIUC that the app would be stuck in that case as the main thread would be spinning?
Yeah, probably. Note that this is already the case, since if an error is thrown without the Worker setting the state to AWAIT
, the main thread would be stuck spinning as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think a lot more work would be necessary to make this code "reentrant" on an error. It would probably be better tracked by #69740, than in this PR.
@@ -202,6 +218,19 @@ class LibraryChannel { | |||
return String.fromCharCode.apply(null, slicedMessage); | |||
} | |||
|
|||
private acquire_lock() { | |||
while (Atomics.compareExchange(this.comm, this.LOCK_IDX, this.LOCK_UNLOCKED, this.LOCK_OWNED) !== this.LOCK_UNLOCKED) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TIL.. JavaScript has web workers and shared memory, but no locks? is this just a polyfill? (just curious)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I learned a lot from reading https://hacks.mozilla.org/2017/06/avoiding-race-conditions-in-sharedarraybuffers-with-atomics/. And the basic lock implementation it links to.
Merging to fix the random CI failures. |
Would it make sense to put this in "known issues" for preview 5 then? |
Good idea. I've opened dotnet/core#7524 |
When sending a message between LibraryChannel and ChannelWorker, there is a race condition where both threads are reading/writing to shared memory at the same time. This can cause message pages to be skipped.
To fix this, add a shared mutex lock so only one side is reading/writing to shared memory at the same time.
Fix #69806