Fix race condition in WASM crypto worker #70185

eerhardt · 2022-06-02T23:43:13Z

When sending a message between LibraryChannel and ChannelWorker, there is a race condition where both threads are reading/writing to shared memory at the same time. This can cause message pages to be skipped.

To fix this, add a shared mutex lock so only one side is reading/writing to shared memory at the same time.

Fix #69806

ghost · 2022-06-02T23:43:21Z

Tagging subscribers to this area: @dotnet/area-system-security, @vcsjones
See info in area-owners.md if you want to be subscribed.

Issue Details

WIP: Solve race condition in WASM crypto worker

Putting this draft PR up to test in CI.

Fix #69806

Author:	eerhardt
Assignees:	-
Labels:	`area-System.Security`
Milestone:	-

eerhardt · 2022-06-03T15:43:47Z

/azp run runtime (Build Browser wasm Linux Release LibraryTests)

azure-pipelines · 2022-06-03T15:43:52Z

No pipelines are associated with this pull request.

When sending a message between LibraryChannel and ChannelWorker, there is a race condition where both threads are reading/writing to shared memory at the same time. This can cause message pages to be skipped. To fix this, add a shared mutex lock so only one side is reading/writing to shared memory at the same time. Fix dotnet#69806

radical · 2022-06-03T21:45:50Z

/azp run runtime-wasm

azure-pipelines · 2022-06-03T21:45:59Z

Azure Pipelines successfully started running 1 pipeline(s).

ghost · 2022-06-06T08:26:00Z

Tagging subscribers to 'arch-wasm': @lewing
See info in area-owners.md if you want to be subscribed.

Issue Details

When sending a message between LibraryChannel and ChannelWorker, there is a race condition where both threads are reading/writing to shared memory at the same time. This can cause message pages to be skipped.

To fix this, add a shared mutex lock so only one side is reading/writing to shared memory at the same time.

Fix #69806

Author:	eerhardt
Assignees:	eerhardt
Labels:	`arch-wasm`, `area-System.Security`
Milestone:	-

bartonjs · 2022-06-06T17:19:03Z

src/mono/wasm/runtime/dotnet-crypto-worker.js

        // BEGIN ChannelOwner contract - shared constants.
        get STATE_IDX() { return 0; }
        get MSG_SIZE_IDX() { return 1; }
+        get LOCK_IDX() { return 2; }


This is fine, and probably simplifies things, but I'm surprised the state machine isn't handling it, since it uses atomics for transitions. Do you know how the state machine is getting desynced?

I haven't worked out the exact way the race condition happens. What I do know is that when the race happens, the ChannelWorker reads 0 for the MSG_SIZE. The only way that can happen is that the ChannelWorker runs twice in succession, and it is reading the 0 it wrote itself to the MSG_SIZE field in the previous page:

runtime/src/mono/wasm/runtime/dotnet-crypto-worker.js

Lines 69 to 70 in 54f104a

// Reset the size and transition to await state.

Atomics.store(this.comm, this.MSG_SIZE_IDX, 0);

I've verified this by changing that line to -1, and when the race happens, the ChannelWorker reads -1 for the MSG_SIZE.

Then, when the race condition happens, I logged what the total amount of data received by the ChannelWorker was. And when the test fails, the total amount is always 1024 chars (one page size) less than when the test passes. Upping the test from one million as to ten million as, and running the test in a loop on a 2-core machine seems to consistently repro the problem after 5-10 iterations of the loop. (I set the loop for 50, and it never reached the full 50 before reproing.)

My guessing here is that it isn't a simple race condition, but a combination of race conditions that need to happen in a specific order for the code to get into this state.

The "write" side:

runtime/src/mono/wasm/runtime/crypto-worker.ts

Lines 122 to 152 in 54f104a

private send_request(msg: string): void {

let state;

const msg_len = msg.length;

let msg_written = 0;

for (; ;) {

// Write the message and return how much was written.

const wrote = this.write_to_msg(msg, msg_written, msg_len);

msg_written += wrote;

// Indicate how much was written to the this.msg buffer.

Atomics.store(this.comm, this.MSG_SIZE_IDX, wrote);

// Indicate if this was the whole message or part of it.

state = msg_written === msg_len ? this.STATE_REQ : this.STATE_REQ_P;

// Notify webworker

Atomics.store(this.comm, this.STATE_IDX, state);

Atomics.notify(this.comm, this.STATE_IDX);

// The send message is complete.

if (state === this.STATE_REQ)

break;

// Wait for the worker to be ready for the next part.

// - Atomics.wait() is not permissible on the main thread.

do {

state = Atomics.load(this.comm, this.STATE_IDX);

} while (state !== this.STATE_AWAIT);

}

}

The "read" side:

runtime/src/mono/wasm/runtime/dotnet-crypto-worker.js

Lines 51 to 73 in 54f104a

_read_request() {

var request = "";

for (;;) {

// Get the current state and message size

var state = Atomics.load(this.comm, this.STATE_IDX);

var size_to_read = Atomics.load(this.comm, this.MSG_SIZE_IDX);

// Append the latest part of the message.

request += this._read_from_msg(0, size_to_read);

// The request is complete.

if (state === this.STATE_REQ)

break;

// Shutdown the worker.

if (state === this.STATE_SHUTDOWN)

return this.STATE_SHUTDOWN;

// Reset the size and transition to await state.

Atomics.store(this.comm, this.MSG_SIZE_IDX, 0);

Atomics.store(this.comm, this.STATE_IDX, this.STATE_AWAIT);

Atomics.wait(this.comm, this.STATE_IDX, this.STATE_AWAIT);

}

The first race is at the bottom of the methods - the "write" side doesn't wait to be notified, but instead spins until it reads an AWAIT "state". That means the write side can get unblocked before the "read" side calls ".wait".

Then another race is between lines 139-140 of the "write" and 71-72 of the "read". The "write" side can change the state in between the "read" side moving to 'state = AWAIT' and calling ".wait". So now the "read" side doesn't wait on line 72 (because the state is no longer 'AWAIT'). It then loops around and reads another page of data, sets the state, and then calls ".wait". Meanwhile, the "write" side is after 139, but before 140. Now it runs to call "notify" to wake up the "read" side one more time - and the "read" side now reads 0 for MSG_SIZE.

The part I haven't worked out is how that causes the skipping of a page. But given that you can see these two sides running concurrently, and causing unintended state transitions, adding the lock to ensure only one side is in the "critical section" at a time seems like the most appropriate way to fix it.

vcsjones · 2022-06-06T18:46:52Z

Just to ask, is this worth considering back porting to release/7.0-preview5?

eerhardt · 2022-06-06T18:52:42Z

Just to ask, is this worth considering back porting to release/7.0-preview5?

I believe we have already built the final "runtime" build for preview5. According to the schedule the build happened on June 1 (which was also the last commit to the preview5 branch).

radical · 2022-06-06T21:07:29Z

The wasm/AOT failure is #70304 .
The WBT ones are #69860 .

Both can be ignored for this PR.

radical · 2022-06-06T21:09:16Z

src/mono/wasm/runtime/dotnet-crypto-worker.js

@@ -101,6 +116,8 @@ var ChannelWorker = {
                // Update the state
                Atomics.store(this.comm, this.STATE_IDX, state);

+                this._release_lock();


should this be released if there was any error thrown?

What kind of error are you thinking?

A bug in our code that causes an error to be thrown, for example. Do IIUC that the app would be stuck in that case as the main thread would be spinning?

Do IIUC that the app would be stuck in that case as the main thread would be spinning?

Yeah, probably. Note that this is already the case, since if an error is thrown without the Worker setting the state to AWAIT, the main thread would be stuck spinning as well.

I think a lot more work would be necessary to make this code "reentrant" on an error. It would probably be better tracked by #69740, than in this PR.

danmoseley · 2022-06-04T04:21:40Z

src/mono/wasm/runtime/crypto-worker.ts

@@ -202,6 +218,19 @@ class LibraryChannel {
        return String.fromCharCode.apply(null, slicedMessage);
    }

+    private acquire_lock() {
+        while (Atomics.compareExchange(this.comm, this.LOCK_IDX, this.LOCK_UNLOCKED, this.LOCK_OWNED) !== this.LOCK_UNLOCKED) {


TIL.. JavaScript has web workers and shared memory, but no locks? is this just a polyfill? (just curious)

I learned a lot from reading https://hacks.mozilla.org/2017/06/avoiding-race-conditions-in-sharedarraybuffers-with-atomics/. And the basic lock implementation it links to.

eerhardt · 2022-06-07T18:08:33Z

Merging to fix the random CI failures.

vcsjones · 2022-06-07T18:12:14Z

Just to ask, is this worth considering back porting to release/7.0-preview5?

I believe we have already built the final "runtime" build for preview5. According to the schedule the build happened on June 1 (which was also the last commit to the preview5 branch).

Would it make sense to put this in "known issues" for preview 5 then?

eerhardt · 2022-06-08T15:17:29Z

Would it make sense to put this in "known issues" for preview 5 then?

Good idea. I've opened dotnet/core#7524

dotnet-issue-labeler bot added the area-System.Security label Jun 2, 2022

ghost assigned eerhardt Jun 2, 2022

eerhardt mentioned this pull request Jun 3, 2022

Move HostAbortedException to Abstractions #70205

Merged

eerhardt force-pushed the Fix69806 branch from 38f9d09 to dbb2532 Compare June 3, 2022 21:43

eerhardt marked this pull request as ready for review June 3, 2022 21:44

eerhardt requested a review from marek-safar as a code owner June 3, 2022 21:44

eerhardt requested review from bartonjs, AaronRobinsonMSFT and radical June 3, 2022 21:44

marek-safar added the arch-wasm WebAssembly architecture label Jun 6, 2022

bartonjs reviewed Jun 6, 2022

View reviewed changes

bartonjs approved these changes Jun 6, 2022

View reviewed changes

AaronRobinsonMSFT approved these changes Jun 6, 2022

View reviewed changes

radical reviewed Jun 6, 2022

View reviewed changes

danmoseley reviewed Jun 6, 2022

View reviewed changes

eerhardt mentioned this pull request Jun 7, 2022

WASM error handling in SubtleCrypto through web worker #69740

Closed

eerhardt merged commit 8d34e0a into dotnet:main Jun 7, 2022

eerhardt deleted the Fix69806 branch June 7, 2022 18:09

ghost locked as resolved and limited conversation to collaborators Jul 8, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix race condition in WASM crypto worker #70185

Fix race condition in WASM crypto worker #70185

eerhardt commented Jun 2, 2022 •

edited

Loading

ghost commented Jun 2, 2022

eerhardt commented Jun 3, 2022

azure-pipelines bot commented Jun 3, 2022

radical commented Jun 3, 2022

azure-pipelines bot commented Jun 3, 2022

ghost commented Jun 6, 2022

bartonjs Jun 6, 2022

eerhardt Jun 6, 2022 •

edited

Loading

vcsjones commented Jun 6, 2022

eerhardt commented Jun 6, 2022

radical commented Jun 6, 2022

radical Jun 6, 2022

eerhardt Jun 6, 2022

radical Jun 6, 2022

eerhardt Jun 6, 2022 •

edited

Loading

eerhardt Jun 6, 2022

danmoseley Jun 4, 2022

eerhardt Jun 6, 2022

eerhardt commented Jun 7, 2022

vcsjones commented Jun 7, 2022

eerhardt commented Jun 8, 2022

	// Reset the size and transition to await state.
	Atomics.store(this.comm, this.MSG_SIZE_IDX, 0);

	private send_request(msg: string): void {
	let state;
	const msg_len = msg.length;
	let msg_written = 0;

	for (; ;) {
	// Write the message and return how much was written.
	const wrote = this.write_to_msg(msg, msg_written, msg_len);
	msg_written += wrote;

	// Indicate how much was written to the this.msg buffer.
	Atomics.store(this.comm, this.MSG_SIZE_IDX, wrote);

	// Indicate if this was the whole message or part of it.
	state = msg_written === msg_len ? this.STATE_REQ : this.STATE_REQ_P;

	// Notify webworker
	Atomics.store(this.comm, this.STATE_IDX, state);
	Atomics.notify(this.comm, this.STATE_IDX);

	// The send message is complete.
	if (state === this.STATE_REQ)
	break;

	// Wait for the worker to be ready for the next part.
	// - Atomics.wait() is not permissible on the main thread.
	do {
	state = Atomics.load(this.comm, this.STATE_IDX);
	} while (state !== this.STATE_AWAIT);
	}
	}

	_read_request() {
	var request = "";
	for (;;) {
	// Get the current state and message size
	var state = Atomics.load(this.comm, this.STATE_IDX);
	var size_to_read = Atomics.load(this.comm, this.MSG_SIZE_IDX);

	// Append the latest part of the message.
	request += this._read_from_msg(0, size_to_read);

	// The request is complete.
	if (state === this.STATE_REQ)
	break;

	// Shutdown the worker.
	if (state === this.STATE_SHUTDOWN)
	return this.STATE_SHUTDOWN;

	// Reset the size and transition to await state.
	Atomics.store(this.comm, this.MSG_SIZE_IDX, 0);
	Atomics.store(this.comm, this.STATE_IDX, this.STATE_AWAIT);
	Atomics.wait(this.comm, this.STATE_IDX, this.STATE_AWAIT);
	}

Fix race condition in WASM crypto worker #70185

Fix race condition in WASM crypto worker #70185

Conversation

eerhardt commented Jun 2, 2022 • edited Loading

ghost commented Jun 2, 2022

eerhardt commented Jun 3, 2022

azure-pipelines bot commented Jun 3, 2022

radical commented Jun 3, 2022

azure-pipelines bot commented Jun 3, 2022

ghost commented Jun 6, 2022

bartonjs Jun 6, 2022

Choose a reason for hiding this comment

eerhardt Jun 6, 2022 • edited Loading

Choose a reason for hiding this comment

vcsjones commented Jun 6, 2022

eerhardt commented Jun 6, 2022

radical commented Jun 6, 2022

radical Jun 6, 2022

Choose a reason for hiding this comment

eerhardt Jun 6, 2022

Choose a reason for hiding this comment

radical Jun 6, 2022

Choose a reason for hiding this comment

eerhardt Jun 6, 2022 • edited Loading

Choose a reason for hiding this comment

eerhardt Jun 6, 2022

Choose a reason for hiding this comment

danmoseley Jun 4, 2022

Choose a reason for hiding this comment

eerhardt Jun 6, 2022

Choose a reason for hiding this comment

eerhardt commented Jun 7, 2022

vcsjones commented Jun 7, 2022

eerhardt commented Jun 8, 2022

eerhardt commented Jun 2, 2022 •

edited

Loading

eerhardt Jun 6, 2022 •

edited

Loading

eerhardt Jun 6, 2022 •

edited

Loading