Add a queue per handler to keep events in order #555

ankur22 · 2022-09-29T16:37:15Z

Events are sometimes out of order due to use using a goroutine to publish events to the handler concurrently as there's no guarantee when the goroutines will run (the first goroutine to be created isn't necessarily going to be the first to run). To overcome this, a queue has been added per handler to synchronise all publishes of events to that handler as well as a mutex so that only one emitEvent goroutine can read from the queue and write to the handler's channel at any one time.

We tried working with a buffered channel instead of a queue (which is a slice). Unfortunately it's difficult to determine how large the channel needs to be to work with the test -- the more complex the test, the more events that will be emitted. For now we think it is easier to work with a queue that can grow when needed.

After creating the unit tests, we found that although we were receiving events in order we had created a deadlock when a handler was receiving events and emitting new events. To overcome this we had to create a "double queue" system whereby new emitted events can be stored on one queue, while the emitEvent goroutines can read off of a separate queue. When the read queue is depleted, the queues are swapped around.

Closes: #553

common/event_emitter.go

imiric

Hey, nicely done! From my local tests your change seems to work 👍

Do you think it would be possible to have tests for this? The previous behavior was flaky, so not sure how easily reproducible it is. Worth a try?

common/event_emitter.go

inancgumus

Nice idea, @ankur22! 👏 This ensures that the handlers receive emitted events in order.

One question: How can we be sure that this PR fixes the issue in #553 and maybe #482 or #427 or #496? I'm unsure whether we were aiming to fix them with this PR, though :)?

By the way, I ran this PR's tests on CI a few times, and there were errors (1, 2, 3). Also, locally, I received the following error:

time="2022-10-03T10:52:42+03:00" level=error msg="process with PID 72688 unexpectedly ended: signal: killed" category=browser elapsed="0 ms" goroutine=99722
--- FAIL: TestLocator (39.38s)
    --- FAIL: TestLocator/Tap (1.63s)
        locator_test.go:188: 
                Error Trace:    /Users/inanc/grafana/xk6-browser/tests/locator_test.go:188
                Error:          Expected value not to be nil.
                Test:           TestLocator/Tap
    --- FAIL: TestLocator/Click (30.62s)
    cleanup panicked with closing the browser: sending a message to browser: websocket: close 1006 (abnormal closure): unexpected EOFpanic: clicking on "#link": timed out after 30s [recovered]
        panic: clicking on "#link": timed out after 30s

WDYT about these errors? Do you believe they're related to this PR?

common/event_emitter.go

ankur22 · 2022-10-03T17:15:38Z

Do you think it would be possible to have tests for this? The previous behavior was flaky, so not sure how easily reproducible it is. Worth a try?

Good idea, I'll try.

ankur22 · 2022-10-04T16:18:40Z

One question: How can we be sure that this PR fixes the issue in #553 and maybe #482 or #427 or #496? I'm unsure whether we were aiming to fix them with this PR, though :)?

The aim here is to ensure that the ordering of the CDP events is maintained and the handlers receive the events in order. I'll create a test that breaks under what currently exists, and hopefully that same test will pass with whatever the fix is. Although, I think #482 and #496 should be resolved by whatever this fix is 🤞

By the way, I ran this PR's tests on CI a few times

🤔 1 and 3 are worrying, and I'll need to investigate those.

2 and the TestLocator/Tap error are a concern for us but I run into those often when running the Locator tests even on main. I believe they're caused by some sort of race issue, but not to do with the out of order CDP events.

ankur22 · 2022-10-06T09:09:23Z

Do you think it would be possible to have tests for this? The previous behavior was flaky, so not sure how easily reproducible it is. Worth a try?

Good idea, I'll try.

Unit tests added. I verified that out of the two tests, the test that ensures that events are received in order fails in main.

common/event_emitter.go

imiric

Excellent work, Ankur! 👏

It took me a while to get the double queue approach, but it makes sense.

The tests are a great touch, and I also confirmed the order one fails on main. 👍

I ran this with several of our examples, which worked great, and with the Vistaprint script, which needs to be rewritten to use the new async APIs, but at least it didn't regress compared to main. 😅

Let's create a Cloud release when this is merged, so that it can be battle tested.

common/event_emitter_test.go

inancgumus · 2022-10-06T10:17:25Z

@Ankur Awesome work, thanks for this! 👏

I ran the CI tests multiple times, they worked fabulously, thanks to you.

On the 3rd run, I saw the following error in this CI job:

=== CONT  TestWaitForFrameNavigationWithinDocument/anchor
    frame_manager_test.go:46: 
        	Error Trace:	/home/runner/work/xk6-browser/xk6-browser/tests/frame_manager_test.go:46
        	            				/home/runner/work/xk6-browser/xk6-browser/tests/asm_amd64.s:1571
        	Error:      	Expected value not to be nil.
        	Test:       	TestWaitForFrameNavigationWithinDocument/anchor

I believe this one might be related to #560 or the previous related issues. I think it is not related to #553. Then again, I wanted to let you know (in case you may have an idea).

I also saw the following error when I ran the tests locally:

--- FAIL: TestLocator (1.00s)
    locator_test.go:271: 
                Error Trace:    /Users/inanc/grafana/xk6-browser/tests/locator_test.go:271
                Error:          Expected value not to be nil.
                Test:           TestLocator

This one is related to the previous error I explained above. The origin of both errors seems to be page.Goto. Although, #560 is about a timeout error, after this change, the underlying problem with page.Goto might have reached the surface.

inancgumus

Thanks for making all of our lives easier 🎉

ankur22 · 2022-10-06T11:04:32Z

@Ankur Awesome work, thanks for this! 👏

I ran the CI tests multiple times, they worked fabulously, thanks to you.

On the 3rd run, I saw the following error in this CI job:
=== CONT  TestWaitForFrameNavigationWithinDocument/anchor
    frame_manager_test.go:46: 
        	Error Trace:	/home/runner/work/xk6-browser/xk6-browser/tests/frame_manager_test.go:46
        	            				/home/runner/work/xk6-browser/xk6-browser/tests/asm_amd64.s:1571
        	Error:      	Expected value not to be nil.
        	Test:       	TestWaitForFrameNavigationWithinDocument/anchor
I believe this one might be related to #560 or the previous related issues. I think it is not related to #553. Then again, I wanted to let you know (in case you may have an idea).

I also saw the following error when I ran the tests locally:
--- FAIL: TestLocator (1.00s)
    locator_test.go:271: 
                Error Trace:    /Users/inanc/grafana/xk6-browser/tests/locator_test.go:271
                Error:          Expected value not to be nil.
                Test:           TestLocator
This one is related to the previous error I explained above. The origin of both errors seems to be page.Goto. Although, #560 is about a timeout error, after this change, the underlying problem with page.Goto might have reached the surface.

Thanks @inancgumus!

I've taken a look at those errors before actually, and unfortunately they are race conditions unrelated to unordered CDP events. I did start to look into them a couple of months ago before the deadlock cloud issue was prioritised over other issues. I'd be happy to look into these next though as they're a real PITA when stress testing locally, and obviously a pain in CI -- i had to comment out a lot of tests since they kept failing in response being nil when they shouldn't have been 😭

This will help us to diagnose any deadlocks. The unit tests should pass within the 2 second limit. Resolves: #555 (comment)

ankur22 · 2022-10-06T13:18:53Z

🤔 1 and 3 are worrying, and I'll need to investigate those.

These errors occurred prior to adding in the "double queue" system for managing the read and write of events from different goroutines. When using a single queue, it would deadlock, which is what those test runs showed us. Those same deadlocks shouldn't happen now with the "double queue" system (i've done stress tests locally and not had a deadlock so far).

common/event_emitter_test.go

common/event_emitter.go

Co-authored-by: İnanç Gümüş <[email protected]>

This timer is no longer needed since the fixes in PR #555 and #597. These PRs helped to ensure that we receive the CDP events in order and that we correctly handle them when we do receive them. The networkidle timer was in place for a situation where we might not receive the networkidle lifecycle event from the browser via CDP. This could still occur, which would mean that there's a bug outside of our direct control. When this happens we should find a better solution.

The reason this was in place was due to out of order CDP events. Before PR #555, there was a slim chance that the root frame would receive lifecycle events before its child frames. This is now not the case and we can see after extensive stress tests that the child frame first receives CDP events and then the root/main frame.

Each frame (main and the subs) receive their own lifecycle events in the correct order (so the sub frames first, and then the root/main frame). The reason for having this code in the first place I believe is due to out of order CDP event, which has now been fixed in PR #555. After extensive stress testing I can confirm that we no longer need to perform the recalculateLifecycle in a recursive way. There maybe a need to do this in the future if the browser changes its behaviour on how it deals with lifecycle events, but at this point we should consider adding such a recursive call.

Each frame (main and the subs) receive their own lifecycle events in the correct order (so the sub frames first, and then the root/main frame). The reason for having this code in the first place I believe is due to out of order CDP event, which has now been fixed in PR #555. After extensive stress testing I can confirm that we no longer need to perform the recalculateLifecycle in a recursive way. So each frame should be in charge of recalculating its own lifecycle. There maybe a need to do this in the future if the browser changes its behaviour on how it deals with lifecycle events, but at this point we should not consider adding such a recursive call.

This timer is no longer needed since the fixes in PR #555 and #597. These PRs helped to ensure that we receive the CDP events in order and that we correctly handle them when we do receive them. The networkidle timer was in place for a situation where we might not receive the networkidle lifecycle event from the browser via CDP. This could still occur, which would mean that there's a bug outside of our direct control. When this happens we should find a better solution.

Each frame (main and the subs) receive their own lifecycle events in the correct order (so the sub frames first, and then the root/main frame). The reason for having this code in the first place I believe is due to out of order CDP event, which has now been fixed in PR #555. After extensive stress testing I can confirm that we no longer need to perform the recalculateLifecycle in a recursive way. So each frame should be in charge of recalculating its own lifecycle. There maybe a need to do this in the future if the browser changes its behaviour on how it deals with lifecycle events, but at this point we should not consider adding such a recursive call.

This timer is no longer needed since the fixes in PR #555 and #597. These PRs helped to ensure that we receive the CDP events in order and that we correctly handle them when we do receive them. The networkidle timer was in place for a situation where we might not receive the networkidle lifecycle event from the browser via CDP. This could still occur, which would mean that there's a bug outside of our direct control. When this happens we should find a better solution.

Each frame (main and the subs) receive their own lifecycle events in the correct order (so the sub frames first, and then the root/main frame). The reason for having this code in the first place I believe is due to out of order CDP event, which has now been fixed in PR #555. After extensive stress testing I can confirm that we no longer need to perform the recalculateLifecycle in a recursive way. So each frame should be in charge of recalculating its own lifecycle. There maybe a need to do this in the future if the browser changes its behaviour on how it deals with lifecycle events, but at this point we should not consider adding such a recursive call.

ankur22 marked this pull request as draft September 29, 2022 16:37

ankur22 requested review from inancgumus, imiric and mstoykov September 29, 2022 16:37

ankur22 mentioned this pull request Sep 29, 2022

Flaky timeout in Frame.waitForExecutionContext() #482

Closed

imiric reviewed Sep 30, 2022

View reviewed changes

common/event_emitter.go Outdated Show resolved Hide resolved

imiric reviewed Sep 30, 2022

View reviewed changes

common/event_emitter.go Outdated Show resolved Hide resolved

inancgumus reviewed Oct 3, 2022

View reviewed changes

ankur22 commented Oct 3, 2022

View reviewed changes

common/event_emitter.go Show resolved Hide resolved

inancgumus assigned ankur22 Oct 5, 2022

ankur22 force-pushed the fix/553-unordered-events branch from b28c3ed to 6a45010 Compare October 5, 2022 15:27

ankur22 marked this pull request as ready for review October 5, 2022 15:29

ankur22 changed the title ~~Draft: Add a queue per handler to keep events in order~~ Add a queue per handler to keep events in order Oct 5, 2022

ankur22 force-pushed the fix/553-unordered-events branch 2 times, most recently from 0741dec to 5d55090 Compare October 6, 2022 08:57

ankur22 requested review from imiric and inancgumus October 6, 2022 09:48

inancgumus reviewed Oct 6, 2022

View reviewed changes

common/event_emitter.go Outdated Show resolved Hide resolved

common/event_emitter.go Outdated Show resolved Hide resolved

common/event_emitter.go Outdated Show resolved Hide resolved

imiric approved these changes Oct 6, 2022

View reviewed changes

common/event_emitter_test.go Outdated Show resolved Hide resolved

ankur22 force-pushed the fix/553-unordered-events branch from 5c3e2fd to 1e6535c Compare October 6, 2022 10:52

inancgumus approved these changes Oct 6, 2022

View reviewed changes

ankur22 added a commit that referenced this pull request Oct 6, 2022

Add a timeout for the emitter unit tests

22fefb2

This will help us to diagnose any deadlocks. The unit tests should pass within the 2 second limit. Resolves: #555 (comment)

ankur22 added a commit that referenced this pull request Oct 6, 2022

Add a timeout for the emitter unit tests

0047d6c

This will help us to diagnose any deadlocks. The unit tests should pass within the 2 second limit. Resolves: #555 (comment)

ankur22 force-pushed the fix/553-unordered-events branch from 22fefb2 to 0047d6c Compare October 6, 2022 13:15

inancgumus reviewed Oct 7, 2022

View reviewed changes

common/event_emitter_test.go Outdated Show resolved Hide resolved

inancgumus reviewed Oct 7, 2022

View reviewed changes

common/event_emitter_test.go Outdated Show resolved Hide resolved

inancgumus reviewed Oct 7, 2022

View reviewed changes

common/event_emitter.go Outdated Show resolved Hide resolved

ankur22 and others added 2 commits October 7, 2022 09:56

Add something

901eddf

Apply suggestions from code review

e40feb8

Co-authored-by: İnanç Gümüş <[email protected]>

ankur22 merged commit 508bcb8 into main Oct 7, 2022

ankur22 deleted the fix/553-unordered-events branch October 7, 2022 10:03

ankur22 mentioned this pull request Oct 7, 2022

Deadlock in the cloud #496

Closed

imiric mentioned this pull request Oct 11, 2022

Error disposing browser context #558

Closed

imiric linked an issue Oct 11, 2022 that may be closed by this pull request

Error disposing browser context #558

Closed

imiric mentioned this pull request Oct 11, 2022

Timeout error getting DevTools URL during browser launch #559

Closed

imiric linked an issue Oct 11, 2022 that may be closed by this pull request

Timeout error getting DevTools URL during browser launch #559

Closed

imiric mentioned this pull request Oct 11, 2022

Intermittent Page.goto() timeout error #560

Closed

imiric linked an issue Oct 11, 2022 that may be closed by this pull request

Intermittent Page.goto() timeout error #560

Closed

imiric mentioned this pull request Oct 11, 2022

Intermittent ElementHandle.click() timeout error #561

Closed

imiric linked an issue Oct 11, 2022 that may be closed by this pull request

Intermittent ElementHandle.click() timeout error #561

Closed

imiric mentioned this pull request Oct 11, 2022

Temporary execution pauses in the Cloud #502

Closed

imiric linked an issue Oct 11, 2022 that may be closed by this pull request

Temporary execution pauses in the Cloud #502

Closed

ankur22 mentioned this pull request Oct 17, 2022

Fix wait for networkIdle lifecycle event #578

Merged

ankur22 mentioned this pull request Nov 7, 2022

Deadlock in TestPageWaitForFunction tests #635

Closed

ankur22 mentioned this pull request Nov 8, 2022

Refactor the frame's lifecycle event handling code #644

Merged

ankur22 mentioned this pull request Oct 16, 2024

Fix memory leak in queue #1488

Merged

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add a queue per handler to keep events in order #555

Add a queue per handler to keep events in order #555

ankur22 commented Sep 29, 2022 •

edited

Loading

imiric left a comment

inancgumus left a comment •

edited

Loading

ankur22 commented Oct 3, 2022

ankur22 commented Oct 4, 2022

ankur22 commented Oct 6, 2022

imiric left a comment

inancgumus commented Oct 6, 2022 •

edited

Loading

inancgumus left a comment

ankur22 commented Oct 6, 2022

ankur22 commented Oct 6, 2022

Add a queue per handler to keep events in order #555

Add a queue per handler to keep events in order #555

Conversation

ankur22 commented Sep 29, 2022 • edited Loading

imiric left a comment

Choose a reason for hiding this comment

inancgumus left a comment • edited Loading

Choose a reason for hiding this comment

ankur22 commented Oct 3, 2022

ankur22 commented Oct 4, 2022

ankur22 commented Oct 6, 2022

imiric left a comment

Choose a reason for hiding this comment

inancgumus commented Oct 6, 2022 • edited Loading

inancgumus left a comment

Choose a reason for hiding this comment

ankur22 commented Oct 6, 2022

ankur22 commented Oct 6, 2022

ankur22 commented Sep 29, 2022 •

edited

Loading

inancgumus left a comment •

edited

Loading

inancgumus commented Oct 6, 2022 •

edited

Loading