Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a queue per handler to keep events in order #555

Merged
merged 8 commits into from
Oct 7, 2022

Conversation

ankur22
Copy link
Collaborator

@ankur22 ankur22 commented Sep 29, 2022

Events are sometimes out of order due to use using a goroutine to publish events to the handler concurrently as there's no guarantee when the goroutines will run (the first goroutine to be created isn't necessarily going to be the first to run). To overcome this, a queue has been added per handler to synchronise all publishes of events to that handler as well as a mutex so that only one emitEvent goroutine can read from the queue and write to the handler's channel at any one time.

We tried working with a buffered channel instead of a queue (which is a slice). Unfortunately it's difficult to determine how large the channel needs to be to work with the test -- the more complex the test, the more events that will be emitted. For now we think it is easier to work with a queue that can grow when needed.

After creating the unit tests, we found that although we were receiving events in order we had created a deadlock when a handler was receiving events and emitting new events. To overcome this we had to create a "double queue" system whereby new emitted events can be stored on one queue, while the emitEvent goroutines can read off of a separate queue. When the read queue is depleted, the queues are swapped around.

Closes: #553

common/event_emitter.go Outdated Show resolved Hide resolved
Copy link
Contributor

@imiric imiric left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey, nicely done! From my local tests your change seems to work 👍

Do you think it would be possible to have tests for this? The previous behavior was flaky, so not sure how easily reproducible it is. Worth a try?

common/event_emitter.go Outdated Show resolved Hide resolved
Copy link
Member

@inancgumus inancgumus left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice idea, @ankur22! 👏 This ensures that the handlers receive emitted events in order.

One question: How can we be sure that this PR fixes the issue in #553 and maybe #482 or #427 or #496? I'm unsure whether we were aiming to fix them with this PR, though :)?

By the way, I ran this PR's tests on CI a few times, and there were errors (1, 2, 3). Also, locally, I received the following error:

time="2022-10-03T10:52:42+03:00" level=error msg="process with PID 72688 unexpectedly ended: signal: killed" category=browser elapsed="0 ms" goroutine=99722
--- FAIL: TestLocator (39.38s)
    --- FAIL: TestLocator/Tap (1.63s)
        locator_test.go:188: 
                Error Trace:    /Users/inanc/grafana/xk6-browser/tests/locator_test.go:188
                Error:          Expected value not to be nil.
                Test:           TestLocator/Tap
    --- FAIL: TestLocator/Click (30.62s)
    cleanup panicked with closing the browser: sending a message to browser: websocket: close 1006 (abnormal closure): unexpected EOFpanic: clicking on "#link": timed out after 30s [recovered]
        panic: clicking on "#link": timed out after 30s

WDYT about these errors? Do you believe they're related to this PR?

@ankur22
Copy link
Collaborator Author

ankur22 commented Oct 3, 2022

Do you think it would be possible to have tests for this? The previous behavior was flaky, so not sure how easily reproducible it is. Worth a try?

Good idea, I'll try.

@ankur22
Copy link
Collaborator Author

ankur22 commented Oct 4, 2022

One question: How can we be sure that this PR fixes the issue in #553 and maybe #482 or #427 or #496? I'm unsure whether we were aiming to fix them with this PR, though :)?

The aim here is to ensure that the ordering of the CDP events is maintained and the handlers receive the events in order. I'll create a test that breaks under what currently exists, and hopefully that same test will pass with whatever the fix is. Although, I think #482 and #496 should be resolved by whatever this fix is 🤞

By the way, I ran this PR's tests on CI a few times

🤔 1 and 3 are worrying, and I'll need to investigate those.

2 and the TestLocator/Tap error are a concern for us but I run into those often when running the Locator tests even on main. I believe they're caused by some sort of race issue, but not to do with the out of order CDP events.

@ankur22 ankur22 force-pushed the fix/553-unordered-events branch from b28c3ed to 6a45010 Compare October 5, 2022 15:27
@ankur22 ankur22 marked this pull request as ready for review October 5, 2022 15:29
@ankur22 ankur22 changed the title Draft: Add a queue per handler to keep events in order Add a queue per handler to keep events in order Oct 5, 2022
@ankur22 ankur22 force-pushed the fix/553-unordered-events branch 2 times, most recently from 0741dec to 5d55090 Compare October 6, 2022 08:57
@ankur22
Copy link
Collaborator Author

ankur22 commented Oct 6, 2022

Do you think it would be possible to have tests for this? The previous behavior was flaky, so not sure how easily reproducible it is. Worth a try?

Good idea, I'll try.

Unit tests added. I verified that out of the two tests, the test that ensures that events are received in order fails in main.

@ankur22 ankur22 requested review from imiric and inancgumus October 6, 2022 09:48
common/event_emitter.go Outdated Show resolved Hide resolved
common/event_emitter.go Outdated Show resolved Hide resolved
common/event_emitter.go Outdated Show resolved Hide resolved
Copy link
Contributor

@imiric imiric left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Excellent work, Ankur! 👏

It took me a while to get the double queue approach, but it makes sense.

The tests are a great touch, and I also confirmed the order one fails on main. 👍

I ran this with several of our examples, which worked great, and with the Vistaprint script, which needs to be rewritten to use the new async APIs, but at least it didn't regress compared to main. 😅

Let's create a Cloud release when this is merged, so that it can be battle tested. :shipit:

common/event_emitter_test.go Outdated Show resolved Hide resolved
@inancgumus
Copy link
Member

inancgumus commented Oct 6, 2022

@Ankur Awesome work, thanks for this! 👏

I ran the CI tests multiple times, they worked fabulously, thanks to you.

On the 3rd run, I saw the following error in this CI job:

=== CONT  TestWaitForFrameNavigationWithinDocument/anchor
    frame_manager_test.go:46: 
        	Error Trace:	/home/runner/work/xk6-browser/xk6-browser/tests/frame_manager_test.go:46
        	            				/home/runner/work/xk6-browser/xk6-browser/tests/asm_amd64.s:1571
        	Error:      	Expected value not to be nil.
        	Test:       	TestWaitForFrameNavigationWithinDocument/anchor

I believe this one might be related to #560 or the previous related issues. I think it is not related to #553. Then again, I wanted to let you know (in case you may have an idea).

I also saw the following error when I ran the tests locally:

--- FAIL: TestLocator (1.00s)
    locator_test.go:271: 
                Error Trace:    /Users/inanc/grafana/xk6-browser/tests/locator_test.go:271
                Error:          Expected value not to be nil.
                Test:           TestLocator

This one is related to the previous error I explained above. The origin of both errors seems to be page.Goto. Although, #560 is about a timeout error, after this change, the underlying problem with page.Goto might have reached the surface.

@ankur22 ankur22 force-pushed the fix/553-unordered-events branch from 5c3e2fd to 1e6535c Compare October 6, 2022 10:52
Copy link
Member

@inancgumus inancgumus left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for making all of our lives easier 🎉

@ankur22
Copy link
Collaborator Author

ankur22 commented Oct 6, 2022

@Ankur Awesome work, thanks for this! 👏

I ran the CI tests multiple times, they worked fabulously, thanks to you.

On the 3rd run, I saw the following error in this CI job:

=== CONT  TestWaitForFrameNavigationWithinDocument/anchor
    frame_manager_test.go:46: 
        	Error Trace:	/home/runner/work/xk6-browser/xk6-browser/tests/frame_manager_test.go:46
        	            				/home/runner/work/xk6-browser/xk6-browser/tests/asm_amd64.s:1571
        	Error:      	Expected value not to be nil.
        	Test:       	TestWaitForFrameNavigationWithinDocument/anchor

I believe this one might be related to #560 or the previous related issues. I think it is not related to #553. Then again, I wanted to let you know (in case you may have an idea).

I also saw the following error when I ran the tests locally:

--- FAIL: TestLocator (1.00s)
    locator_test.go:271: 
                Error Trace:    /Users/inanc/grafana/xk6-browser/tests/locator_test.go:271
                Error:          Expected value not to be nil.
                Test:           TestLocator

This one is related to the previous error I explained above. The origin of both errors seems to be page.Goto. Although, #560 is about a timeout error, after this change, the underlying problem with page.Goto might have reached the surface.

Thanks @inancgumus!

I've taken a look at those errors before actually, and unfortunately they are race conditions unrelated to unordered CDP events. I did start to look into them a couple of months ago before the deadlock cloud issue was prioritised over other issues. I'd be happy to look into these next though as they're a real PITA when stress testing locally, and obviously a pain in CI -- i had to comment out a lot of tests since they kept failing in response being nil when they shouldn't have been 😭

ankur22 added a commit that referenced this pull request Oct 6, 2022
This will help us to diagnose any deadlocks. The unit tests should
pass within the 2 second limit.

Resolves: #555 (comment)
ankur22 added a commit that referenced this pull request Oct 6, 2022
This will help us to diagnose any deadlocks. The unit tests should
pass within the 2 second limit.

Resolves: #555 (comment)
@ankur22 ankur22 force-pushed the fix/553-unordered-events branch from 22fefb2 to 0047d6c Compare October 6, 2022 13:15
@ankur22
Copy link
Collaborator Author

ankur22 commented Oct 6, 2022

🤔 1 and 3 are worrying, and I'll need to investigate those.

These errors occurred prior to adding in the "double queue" system for managing the read and write of events from different goroutines. When using a single queue, it would deadlock, which is what those test runs showed us. Those same deadlocks shouldn't happen now with the "double queue" system (i've done stress tests locally and not had a deadlock so far).

common/event_emitter.go Outdated Show resolved Hide resolved
@ankur22 ankur22 merged commit 508bcb8 into main Oct 7, 2022
@ankur22 ankur22 deleted the fix/553-unordered-events branch October 7, 2022 10:03
@ankur22 ankur22 mentioned this pull request Oct 7, 2022
@imiric imiric linked an issue Oct 11, 2022 that may be closed by this pull request
@imiric imiric linked an issue Oct 11, 2022 that may be closed by this pull request
@imiric imiric linked an issue Oct 11, 2022 that may be closed by this pull request
@imiric imiric linked an issue Oct 11, 2022 that may be closed by this pull request
@imiric imiric linked an issue Oct 11, 2022 that may be closed by this pull request
ankur22 added a commit that referenced this pull request Nov 8, 2022
This timer is no longer needed since the fixes in PR #555 and #597.
These PRs helped to ensure that we receive the CDP events in order
and that we correctly handle them when we do receive them. The
networkidle timer was in place for a situation where we might not
receive the networkidle lifecycle event from the browser via CDP.
This could still occur, which would mean that there's a bug outside
of our direct control. When this happens we should find a better
solution.
ankur22 added a commit that referenced this pull request Nov 8, 2022
The reason this was in place was due to out of order CDP events.
Before PR #555, there was a slim chance that the root frame would
receive lifecycle events before its child frames. This is now not the
case and we can see after extensive stress tests that the child frame
first receives CDP events and then the root/main frame.
ankur22 added a commit that referenced this pull request Nov 8, 2022
Each frame (main and the subs) receive their own lifecycle events in
the correct order (so the sub frames first, and then the root/main
frame). The reason for having this code in the first place I believe
is due to out of order CDP event, which has now been fixed in PR #555.

After extensive stress testing I can confirm that we no longer need to
perform the recalculateLifecycle in a recursive way.

There maybe a need to do this in the future if the browser changes its
behaviour on how it deals with lifecycle events, but at this point we
should consider adding such a recursive call.
ankur22 added a commit that referenced this pull request Nov 8, 2022
Each frame (main and the subs) receive their own lifecycle events in
the correct order (so the sub frames first, and then the root/main
frame). The reason for having this code in the first place I believe
is due to out of order CDP event, which has now been fixed in PR #555.

After extensive stress testing I can confirm that we no longer need to
perform the recalculateLifecycle in a recursive way. So each frame
should be in charge of recalculating its own lifecycle.

There maybe a need to do this in the future if the browser changes its
behaviour on how it deals with lifecycle events, but at this point we
should not consider adding such a recursive call.
ankur22 added a commit that referenced this pull request Nov 11, 2022
This timer is no longer needed since the fixes in PR #555 and #597.
These PRs helped to ensure that we receive the CDP events in order
and that we correctly handle them when we do receive them. The
networkidle timer was in place for a situation where we might not
receive the networkidle lifecycle event from the browser via CDP.
This could still occur, which would mean that there's a bug outside
of our direct control. When this happens we should find a better
solution.
ankur22 added a commit that referenced this pull request Nov 11, 2022
Each frame (main and the subs) receive their own lifecycle events in
the correct order (so the sub frames first, and then the root/main
frame). The reason for having this code in the first place I believe
is due to out of order CDP event, which has now been fixed in PR #555.

After extensive stress testing I can confirm that we no longer need to
perform the recalculateLifecycle in a recursive way. So each frame
should be in charge of recalculating its own lifecycle.

There maybe a need to do this in the future if the browser changes its
behaviour on how it deals with lifecycle events, but at this point we
should not consider adding such a recursive call.
ankur22 added a commit that referenced this pull request Nov 11, 2022
This timer is no longer needed since the fixes in PR #555 and #597.
These PRs helped to ensure that we receive the CDP events in order
and that we correctly handle them when we do receive them. The
networkidle timer was in place for a situation where we might not
receive the networkidle lifecycle event from the browser via CDP.
This could still occur, which would mean that there's a bug outside
of our direct control. When this happens we should find a better
solution.
ankur22 added a commit that referenced this pull request Nov 11, 2022
Each frame (main and the subs) receive their own lifecycle events in
the correct order (so the sub frames first, and then the root/main
frame). The reason for having this code in the first place I believe
is due to out of order CDP event, which has now been fixed in PR #555.

After extensive stress testing I can confirm that we no longer need to
perform the recalculateLifecycle in a recursive way. So each frame
should be in charge of recalculating its own lifecycle.

There maybe a need to do this in the future if the browser changes its
behaviour on how it deals with lifecycle events, but at this point we
should not consider adding such a recursive call.
@ankur22 ankur22 mentioned this pull request Oct 16, 2024
3 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants