-
Notifications
You must be signed in to change notification settings - Fork 42
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add a queue per handler to keep events in order #555
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey, nicely done! From my local tests your change seems to work 👍
Do you think it would be possible to have tests for this? The previous behavior was flaky, so not sure how easily reproducible it is. Worth a try?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice idea, @ankur22! 👏 This ensures that the handlers receive emitted events in order.
One question: How can we be sure that this PR fixes the issue in #553 and maybe #482 or #427 or #496? I'm unsure whether we were aiming to fix them with this PR, though :)?
By the way, I ran this PR's tests on CI a few times, and there were errors (1, 2, 3). Also, locally, I received the following error:
time="2022-10-03T10:52:42+03:00" level=error msg="process with PID 72688 unexpectedly ended: signal: killed" category=browser elapsed="0 ms" goroutine=99722
--- FAIL: TestLocator (39.38s)
--- FAIL: TestLocator/Tap (1.63s)
locator_test.go:188:
Error Trace: /Users/inanc/grafana/xk6-browser/tests/locator_test.go:188
Error: Expected value not to be nil.
Test: TestLocator/Tap
--- FAIL: TestLocator/Click (30.62s)
cleanup panicked with closing the browser: sending a message to browser: websocket: close 1006 (abnormal closure): unexpected EOFpanic: clicking on "#link": timed out after 30s [recovered]
panic: clicking on "#link": timed out after 30s
WDYT about these errors? Do you believe they're related to this PR?
Good idea, I'll try. |
The aim here is to ensure that the ordering of the CDP events is maintained and the handlers receive the events in order. I'll create a test that breaks under what currently exists, and hopefully that same test will pass with whatever the fix is. Although, I think #482 and #496 should be resolved by whatever this fix is 🤞
🤔 1 and 3 are worrying, and I'll need to investigate those. 2 and the |
b28c3ed
to
6a45010
Compare
0741dec
to
5d55090
Compare
Unit tests added. I verified that out of the two tests, the test that ensures that events are received in order fails in |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Excellent work, Ankur! 👏
It took me a while to get the double queue approach, but it makes sense.
The tests are a great touch, and I also confirmed the order one fails on main
. 👍
I ran this with several of our examples, which worked great, and with the Vistaprint script, which needs to be rewritten to use the new async APIs, but at least it didn't regress compared to main
. 😅
Let's create a Cloud release when this is merged, so that it can be battle tested.
@Ankur Awesome work, thanks for this! 👏 I ran the CI tests multiple times, they worked fabulously, thanks to you. On the 3rd run, I saw the following error in this CI job:
I believe this one might be related to #560 or the previous related issues. I think it is not related to #553. Then again, I wanted to let you know (in case you may have an idea). I also saw the following error when I ran the tests locally:
This one is related to the previous error I explained above. The origin of both errors seems to be |
5c3e2fd
to
1e6535c
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for making all of our lives easier 🎉
Thanks @inancgumus! I've taken a look at those errors before actually, and unfortunately they are race conditions unrelated to unordered CDP events. I did start to look into them a couple of months ago before the deadlock cloud issue was prioritised over other issues. I'd be happy to look into these next though as they're a real PITA when stress testing locally, and obviously a pain in CI -- i had to comment out a lot of tests since they kept failing in response being nil when they shouldn't have been 😭 |
This will help us to diagnose any deadlocks. The unit tests should pass within the 2 second limit. Resolves: #555 (comment)
This will help us to diagnose any deadlocks. The unit tests should pass within the 2 second limit. Resolves: #555 (comment)
22fefb2
to
0047d6c
Compare
These errors occurred prior to adding in the "double |
Co-authored-by: İnanç Gümüş <[email protected]>
This timer is no longer needed since the fixes in PR #555 and #597. These PRs helped to ensure that we receive the CDP events in order and that we correctly handle them when we do receive them. The networkidle timer was in place for a situation where we might not receive the networkidle lifecycle event from the browser via CDP. This could still occur, which would mean that there's a bug outside of our direct control. When this happens we should find a better solution.
The reason this was in place was due to out of order CDP events. Before PR #555, there was a slim chance that the root frame would receive lifecycle events before its child frames. This is now not the case and we can see after extensive stress tests that the child frame first receives CDP events and then the root/main frame.
Each frame (main and the subs) receive their own lifecycle events in the correct order (so the sub frames first, and then the root/main frame). The reason for having this code in the first place I believe is due to out of order CDP event, which has now been fixed in PR #555. After extensive stress testing I can confirm that we no longer need to perform the recalculateLifecycle in a recursive way. There maybe a need to do this in the future if the browser changes its behaviour on how it deals with lifecycle events, but at this point we should consider adding such a recursive call.
Each frame (main and the subs) receive their own lifecycle events in the correct order (so the sub frames first, and then the root/main frame). The reason for having this code in the first place I believe is due to out of order CDP event, which has now been fixed in PR #555. After extensive stress testing I can confirm that we no longer need to perform the recalculateLifecycle in a recursive way. So each frame should be in charge of recalculating its own lifecycle. There maybe a need to do this in the future if the browser changes its behaviour on how it deals with lifecycle events, but at this point we should not consider adding such a recursive call.
This timer is no longer needed since the fixes in PR #555 and #597. These PRs helped to ensure that we receive the CDP events in order and that we correctly handle them when we do receive them. The networkidle timer was in place for a situation where we might not receive the networkidle lifecycle event from the browser via CDP. This could still occur, which would mean that there's a bug outside of our direct control. When this happens we should find a better solution.
Each frame (main and the subs) receive their own lifecycle events in the correct order (so the sub frames first, and then the root/main frame). The reason for having this code in the first place I believe is due to out of order CDP event, which has now been fixed in PR #555. After extensive stress testing I can confirm that we no longer need to perform the recalculateLifecycle in a recursive way. So each frame should be in charge of recalculating its own lifecycle. There maybe a need to do this in the future if the browser changes its behaviour on how it deals with lifecycle events, but at this point we should not consider adding such a recursive call.
This timer is no longer needed since the fixes in PR #555 and #597. These PRs helped to ensure that we receive the CDP events in order and that we correctly handle them when we do receive them. The networkidle timer was in place for a situation where we might not receive the networkidle lifecycle event from the browser via CDP. This could still occur, which would mean that there's a bug outside of our direct control. When this happens we should find a better solution.
Each frame (main and the subs) receive their own lifecycle events in the correct order (so the sub frames first, and then the root/main frame). The reason for having this code in the first place I believe is due to out of order CDP event, which has now been fixed in PR #555. After extensive stress testing I can confirm that we no longer need to perform the recalculateLifecycle in a recursive way. So each frame should be in charge of recalculating its own lifecycle. There maybe a need to do this in the future if the browser changes its behaviour on how it deals with lifecycle events, but at this point we should not consider adding such a recursive call.
Events are sometimes out of order due to use using a goroutine to publish events to the handler concurrently as there's no guarantee when the goroutines will run (the first goroutine to be created isn't necessarily going to be the first to run). To overcome this, a
queue
has been added per handler to synchronise all publishes of events to that handler as well as amutex
so that only oneemitEvent
goroutine can read from thequeue
and write to the handler's channel at any one time.We tried working with a buffered channel instead of a
queue
(which is a slice). Unfortunately it's difficult to determine how large the channel needs to be to work with the test -- the more complex the test, the more events that will be emitted. For now we think it is easier to work with aqueue
that can grow when needed.After creating the unit tests, we found that although we were receiving events in order we had created a deadlock when a handler was receiving events and emitting new events. To overcome this we had to create a "double
queue
" system whereby new emitted events can be stored on onequeue
, while theemitEvent
goroutines can read off of a separatequeue
. When the readqueue
is depleted, the queues are swapped around.Closes: #553