-
Notifications
You must be signed in to change notification settings - Fork 41
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix wait for networkIdle lifecycle event #578
Conversation
c818b3c
to
ff9f89d
Compare
ff9f89d
to
519c86f
Compare
90b87ab
to
0887571
Compare
Hey, Ankur! I couldn't reproduce the issue (neither before the fix nor after)—there was no timeout. I've changed the URL to the complex website's root URL. Could you help me to reproduce the problem? |
@ankur22 I'm not sure I understand the description, and will have to try reproducing it myself to confirm. 🤯 But just wanted to say: kudos on trying to understand this mess of an informal state machine... 😮💨 IMO |
31f0952
to
37a64e3
Compare
I find all of this super confusing 😅 My main question is: why do we need Frame lifecycle states are sequential and ordered: So I vote for a larger refactor to drop I looked at the proposed changes here, and they only raise more questions. E.g. why would increasing As for your questions:
Yes, we need to be able to process those as well. Imagine a SPA that emits
I think we should remove all of it, and not have any blocking waits. Easier said than done, of course, given how central this functionality is. But I think we would end up with a simpler, less racy, and more intuitive way of keeping track of frame states. BTW, let's create an issue for this where we can discuss possible solutions, instead of doing this in a PR. |
Me too 😞 There's a lot of guess work as to why it's done this way and I think we should just remove most of it and and fix any issues that we may find -- which i doubt we will
Very good questions, here's what I think they all mean. NOTE: my answers are based on the fact that once upon a time CDP events were received out of order prior to #555 and a lot these are "fixes" to try and bring it back in order. I also think we can remove them all and just forward all incoming frame lifecycle CDP events to the handlers that are interested in them:
Agreed, and this is exactly how we're receiving events from the browser via CDP -- they're in order and all looks good. Why would we need a state machine?
Yep, I like this suggestion. Let me know if i should go ahead with it 🙂 Also, why do we need a FSM if the order of CDP events from the browser is correct?
I only increased it since it was firing too early. Yep, you're correct, it could still fire too early. I would prefer to remove it.
What's a SPA?
👍 One thing to point out, I don't know why we would need to keep track of the frame state. As long as we forward all relevant CDP lifecycle events to the frames, then they can unblock on waits and all's well. Am i missing something? Further discussion for removing and simplifying lifecycle event in frames can be done in #593. For now, it would be good if this was either approved if we're happy that the waits are unblocked when we receive a |
@ankur22 Thanks for the summary of the functionality. 👍 Some notes:
So, yeah, I'm uncertain about many of these assertions and hypotheses, so take them with a grain of salt, but I think we should try to simplify and remove all of this mess.
I mean, it's implicitly a state machine anyway, which is what this spaghetti tries to model. We might as well make that explicit and be able to fail if, e.g., a wrong state transition is attempted. But you're right, the implementation doesn't have to be an FSM for this to work.
Some structure around this would be helpful, if nothing else so that we can query in what state the frame is at any point in time. But you're right that we might not need it, so maybe the first step should be giving that barebones approach a try first.
I would rather not merge this as is, since we're not sure about the overall impact it might have on other sites, given the fragile state of this core feature. So my vote is to close and proceed with a larger refactor. |
@imiric I think we're on the same page. Lots of what you have said are things that I've discovered myself, and somethings which are new (e.g. the links to puppeteer gives us some information as to why 500ms was possibly used in the first place). I'm definitely up for removing as much as we can while maintaining the current behaviour. We will need to create new issues for SPAs when we come across them and can easily replicate them. @inancgumus what do you think? |
@inancgumus I think this will help you understand what the issues were and why the fixes were done. This is a table of events which I observed before any fixes:
After the fixes have been applied we see this:
Yes that does fix this issue. By doing this what will happen in this scenario (google.com) is that we will receive the last request response which the page has made (success or failure) and the timeout timer will start. After 500ms it will fire and Although this works, I don't think we should be relying on the timeout in this scenario (i can see a scenario where a timeout would be necessary and I'm working on a test for it). We should be working directly with the CDP browser events where possible. We will definitely need to improve this implementation though where there are websites which poll frequently or are quite chatty for other reasons -- and in those scenarios it might be best to allow the user to determine what a safe timeout is instead of forcing 500ms.
I don't believe you need to track them, and that solution has just makes things more complicated. If we clear There are many issues with lifecycle events. In this PR i've tried to fix the I would like to close this PR and create a new PR with three fixes (two of which you have already pointed out in your patch in this comment:
All fixes will have a test to lock in the behaviour. Once that is done I will move onto refactoring the lifecycle events code, adding more tests where necessary to ensure that we don't regresse. WDYT @inancgumus & @imiric? |
The diff I shared in this comment was about demonstrating one part of the issue that I discovered after printing the events as you did. It wasn't a solution at all ;) I appreciated the detailed explanation ❤️ And #594 was a proper possible fix for our "current pubsub architecture". However, I agree that the solution was as complicated as the current architecture. So I'm OK to refactor and simplify the current architecture as long as we have proper tests in place 👍
It's up to you, but IMO, it's OK to merge this PR just with the diff I shared. Then you can continue where you left off in another PR. By the way, I believe it's more effective to create issues first for laying out a problem, its possible solutions, and discussions around it. Then we can send as many PRs as possible until the issue closes. |
Cool, i'll just do the fix here and clean up the commits so that we only have the most relevant fixes and tests.
Yep, I will endeavour to open a new issue next time, even if the fix initially feels trivial. |
37a64e3
to
98f7557
Compare
5828d3e
to
4485c6d
Compare
e8186bf
to
d81d9aa
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll review this probably tomorrow: It seems like we have good tests now 👏 After skimming it a little, the only trivial thing that caught my eye was counter
s. IMO, we can make them atomic.Int64
or something instead of a sync.RWMutex
.
When we test using WaitUntil, it is transformed from a go struct into a goja value. If we don't provide a struct tag, goja will transform the name to snake case (wait_until). Adding the js struct tag fixes this.
When a site that is navigated to emits a DOMContentLoaded well before load, this is a signal that we're waiting on assets to load over the network. In such a case (e.g. navigating to google.com) when we goto with waitUntil = networkIdle, due to how we handle such lifecycle events, we end up in a situation where the networkIdle event is ignored. This change ensure that we action on these events too. Partly fixes: https://github.com/orgs/grafana/projects/80/views/14
When we navigate to a page, lifecycle events were being stored. If a new navigation occurred, the previous events would be used and fired off to the internal handles. This is incorrect as it means that WaitUntil will fail to wait for the correct event after the initial navigation to a page. Partly fixes: https://github.com/orgs/grafana/projects/80/views/14
Some websites may have many network requests which are performed after the page has fully loaded and we receive a FrameStoppedLoading. For now we want to wait for the networkIdle event, and in the future if we need, we should add a timer for very chatty websites where we may never see a networkIdle event. Partly fixes: https://github.com/orgs/grafana/projects/80/views/14
f5158dd
to
4a55816
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's awesome that we have tests for this behavior now 👏 🥳 LGTM!
I made some opinionated changes to make tests more readable and maintainable in #603. Please feel free to skip them, I don't mind. I thought it would be better to create another branch instead of reviewing the PR in place.
Changes:
- Uses an atomic counter.
- Moves home assertion to a helper closure.
- IMO, editor support is useful when dealing with long HTMLs and scripts (especially when debugging later). So I moved two of the home page tests to static files.
Thanks for these changes.
What's the benefit of this change?
Looks good thanks, i'll cherry pick this change from your PR.
Not too sure about this, tbh. I prefer things inline. I'll make the change though if you have a very strong opinion on this. |
Sure 👍
IMHO, with HTMLs in a static folder, the test flow becomes clear, and we get editor support like syntax highlighting, completion, etc.—which can be useful when debugging it later. The downside is we will need to open another tab to see what's in the HTMLs. However, that's an easy problem to solve, we can always open the HTML in a side-by-side editor. It seems like a minor problem, at least for me. All of my changes were suggestions, not strong opinions. As I explained in my previous comment, feel free to apply or skip them. When I have a strong opinion on something, I request changes instead of approving PRs ;) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great work @ankur22! 👏
I'm not sure I understand the flow for all of these changes, but the sample script in #597 passes on this branch, but not on main
, so it seems like an improvement.
FWIW the doesn't_timeout_waiting_for_networkIdle
test passes for me on main
, so it seems susceptible to timing. Nice job on the tests though 👍
Thanks for the review @imiric.
This is an annoying issue, which has caught me a third time now 😅 (including just now when i was debugging it after you mentioned it). The |
Fixes: #597
Details
There are three
lifecycle
types that are emitted from the browser over CDP:load
DOMContentLoaded
networkIdle
We should be able to to wait on any of these and proceed with the test once we receive them and action on them at an appropriate time. Currently (4485c6d) the wait for
networkIdle
waits indefinitely on some websites e.g. google.com.IMO, we should be able to rely on the browser to send us the CDP lifecycle events, which in my tests it does do. What we seem to have (and I can only guess the reasoning behind it) are three redundancies in place incase we don't receive these events from the browser at all or within a set timeout (the timeout scenario is only for
networkIdle
) or not in the order we expect (if a webpage has a hierarchy of frames, then currently we wait for the child frames to receive and action on the lifecycle events before the main one).The first redundancy we have is the timeout for
networkIdle
. This timer should fire when:networkIdle
lifecycle event.Unfortunately what happens in this issue is that another event is received (EventFrameStoppedLoading) which sets
networkIdle
in the frame'slifecycleEvents
, and when all requests finally complete (which is after we receiveEventFrameStoppedLoading
) the time cannot be started. Therefore there is nothing in place to recalculate the frame's lifecycle, and unblock waiting actions. I also think 500ms is not long enough for a timeout and I've seen situations where the timeout has fired, but then we receive thenetworkIdle
lifecycle CDP event a very short while later.The second redundancy in place is listening to
EventFrameStoppedLoading
and setting all the lifecycle types to the frame'slifecycleEvents
, and when a subsequent new lifecycle event is received from the browser it should recalculateLifecycle which would unblock any waits. The problem with this currently is that this event can be received but we then don't receive and action on any other lifecycle events after -- we actually do receivenetworkIdle
after but we're not callingrecalculateLifecycle
for an unknown reason.The fix for unblocking when waiting on
networkIdle
is:recalculateLifecycle
once we seenetworkIdle
lifecycle events.After this fix, I found another issue -- if we navigate to google.com, we correctly receive
networkIdle
andrecalculateLifecycle
. If we now navigate again to google.com, we unblock the wait as soon as we receive any lifecycle event. We should be waiting on the correctnetworkIdle
event before unblocking waits. The issue is in clearLifecycle. We're looping throughlifecycleEvents
and setting the entry values tofalse
. This actually doesn't help later whenrecalculateLifecycle
is called as it doesn't take into account the value of thefalse
entry, and instead emits all the lifecycle events it has seen so far (which were from a previous navigation) and unblocks waits too early.The fix is:
lifecycleEvents
inclearLifecycle
. This will ensure we unblock waits when a new lifecycle event is received for the new navigation.There is now a third redundancy and a third question that needs answering -- if we navigate to youtube.com, it contains a hierarchy of frames, and so the "main" frame currently needs to wait for its child frames to receive the lifecycle events separately before it will itself action on lifecycle events and unblock waits. Again, i've not seen evidence where the child frames don't receive the events but the main frame does or vice-a-verse. So I think having this dependency on the main/child frames is unrequired, and we should add it back in once we have a real reason to do so.
Unanswered questions
DOMContentLoaded
) when navigating to a page (e.g. youtube.com). We're currently receiving and actioning on the first event, but ignore all the same subsequent events -- should we be doing something with all the other events? I think we can only action on the first event as we can't predicate how the browser will behave and how many times theDOMContentLoaded
will be fired.Issue
Test script:
Version of xk6-browser where this test fails with the following error: 4485c6d
After the fix we don't see this timeout and xk6-browser correctly unblocks on wait when it receives the
networkIdle
lifecycle event.