Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix wait for networkIdle lifecycle event #578

Merged
merged 6 commits into from
Oct 26, 2022
Merged

Fix wait for networkIdle lifecycle event #578

merged 6 commits into from
Oct 26, 2022

Conversation

ankur22
Copy link
Collaborator

@ankur22 ankur22 commented Oct 10, 2022

Fixes: #597

Details

There are three lifecycle types that are emitted from the browser over CDP:

  • load
  • DOMContentLoaded
  • networkIdle

We should be able to to wait on any of these and proceed with the test once we receive them and action on them at an appropriate time. Currently (4485c6d) the wait for networkIdle waits indefinitely on some websites e.g. google.com.

IMO, we should be able to rely on the browser to send us the CDP lifecycle events, which in my tests it does do. What we seem to have (and I can only guess the reasoning behind it) are three redundancies in place incase we don't receive these events from the browser at all or within a set timeout (the timeout scenario is only for networkIdle) or not in the order we expect (if a webpage has a hierarchy of frames, then currently we wait for the child frames to receive and action on the lifecycle events before the main one).

The first redundancy we have is the timeout for networkIdle. This timer should fire when:

  1. 500ms have passed;
  2. There are 0 inflight requests (actually this condition is needed for the timer to start)
  3. We've not already received the networkIdle lifecycle event.

Unfortunately what happens in this issue is that another event is received (EventFrameStoppedLoading) which sets networkIdle in the frame's lifecycleEvents, and when all requests finally complete (which is after we receive EventFrameStoppedLoading) the time cannot be started. Therefore there is nothing in place to recalculate the frame's lifecycle, and unblock waiting actions. I also think 500ms is not long enough for a timeout and I've seen situations where the timeout has fired, but then we receive the networkIdle lifecycle CDP event a very short while later.

The second redundancy in place is listening to EventFrameStoppedLoading and setting all the lifecycle types to the frame's lifecycleEvents, and when a subsequent new lifecycle event is received from the browser it should recalculateLifecycle which would unblock any waits. The problem with this currently is that this event can be received but we then don't receive and action on any other lifecycle events after -- we actually do receive networkIdle after but we're not calling recalculateLifecycle for an unknown reason.

The fix for unblocking when waiting on networkIdle is:

  • Call recalculateLifecycle once we see networkIdle lifecycle events.
  • Increase the timeout to 1sec so that we're not timing out too early.

After this fix, I found another issue -- if we navigate to google.com, we correctly receive networkIdle and recalculateLifecycle. If we now navigate again to google.com, we unblock the wait as soon as we receive any lifecycle event. We should be waiting on the correct networkIdle event before unblocking waits. The issue is in clearLifecycle. We're looping through lifecycleEvents and setting the entry values to false. This actually doesn't help later when recalculateLifecycle is called as it doesn't take into account the value of the false entry, and instead emits all the lifecycle events it has seen so far (which were from a previous navigation) and unblocks waits too early.

The fix is:

  • Completely clear the frame's lifecycleEvents in clearLifecycle. This will ensure we unblock waits when a new lifecycle event is received for the new navigation.

There is now a third redundancy and a third question that needs answering -- if we navigate to youtube.com, it contains a hierarchy of frames, and so the "main" frame currently needs to wait for its child frames to receive the lifecycle events separately before it will itself action on lifecycle events and unblock waits. Again, i've not seen evidence where the child frames don't receive the events but the main frame does or vice-a-verse. So I think having this dependency on the main/child frames is unrequired, and we should add it back in once we have a real reason to do so.

Unanswered questions

  1. The browser may send multiple lifecycle events to the same frame with the same type (e.g. DOMContentLoaded) when navigating to a page (e.g. youtube.com). We're currently receiving and actioning on the first event, but ignore all the same subsequent events -- should we be doing something with all the other events? I think we can only action on the first event as we can't predicate how the browser will behave and how many times the DOMContentLoaded will be fired.
  2. By removing all three of the redundancies, it would make the lifecycle code a lot simpler, and we would just need to rely on the CDP events to dictate when to unblock waits. I have a feeling that this was added in due to out of order CDP events, which has now been resolved. If everyone agrees then I can go ahead and remove those redundancies and make things a lot simpler.

Issue

Test script:

import { chromium } from 'k6/x/browser';

export const options = {
  vus: 1,
  iterations: 1,
};

export default function () {
  const browser = chromium.launch({
    headless: false,
  });

  const context = browser.newContext();

  const page = context.newPage();

  page.goto('https://www.google.com', { waitUntil: 'networkidle' }).then(() => {
    console.log('done')
    return page.goto('https://www.google.com', { waitUntil: 'networkidle' }).then(() => {
      console.log('done')
    });
  }).finally(() => {
    browser.close();
  });
}

Version of xk6-browser where this test fails with the following error: 4485c6d

ERRO[0030] communicating with browser: websocket: close 1006 (abnormal closure): unexpected EOF  category=cdp elapsed="0 ms" goroutine=53
ERRO[0030] process with PID 41718 unexpectedly ended: signal: killed  category=browser elapsed="0 ms" goroutine=75
ERRO[0030] Uncaught (in promise) navigating to "https://www.google.com": timed out after 30s  executor=shared-iterations scenario=default

After the fix we don't see this timeout and xk6-browser correctly unblocks on wait when it receives the networkIdle lifecycle event.

@ankur22 ankur22 marked this pull request as draft October 10, 2022 17:04
@ankur22 ankur22 changed the title Fix wait in networkIdle when event already fired Fix wait for networkIdle when event already fired Oct 10, 2022
@ankur22 ankur22 added the bug Something isn't working label Oct 10, 2022
@ankur22 ankur22 self-assigned this Oct 10, 2022
@ankur22 ankur22 added this to the v0.6.0 milestone Oct 10, 2022
@ankur22 ankur22 marked this pull request as ready for review October 11, 2022 13:00
@ankur22 ankur22 changed the title Fix wait for networkIdle when event already fired Fix wait for networkIdle for complex scenarios Oct 11, 2022
@ankur22 ankur22 force-pushed the fix/networkidle branch 2 times, most recently from 90b87ab to 0887571 Compare October 11, 2022 13:18
@inancgumus
Copy link
Member

inancgumus commented Oct 11, 2022

Before the fix the test would time out since the networkIdle event had already fired before the frame registered to the event loop. After the fix, the test passes.

Hey, Ankur! I couldn't reproduce the issue (neither before the fix nor after)—there was no timeout. I've changed the URL to the complex website's root URL. Could you help me to reproduce the problem?

@ankur22 ankur22 marked this pull request as draft October 11, 2022 15:01
@imiric
Copy link
Contributor

imiric commented Oct 11, 2022

@ankur22 I'm not sure I understand the description, and will have to try reproducing it myself to confirm. 🤯

But just wanted to say: kudos on trying to understand this mess of an informal state machine... 😮‍💨 IMO Frame.recalculateLifecycle(), Frame.onLifecycleEvent(), Frame.hasLifecycleEventFired(), Frame.hasSubtreeLifecycleEventFired(), and all related methods should be either heavily refactored or removed altogether. They're inherently racy, and incredibly difficult to reason about. It only becomes apparent they're broken when testing a complex site like that script has, but it's surprising they work even for simpler scripts.

@ankur22 ankur22 changed the title Fix wait for networkIdle for complex scenarios Draft: Fix wait for networkIdle for complex scenarios Oct 12, 2022
@ankur22 ankur22 force-pushed the fix/networkidle branch 3 times, most recently from 31f0952 to 37a64e3 Compare October 14, 2022 10:29
@ankur22 ankur22 changed the title Draft: Fix wait for networkIdle for complex scenarios Fix wait for networkIdle lifecycle event Oct 14, 2022
@ankur22 ankur22 marked this pull request as ready for review October 14, 2022 14:15
@imiric
Copy link
Contributor

imiric commented Oct 17, 2022

I find all of this super confusing 😅

My main question is: why do we need {start,stop}NetworkIdleTimer() / clearLifecycle() / recalculateLifecycle() / has{Subtree,}LifecycleEventFired() at all? All of this seems to be done so we can keep track of the lifecycle state of each frame.

Frame lifecycle states are sequential and ordered: commit -> DOMContentLoaded -> load -> networkAlmostIdle -> networkIdle. This can be easily expressed as a state machine. They can be interrupted and loop again, as in the case of SPAs (also mentioned below), but this order should always be maintained. If we don't receive an event, or skip an event, then that's an error state and should be raised.

So I vote for a larger refactor to drop Frame.{lifecycle,subtreeLifecycle}Events and all these racy state calculation functions (Frame.onLoadingStopped() seems super hacky...), in favor of a simpler FSM implementation.

I looked at the proposed changes here, and they only raise more questions. E.g. why would increasing LifeCycleNetworkIdleTimeout to 1s be enough, and couldn't we run into the same issues when it's exceeded?

As for your questions:

  1. The browser may send multiple lifecycle events to the same frame with the same type [...] should we be doing something with all the other events?

Yes, we need to be able to process those as well. Imagine a SPA that emits DOMContentLoaded, or maybe even networkIdle. Any internal navigation triggered by JS would cause us to receive those events again. (I assume so, and haven't confirmed it, but would be surprised if that isn't the case. We should have a SPA example site in that examples repo we have so that this can be reproduced.)

  1. By removing all three of the redundancies, it would make the lifecycle code a lot simpler, and we would just need to rely on the CDP events to dictate when to unblock waits.

I think we should remove all of it, and not have any blocking waits. Easier said than done, of course, given how central this functionality is. But I think we would end up with a simpler, less racy, and more intuitive way of keeping track of frame states.

BTW, let's create an issue for this where we can discuss possible solutions, instead of doing this in a PR.

@ankur22
Copy link
Collaborator Author

ankur22 commented Oct 17, 2022

I find all of this super confusing 😅

Me too 😞 There's a lot of guess work as to why it's done this way and I think we should just remove most of it and and fix any issues that we may find -- which i doubt we will

My main question is: why do we need {start,stop}NetworkIdleTimer() / clearLifecycle() / recalculateLifecycle() / has{Subtree,}LifecycleEventFired() at all? All of this seems to be done so we can keep track of the lifecycle state of each frame.

Very good questions, here's what I think they all mean. NOTE: my answers are based on the fact that once upon a time CDP events were received out of order prior to #555 and a lot these are "fixes" to try and bring it back in order. I also think we can remove them all and just forward all incoming frame lifecycle CDP events to the handlers that are interested in them:

  • {start,stop}NetworkIdleTimer() -- The timer is started when we stop seeing requests being made by the page the test is navigating to. After 500ms it will try to emit networkIdle event to unblock any waiting navigation actions. The timer is stopped when the frame is destroyed, cleared, detached, and when the web page makes another request within the 500ms timeout. It's just a fail safe in place in case networkIdle isn't received via CDP from the browser after the last request made by the page.
  • clearLifecycle -- it's called once the page navigation ends. The assumption is that once the page has been navigated to, there's no need for that data now, and it prepares the frame for the next navigation.
  • recalculateLifecycle -- this will receive lifecycle CDP events, and allow these events to be forwarded to handlers if:
    • Its child frames have already received the lifecycle event in question and emitted the event themselves first.
    • It itself hasn't already forwarded (emitted) the lifecycle event to handlers.
    • Forwards (emits) the lifecycle events to the main page -- Probably can remove this since no handlers are listening to this event.
  • has{Subtree,}LifecycleEventFired() -- this is needed to prevent emitting more than one of the lifecycle events to handlers, and also to help ensure that the child frames emit an event first before the main frame emits the event.

Frame lifecycle states are sequential and ordered: commit -> DOMContentLoaded -> load -> networkAlmostIdle -> networkIdle. This can be easily expressed as a state machine. They can be interrupted and loop again, as in the case of SPAs (also mentioned below), but this order should always be maintained. If we don't receive an event, or skip an event, then that's an error state and should be raised.

Agreed, and this is exactly how we're receiving events from the browser via CDP -- they're in order and all looks good. Why would we need a state machine?

So I vote for a larger refactor to drop Frame.{lifecycle,subtreeLifecycle}Events and all these racy state calculation functions (Frame.onLoadingStopped() seems super hacky...), in favor of a simpler FSM implementation.

Yep, I like this suggestion. Let me know if i should go ahead with it 🙂 Also, why do we need a FSM if the order of CDP events from the browser is correct?

I looked at the proposed changes here, and they only raise more questions. E.g. why would increasing LifeCycleNetworkIdleTimeout to 1s be enough, and couldn't we run into the same issues when it's exceeded?

I only increased it since it was firing too early. Yep, you're correct, it could still fire too early. I would prefer to remove it.

Yes, we need to be able to process those as well. Imagine a SPA that emits DOMContentLoaded, or maybe even networkIdle. Any internal navigation triggered by JS would cause us to receive those events again. (I assume so, and haven't confirmed it, but would be surprised if that isn't the case. We should have a SPA example site in that examples repo we have so that this can be reproduced.)

What's a SPA?

I think we should remove all of it, and not have any blocking waits. Easier said than done, of course, given how central this functionality is. But I think we would end up with a simpler, less racy, and more intuitive way of keeping track of frame states.

👍 One thing to point out, I don't know why we would need to keep track of the frame state. As long as we forward all relevant CDP lifecycle events to the frames, then they can unblock on waits and all's well. Am i missing something?

Further discussion for removing and simplifying lifecycle event in frames can be done in #593.

For now, it would be good if this was either approved if we're happy that the waits are unblocked when we receive a networkIdle lifecycle event or closed to go ahead with the larger refactor.

@imiric
Copy link
Contributor

imiric commented Oct 17, 2022

@ankur22 Thanks for the summary of the functionality. 👍

Some notes:

  • What's a SPA?

    A single-page application. Though repeated lifecycle events could also be sent in a traditional multi-page application, if the frame navigates normally.

  • {start,stop}NetworkIdleTimer(): Right, so the timer is a workaround for either a) us missing the event, as seems to happen for google.com, which is why increasing the timeout works, or b) the case when the browser doesn't send the event, which could theoretically happen for SPA frames, where the network is never really idle.

    The second is a tough problem we don't handle well at all now, and it would be good seeing how Playwright / Puppeteer handle this. Puppeteer, for example, has additional logic and custom networkidle0 and networkidle2 states to use depending on the site. This is understandably not flexible enough, and users were asking to be able to specify a custom time or number of responses to wait for. There are some JS workarounds in that thread, but we can consider a better implementation of this.

    In any case, I think this is a secondary issue to what we're discussing here, and deserves a separate GH issue we can work on after this.

  • clearLifecycle(): Right, this seems to only be needed to "reset" the state, but it calls recalculateLifecycle(), so who really knows what the resulting state will be... 😕

    In any case, it shouldn't be needed if we just leave the state according to whatever the last received event was, and assume that this could change at any point after a navigation event.

  • recalculateLifecycle(): Yeah, I would drop this altogether. Instead, just store the state according to whatever event was received. Unless I'm mistaken, no calculations or complex logic should be required...

  • has{Subtree,}LifecycleEventFired(): I'm still not sure why this is needed. 😕 We should emit as many lifecycle events as CDP events are received, and I'm not sure why we need the distinction of having to emit events from child frames first... We need some example sites and E2E tests for this, as I'm not following why we would need to do this.

    Anyway, the whole idea of a method to determine whether a lifecycle event was fired seems redundant and racy. Redundant in the sense that it should be obvious that if the current state is networkIdle, then it can be assumed that DOMContentLoaded was received before it (otherwise we would've raised an error). So that check boils down to frame.state >= DOMContentLoaded. And racy in the sense that it doesn't specify for which document the event was expected. I.e. like we mentioned above, an event can be received multiple times, so are we checking that DOMContentLoaded was fired for the current document, or that it fired at any point in the past? From what I remember, I've run into bugs where this check passed because the event was received for some document in the past, but since we didn't clear it on time, it returned true for the current document, even though we hadn't received it yet. 😓

So, yeah, I'm uncertain about many of these assertions and hypotheses, so take them with a grain of salt, but I think we should try to simplify and remove all of this mess.

Agreed, and this is exactly how we're receiving events from the browser via CDP -- they're in order and all looks good. Why would we need a state machine?

I mean, it's implicitly a state machine anyway, which is what this spaghetti tries to model. We might as well make that explicit and be able to fail if, e.g., a wrong state transition is attempted. But you're right, the implementation doesn't have to be an FSM for this to work.

One thing to point out, I don't know why we would need to keep track of the frame state. As long as we forward all relevant CDP lifecycle events to the frames, then they can unblock on waits and all's well. Am i missing something?

Some structure around this would be helpful, if nothing else so that we can query in what state the frame is at any point in time. But you're right that we might not need it, so maybe the first step should be giving that barebones approach a try first.

For now, it would be good if this was either approved if we're happy that the waits are unblocked when we receive a networkIdle lifecycle event or closed to go ahead with the larger refactor.

I would rather not merge this as is, since we're not sure about the overall impact it might have on other sites, given the fragile state of this core feature. So my vote is to close and proceed with a larger refactor.

@ankur22
Copy link
Collaborator Author

ankur22 commented Oct 17, 2022

@imiric I think we're on the same page. Lots of what you have said are things that I've discovered myself, and somethings which are new (e.g. the links to puppeteer gives us some information as to why 500ms was possibly used in the first place).

I'm definitely up for removing as much as we can while maintaining the current behaviour. We will need to create new issues for SPAs when we come across them and can easily replicate them.

@inancgumus what do you think?

@ankur22
Copy link
Collaborator Author

ankur22 commented Oct 18, 2022

@inancgumus I think this will help you understand what the issues were and why the fixes were done. This is a table of events which I observed before any fixes:

| Log                                                                                        | nav to      | description                                                                                                                                                                                                                                                                                                                                                                               |
|--------------------------------------------------------------------------------------------|-------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| FrameSession:onPageLifecycle event:DOMContentLoaded                                        | about:blank | We recieve a DOMContentLoaded CDP event from the browser.                                                                                                                                                                                                                                                                                                                                 |
| Frame:recalculateLifecycle emit EventFrameAddLifecycle event:domcontentloaded              | about:blank | We now emit domcontentloaded to any listening handlers.                                                                                                                                                                                                                                                                                                                                   |
| FrameSession:onPageLifecycle event:load                                                    | about:blank | We recieve a load CDP event from the browser.                                                                                                                                                                                                                                                                                                                                             |
| Frame:recalculateLifecycle will not emit event:domcontentloaded                            | about:blank | Since we loop through all the events that we've recieved (which i believe is a reduancy that is unnecessary if we just work with the CDP events and not keep track of them) we won't emit this again.                                                                                                                                                                                     |
| Frame:recalculateLifecycle emit EventFrameAddLifecycle event:load                          | about:blank | We now emit load to any listening handler.                                                                                                                                                                                                                                                                                                                                                |
| FrameSession:onPageLifecycle event:networkIdle                                             | about:blank | We recieve a networkIdle CDP event from the browser.                                                                                                                                                                                                                                                                                                                                      |
| clearLifecycle                                                                             | about:blank | Navigation to about:blank is complete so clear the frame's lifecycle which in turn calls recalculateLifecycle.                                                                                                                                                                                                                                                                            |
| Frame:recalculateLifecycle will not emit SubtreeLifecycleEventFired event:domcontentloaded | about:blank | No need to emit domcontentloaded as we've cleared the lifecycle. If we stop tracking these events internally and just work with the CDP events we can completely remove this clear method.                                                                                                                                                                                                |
| Frame:recalculateLifecycle will not emit SubtreeLifecycleEventFired event:load             | about:blank | No need to emit load as we've cleared the lifecycle.                                                                                                                                                                                                                                                                                                                                      |
| FrameSession:onPageLifecycle event:DOMContentLoaded                                        | google.com  | We recieve a DOMContentLoaded CDP event from the browser.                                                                                                                                                                                                                                                                                                                                 |
| Frame:recalculateLifecycle emit EventFrameAddLifecycle event:domcontentloaded              | google.com  | We now emit domcontentloaded to any listening handler.                                                                                                                                                                                                                                                                                                                                    |
| Frame:recalculateLifecycle emit EventFrameAddLifecycle event:load                          | google.com  | We now emit load to any listening handler. This is wrong, we've not received the load CDP event yet. This issue exists because we're not clearing the lifecycleEvents correctly (2a6f6f63db74d0224d2baeeb408b67c43c890884 fixes this).                                                                                                                                                    |
| FrameSession:onPageLifecycle event:load                                                    | google.com  | We recieve a load CDP event from the browser.                                                                                                                                                                                                                                                                                                                                             |
| Frame:recalculateLifecycle will not emit SubtreeLifecycleEventFired event:domcontentloaded | google.com  | Obviously not going to emit domcontentloaded again as we've already done so.                                                                                                                                                                                                                                                                                                              |
| Frame:recalculateLifecycle will not emit EventFrameAddLifecycle event:load                 | google.com  | Not going to emit load again as we've already done so, but that's wrong.                                                                                                                                                                                                                                                                                                                  |
| FrameSession:onFrameStoppedLoading                                                         | google.com  | This is also a bug. We shouldn't rely on onFrameStoppedLoading which will set load, domcontentloaded and networkidle in lifecycleEvents, which it shouldn't need to do. Soon after this we get the response from the last request made by the page, but the networkIdle timeout cannot be started since networkIdle in lifecycleEvents has already been set due to onFrameStoppedLoading. |
| FrameSession:onPageLifecycle event:networkIdle                                             | google.com  | We ge this, but do nothing with it. We should call `fs.manager.frameLifecycleEvent(event.FrameID, LifecycleEventDOMContentLoad)` from `FrameSession` (fixed in 45cd802e94205e617eaf2bbd9870b4cbb5e06de1).                                                                                                                                                                                 |
| **TIMEOUT after 30secs**                                                                   |             |                                                                                                                                                                                                                                                                                                                                                                                           |

After the fixes have been applied we see this:

| log                                                                           | nav to      | description                                                                                                                                                                                                                                                   |
|-------------------------------------------------------------------------------|-------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| FrameSession:onPageLifecycle event:DOMContentLoaded                           | about:blank | We receive a DOMContentLoaded CDP event from the browser.                                                                                                                                                                                                     |
| Frame:recalculateLifecycle emit EventFrameAddLifecycle event:domcontentloaded | about:blank | We now emit domcontentloaded to any listening handlers.                                                                                                                                                                                                       |
| FrameSession:onPageLifecycle event:load                                       | about:blank | We receive a load CDP event from the browser.                                                                                                                                                                                                                 |
| Frame:recalculateLifecycle will not emit event:domcontentloaded               | about:blank | We don't re-emit domcontentloaded. Shouldn't even need to loop over received CDP events like this.                                                                                                                                                            |
| Frame:recalculateLifecycle emit EventFrameAddLifecycle event:load             | about:blank | We now emit load to any listening handlers.                                                                                                                                                                                                                   |
| FrameSession:onPageLifecycle event:networkIdle                                | about:blank | We receive a networkIdle CDP event from the browser. We're now doing something with this event.                                                                                                                                                               |
| Frame:recalculateLifecycle emit EventFrameAddLifecycle event:networkidle      | about:blank | We now emit networkidle to any listening handlers.                                                                                                                                                                                                            |
| Frame:recalculateLifecycle will not emit event:domcontentloaded               | about:blank | We don't re-emit domcontentloaded.                                                                                                                                                                                                                            |
| Frame:recalculateLifecycle will not emit event:load                           | about:blank | We don't re-emit load.                                                                                                                                                                                                                                        |
| clearLifecycle                                                                | about:blank | This time when we recalculateLifecycle is called, it has no events to loop over since lifecycleEvent has been completed cleared. In recalculateLifecycle, subtreeLifecycleEvents is left empty and not populated like before since the events slice is empty. |
| FrameSession:onPageLifecycle event:DOMContentLoaded                           | google.com  | We receive a DOMContentLoaded CDP event from the browser.                                                                                                                                                                                                     |
| Frame:recalculateLifecycle emit EventFrameAddLifecycle event:domcontentloaded | google.com  | We now emit domcontentloaded to any listening handlers.                                                                                                                                                                                                       |
| FrameSession:onPageLifecycle event:load                                       | google.com  | We receive a load CDP event from the browser.                                                                                                                                                                                                                 |
| Frame:recalculateLifecycle will not emit event:domcontentloaded               | google.com  | We don't re-emit domcontentloaded.                                                                                                                                                                                                                            |
| Frame:recalculateLifecycle emit EventFrameAddLifecycle event:load             | google.com  | We now emit load to any listening handlers.                                                                                                                                                                                                                   |
| FrameSession:onFrameStoppedLoading                                            | google.com  | In my local version of the this fix i've stopped it from setting load, domcontentloaded and networkIdle in lifecycleEvents. This solves another problem to do with simple sites which perform lots of requests.                                               |
| FrameSession:onPageLifecycle event:networkIdle                                | google.com  | We receive a networkIdle CDP event from the browser.                                                                                                                                                                                                          |
| Frame:recalculateLifecycle will not emit event:load                           | google.com  | We don't re-emit load.                                                                                                                                                                                                                                        |
| Frame:recalculateLifecycle emit EventFrameAddLifecycle event:networkidle      | google.com  | We now emit networkidle to any listening handlers.                                                                                                                                                                                                            |
| Frame:recalculateLifecycle will not emit event:domcontentloaded               | google.com  | We don't re-emit domcontentloaded.                                                                                                                                                                                                                            |
| clearLifecycle                                                                | google.com  | lifecycleEvent and subtreeLifecycleEvents completely cleared for the frame when navigation ends.                                                                                                                                                              |
| FrameSession:onPageLifecycle event:DOMContentLoaded                           | google.com  | We receive a DOMContentLoaded CDP event from the browser.                                                                                                                                                                                                     |
| Frame:recalculateLifecycle emit EventFrameAddLifecycle event:domcontentloaded | google.com  | We now emit domcontentloaded to any listening handlers.                                                                                                                                                                                                       |
| FrameSession:onPageLifecycle event:load                                       | google.com  | We receive a load CDP event from the browser.                                                                                                                                                                                                                 |
| Frame:recalculateLifecycle will not emit event:domcontentloaded               | google.com  | We don't re-emit domcontentloaded.                                                                                                                                                                                                                            |
| Frame:recalculateLifecycle emit EventFrameAddLifecycle event:load             | google.com  | We now emit load to any listening handlers.                                                                                                                                                                                                                   |
| FrameSession:onFrameStoppedLoading                                            | google.com  | In the local version this does nothing now.                                                                                                                                                                                                                   |
| FrameSession:onPageLifecycle event:networkIdle                                | google.com  | We receive a networkIdle CDP event from the browser.                                                                                                                                                                                                          |
| Frame:recalculateLifecycle will not emit event:domcontentloaded               | google.com  | We don't re-emit domcontentloaded.                                                                                                                                                                                                                            |
| Frame:recalculateLifecycle will not emit event:load                           | google.com  | We don't re-emit load.                                                                                                                                                                                                                                        |
| Frame:recalculateLifecycle emit EventFrameAddLifecycle event:networkidle      | google.com  | We now emit networkIdle to any listening handlers.                                                                                                                                                                                                            |
| wait in unblocked | google.com | test ends |

I also noticed that even without the fix in this PR, we could make the script pass by applying the following diff on 4485c6d.

Yes that does fix this issue. By doing this what will happen in this scenario (google.com) is that we will receive the last request response which the page has made (success or failure) and the timeout timer will start. After 500ms it will fire and recalculateLifecycle will be called, which will emit the networkIdle event, and will unblock the wait. Inbetween the last response and the timeout firing, we will have received a CDP networkIdle event from the browser, which will be ignored.

Although this works, I don't think we should be relying on the timeout in this scenario (i can see a scenario where a timeout would be necessary and I'm working on a test for it). We should be working directly with the CDP browser events where possible. We will definitely need to improve this implementation though where there are websites which poll frequently or are quite chatty for other reasons -- and in those scenarios it might be best to allow the user to determine what a safe timeout is instead of forcing 500ms.

Out of curiosity, I experimented with tracking frame lifecycle event addition and removal in this #594.

I don't believe you need to track them, and that solution has just makes things more complicated. If we clear lifecycleEvents by overwriting it with a new map (as i've done in 2a6f6f6) then when recalculateLifecycle is called straight after, the subtreeLifecycleEvents is cleared, allowing future navigations to correctly emit new load, DOMContentLoad and networkIdle events to the internal handlers, which in turn will unblock waits.

There are many issues with lifecycle events. In this PR i've tried to fix the networkIdle issue without a complete refactor. But, I've also come to the realisation that this lifecycle events code we currently have is too complicated and most of it should be deleted. I just need to create the tests to prove it 🙂

I would like to close this PR and create a new PR with three fixes (two of which you have already pointed out in your patch in this comment:

  1. Clear the lifecycleEvents in clearLifecycle -- this will ensure that we don't emit the wrong events when navigating from about:blank to google.com or google.com to another site.
  2. Call fs.manager.frameLifecycleEvent(event.FrameID, LifecycleEventNetworkIdle) from onPageLifecycle when we receive networkIdle CDP event -- this will ensure that we emit networkidle events to the internal handlers, which will unblock waits.
  3. Not do anything with onFrameStoppedLoading -- this will ensure that we don't start emitting events to early before they're needed.

All fixes will have a test to lock in the behaviour.

Once that is done I will move onto refactoring the lifecycle events code, adding more tests where necessary to ensure that we don't regresse.

WDYT @inancgumus & @imiric?

@inancgumus
Copy link
Member

inancgumus commented Oct 18, 2022

The diff I shared in this comment was about demonstrating one part of the issue that I discovered after printing the events as you did. It wasn't a solution at all ;) I appreciated the detailed explanation ❤️

And #594 was a proper possible fix for our "current pubsub architecture". However, I agree that the solution was as complicated as the current architecture. So I'm OK to refactor and simplify the current architecture as long as we have proper tests in place 👍

I would like to close this PR and create a new PR with three fixes (two of which you have already pointed out in your patch in this comment:

It's up to you, but IMO, it's OK to merge this PR just with the diff I shared. Then you can continue where you left off in another PR.


By the way, I believe it's more effective to create issues first for laying out a problem, its possible solutions, and discussions around it. Then we can send as many PRs as possible until the issue closes.

@ankur22
Copy link
Collaborator Author

ankur22 commented Oct 19, 2022

It's up to you, but IMO, it's OK to merge this PR just with the #578 (comment) I shared. Then you can continue where you left off in another PR.

Cool, i'll just do the fix here and clean up the commits so that we only have the most relevant fixes and tests.

By the way, I believe it's more effective to create issues first for laying out a problem, its possible solutions, and discussions around it. Then we can send as many PRs as possible until the issue closes.

Yep, I will endeavour to open a new issue next time, even if the fix initially feels trivial.

Copy link
Member

@inancgumus inancgumus left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll review this probably tomorrow: It seems like we have good tests now 👏 After skimming it a little, the only trivial thing that caught my eye was counters. IMO, we can make them atomic.Int64 or something instead of a sync.RWMutex.

When we test using WaitUntil, it is transformed from a go struct into
a goja value. If we don't provide a struct tag, goja will transform
the name to snake case (wait_until). Adding the js struct tag fixes
this.
When a site that is navigated to emits a DOMContentLoaded well before
load, this is a signal that we're waiting on assets to load over the
network. In such a case (e.g. navigating to google.com) when we goto
with waitUntil = networkIdle, due to how we handle such lifecycle
events, we end up in a situation where the networkIdle event is
ignored. This change ensure that we action on these events too.

Partly fixes: https://github.com/orgs/grafana/projects/80/views/14
When we navigate to a page, lifecycle events were being stored. If a
new navigation occurred, the previous events would be used and fired
off to the internal handles. This is incorrect as it means that
WaitUntil will fail to wait for the correct event after the initial
navigation to a page.

Partly fixes: https://github.com/orgs/grafana/projects/80/views/14
Some websites may have many network requests which are performed after
the page has fully loaded and we receive a FrameStoppedLoading. For
now we want to wait for the networkIdle event, and in the future if we
need, we should add a timer for very chatty websites where we may
never see a networkIdle event.

Partly fixes: https://github.com/orgs/grafana/projects/80/views/14
Copy link
Member

@inancgumus inancgumus left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's awesome that we have tests for this behavior now 👏 🥳 LGTM!

I made some opinionated changes to make tests more readable and maintainable in #603. Please feel free to skip them, I don't mind. I thought it would be better to create another branch instead of reviewing the PR in place.

Changes:

  • Uses an atomic counter.
  • Moves home assertion to a helper closure.
  • IMO, editor support is useful when dealing with long HTMLs and scripts (especially when debugging later). So I moved two of the home page tests to static files.

@ankur22
Copy link
Collaborator Author

ankur22 commented Oct 24, 2022

Thanks for these changes.

Uses an atomic counter.

What's the benefit of this change?

Moves home assertion to a helper closure.

Looks good thanks, i'll cherry pick this change from your PR.

IMO, editor support is useful when dealing with long HTMLs and scripts (especially when debugging later). So I moved two of the home page tests to static files.

Not too sure about this, tbh. I prefer things inline. I'll make the change though if you have a very strong opinion on this.

@inancgumus
Copy link
Member

inancgumus commented Oct 24, 2022

Sure 👍

Not too sure about this, tbh. I prefer things inline. I'll make the change though if you have a very strong opinion on this.

IMHO, with HTMLs in a static folder, the test flow becomes clear, and we get editor support like syntax highlighting, completion, etc.—which can be useful when debugging it later. The downside is we will need to open another tab to see what's in the HTMLs. However, that's an easy problem to solve, we can always open the HTML in a side-by-side editor. It seems like a minor problem, at least for me.

All of my changes were suggestions, not strong opinions. As I explained in my previous comment, feel free to apply or skip them. When I have a strong opinion on something, I request changes instead of approving PRs ;)

Copy link
Contributor

@imiric imiric left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work @ankur22! 👏

I'm not sure I understand the flow for all of these changes, but the sample script in #597 passes on this branch, but not on main, so it seems like an improvement.

FWIW the doesn't_timeout_waiting_for_networkIdle test passes for me on main, so it seems susceptible to timing. Nice job on the tests though 👍

@ankur22
Copy link
Collaborator Author

ankur22 commented Oct 26, 2022

Thanks for the review @imiric.

FWIW the doesn't_timeout_waiting_for_networkIdle test passes for me on main, so it seems susceptible to timing. Nice job on the tests though 👍

This is an annoying issue, which has caught me a third time now 😅 (including just now when i was debugging it after you mentioned it). The WaitUntil: common.LifecycleEventNetworkIdle, isn't being serialised to waitUntil and is instead being serialised to wait_until. On main that test isn't waiting for networkidle, and is instead waiting on load so that's why it passes. I had to first fix that serialisation issue in cd3f227 before being able to replicate the timeout issue.

@ankur22 ankur22 merged commit 8f704ab into main Oct 26, 2022
@ankur22 ankur22 deleted the fix/networkidle branch October 26, 2022 12:52
@imiric imiric linked an issue Oct 27, 2022 that may be closed by this pull request
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working events CDP or internal events related.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Fix networkIdle waitUntil option Navigation timeout with device_emulation.js example
3 participants