Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WebVitals metrics measurements are incorrect for "slow" loading sites #960

Closed
ka3de opened this issue Jun 30, 2023 · 4 comments
Closed

WebVitals metrics measurements are incorrect for "slow" loading sites #960

ka3de opened this issue Jun 30, 2023 · 4 comments
Assignees
Labels
bug Something isn't working
Milestone

Comments

@ka3de
Copy link
Collaborator

ka3de commented Jun 30, 2023

Brief summary

After modifications made in #949 and #943, and comparisons done in #950, we concluded that, in some cases, k6 browser reports incorrect Web Vitals metrics measurements. This happens especially for sites that are more dynamic and take longer to load, as in this cases k6 browser seems to report "intermediate" values corresponding to samples taken before the page is fully loaded, event for which k6 browser seems to not wait for long enough. This affects particularly FCP and LCP metric.

xk6-browser version

commit: c959b57

OS

Ubuntu 20.04.5 LTS

Chrome version

113.0.5672.126 (Official Build) (64-bit)

Docker version and image (if applicable)

No response

Steps to reproduce the problem

Run the following test:

import { browser } from 'k6/x/browser';

export const options = {
  scenarios: {
    ui: {
      executor: 'shared-iterations',
      options: {
        browser: {
            type: 'chromium',
        },
      },
    },
  },
};
  

export default async function () {
  const page = browser.newPage();

  try {
    await page.goto('https://grafana.com')
  } finally {
    page.close();
  } 
}

Expected behaviour

Web Vitals metrics should be similar to the ones reported by Google Lighthouse:
Screenshot from 2023-06-30 11-43-34

Actual behaviour

Web Vitals metrics reported by k6 browser:
Screenshot from 2023-06-30 11-47-33
Notice the differences in FCP and LCP.

@ka3de ka3de added the bug Something isn't working label Jun 30, 2023
@ka3de ka3de added this to the v0.11.0 milestone Jun 30, 2023
@ka3de ka3de self-assigned this Jul 6, 2023
@ka3de
Copy link
Collaborator Author

ka3de commented Jul 10, 2023

I've been investigating this issue further. My main concern was to clarify if k6 browser implementation was incorrect at some point during the Web Vitals metrics gathering, parsing or reporting; from the tooling script injected in the web site, to being reported to k6 browser through a set up binding and finally pushed to k6 core.

For this I defined a set of hypothetical issues that might be producing these metric mismatches and explored them individually. This was done by collecting timestamped events for the moments on which:

  • Web Vitals JS library is loaded in the web site (from Web Site context)
  • Web Vitals JS libary reports a sample (from Web Site context)
  • k6 browser receives a sample through binding call
  • k6 browser pushes a sample to VU samples channel (specifying whether this one was open or not)
  • k6 browser removes the Web Vitals binding as part of page close
  • k6 browser finishes an iteration

A clarification first on reportAllChanges flag for Web Vitals library

In our effort to achieve more consistent Web Vitals reporting, a change was introduced in #949 to force the Web Vitals library to report metrics on every value change. Early on during the tests performed for this issue investigation, we observed that this was an incorrect configuration, especially for LCP metric. LCP measures the time it takes to render the largest image or text block visible within the viewport, relative to when the page first started loading. Because there can only be one "largest" element in a page, LCP should only provide one sample per page load. Instead, if we use the reportAllChanges flag, LCP is reported multiple times when rendering intermediate elements that, up until that point, are the largest rendered in the page. This makes k6 browser record incorrect values for LCP on every new page.
Therefore, we should not use reportAllChanges flag for Web Vitals library. Instead, and in order to keep the reporting consistency that this flag was providing, the change implemented in #953 proved to be a better fit and produces a similar effect as recommendations given in the Web Vitals library doc:

Note that some of these metrics will not report until the user has interacted with the page, switched tabs, or the page starts to unload. If you don't see the values logged to the console immediately, try reloading the page (with preserve log enabled) or switching tabs and then switching back.

For this reason, all tests executed as part of the investigation for this issue have been done reverting #949 and including changes implemented in #953. More on this in conclusions and next steps.

Issue hypothesis

Metric samples are lost due to premature iteration end
As per our tests, no metrics samples were lost due to premature iteration end. This hypothesis tried to clarify if there were samples being reported by the Web Vitals metrics script in the web site context that would not be catched by k6 browser implementation. This issue was particularly plausible in iterations that only perform a page.goto action, which, by default, would guarantee waiting until the page load event is fired, but this might not have been enough to receive Web Vitals metrics samples reported from JS script. In practice this was not an issue.

Mismatched time measurements due to injecting Web Vitals script too late
This hypothesis tried to clarify if, due to how the Web Vitals script is injected in every page through CDP, might have an effect in time measurements, maybe because it was being injected too late in the page load process.
In practice this was not an issue as Web Vitals library is injected before any other script in the page. Also, even when instrumenting a website through "standard means" (not through a browser automation tool), the library is also injected as an included script in the page, therefore this should not be a critical point with observed mismatched metrics.

Mismatched measurements due to browser flags
k6 browser uses a set of default flags when launching chrome/chromium browser in order to minimize undesired "background processes" and favor test automation consistency. This hypothesis tried to clarify if these flags could produce significant differences on Web Vitals metrics measurements vs a "default browser execution".
In practice there were not consistent significant differences observed between both "execution modes".

Conclusions

Overall this issue has been useful to discard any mishandling of Web Vitals metrics from k6 browser implementation. Instead what we have observed is that the metrics reported by the Web Vitals library script are the ones that end up being processed and reported by k6 browser. Which means that any sort of mismatched measurements should, in theory, be produced by the library itself.
In general I think we have to accept the particularities of k6 browser execution "context" and the effect that these might have in these measurements. These differences are not consistent across different web sites and, in practice, these are not significantly different than variations observed between different web performance measuring tools (see tests section). Also if we put these differences, which mainly affect FCP and LCP metrics, in perspective with the thresholds defined for these metrics, they very rarely (at least per our tests) produce a change of rating. E.g.: The differences observed don't end up impacting the rating for these metrics as they are still lower than the good threshold value for FCP (1.8s) and LCP (2.5s).

Other takeaways
reportAllChanges flag should not be used for Web Vitals library
As explained earlier, this flag produces incorrect aggregated metrics for LCP in our tests and should not be used.

Use pagehide event on page close
Also mentioned earlier, the changes implemented in #953 has proven to be a good fix in order to achieve the consistency provided by the reportAllChanges flag for Web Vitals library without the undesired drawbacks. This also proves to be a good fix for premature iteration end in tests that just perform a page.goto.

Headless flag for k6 browser has an impact on page rendering
In some cases we could observe that executing the browser in headless/headful mode produced significant changes for FCP and LCP metrics (see zara test case). This again shows that there are many particularities in the execution context hat have an impact on the Web Vitals metrics measurements, and that these should be most probably accepted. It is yet to see if the new headless mode in chrome has an impact on these differences (see #948).

Tests

See the tests performed. These have been run through N executions and values reported in the tables correspond to one execution which was producing measurements similar to the more consistent results through all executions. E.g.: We have verified that the measurements in these tables do not belong to a particular execution that for some reason might have produced significantly different results compared to the majority of other executions.

Tests
grafana.com k6 Browser headful Lighthouse PageSpeed WebPageTest
TTF 105.1ms 100.7ms - 500ms 722ms
FCP 317.6ms 353.9ms 800ms 1800ms 2074ms
LCP 521.3ms 554ms 1600ms 2100ms 3989ms
CLS 0.001913 0.001959 0.003 0.02 0.029
slack.com k6 Browser headful Lighthouse PageSpeed WebPageTest
TTF 143.84ms 248.59ms _ 700ms 899ms
FCP 573.6ms 629.19ms 900ms 1500ms 2821ms
LCP 603.1ms 608.19ms 1600ms 2100ms 2821ms
CLS 0.057071 0.061046 0.056 0.15 0.236

Headless/Headful mode produced huge differences between runs for slack.com. From 600ms to 5.38s for LCP and up to 2.5s for FCP in a few executions in headful mode.

github.com k6 Browser headful Lighthouse PageSpeed PageWebTest
TTF 133.8ms 142.09ms - 1000ms 845ms
FCP 359.6ms 450.39ms 700ms 1900ms 3216ms
LCP 375ms 406.19ms 900ms 2700ms 3843ms
CLS 0.002217 0.001416 0 0.11 0
news.ycombinator.com k6 Browser headful Lighthouse PageSpeed PageWebTest
TTF 515.39ms 509.7ms - 600ms 895ms
FCP 721.59ms 735.1ms 600ms 600ms 1464ms
LCP 721.59ms 735.1ms 800ms 600ms 1464ms
CLS 0 0 0 0 0
youtube.com k6 Browser headful Lighthouse PageSpeed PageWebTest
TTF 250.35ms 209.89ms - 700ms 2093ms
FCP 439.1ms 412.3ms 700ms 2600ms 4234ms
LCP 1650ms 1640ms 2600ms 4000ms 6187ms
CLS 0.000816 0.000737 0.001 0.31 0
www.zara.com k6 Browser headful Lighthouse PageSpeed PageWebTest
TTF 148.1ms 171.5ms - 400ms 950ms
FCP 174.89ms 431.6ms 600ms 1100ms 3052ms
LCP 174.89ms 464.9ms 700ms 3000ms 9160ms
CLS 0 0 0.002 0.14 0

See also charts.

Next steps

@ankur22
Copy link
Collaborator

ankur22 commented Jul 10, 2023

Great bit of analysis 👏 🎉

It feels like there's not much we can do at the moment (apart from what is listed in the Next Steps, which I agree with), we are bound by the Web Vital library and the behaviours of the all the different factors (the browser module, website under test, the OS, and the compute resources, the network, etc).

I think what will be important is how consistent the measurements are from one test run to another -- a test today against grafana.com should result in a similar test result where the margin of error (or delta) is small assuming the factors remain the same.

@ka3de
Copy link
Collaborator Author

ka3de commented Jul 11, 2023

See comparison charts for FCP and LCP metrics for sites under test:
(Take into account that PageSpeed caches results, therefore the consistent measurements)

grafana.com

Screenshot from 2023-07-11 09-39-24
Screenshot from 2023-07-11 09-39-38

slack.com

Screenshot from 2023-07-11 09-39-47
Screenshot from 2023-07-11 09-39-54

github.com

Screenshot from 2023-07-11 09-40-01
Screenshot from 2023-07-11 09-40-08

news.ycombinator.com

Screenshot from 2023-07-11 09-40-17
Screenshot from 2023-07-11 09-40-22

youtube.com

Screenshot from 2023-07-11 09-40-29
Screenshot from 2023-07-11 09-40-40

zara.com

Screenshot from 2023-07-11 09-40-48
Screenshot from 2023-07-11 09-40-54

We can see that the trends between k6 browser and Lighthouse usually match, and the difference between them in absolute values are not so significant compared to differences with other tools.

@ka3de
Copy link
Collaborator Author

ka3de commented Jul 14, 2023

An interesting test comparison between k6 Browser and Grafana Faro:

This test was done by instrumenting k6-docs site with Grafana Faro, running the site locally and executing k6 browser tests against it. The interesting point about this test is that it allows us to obtain metrics calculated by k6 browser and Grafana Faro in the same "environment conditions", including hardware, network etc, also taking into account that in this test both measurements for k6 browser and Grafana faro for each sample are obtained as a result of the same page request/page load, as the k6 browser script just executes a page.goto() action, from which it obtains the Web Vitals metrics, and is this same action the one that triggers the page load and the metrics measurements from Faro side due to the website being instrumented.

FCP

Screenshot from 2023-07-14 08-29-14

Sample k6 Browser Faro
1 1070 1080
2 999.4 1060
3 1010 1010
4 962.7 1010
5 1000 1010
6 937.2 986
7 1030 1020
8 951.19 992
9 953.9 974
10 999.5 1010

LCP

Screenshot from 2023-07-14 08-29-14

Sample k6 Browser Faro
1 1070 1080
2 999.4 1060
3 1010 1010
4 1040 1030
5 1000 1030
6 937.2 1030
7 1030 1020
8 951.19 992
9 953.9 974
10 999.5 1010

Try it yourself

k6-docs
Git clone k6-docs repo. In our case we were using commit e8c334f6.

git clone [email protected]:grafana/k6-docs.git
cd k6-docs

Instrument web app with Faro:

npm i -S @grafana/faro-web-sdk
Apply diff
diff --git a/src/components/pages/doc-welcome/faro/faro.js b/src/components/pages/doc-welcome/faro/faro.js
new file mode 100644
index 00000000..0ee82df8
--- /dev/null
+++ b/src/components/pages/doc-welcome/faro/faro.js
@@ -0,0 +1,16 @@
+import { initializeFaro as coreInit } from '@grafana/faro-web-sdk';
+
+export function initializeFaro() {
+  const faro = coreInit({
+    url: 'http://localhost:8027/collect',
+    apiKey: 'api_key',
+    app: {
+      name: 'k6-docs-test',
+      version: '1.0.0',
+    },
+  });
+
+  faro.api.pushLog(['Faro was initialized']);
+}
diff --git a/src/components/pages/doc-welcome/faro/index.js b/src/components/pages/doc-welcome/faro/index.js
new file mode 100644
index 00000000..00cebe8e
--- /dev/null
+++ b/src/components/pages/doc-welcome/faro/index.js
@@ -0,0 +1 @@
+export * from './faro';
diff --git a/src/components/pages/doc-welcome/index.js b/src/components/pages/doc-welcome/index.js
index 6fbdb036..06203d92 100644
--- a/src/components/pages/doc-welcome/index.js
+++ b/src/components/pages/doc-welcome/index.js
@@ -1,3 +1,5 @@
+import { initializeFaro } from './faro';
+
 export * from './cloud';
 export * from './features';
 export * from './help';
@@ -6,3 +8,6 @@ export * from './showcase';
 export * from './what-is';
 export * from './whats-new';
 export * from './manifesto';
+export * from './faro';
+
+initializeFaro();

Run it:

docker-compose up --build

Faro
We can use the faro-web-sdk demo in order to deply the Faro components locally. In our case we were using commit 54618b1.

git clone [email protected]:grafana/faro-web-sdk.git
cd faro-web-sdk
docker compose --profile demo up

Website will be available at http://localhost:8100 and Grafana at http://localhost:300. To see the Web Vitals metrics go to Dashboards -> Grafana Agent -> Frontend.

Test
We were using xk6-browser with commit a8ebe8c.
Which you can build with xk6 such as:

K6_VERSION=master xk6 build --with github.com/grafana/xk6-browser@a8ebe8c

And run the following script:

Simple k6 browser test being used.
import { browser } from 'k6/x/browser';

export const options = {
  scenarios: {
    ui: {
      executor: 'shared-iterations',
      options: {
        browser: {
            type: 'chromium',
        },
      },
    },
  },
};
  

export default async function () {
  try {
    await page.goto('http://localhost:8100/')
  } finally {
    page.close();
  } 
}

Conclusion

So, all in all is good to verify that both k6 browser and Grafana Faro produce almost identical results when executed in the same conditions for a "real world" website (not a simple test site), even if in the end they serve for different purposes. Still, as mentioned in the previous comment, it's important to remark again the impact that the "execution environment" can have in Web Vitals measurements, considering things like hardware and network conditions, but also headless/headful mode, viewport and screen resolution, and also the user interaction with the page.

@ka3de ka3de closed this as completed Jul 18, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants