Skip to content
This repository has been archived by the owner on Jan 4, 2023. It is now read-only.

Create new scripts to detect importScripts() and usage of SW methods inside pwa.js #203

Open
demianrenzulli opened this issue May 29, 2021 · 1 comment

Comments

@demianrenzulli
Copy link
Contributor

demianrenzulli commented May 29, 2021

Hi folks,

As you know, for the 2021 Web Almanac PWA section, we need some additional metrics, to cover two use cases:

  1. Detect all the calls to importScripts(), so we can replace the 2020 query used to create a ranking of popular libraries used by service workers.
  2. Getting all the service worker events and properties, to replace the 2020 SW Event Query.

Adding new scripts to pwa.js

To start testing this, I decided to use a "brute force" approach, by just taking Rick's code for Workbox detection, and applying it to regular expressions for importScripts() and service worker events and properties.

These are the resulting snippets added to pwa.js. The code is obviously not good, but it helped me obtain some quick conclusions:

// We should use serviceWorkerInitiatedURLs here, but SW detection has some false negatives.
function getInfoForPattern(regexPattern) {
  return response_bodies.filter(har => {
    return regexPattern.test(har.response_body);
  }).map(har => {
    return [har.url, Array.from(har.response_body.matchAll(regexPattern)).map(m => m[0])];
  });
}

const workboxPattern = /(?:workbox:[a-z\-]+:[\d.]+|workbox\.[a-zA-Z]+\.?[a-zA-Z]*)/g;
const workboxInfo = getInfoForPattern(workboxPattern);

const importScriptsPattern = /importScripts\(.*\);/g;
const importScriptsInfo = getInfoForPattern(importScriptsPattern);

const swEventListenersPattern = /addEventListener\(\s*[\'"](install|activate|fetch|push|notificationclick|notificationclose|sync|canmakepayment|paymentrequest|message|messageerror|periodicsync|backgroundfetchsuccess|backgroundfetchfailure|backgroundfetchabort|backgroundfetchclick)[\'"]/g;
const swEventListenersInfo = getInfoForPattern(swEventListenersPattern);

const swPropertiesPattern = /\.on(install|activate|fetch|push|notificationclick|notificationclose|sync|canmakepayment|paymentrequest|message|messageerror|periodicsync|backgroundfetchsuccess|backgroundfetchfailure|backgroundfetchabort|backgroundfetchclick)\s*=/g;
const wPropertiesInfo = getInfoForPattern(swPropertiesPattern);

The object to return has three new properties: importScriptsInfo, swEventListenersInfo, wPropertiesInfo.

return {
    serviceWorkers: Object.fromEntries(serviceWorkers),
    manifests: Object.fromEntries(manifests),
    serviceWorkerInitiated: Object.keys(Object.fromEntries(serviceWorkerInitiated)),
    workboxInfo: Object.fromEntries(workboxInfo),
    importScriptsInfo: Object.fromEntries(importScriptsInfo),
    swEventListenersInfo: Object.fromEntries(swEventListenersInfo),
    wPropertiesInfo: Object.fromEntries(wPropertiesInfo)
};

Testing URL detection

You can take a look at this test for https://www.naranja.com (output at $.data.runs[1].firstView.pwa.importScriptsInfo).

The problem with the script is that it's detecting every importScripts(), even inside libraries, like Workbox. The good thing is that since importScrpts() is a worker method, it doesn't match client-side libraries, so we might be able to filter this better.

Something similar happens with the service worker events and properties (output at $.data.runs[1].firstView.pwa.swEventListenersInfo and $.data.runs[1].firstView.pwa.wPropertiesInfo). This one is matching all events including those written inside libraries (e.g. Workbox's install) and, in many cases, the onmessage listener, which can be used in client-side libraries as well.

Matching only importScripts() and service worker events inside the service worker file.

As Rick left in one comment in the script: "We should use serviceWorkerInitiatedURLs here, but SW detection has some false negatives".

I'm a bit concerned that these false negatives might be actually quite frequent (minification might be one cause, as Rick mentioned here).

Here are some example sites that have service workers, where the test returns it empty for the serviceWorkers field, but have values for swEventListenersInfo:

So, on one hand, if we end up taking into account only the service worker URLs, we might be able to reduce the noise in the results, but, given that there are potentially so many false negatives, it might not be a good idea.

I'm pretty new to all this, but I wanted to see if we could get these things done by Monday, when the crawl takes place, but based on these early results, I think we might need to discuss some of the limitations the pwa.js script a little bit more.

cc // @tunetheweb @rviscomi @OBTo

@tunetheweb
Copy link
Member

tunetheweb commented May 30, 2021

As discussed onSlack, I think these limitations are fine and similar to what we had last year.

The problem with the script is that it's detecting every importScripts(), even inside libraries, like Workbox. The good thing is that since importScrpts() is a worker method, it doesn't match client-side libraries, so we might be able to filter this better.

The intent of the importScripts query “How are developers using PWAs? Are they writing it from scratch or using libraries and, if so, which ones?” So if WorkBox is using some other library then the page is (indirectly) using that so good to track. So I don't think this is an issue.

Yes there is a concern if this pattern was used:

if (some test) {
  importScript(...)
}

And the page doesn't actually pass that test and use that library but I'd imagine that's small and arguably it is using that library even though it's not actually using it - if that contradictory message makes sense 😁

Something similar happens with the service worker events and properties (output at $.data.runs[1].firstView.pwa.swEventListenersInfo and $.data.runs[1].firstView.pwa.wPropertiesInfo). This one is matching all events including those written inside libraries (e.g. Workbox's install) and, in many cases, the onmessage listener, which can be used in client-side libraries as well.

Again I think that's fine. Workbox in particular breaks it's libraries into small files so we'll only find the ones for workbox functionality we use - even if we don't use them all. And again arguably by importing that code we are "using" those events - or at least having access to use them. If we remove them from the web platform we'd likely need to rewrite this code so that's a "use" in my convoluted, contrived mind here to answer this concern 😁

We can see how many message events are logged for non-service worker pages after the run and decide whether this is a concern or not. Suspect it won't be.

Matching only importScripts() and service worker events inside the service worker file.

As Rick left in one comment in the script: "We should use serviceWorkerInitiatedURLs here SW detection but it has some false negatives".

I'm a bit concerned that these false negatives might be actually quite frequent (minification might be one cause, as Rick mentioned here).

Here are some example sites that have service workers, where the test returns it empty for the serviceWorkers field, but have values for swEventListenersInfo:

Those are weird. Had a look at them myself and can't see how the service worker is registered! So not surprised the custom metric can't figure this out. I think we can accept the limitation of this for really obfuscated code.

As suggested on Slack, we can also look at sites that register service worker event listeners (e.g. install, fetch...etc.) and see how many pages with those don't have the serviceWorkers object defined to see the scale of the problem. Can then decide if we need to include those pages or, if it's small enough, to just ignore.

WPT also detected they were service worker calls as they are in blue. So could also ask @pmeenan how it does that and use that potentially?

So, on one hand, if we end up taking into account only the service worker URLs, we might be able to reduce the noise in the results, but, given that there are potentially so many false negatives, it might not be a good idea.

I think that because we are mostly looking for service worker specific methods and text, we should look at everything like we did last year otherwise we exclude too much from importScripts. I think the likelihood of false positives is reasonably small and the risk of false negatives if we limited just to the main service worker.js is much larger.

I'm pretty new to all this, but I wanted to see if we could get these things done by Monday, when the crawl takes place, but based on these early results, I think we might need to discuss some of the limitations the pwa.js script a little bit more.

As discussed, I think your changes are good to propose in a PR now. Not sure when @rviscomi or @OBTo will get a chance to look at this since it's a weekend (and a holiday weekend at that!), but even if it doesn't make the start of the June crawl, we can still hopefully get it in for some of that crawl to give us enough to have a look at and make any amendments before the main July crawl we will use for the Web Almanac.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants