Add the source text back to the script struct #3316

jungkees · 2018-01-02T09:52:37Z

Service workers reference the script's source text to compare two service workers' version. The script's source text has been removed from HTML at some point. Please add it back. We plan to use it for both the classic script comparison and the module script comparison.

Related issue: w3c/ServiceWorker#1212.

/cc @domenic

domenic · 2018-01-02T17:39:30Z

Thanks for filing over here; this kept slipping off the top of my mind. I'll try to get to it soon!!

For my personal curiousity, I'd love to hear from implementers how they plan to implement this, and maybe add a note to the spec about it. Because keeping the script source in memory all the time seems like a lot of overhead. Would implementers do that only for service worker scripts? Or would they page them out to disk, perhaps? /cc @whatwg/modules @wanderview

jungkees · 2018-01-03T11:19:57Z

Chromium stores the actual script resources in the disk cache and maintains a separate in-memory map that only has url to resource record info (id and such that's tied to the actual resource.) I believe Firefox, Edge, and Safari have similar implementations already. Adding a note to the spec seems like a good idea. /cc @mattto @aliams @cdumez

jungkees · 2018-01-03T11:33:13Z

To clarify, Chromium doesn't support module scripts for service workers yet.

annevk · 2018-01-03T14:01:04Z

(It seems implementations could store a SHA-2 hash as well and use that for comparison against a new script.)

jungkees · 2018-01-04T01:55:13Z

Chromium's comparing the existing script progressively with the new script as it is read from network so when it detects a byte difference it starts writing the new one to the disk right away.

I think using hash could be one of the ways to implement this too. It seems like implementation options.

wanderview · 2018-01-04T02:06:36Z

We currently keep the script source in memory while performing a SW update. We haven't really implemented any early abort optimizations, etc.

I'm not sure what we do for script sources in spidermonkey currently.

aliams · 2018-01-05T00:34:17Z

We use a hash for comparison which is generated after we retrieve the new script file.

nyaxt · 2018-01-05T02:12:04Z

@hiroshige-g @otherdaniel who was poking at the Blink code around the area.
I believe we (Blink+V8) keep the script source text in memory, even after execution, so that we can answer Function.prototype.toString().

One question: Is this about decoded source text, or raw undecoded bytestream? Keeping raw undecoded bytestream around can come with additional overhead.

annevk · 2018-01-05T08:17:16Z

What does decoded source text mean? Is that only text decoding or something more? Note that if Edge computes the hash based on the byte stream they'll get different results, so we should actually make this very clear (and add tests for the difference).

otherdaniel · 2018-01-05T10:01:23Z

@nyaxt: For V8+Blink:

V8 keeps a reference to all source text as v8::String.
The exact shape of that string depends on how the source was handed down from Blink. Usually, it's External(One|Two)ByteString, meaning it wraps a piece of memory owned by Blink. ServiceWorkers might handle this differently; not sure about that.
The script string is kept until the GC no longer finds any direct or indirect references.
- (indirect refs include: the script object; any compiled function; any substring that references the original; etc.)
The raw bytestream is only temporarily used (while streaming), and we won't keep a reference past the initial compile (which occurs when streaming is finished).

@annevk:

decoded == ASCII or UCS-2, derived from whichever text encoding the resource originally was in.

annevk · 2018-01-05T10:11:32Z

Okay, so different byte sequences that decode to the same code points end up not counting as a different resource in Chrome. This could be due to a leading BOM being stripped, a different encoding being used, and different byte sequences decoding to the same code point (most common with errors becoming U+FFFD).

hiroshige-g · 2018-01-05T12:53:28Z

Looks fine.

According to @jungkees's comments and offline discussion with @horo-t, I expect V8+Blink is not affected because the update algorithm referenced in w3c/ServiceWorker#1212 is implemented in the browser process (which is quite far from Blink's #concept-script impl blink::Script).

Even when we add source text to blink::Script, I expect it doesn't cause memory regression (+1 to @nyaxt @otherdaniel comments).
But I think it's better not to add source text to blink::Script for implementation flexibility, and thus it might be better to limit the source text's usage (e.g. by add a note "comparison only").

domenic · 2018-01-09T00:56:45Z

So in terms of actually working on this, do we want to store the response bytes, or the source text? If we store source text, then e.g. two scripts should compare the same even if their bytes are different, per all the factors @annevk lists.

Based on what I'm hearing so far in this thread:

Edge: unclear; @aliams is the hash being done on the source bytes or the decoded/BOM-stripped source text?
Gecko: @wanderview said "script source" implying to me source text, not bytes
Blink: source text
WebKit: ?

I hope someone can help write tests for these cases.

jungkees · 2018-01-09T02:35:54Z

do we want to store the response bytes, or the source text? If we store source text, then e.g. two scripts should compare the same even if their bytes are different, per all the factors @annevk lists.

I think we should store the decoded/BOM-stripped source text as the factors @annevk listed can change. What we'd want to check here is the difference of the source texts. So, I believe the hash should be computed against the decoded source text too.

jungkees · 2018-01-09T02:51:36Z

But I think it's better not to add source text to blink::Script for implementation flexibility, and thus it might be better to limit the source text's usage (e.g. by add a note "comparison only").

If we really want to avoid adding the source text to the script struct, we perhaps can think about having fetch a * worker script return the decoded source text separately from the resulting script. But adding the source text to the script struct and leaving such a note seems to be simpler to understand I think.

mfalken · 2018-01-09T08:04:30Z

So in terms of actually working on this, do we want to store the response bytes, or the source text?

In terms of storing service worker scripts, Chrome stores and compares the raw bytes from the HTTP response, and it does this in the browser process separately from Blink. So Blink/V8 is not really involved in how Chrome stores and updates service workers.

It might be difficult for us to store decoded/BOM-stripped source text as currently implemented, since I believe that sort of logic is implemented in the renderer process in Blink.

jungkees · 2018-01-09T09:27:12Z

Okay, so different byte sequences that decode to the same code points end up not counting as a different resource in Chrome.

So, this wasn't the case actually. Different byte sequences will always ~~abort~~go on with the update in Chrome.

I'd like to hear from EdgeHTML, Gecko, and WebKit whether they do the same as Chrome or actually use decoded source text.

BTW, how likely do we encounter "different byte sequences that decode to the same code points" case in the wild?

EDIT: fixed the confused part indicated by strikethrough.

wanderview · 2018-01-09T14:30:22Z

Gecko gets the response body from our network stack and Cache API as wide characters. We then do a simple equality comparison on them:

https://searchfox.org/mozilla-central/rev/cf149b7b63ff97023e28723167725e38cf5df757/dom/workers/ServiceWorkerScriptCache.cpp#804

Note that things like compression are removed before this step, but we don't do any other decoding/BOM-stripping.

aliams · 2018-01-10T00:35:42Z

@domenic, we are currently hashing the source bytes, but we'll probably need to update it to hash the decoded bytes considering that service worker scripts can be non-UTF-8.

domenic · 2018-01-10T00:45:23Z

Thankfully they cannot; they are always decoded as UTF-8 for all worker scripts (classic or, in general, for all module script fetches). It's mostly about BOM-stripping and replacement characters.

This is all very edge-casey, of course, but I guess I can imagine some developer toggling between Visual Studio (which adds BOMs) and something else (which strips them) and causing their site to update in one browser and not update in another. So we should try to get interop.

So what I'm hearing so far is that everyone is doing source bytes. I'll work on updating the spec to store those for comparison. I hope someone writes tests for me that e.g. updating the file to remove/add a BOM causes an update, or that moving from one unpaired surrogate to another unpaired surrogate still causes an update despite both being normalized to U+FFFD after decoding.

As for my original question about implementation strategies, I'll probably add a note like

NOTE: As of the time of this writing, the only use of this field is for comparison with future script fetching response bodies as part of the service worker update mechanism. As such, implementations may be able to use a more efficient strategy than storing all of the bytes in memory, e.g. reusing the disk cache or comparing hashes of the data.

aliams · 2018-01-10T01:08:26Z

We are hashing the encoded bytes meaning that if the script happened to change encodings, but not content, we’ll generate a new version. That feels like it might be unintended behavior.

domenic · 2018-01-10T02:48:37Z

Well, if it changes encodings, it'll change content, since interpreting any other encoding as UTF-8 (which is required per spec) will give different text.

aliams · 2018-01-10T03:02:33Z

Does the content encoding header not impact this? For instance, if you serve the same underlying bytes with a different encoding that gets read as UTF-8 by the browser.

domenic · 2018-01-10T03:23:46Z

Maybe we are talking past each other. You are talking about Content-Encoding (gzip/br/deflate/etc.), not Content-Type's charset parameter (utf-8/windows-1252/etc.)? Content-Encoding indeed matters; it sounds like Chrome and Firefox compare the post-decompression version. Whereas Content-Type does not matter; charset is completely ignored for service worker scripts.

aliams · 2018-01-10T05:56:36Z

Great, thanks for investigating @domenic! We also are also comparing the decoded (post-decompression) bytes before converting to UTF-8 since that's what's being hashed. That said, I think your previous summary regarding how to spec this makes sense.

This change includes/considers the following: - Include imported scripts to byte-check (for classic scripts). - Compare responses' body instead of source text as per whatwg/html#3316. - Handle duplicate importScripts() as per #1041. - Replace *imported scripts updated flag* referenced in importScripts() by using service worker's state item. - Have Update's perform the fetch steps cover module scripts. - Avoid dobule-download of imported scripts pointed out in #1023 (comment). This change basically makes it check out if the main script resource is identical to the existing resource. If so, it returns; otherwise, it creates a new service worker and evalute it to check out if any imported scripts are changed. It continues with Install only when any of the resources has been changed. With the change, importScripts() returns resources from the cache for any duplicated requests including the request for the main script. Fixes #1041, #1212, #1023.

domenic · 2018-03-06T08:24:01Z

If I have understood w3c/ServiceWorker#1283 correctly, @jungkees taking a response body-based approach means we don't need to store this on the script struct. So, I'll close this :)

This change includes/considers the following: - Include imported scripts to byte-check (for classic scripts). - Compare responses' body instead of source text as per whatwg/html#3316. - Handle duplicate importScripts() as per #1041. - Remove imported scripts updated flag referenced in importScripts() by using service worker's state item instead. - Have Update's perform the fetch steps cover module scripts. - Confirm dobule-download of imported scripts pointed out in #1023 (comment) never happens; importing a script again always gets the result from the cache. This change basically gets the imported scripts included to the byte-check for updates. Update continues with Install if any of the (main or imported) resources has been changed. This change also makes any duplicated requests (including the main script) from importScripts() return the result from the cache. Fixes #839, #1041, #1212, #1023.

domenic self-assigned this Jan 2, 2018

domenic added the topic: script label Jan 2, 2018

domenic mentioned this issue Jan 12, 2018

Getting all bytes in a body whatwg/fetch#661

Closed

jungkees mentioned this issue Mar 5, 2018

Improve service worker script caching and update w3c/ServiceWorker#1283

Merged

domenic closed this as completed Mar 6, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add the source text back to the script struct #3316

Add the source text back to the script struct #3316

jungkees commented Jan 2, 2018

domenic commented Jan 2, 2018

jungkees commented Jan 3, 2018

jungkees commented Jan 3, 2018

annevk commented Jan 3, 2018

jungkees commented Jan 4, 2018

wanderview commented Jan 4, 2018

aliams commented Jan 5, 2018

nyaxt commented Jan 5, 2018

annevk commented Jan 5, 2018

otherdaniel commented Jan 5, 2018

annevk commented Jan 5, 2018

hiroshige-g commented Jan 5, 2018

domenic commented Jan 9, 2018

jungkees commented Jan 9, 2018

jungkees commented Jan 9, 2018

mfalken commented Jan 9, 2018

jungkees commented Jan 9, 2018 •

edited

Loading

wanderview commented Jan 9, 2018

aliams commented Jan 10, 2018

domenic commented Jan 10, 2018 •

edited

Loading

aliams commented Jan 10, 2018

domenic commented Jan 10, 2018

aliams commented Jan 10, 2018

domenic commented Jan 10, 2018

aliams commented Jan 10, 2018

domenic commented Mar 6, 2018

Add the source text back to the script struct #3316

Add the source text back to the script struct #3316

Comments

jungkees commented Jan 2, 2018

domenic commented Jan 2, 2018

jungkees commented Jan 3, 2018

jungkees commented Jan 3, 2018

annevk commented Jan 3, 2018

jungkees commented Jan 4, 2018

wanderview commented Jan 4, 2018

aliams commented Jan 5, 2018

nyaxt commented Jan 5, 2018

annevk commented Jan 5, 2018

otherdaniel commented Jan 5, 2018

annevk commented Jan 5, 2018

hiroshige-g commented Jan 5, 2018

domenic commented Jan 9, 2018

jungkees commented Jan 9, 2018

jungkees commented Jan 9, 2018

mfalken commented Jan 9, 2018

jungkees commented Jan 9, 2018 • edited Loading

wanderview commented Jan 9, 2018

aliams commented Jan 10, 2018

domenic commented Jan 10, 2018 • edited Loading

aliams commented Jan 10, 2018

domenic commented Jan 10, 2018

aliams commented Jan 10, 2018

domenic commented Jan 10, 2018

aliams commented Jan 10, 2018

domenic commented Mar 6, 2018

jungkees commented Jan 9, 2018 •

edited

Loading

domenic commented Jan 10, 2018 •

edited

Loading