Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add the source text back to the script struct #3316

Closed
jungkees opened this issue Jan 2, 2018 · 26 comments
Closed

Add the source text back to the script struct #3316

jungkees opened this issue Jan 2, 2018 · 26 comments
Assignees

Comments

@jungkees
Copy link
Contributor

jungkees commented Jan 2, 2018

Service workers reference the script's source text to compare two service workers' version. The script's source text has been removed from HTML at some point. Please add it back. We plan to use it for both the classic script comparison and the module script comparison.

Related issue: w3c/ServiceWorker#1212.

/cc @domenic

@domenic domenic self-assigned this Jan 2, 2018
@domenic
Copy link
Member

domenic commented Jan 2, 2018

Thanks for filing over here; this kept slipping off the top of my mind. I'll try to get to it soon!!

For my personal curiousity, I'd love to hear from implementers how they plan to implement this, and maybe add a note to the spec about it. Because keeping the script source in memory all the time seems like a lot of overhead. Would implementers do that only for service worker scripts? Or would they page them out to disk, perhaps? /cc @whatwg/modules @wanderview

@jungkees
Copy link
Contributor Author

jungkees commented Jan 3, 2018

Chromium stores the actual script resources in the disk cache and maintains a separate in-memory map that only has url to resource record info (id and such that's tied to the actual resource.) I believe Firefox, Edge, and Safari have similar implementations already. Adding a note to the spec seems like a good idea. /cc @mattto @aliams @cdumez

@jungkees
Copy link
Contributor Author

jungkees commented Jan 3, 2018

To clarify, Chromium doesn't support module scripts for service workers yet.

@annevk
Copy link
Member

annevk commented Jan 3, 2018

(It seems implementations could store a SHA-2 hash as well and use that for comparison against a new script.)

@jungkees
Copy link
Contributor Author

jungkees commented Jan 4, 2018

Chromium's comparing the existing script progressively with the new script as it is read from network so when it detects a byte difference it starts writing the new one to the disk right away.

I think using hash could be one of the ways to implement this too. It seems like implementation options.

@wanderview
Copy link
Member

We currently keep the script source in memory while performing a SW update. We haven't really implemented any early abort optimizations, etc.

I'm not sure what we do for script sources in spidermonkey currently.

@aliams
Copy link
Member

aliams commented Jan 5, 2018

We use a hash for comparison which is generated after we retrieve the new script file.

@nyaxt
Copy link
Member

nyaxt commented Jan 5, 2018

@hiroshige-g @otherdaniel who was poking at the Blink code around the area.
I believe we (Blink+V8) keep the script source text in memory, even after execution, so that we can answer Function.prototype.toString().

One question: Is this about decoded source text, or raw undecoded bytestream? Keeping raw undecoded bytestream around can come with additional overhead.

@annevk
Copy link
Member

annevk commented Jan 5, 2018

What does decoded source text mean? Is that only text decoding or something more? Note that if Edge computes the hash based on the byte stream they'll get different results, so we should actually make this very clear (and add tests for the difference).

@otherdaniel
Copy link
Contributor

@nyaxt: For V8+Blink:

  • V8 keeps a reference to all source text as v8::String.
  • The exact shape of that string depends on how the source was handed down from Blink. Usually, it's External(One|Two)ByteString, meaning it wraps a piece of memory owned by Blink. ServiceWorkers might handle this differently; not sure about that.
  • The script string is kept until the GC no longer finds any direct or indirect references.
    • (indirect refs include: the script object; any compiled function; any substring that references the original; etc.)
  • The raw bytestream is only temporarily used (while streaming), and we won't keep a reference past the initial compile (which occurs when streaming is finished).

@annevk:

  • decoded == ASCII or UCS-2, derived from whichever text encoding the resource originally was in.

@annevk
Copy link
Member

annevk commented Jan 5, 2018

Okay, so different byte sequences that decode to the same code points end up not counting as a different resource in Chrome. This could be due to a leading BOM being stripped, a different encoding being used, and different byte sequences decoding to the same code point (most common with errors becoming U+FFFD).

@hiroshige-g
Copy link
Contributor

Looks fine.

According to @jungkees's comments and offline discussion with @horo-t, I expect V8+Blink is not affected because the update algorithm referenced in w3c/ServiceWorker#1212 is implemented in the browser process (which is quite far from Blink's #concept-script impl blink::Script).

Even when we add source text to blink::Script, I expect it doesn't cause memory regression (+1 to @nyaxt @otherdaniel comments).
But I think it's better not to add source text to blink::Script for implementation flexibility, and thus it might be better to limit the source text's usage (e.g. by add a note "comparison only").

@domenic
Copy link
Member

domenic commented Jan 9, 2018

So in terms of actually working on this, do we want to store the response bytes, or the source text? If we store source text, then e.g. two scripts should compare the same even if their bytes are different, per all the factors @annevk lists.

Based on what I'm hearing so far in this thread:

  • Edge: unclear; @aliams is the hash being done on the source bytes or the decoded/BOM-stripped source text?
  • Gecko: @wanderview said "script source" implying to me source text, not bytes
  • Blink: source text
  • WebKit: ?

I hope someone can help write tests for these cases.

@jungkees
Copy link
Contributor Author

jungkees commented Jan 9, 2018

do we want to store the response bytes, or the source text? If we store source text, then e.g. two scripts should compare the same even if their bytes are different, per all the factors @annevk lists.

I think we should store the decoded/BOM-stripped source text as the factors @annevk listed can change. What we'd want to check here is the difference of the source texts. So, I believe the hash should be computed against the decoded source text too.

@jungkees
Copy link
Contributor Author

jungkees commented Jan 9, 2018

But I think it's better not to add source text to blink::Script for implementation flexibility, and thus it might be better to limit the source text's usage (e.g. by add a note "comparison only").

If we really want to avoid adding the source text to the script struct, we perhaps can think about having fetch a * worker script return the decoded source text separately from the resulting script. But adding the source text to the script struct and leaving such a note seems to be simpler to understand I think.

@mfalken
Copy link
Contributor

mfalken commented Jan 9, 2018

So in terms of actually working on this, do we want to store the response bytes, or the source text?

In terms of storing service worker scripts, Chrome stores and compares the raw bytes from the HTTP response, and it does this in the browser process separately from Blink. So Blink/V8 is not really involved in how Chrome stores and updates service workers.

It might be difficult for us to store decoded/BOM-stripped source text as currently implemented, since I believe that sort of logic is implemented in the renderer process in Blink.

@jungkees
Copy link
Contributor Author

jungkees commented Jan 9, 2018

Okay, so different byte sequences that decode to the same code points end up not counting as a different resource in Chrome.

So, this wasn't the case actually. Different byte sequences will always abortgo on with the update in Chrome.

I'd like to hear from EdgeHTML, Gecko, and WebKit whether they do the same as Chrome or actually use decoded source text.

BTW, how likely do we encounter "different byte sequences that decode to the same code points" case in the wild?

EDIT: fixed the confused part indicated by strikethrough.

@wanderview
Copy link
Member

Gecko gets the response body from our network stack and Cache API as wide characters. We then do a simple equality comparison on them:

https://searchfox.org/mozilla-central/rev/cf149b7b63ff97023e28723167725e38cf5df757/dom/workers/ServiceWorkerScriptCache.cpp#804

Note that things like compression are removed before this step, but we don't do any other decoding/BOM-stripping.

@aliams
Copy link
Member

aliams commented Jan 10, 2018

@domenic, we are currently hashing the source bytes, but we'll probably need to update it to hash the decoded bytes considering that service worker scripts can be non-UTF-8.

@domenic
Copy link
Member

domenic commented Jan 10, 2018

Thankfully they cannot; they are always decoded as UTF-8 for all worker scripts (classic or, in general, for all module script fetches). It's mostly about BOM-stripping and replacement characters.

This is all very edge-casey, of course, but I guess I can imagine some developer toggling between Visual Studio (which adds BOMs) and something else (which strips them) and causing their site to update in one browser and not update in another. So we should try to get interop.

So what I'm hearing so far is that everyone is doing source bytes. I'll work on updating the spec to store those for comparison. I hope someone writes tests for me that e.g. updating the file to remove/add a BOM causes an update, or that moving from one unpaired surrogate to another unpaired surrogate still causes an update despite both being normalized to U+FFFD after decoding.

As for my original question about implementation strategies, I'll probably add a note like

NOTE: As of the time of this writing, the only use of this field is for comparison with future script fetching response bodies as part of the service worker update mechanism. As such, implementations may be able to use a more efficient strategy than storing all of the bytes in memory, e.g. reusing the disk cache or comparing hashes of the data.

@aliams
Copy link
Member

aliams commented Jan 10, 2018

We are hashing the encoded bytes meaning that if the script happened to change encodings, but not content, we’ll generate a new version. That feels like it might be unintended behavior.

@domenic
Copy link
Member

domenic commented Jan 10, 2018

Well, if it changes encodings, it'll change content, since interpreting any other encoding as UTF-8 (which is required per spec) will give different text.

@aliams
Copy link
Member

aliams commented Jan 10, 2018

Does the content encoding header not impact this? For instance, if you serve the same underlying bytes with a different encoding that gets read as UTF-8 by the browser.

@domenic
Copy link
Member

domenic commented Jan 10, 2018

Maybe we are talking past each other. You are talking about Content-Encoding (gzip/br/deflate/etc.), not Content-Type's charset parameter (utf-8/windows-1252/etc.)? Content-Encoding indeed matters; it sounds like Chrome and Firefox compare the post-decompression version. Whereas Content-Type does not matter; charset is completely ignored for service worker scripts.

@aliams
Copy link
Member

aliams commented Jan 10, 2018

Great, thanks for investigating @domenic! We also are also comparing the decoded (post-decompression) bytes before converting to UTF-8 since that's what's being hashed. That said, I think your previous summary regarding how to spec this makes sense.

jungkees added a commit to w3c/ServiceWorker that referenced this issue Mar 5, 2018
This change includes/considers the following:
 - Include imported scripts to byte-check (for classic scripts).
 - Compare responses' body instead of source text as per whatwg/html#3316.
 - Handle duplicate importScripts() as per #1041.
 - Replace *imported scripts updated flag* referenced in importScripts()
   by using service worker's state item.
 - Have Update's perform the fetch steps cover module scripts.
 - Avoid dobule-download of imported scripts pointed out in #1023 (comment).

This change basically makes it check out if the main script resource is
identical to the existing resource. If so, it returns; otherwise, it
creates a new service worker and evalute it to check out if any imported
scripts are changed. It continues with Install only when any of the
resources has been changed. With the change, importScripts() returns
resources from the cache for any duplicated requests including the
request for the main script.

Fixes #1041, #1212, #1023.
@domenic
Copy link
Member

domenic commented Mar 6, 2018

If I have understood w3c/ServiceWorker#1283 correctly, @jungkees taking a response body-based approach means we don't need to store this on the script struct. So, I'll close this :)

@domenic domenic closed this as completed Mar 6, 2018
jungkees added a commit to w3c/ServiceWorker that referenced this issue Mar 13, 2018
This change includes/considers the following:

 - Include imported scripts to byte-check (for classic scripts).
 - Compare responses' body instead of source text as per whatwg/html#3316.
 - Handle duplicate importScripts() as per #1041.
 - Remove imported scripts updated flag referenced in importScripts() by
   using service worker's state item instead.
 - Have Update's perform the fetch steps cover module scripts.
 - Confirm dobule-download of imported scripts pointed out in #1023 (comment)
   never happens; importing a script again always gets the result from
   the cache.

This change basically gets the imported scripts included to the
byte-check for updates. Update continues with Install if any of the
(main or imported) resources has been changed. This change also makes
any duplicated requests (including the main script) from importScripts()
return the result from the cache.

Fixes #839, #1041, #1212, #1023.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

No branches or pull requests

9 participants