-
Notifications
You must be signed in to change notification settings - Fork 2.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add the source text back to the script struct #3316
Comments
Thanks for filing over here; this kept slipping off the top of my mind. I'll try to get to it soon!! For my personal curiousity, I'd love to hear from implementers how they plan to implement this, and maybe add a note to the spec about it. Because keeping the script source in memory all the time seems like a lot of overhead. Would implementers do that only for service worker scripts? Or would they page them out to disk, perhaps? /cc @whatwg/modules @wanderview |
Chromium stores the actual script resources in the disk cache and maintains a separate in-memory map that only has url to resource record info (id and such that's tied to the actual resource.) I believe Firefox, Edge, and Safari have similar implementations already. Adding a note to the spec seems like a good idea. /cc @mattto @aliams @cdumez |
To clarify, Chromium doesn't support module scripts for service workers yet. |
(It seems implementations could store a SHA-2 hash as well and use that for comparison against a new script.) |
Chromium's comparing the existing script progressively with the new script as it is read from network so when it detects a byte difference it starts writing the new one to the disk right away. I think using hash could be one of the ways to implement this too. It seems like implementation options. |
We currently keep the script source in memory while performing a SW update. We haven't really implemented any early abort optimizations, etc. I'm not sure what we do for script sources in spidermonkey currently. |
We use a hash for comparison which is generated after we retrieve the new script file. |
@hiroshige-g @otherdaniel who was poking at the Blink code around the area. One question: Is this about decoded source text, or raw undecoded bytestream? Keeping raw undecoded bytestream around can come with additional overhead. |
What does decoded source text mean? Is that only text decoding or something more? Note that if Edge computes the hash based on the byte stream they'll get different results, so we should actually make this very clear (and add tests for the difference). |
@nyaxt: For V8+Blink:
|
Okay, so different byte sequences that decode to the same code points end up not counting as a different resource in Chrome. This could be due to a leading BOM being stripped, a different encoding being used, and different byte sequences decoding to the same code point (most common with errors becoming U+FFFD). |
Looks fine. According to @jungkees's comments and offline discussion with @horo-t, I expect V8+Blink is not affected because the update algorithm referenced in w3c/ServiceWorker#1212 is implemented in the browser process (which is quite far from Blink's Even when we add source text to |
So in terms of actually working on this, do we want to store the response bytes, or the source text? If we store source text, then e.g. two scripts should compare the same even if their bytes are different, per all the factors @annevk lists. Based on what I'm hearing so far in this thread:
I hope someone can help write tests for these cases. |
I think we should store the decoded/BOM-stripped source text as the factors @annevk listed can change. What we'd want to check here is the difference of the source texts. So, I believe the hash should be computed against the decoded source text too. |
If we really want to avoid adding the source text to the script struct, we perhaps can think about having fetch a * worker script return the decoded source text separately from the resulting script. But adding the source text to the script struct and leaving such a note seems to be simpler to understand I think. |
In terms of storing service worker scripts, Chrome stores and compares the raw bytes from the HTTP response, and it does this in the browser process separately from Blink. So Blink/V8 is not really involved in how Chrome stores and updates service workers. It might be difficult for us to store decoded/BOM-stripped source text as currently implemented, since I believe that sort of logic is implemented in the renderer process in Blink. |
So, this wasn't the case actually. Different byte sequences will always I'd like to hear from EdgeHTML, Gecko, and WebKit whether they do the same as Chrome or actually use decoded source text. BTW, how likely do we encounter "different byte sequences that decode to the same code points" case in the wild? EDIT: fixed the confused part indicated by strikethrough. |
Gecko gets the response body from our network stack and Cache API as wide characters. We then do a simple equality comparison on them: Note that things like compression are removed before this step, but we don't do any other decoding/BOM-stripping. |
@domenic, we are currently hashing the source bytes, but we'll probably need to update it to hash the decoded bytes considering that service worker scripts can be non-UTF-8. |
Thankfully they cannot; they are always decoded as UTF-8 for all worker scripts (classic or, in general, for all module script fetches). It's mostly about BOM-stripping and replacement characters. This is all very edge-casey, of course, but I guess I can imagine some developer toggling between Visual Studio (which adds BOMs) and something else (which strips them) and causing their site to update in one browser and not update in another. So we should try to get interop. So what I'm hearing so far is that everyone is doing source bytes. I'll work on updating the spec to store those for comparison. I hope someone writes tests for me that e.g. updating the file to remove/add a BOM causes an update, or that moving from one unpaired surrogate to another unpaired surrogate still causes an update despite both being normalized to U+FFFD after decoding. As for my original question about implementation strategies, I'll probably add a note like
|
We are hashing the encoded bytes meaning that if the script happened to change encodings, but not content, we’ll generate a new version. That feels like it might be unintended behavior. |
Well, if it changes encodings, it'll change content, since interpreting any other encoding as UTF-8 (which is required per spec) will give different text. |
Does the content encoding header not impact this? For instance, if you serve the same underlying bytes with a different encoding that gets read as UTF-8 by the browser. |
Maybe we are talking past each other. You are talking about Content-Encoding (gzip/br/deflate/etc.), not Content-Type's charset parameter (utf-8/windows-1252/etc.)? Content-Encoding indeed matters; it sounds like Chrome and Firefox compare the post-decompression version. Whereas Content-Type does not matter; charset is completely ignored for service worker scripts. |
Great, thanks for investigating @domenic! We also are also comparing the decoded (post-decompression) bytes before converting to UTF-8 since that's what's being hashed. That said, I think your previous summary regarding how to spec this makes sense. |
This change includes/considers the following: - Include imported scripts to byte-check (for classic scripts). - Compare responses' body instead of source text as per whatwg/html#3316. - Handle duplicate importScripts() as per #1041. - Replace *imported scripts updated flag* referenced in importScripts() by using service worker's state item. - Have Update's perform the fetch steps cover module scripts. - Avoid dobule-download of imported scripts pointed out in #1023 (comment). This change basically makes it check out if the main script resource is identical to the existing resource. If so, it returns; otherwise, it creates a new service worker and evalute it to check out if any imported scripts are changed. It continues with Install only when any of the resources has been changed. With the change, importScripts() returns resources from the cache for any duplicated requests including the request for the main script. Fixes #1041, #1212, #1023.
If I have understood w3c/ServiceWorker#1283 correctly, @jungkees taking a response body-based approach means we don't need to store this on the script struct. So, I'll close this :) |
This change includes/considers the following: - Include imported scripts to byte-check (for classic scripts). - Compare responses' body instead of source text as per whatwg/html#3316. - Handle duplicate importScripts() as per #1041. - Remove imported scripts updated flag referenced in importScripts() by using service worker's state item instead. - Have Update's perform the fetch steps cover module scripts. - Confirm dobule-download of imported scripts pointed out in #1023 (comment) never happens; importing a script again always gets the result from the cache. This change basically gets the imported scripts included to the byte-check for updates. Update continues with Install if any of the (main or imported) resources has been changed. This change also makes any duplicated requests (including the main script) from importScripts() return the result from the cache. Fixes #839, #1041, #1212, #1023.
Service workers reference the script's source text to compare two service workers' version. The script's source text has been removed from HTML at some point. Please add it back. We plan to use it for both the classic script comparison and the module script comparison.
Related issue: w3c/ServiceWorker#1212.
/cc @domenic
The text was updated successfully, but these errors were encountered: