-
Notifications
You must be signed in to change notification settings - Fork 337
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Getting all bytes in a body #661
Comments
I guess I should also mention that, since this is blocking whatwg/html#3316 which is a quite old issue (going back to w3c/ServiceWorker#1212), maybe we should implement my minimal-delta proposal first, and then worry about cleaning up this larger space (especially the Encoding byte streams mess) later. I'd like @annevk's blessing on that route though first before sending PRs. |
Your approach sounds reasonable, though I'd appreciate if you could read through whatwg/html#1077 (comment) (and maybe earlier comments) and whatwg/encoding#128 to see if that changes your mind. I agree that it would be good to have a shorthand for the body. |
Note that we do have https://fetch.spec.whatwg.org/#concept-read-all-bytes-from-readablestream which goes from ReadableStream to a byte sequence. And it's rejected if the stream somehow errors (which may not be what we always want I guess, that's a little unclear to me). |
In the short-term, it looks like a bit of cut n' paste of "reading the body" language from from the Fetch standard to the HTML standard will solve the immediate issue. Is that what you are proposing? What I'd like to avoid is adding explicit support for synchronous reading to the Streams standard. That's a big can of worms. My understanding is that the glue between the Javascript world and the HTML world generally happens in the WebIDL standard. So in the longer term I'd like to see the ReadableStream glue moved from Fetch to WebIDL. I'd also like to add some extra glue for TransformStream, so we can avoid things like explicitly creating Promises and calling abstract operations in the Encoding standard. In the next few days I'm planning to try a seach-and-replace from "streams" to "token sequences" in the Encoding standard. Once we see what that looks like we can make a judgement as to whether that is the way forward to resolve the streams/streams confusion. |
Basically yes, although I'd like to locate it somewhere more reusable. Also, it would bridge away from promises and into "spec async". But yes, I think that's the right short-term plan. I'm more confused about the long-term vision.
What about if it were done using a magic "wait" primitive which was only available "in parallel" (i.e. on background threads)? Well, still a pretty big can of worms, given you'd essentially be talking about manipulating the stream JS objects in parallel :-/. Yeah, let's not do that, at least not until I've really psyched myself up into believing it's clearly the best path forward...
Where it lives is less concerning to me than the model. Do we try to emulate Web IDL's style by treating streams as JS objects and then translating them into spec language? E.g. wrapping promises up into "spec async", or Uint8Arrays into Infra byte sequences. Or, do we try to say that there is some "byte stream" or "byte token sequences" concept that is more fundamental, and the JS API lives on top of?
I think that'll be a great first step. However I'm also worried about the byte sequences/token sequences confusion I discovered, where lots of other specs treat Encoding's algorithms as if they operate on static byte sequences, whereas Encoding defines them as operating on streams/token sequences that can change over time. I think if we had more explicit static byte-sequence-taking endpoints in Encoding, it'd be clearer how much of encoding really needs "streaming" (i.e. token sequences) and thus how much of Encoding needs asynchronicity and all the attendant complexity. |
I hadn't thought about that aspect. There needs to be a clear separation between parts that are notionally taking place on some other thread and those parts that may be calling into JS. But it naively appears to be doable with the right wording; something like "await with Javascript context Foo ...".
What's the cost of getting it wrong? My inclination is just to start with the text from the Fetch standard and expand on that. But obviously we shouldn't go down a dead-end if it's going to be hard to get back out.
I think they are probably just missing the:
step from decode() and the:
step from encode(). Since I don't find it attractive to export the Encoding Standard's "stream" concept, I think it would be preferable to add a couple of algorithmic wrappers around the run algorithm for use by other standards.
Everything in Encoding is synchronous, so there's no problem there. The kind of asynchronocity I need is to have an implicit conversion from a synchronous algorithm to an asynchronous algorithm so that the algorithms for transform stream support can just say "return" and the Streams standard will see a Promise resolved with undefined. Encoding's concept of a "stream" is technically speaking a double-ended queue (deque). It's almost a simple queue; It only needs to be double-ended to support the "prepend" operation. No "streaming" as such is required, and I don't think browsers actually implement it this way, it just makes the algorithms clearer. There's also the confusingly-named "stream" option to decode(), which doesn't "need" streaming but enables callers to implement it. |
FWIW, I think "wait" is fine here. If some HTML feature invokes fetch, that'll go in parallel. The JavaScript involved would run in a service worker, but as the JavaScript writes things there, it'll then get copied over and be out of JavaScript's control, so we can perfectly wait for it in some HTML feature definition. What scenario are you thinking of where that would not be the case? |
I wonder if we should solve this together with #536. Where if all you need is are the headers and the response body as byte sequence, you'd simply register for specification callbacks that give you exactly that. The tasks fetch queues were essentially meant to provide this, but the design was done before streams. |
As far as I can tell we need two operations here and both would make the overall setup more verbose...
Both 1 and 2 could be used by XMLHttpRequest so it would no longer have to talk about streams directly. The algorithms would be invoked from a parallel queue so the caller will have to use "queue a global task" as needed. We could create a wrapper that does that as well I suppose. (Maybe that's better, we add "in parallel" to the names of the algorithms above and the names above would queue tasks instead. What realm though? The current realm? The magical realm we might add for class construction and such?) However, Fetch has "process response end-of-body" which I suppose we could remove. But Fetch also needs the entire body to be readable for integrity metadata checks. That's what https://fetch.spec.whatwg.org/#concept-body-wait is for. Should that basically tee (and wait, to avoid returning it before we know if it's safe)? Otherwise we cannot read it twice, right? |
Hmm. Yeah, I think this is unavoidable. In particular, the typical flow even for simple cases like fetching scripts or for
Note that body-reading failure is not handled today in specs, so this is an improvement.
Oh, that's annoying. It would add an indentation level to the above I guess? Maybe instead we could do
I think we can safely assume the task source is always the networking task source. And yeah, when we have a magic global we could maybe omit the global... I should prioritize that.
Umm, hmm. I kind of doubt implementations are making a full tee of the stream just to do integrity checking? That is:
|
For fetch(), as currently specified, we don't run the process response callback until we check the SRI. For scripts, we start running some tasks (e.g., compiling) before calculating the hash, but such tasks should be side-effect free, and we abort everything when we find an SRI error. |
@domenic and I discussed this on IRC as well and my tentative plan is to fully read response's body's stream and then set response's body stream to a new stream that user agents can enqueue chunks on. A buffering proxy that does SRI, if you will. This would not allow for doing work in parallel, but I think that is fine as that would be an optimization. |
This refactoring allows specifications building on top of Fetch to avoid having to directly integrate with Streams. It makes these changes: * body's transmitted bytes is removed. body's total bytes will likely be removed in #604. * body's done and wait concepts are removed. * It introduces three primitives to read a body: incrementally read (process each chunk), fully read (get all), and fully reading body as promise (get all as promise). * Removed transmit request body in favor of HTTP-network fetch using incrementally read directly to transmit the request body. * Fetch's processRequestBody and processRequestEndOfBody are no longer passed a request. The former is passed the chunk length for progress reporting. (Needed by XMLHttpRequest.) * Fetch's processResponseEndOfBody when passed will now cause the body to be fully read and will pass the result in the second argument. (This means that callers can no longer read the body themselves.) * Subresource Integrity changed slightly with Fetch handing it the bytes and integrity metadata and no longer a response. Essentially fetch fully reads the body, does the check, and then creates a new body from the same bytes to hand to the caller. Subresoruce Integrity PR: w3c/webappsec-subresource-integrity#97. XMLHttpRequest PR: whatwg/xhr#313. Closes #661.
This refactoring allows specifications building on top of Fetch to avoid having to directly integrate with Streams. It makes these changes: * body's transmitted bytes is removed. body's total bytes will likely be removed in #604. * body's done and wait concepts are removed. * It introduces three primitives to read a body: incrementally read (process each chunk), fully read (get all), and fully reading body as promise (get all as promise). * Removed transmit request body in favor of HTTP-network fetch using incrementally read directly to transmit the request body. * Fetch's processRequestBody and processRequestEndOfBody are no longer passed a request. The former is passed the chunk length for progress reporting. (Needed by XMLHttpRequest.) * Fetch's processResponseEndOfBody when passed will now cause the body to be fully read and will pass the result in the second argument. (This means that callers can no longer read the body themselves.) * Subresource Integrity changed slightly with Fetch handing it the bytes and integrity metadata and no longer a response. Essentially fetch fully reads the body, does the check, and then creates a new body from the same bytes to hand to the caller. Subresoruce Integrity PR: w3c/webappsec-subresource-integrity#97. XMLHttpRequest PR: whatwg/xhr#313. Closes #661.
This refactoring allows specifications building on top of Fetch to avoid having to directly integrate with Streams. It makes these changes: * body's transmitted bytes is removed. body's total bytes will likely be removed in #604. * body's done and wait concepts are removed. * It introduces three primitives to read a body: incrementally read (process each chunk), fully read (get all), and fully reading body as promise (get all as promise). * Removed transmit request body in favor of HTTP-network fetch using incrementally read directly to transmit the request body. * Fetch's processRequestBody and processRequestEndOfBody are no longer passed a request. The former is passed the chunk length for progress reporting. (Needed by XMLHttpRequest.) * Fetch's processResponseEndOfBody when passed will now cause the body to be fully read and will pass the result in the second argument. (This means that callers can no longer read the body themselves.) * Subresource Integrity changed slightly with Fetch handing it the bytes and integrity metadata, and no longer a response. Essentially fetch fully reads the body, does the check, and then creates a new body from the same bytes to hand to the caller. Subresoruce Integrity PR: w3c/webappsec-subresource-integrity#97. XMLHttpRequest PR: whatwg/xhr#313. Closes #661.
Hmm, saw some integrity question earlier on an other repo. i really thought that it was possible to process the readable stream before integrity was even calculated, that it would just throw an error at the very end of the stream if it did not match instead of successfully closing the stream. is this how we want it to work really? // this only console logs 1 final chunk
fetch('https://cdnjs.cloudflare.com/ajax/libs/p5.js/1.4.2/p5.min.js', {
integrity: 'sha512-rCZdHNB0AePry6kAnKAVFMRfWPmUXSo+/vlGtrOUvhsxD0Punm/xWbEh+8vppPIOzKB9xnk42yCRZ5MD/jvvjQ=='
}).then(res => {
const reader = res.body.getReader()
function pump() {
reader.read().then(val => {
console.log(val)
if (!val.done) pump()
})
}
pump()
})
// where as this logs multiple times
fetch('https://cdnjs.cloudflare.com/ajax/libs/p5.js/1.4.2/p5.min.js').then(res => {
const reader = res.body.getReader()
function pump() {
reader.read().then(val => {
console.log(val)
if (!val.done) pump()
})
}
pump()
}) |
@jimmywarting please don't comment on long-closed issues with novel questions. I believe I called this out a couple times now, it's starting to get a little annoying. Perhaps discuss it in https://whatwg.org/chat first next time around or ask on Stack Overflow. |
(This spans HTML, Encoding, Streams, and Fetch. Let's tag at least @ricea to help while we're here, and @jakearchibald since he was helpful in whatwg/infra#181.)
I need help figuring out how at a spec level to get all bytes in a body. This is part of solving whatwg/html#3316.
Right now specs are using a few patterns:
In general the problem here is Streams use of the JS formalism, including promises, and how that interacts poorly with spec-level code and its different conventions and type system. The confusion between encoding's streams and Streams's streams is also tricky.
Looking at https://html.spec.whatwg.org/#fetch-a-classic-script my first thought is that it should be written as:
This doesn't quite work on a few levels:
So here is my proposal, which is essentially trying for a minimal delta from today:
An alternate approach, which you might prefer, is to double-down on the spec-level concept of a byte stream. We'd provide some way of translating a ReadableStream, and thus a body, into a spec-level byte stream. It creates a reader, locks the ReadableStream, and then from then on, specs only manipulate the byte stream.
The hardest part of this, I think, is putting the asynchronicity of "reading" from the byte stream on solid ground. It seems very hand-wavey in Encoding, and most of Encoding's clients, right now. We could either:
This ties back to whatwg/infra#181 again. The problem with synchronous reads is that you can only do them from in-parallel sections, and I think in most cases "decode" does not run in those sections.
Indeed, the larger issue where encode/decode often are used on strings/byte sequences, instead of on "character streams"/"byte streams", seems pretty prevalent: see e.g. https://html.spec.whatwg.org/#form-submission-algorithm:encode or https://html.spec.whatwg.org/#navigating-across-documents:utf-8-decode. So maybe we need to do something about that anyway. Those cases wouldn't need to run in parallel.
The text was updated successfully, but these errors were encountered: