Image upload sometimes stalls with HTTP/2 #3559

luqmana · 2023-07-11T18:43:53Z

This was the image upload failing on Firefox/macOS bug that @david-crespo was running into.

I've looked into it some more and from what I can tell, at some point the browser gets stalled while uploading some chunks.
On the console side we split up the file into 384KiB chunks which we try to upload 6 at a time (to not hit browser concurrency limits). It doesn't happen every time, but every so often there will be a chunk or two where it seems like the browser made the request but there's no response. (At least from the browser Web Developer Tools Network tab).

I changed the console side to add a query parameter on each individual chunk upload and for the stalled chunks I saw no mention of such a request in the Nexus logs. Next I tried a packet capture with Wireshark and after setting up SSLKEYLOGFILE (because this will only repro with https; we'll get back to that) I did see the browser making those requests. And in fact, I could see it sending until at some point it stops with still no response.

After not being able to repro without TLS and then seeing the packet cap, I realized we're using HTTP/2 for compatible clients. The browser is maintaining a single connection and using multiple HTTP/2 streams to make the different requests. Ok, so is something else blocking our image upload somehow? Cue some more reading about HTTP/2 and it definitely has a concept of flow control that each peer maintains separately.

Basically, for each side, there's a connection level flow control window size as well as a per-stream size. Every byte sent decrements the available bytes in the window. If the sender has exhausted the window size, they must not send anymore until a WINDOW_UPDATE is received from the peer that tells it there's more space.

I need to look into it some more but it seems like the browser might think its exhausted the window but there's no window update from the nexus size. Hacked in a tracing subscriber to see anything useful from hyper and I do see mentions of the stalled streams but not super familier with hyper enough to decode them yet. (hyperium/hyper#2899 seems relevant maybe?)

The text was updated successfully, but these errors were encountered:

luqmana · 2023-07-11T19:04:27Z

I've found that navigating around and doing other things on the console shortly before trying the upload makes it more likely to fail. The HTTP/2 connection reuse might figure into that. So force reloading the image upload page could help?

Anecdotally, Chrome and the cli haven't run into this and the packet capture for those show Nexus sending WINDOW_UPDATE frames more regularly. So for a workaround that can be noted.

luqmana · 2023-07-11T21:03:08Z

I took a peek at how the cli was doing image uploads and compared it to what the console was doing and noticed there was a difference in the size of chunks both used (cli, console).

As a quick test I changed the console to use the same chunk size:

index 879342f4..93210bc0 100644
--- a/app/forms/image-upload.tsx
+++ b/app/forms/image-upload.tsx
@@ -136,7 +136,8 @@ const ABORT_ERROR = new Error('Upload canceled')
  * line breaks can introduce up 4% more overhead, but that doesn't apply here
  * because we are only sending the encoded data itself.
  */
-const CHUNK_SIZE = Math.floor((512 * KiB * 3) / 4)
+const CHUNK_SIZE = 512 * KiB

With that, I've started seeing regular WINDOW_UPDATE frames to firefox now and I can't reproduce the bug anymore.

It's hard to say exactly why that works but the HTTP/2 spec does not mandate any specifics inhow a sender or receiver needs to manage these windows:

HTTP/2 defines only the format and semantics of the WINDOW_UPDATE frame (Section 6.9). This document does not stipulate how a receiver decides when to send this frame or the value that it sends, nor does it specify how a sender chooses to send packets. Implementations are able to select any algorithm that suits their needs.¹

It could be that increasing CHUNK_SIZE is enough to trigger hyper's threshold for sending WINDOW_UPDATE's.

Regardless, I think we definitely want to think about adjusting the server side initial window frame size instead of just relying on the default hyper selects. There's also a disabled-by-default "adaptive window size" feature that implements Bandwidth-delay product to dynamically adjust the windows size (hyperium/hyper#1960).

RFC 7540 § 5.2.1 ↩

augustuswm · 2024-01-26T21:18:31Z

We received a report of a user hitting this on Windows 11 + Chrome (latest versions as of this update) when uploading to the colo rack release 5 (stuck at 100% upload). Specifically attempting to upload this image: https://download.fedoraproject.org/pub/fedora/linux/releases/39/Cloud/x86_64/images/Fedora-Cloud-Base-39-1.5.x86_64.raw.xz

Going to check logs on the rack to see if I can find anything relevant.

jmpesp · 2024-01-29T16:07:28Z

Should we be attempting to prevent users from uploading compressed images? There's no way they can boot from a .raw.xz file, as no where in the stack do we decompress anything.

david-crespo · 2024-01-29T16:10:37Z

We can certainly check the extension client-side at the very least. Actually, it would have to be client-side because the API never sees the filename. We have an issue (oxidecomputer/console#1846) to at least say something about this on the form, but I'll add a note about enforcement.

jmpesp · 2024-01-29T16:11:54Z

Maybe a warning instead of outright not allowing?

jmpesp · 2024-01-29T16:20:14Z

I just tried uploading Fedora-Cloud-Base-39-1.5.x86_64.raw.xz to a locally running Pantry (no control plane or anything, just some Python), and saw

16:18:28.009Z INFO crucible-pantry (dropshot): request completed
    error_message_external = Data length not block size multiple
    error_message_internal = Data length not block size multiple
    latency_us = 1386
    local_addr = [::]:12345
    method = POST
    remote_addr = [::1]:49244
    req_id = 4c866850-7e60-4d48-99f7-a632cefde750
    response_code = 400
    uri = //crucible/pantry/0/volume/914ad140-5f54-4480-891c-857a5be89864/bulk-write

This makes sense: that URL currently returns Content-Length: 446022784 which is not divisible by 512.

augustuswm · 2024-01-29T16:27:07Z

I just got an email back from them and confirmed that they were not extracting the image first. Also confirmed that they were actually getting an error during upload. I'll spin this off into a new ticket so that we can possibly better handle this error.

david-crespo · 2024-01-29T16:34:10Z

That's a relief. That likely means one of their chunk requests errored out and the built-in retries weren't enough. We could probably stand to pull the error message out of the API response and display it there. The fact that the size still got up to 100% is weird, though. The chunk size is 512 KiB, so if one failed we should see the amount uploaded down by half a MiB. Could be a mistake in our error handling — maybe it failed, then retried, then succeeded, and we still show it as an error.

david-crespo · 2024-12-18T04:51:09Z

We haven't seen this for a very long time. I think #6690 addressed it.

luqmana added the known issue To include in customer documentation and training label Jul 11, 2023

luqmana added this to the FCS milestone Jul 11, 2023

david-crespo mentioned this issue Jul 11, 2023

Image upload hang (only in FF on macOS) oxidecomputer/console#1469

Closed

david-crespo mentioned this issue Jul 11, 2023

Change image upload chunk size from 512 * 3/4 KiB to 512 KiB oxidecomputer/console#1650

Merged

luqmana mentioned this issue Jul 12, 2023

Allow configuring http server parameters. oxidecomputer/dropshot#726

Open

david-crespo mentioned this issue Jul 29, 2023

Firefox: Image upload is back to being stuck after hitting 100% oxidecomputer/console#1709

Closed

morlandi7 modified the milestones: FCS, MVP Aug 15, 2023

david-crespo mentioned this issue Jan 29, 2024

Show (and maybe enforce) allowed formats on image upload form oxidecomputer/console#1846

Closed

david-crespo removed this from the MVP milestone May 10, 2024

david-crespo closed this as completed Dec 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Image upload sometimes stalls with HTTP/2 #3559

Image upload sometimes stalls with HTTP/2 #3559

luqmana commented Jul 11, 2023

luqmana commented Jul 11, 2023

luqmana commented Jul 11, 2023

augustuswm commented Jan 26, 2024

jmpesp commented Jan 29, 2024

david-crespo commented Jan 29, 2024 •

edited

Loading

jmpesp commented Jan 29, 2024

jmpesp commented Jan 29, 2024

augustuswm commented Jan 29, 2024

david-crespo commented Jan 29, 2024 •

edited

Loading

david-crespo commented Dec 18, 2024

Image upload sometimes stalls with HTTP/2 #3559

Image upload sometimes stalls with HTTP/2 #3559

Comments

luqmana commented Jul 11, 2023

luqmana commented Jul 11, 2023

luqmana commented Jul 11, 2023

Footnotes

augustuswm commented Jan 26, 2024

jmpesp commented Jan 29, 2024

david-crespo commented Jan 29, 2024 • edited Loading

jmpesp commented Jan 29, 2024

jmpesp commented Jan 29, 2024

augustuswm commented Jan 29, 2024

david-crespo commented Jan 29, 2024 • edited Loading

david-crespo commented Dec 18, 2024

david-crespo commented Jan 29, 2024 •

edited

Loading

david-crespo commented Jan 29, 2024 •

edited

Loading