Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Datagram vs stream & relative stream prioritization #62

Closed
vasilvv opened this issue Sep 26, 2019 · 61 comments
Closed

Datagram vs stream & relative stream prioritization #62

vasilvv opened this issue Sep 26, 2019 · 61 comments
Assignees

Comments

@vasilvv
Copy link
Contributor

vasilvv commented Sep 26, 2019

Currently, we specify no particular order in which stream data and datagrams should go.

The way the current scheme works in Google's implementation is roughly:

  1. A timer or a received packet causes the connection to become writable.
  2. The connection notifies the application so that it can send its datagrams.
  3. The connection checks the send buffers for the open streams and sends stream data.

This works only if all operations are synchronous, and we can't make step 2 synchronous since the JS callback would be asynchronous. Thus, we need to come up with a way to provide an opening for the JS to send datagrams while there's always stream data available to send.

A "version 0" approach would be to add fixed size buffer of datagram that gets always drained before stream data. That works, but if the buffer is too small, it can hurt performance, and if it's too big, this can lead to overbuffering.

One possible improvement would be to dynamically size the datagram buffer based on the congestion control window or some other criteria. One could possibly let the user set that buffer size manually. There is still, however, an implicit race between the JS application adding datagrams into the buffer and the network process deciding to send pending stream data instead.

Another idea I had is that the user can "reserve" a portion of congestion window for datagrams and leave the rest for stream data. That would look roughly like this: a user asks for a "token" that reserves a portion of the congestion window aside for a datagram. When the user sends the datagram, they return that token, since the reserved portion of the window is now used by the newly sent datagram.

@Atrius
Copy link

Atrius commented Sep 27, 2019

I can reserve datagram tokens even if the cwnd is full, right?

Otherwise, I think there's the same race, just on the feedback side: whenever an ack arrives, the transport has to fire a callback to the app telling it that there's more room in the cwnd, and in the meantime, it can use that space to send stream data.

I could imagine this working pretty well for my use case, which is real-time media. I imagine I would just keep about 1 frame's worth of datagram tokens on hand, so that I can send the next frame as soon as it's ready, and any excess cwnd space can be used for stream data.

@vasilvv
Copy link
Contributor Author

vasilvv commented Sep 30, 2019

Yup, that's the idea. I imagine that some applications would want to adjust the reservation based on the current CWND size, but without knowing specifically how those look like, I don't know what the API should be.

@jan-ivar
Copy link
Member

jan-ivar commented Mar 30, 2021

Meeting:

  • Still an issue. E.g. with pull-based stream producers sending data.
  • Fyi:, Chrome always attempts to send datagrams before stream data.

@martinthomson
Copy link
Member

martinthomson commented Mar 30, 2021

My sense here is that we want to have a priority API that would allow us to manage relative priorities.

There might be a case here for having multiple datagram queues in order to allow datagrams to operate at different priorities. An alternative might be to allow the application to manage changing datagram priority (Edit: which I have just realized doesn't really work very well without some allowance for that being kept in the datagram queue, but I guess it could be made to work.)

@wilaw wilaw added this to the Minimum viable ship milestone Jul 7, 2021
@wilaw wilaw added the Discuss at next meeting Flags an issue to be discussed at the next WG working label Jul 7, 2021
@yutakahirano
Copy link
Contributor

@vasilvv can you state yout current thoughts, as this issue is very old?

I don't think you're planning to introduce the feature for the initial release (please correct me if I'm wrong), but it's possible that we need to change our current specified APIs to support this use-case.

One thing I want to clarify is whether we want to handle datagrams of various sizes.

If we need to handle only one size, assuming there were a good prirority API, we can cover the use case with DatagramDuplexStream.outgoingHighWaterMark.

If we need to handle datagrams of various sizes, the current DatagramDuplexStream.outgoingHighWaterMark](https://w3c.github.io/webtransport/#dom-datagramduplexstream-outgoinghighwatermark) is insufficient because it counts the number of datagrams. We need something that counts the number of bytes of queued datagrams. This can affect the discussion at #305 as we may want to make the APIs more flexible, before the initial release.

@ricea FYI

@jan-ivar
Copy link
Member

jan-ivar commented Aug 3, 2021

Meeting:

  • Maybe prioritize datagrams over streams? But if you have a pull-based source that might swamp non-datagrams.
  • Struggling to define language around how user agents should work here.
  • Open to API suggestions. Ideas: A field with values like "fairness" or other ways to distribute priority. fields like low/medium/high

yutakahirano added a commit that referenced this issue Aug 31, 2021
Discussed at #62. Explicitly state that outgoing datagrams are
prioritized for now. We will introduce a way to change this
default behavior, as part of the prioritization API.
yutakahirano added a commit that referenced this issue Aug 31, 2021
Discussed at #62. Explicitly state that outgoing datagrams are
prioritized for now. We will introduce a way to change this
default behavior, as part of the prioritization API.
@jan-ivar
Copy link
Member

jan-ivar commented Aug 31, 2021

Meeting:

  • With Prioritize outgoing datagrams over stream data #340 marked as "Editor's can integrate", we've changed the Milestone of this to sometime in the future. Basically this says datagrams take priority over other (webtransport stream) traffic, until further notice. We may come back to add more detailed prioritization APIs later.

yutakahirano added a commit that referenced this issue Sep 1, 2021
Discussed at #62. Explicitly state that outgoing datagrams are
prioritized for now. We will introduce a way to change this
default behavior, as part of the prioritization API.
@jan-ivar
Copy link
Member

Meeting:

  • @vasilvv & @ddragana mentioned related work https://datatracker.ietf.org/doc/html/draft-ietf-httpbis-priority but is on the http level not stream level. Uses headers with a number 0-7 where 0 is highest.
    • Split off due to it going slower due to questions over whether it is useful in practice
    • In http it's just a suggestion to the server how to prioritize, which is different from WebTransport which also would need to affect client sending.
  • Revisit once we have more developer feedback.

@LPardue
Copy link

LPardue commented Sep 28, 2021

FWIW, the Extensible priorities spec allows the priority information to be used to influence local client decisions, but that's a client implementation choice and there's really not much more to say at the protocol level.

The I-D https://datatracker.ietf.org/doc/html/draft-pardue-masque-dgram-priority is a response to people asking how Extensible Priorities might extend to Datagrams. This is not adopted by any group (yet?) and illustrates how this scheme can be extended to support a different data-bearing type alongside streams.

The high-order question though really is, is there any Priority API that allows expressing send and receive priority of fully duplex flows. Without that, you bake in some assumptions.

@wilaw wilaw removed the Discuss at next meeting Flags an issue to be discussed at the next WG working label Sep 29, 2021
@wilaw wilaw added the Discuss at next meeting Flags an issue to be discussed at the next WG working label Jan 26, 2022
@jan-ivar jan-ivar changed the title Datagram tokens as a mechanism of datagram vs stream prioritization Datagram vs stream & relative stream prioritization Jan 26, 2022
@jan-ivar
Copy link
Member

jan-ivar commented Feb 9, 2022

So if you're the producer of data being written to the WT sender, you can throttle that way per stream.

But I'm not sure we made the right decision that datagrams should always trump streams. One example would be a case where media is sent over datagrams but with a stream control channel. We don't want to swamp the control channel.

May I prose we at least add a boolean to flip this? Maybe

wt.datagrams.priorityOverStreams = true (default) | false

...except the web platform apparently frowns on booleans that default to true, so how about:

wt.datagrams.priority = "high" | "low"

?

@martinthomson
Copy link
Member

Whatever we do here should be made consistent with whatever we do for stream priorities. Why not do the same for both? There might be a natural advantage given to datagrams over streams, or a different default, but any deliberate choice should deal with the potential for starvation (inter-stream, inter-session, stream-vs-datagram, etc...).

@LPardue
Copy link

LPardue commented Feb 10, 2022

In my (non-WebTransport) experience, there always a fairly decent chance of picking the wrong data type to be sent first and causing starvation. This becomes more apparent in more complex or bandwidth-intensive workloads . Some kind of timeshare is required to avoid that. In draft-pardue-masque-dgram-priority, the simple suggestion I made was a 50/50 share between streams and datagrams that were deemed equal priority.

@jan-ivar
Copy link
Member

FYI, in the current W3C spec (and Chrome implementation) sent datagrams have 100% priority over streams: "7. Wait for transport’s [[Datagrams]]'s [[OutgoingDatagramsQueue]] to be empty. Note: This means outgoing datagrams are prioritized over stream data."

@martinthomson
Copy link
Member

I think that 100% is inadvisable, but the general idea of datagrams pre-empting streams is fine.

Overall though, I'm concerned that a point solution for the starvation problem, without considering inter-stream prioritization as part of that, might not be the best way to approach this. Prioritization of streams into a small number of slots, with datagram priorities interleaved between those is probably where I would look here. I'm not sure about how easy it would be to reliably implement a 50/50 share in our implementation.

@LPardue
Copy link

LPardue commented Feb 10, 2022

Yeah no need for hard proportions here. What has worked for me is, when preparing a packet to send, look at queues of stream data and datagram messages (slots if you like), and alternate between which one to pop off. If the preferred type is empty, move onto the other for this round to avoid wasted capacity.

@aboba
Copy link
Collaborator

aboba commented Feb 16, 2022

Mathis Engelbart has looked at impact of prioritizing datagrams within a QUIC connection operating with NewReno. This results in high datagram latency, since the reliable stream fills the congestion window when datagrams aren't being sent.
Results are here.

@kixelated
Copy link

kixelated commented May 23, 2022

Here's a quick summary of how (live) video works.

Encoding

I-frames do not depend on other frames while P-frames depend on one or more previous frames. The decoder will crash or produce artifacts if these dependencies are not maintained.

1

HLS/DASH

HLS and DASH split video into segments that start with an I-frame. These segments are downloaded sequentially and are virtually never dropped.

This is effectively FIFO, as the oldest segments are prioritized. Higher numbers are delivered first, in strict order.

5

Warp

Warp is simple; we flip the priority around such that the newest segments are delivered first.

This is basically LIFO, but it gets more complicated when audio and metadata is involved. Warp uses the fact that streams are delivered in order to maintain P-frame ordering. Same as before, higher numbers are delivered first, in strict order.

2

RUSH/WebRTC

RUSH and WebRTC deliver each frame independently, but still need to enforce some form of frame ordering during congestion. RUSH uses QUIC streams while WebRTC uses RTP packets.

This is normally done by aggressively dropping frames based on a timer, but it can be generalized using prioritization.

3

Encoding v2

Higher quality media will use b-frames, which annoyingly reference one or more future frames and zero or more past frames. In an ideal world, these media dependencies could be mapped to network dependencies. Here's an example but there are many different types of GOP structures:

4

WebRTC doesn't use B-frames because they introduce latency. Warp supports B-frames, but the entire segment is delivered over a single QUIC stream (decode order) adding some head-of-line blocking. I don't know what RUSH does.

Summary

It's an open problem as how to best map media to QUIC streams. Congestion is inevitable given the nature of video streams and media will be starved.

I think the best approach is network prioritization that matches the underlying media structure. A simple integer with strict prioritization can accomplish that. The alternative is to aggressively cancel streams based on feedback, but I think that produces a strictly worse experience and involves (delayed) moving parts.

@LPardue
Copy link

LPardue commented May 23, 2022

Thanks Luke.

To explain extensible priorities in similar terms, the spec recommends something that, when all streams are non-incfemental, equates to 8 FIFO queues each serving stream IDs lower to higher. It would be fairly trivial to add some extension parameter to make the ordering of these queues more configurable (either a blanket change, per-queue change, or explicit order based on some number other than stream ID).

@jan-ivar
Copy link
Member

jan-ivar commented May 25, 2022

Meeting:

  • int32 priority is definitely one of the options to consider.
  • Some concerns still that starvation and buffer bloat are paper cuts that might afflict apps owning this problem
  • Implementation complexity is unclear.
  • Would it be fair for an implementer to explore this in a trial, so we can learn if this works well?
  • May have been able to volunteer @vasilvv for this ;)

@martinthomson
Copy link
Member

So it looks like @LPardue is pointing at the decision made in HTTP to establish a common baseline of 8 FIFO queues. Our experience there was that more complex priority signaling schemes were not consistently implemented, or not implemented at all in some cases.

At some level, any number of FIFOs can be used to implement the warp model (HLS works with a single priority that operate on a FIFO basis). Depending on how many streams are outstanding, you need to reallocate priorities as your supply of priorities exhaust. At just 8, it might be that the rate at which you need to perform this reshuffle is pretty high. This is worse if you consider the need to maintain separate priorities for audio and video within that same space. So I can appreciate how many $2^3$ might be a little too parsimonious.

I'm extremely skeptical of the prospects for more complex signaling, which includes schemes that includes more numbers. @vasilvv makes a compelling argument for the browser taking the complexity hit. Unlike in HTTP, we have the advantage of a relatively small set of implementers, with a relatively high tolerance for complexity (I say that like it is a good thing, which feels dirty). It's possible that a more complex interface is appropriate here. But I'd be uncomfortable committing to something more complex without first trying it on.

@LPardue
Copy link

LPardue commented May 25, 2022

I think that's a close summary of what I'm saying.

(HLS works with a single priority that operate on a FIFO basis)

FWIW I've seen first hand problems caused by manifest updates being blocked behind segment requests. So even there different queues can help.

Depending on how many streams are outstanding, you need to reallocate priorities as your supply of priorities exhaust. At just 8, it might be that the rate at which you need to perform this reshuffle is pretty high.

I think that holds if you ignore the order dimension. However, within a single extensible priority queue, the ordering is recommended to be based on stream ID. So my thinking is you can have a 8 coarse buckets of $thing (say based on media type or control data) and within that you can deterministically dispatch individual items based on an ordering dimension. For use case where FIFO is not preferred, some other thing (like a new order param) can modulate dispatch behaviour.

Using a single value like uint32 for priority would immediately cause me to ask - "What is the expected behaviour when two items have the same priority?"

@alvestrand
Copy link

The HTTP priority scheme seems to default to "resources are set one by one" - it offers the "incremental" flag to communicate that there's an advantage to offering parts of multiple responses.

This is in contrast to the media-transmission situation, where it is normal to have multiple conceptual streams that must be interleaved for sanity (frames may still be represented as streams or as datagrams; YMMV)

In the WEBRTC prioriity discussions, the WG ended up specifying a weighted round robin scheme for handling priority - trying to ensure that lower priority streams do not get completely starved out by higher priority schemes - the HTTP priority scheme doesn't seem to be prescriptive in that regard.
(FWIW, as of May 2022, the WRR scheme for webrtc priority has never been implemented.)

@LPardue
Copy link

LPardue commented May 25, 2022

And the valid point on WRR takes us somewhat back to earlier part of this issue about how to prioritize streams vs datagrams. 😃

I agree that more experimentation seems important no matter what's picked

And while you might be able to justify the browsers taking a hit due to scope of work in W3C, it would be really nice* if the conceptual model were something that could be mirrored elsewhere without oodles of pain. For example, if you stick any form of intermediary in the flow between Client to target, being able to bridge those hops would be great.

  • I'm aware the "it would be nice" card is easily ignored

@jan-ivar
Copy link
Member

jan-ivar commented Jun 1, 2022

I'm hearing general agreement for at least 3 levels of strict priority — by which I only mean no-one seems satisfied by less, which means we should continue discussion. Agreement seems to hold up to ~8 levels, and then drops above that. So I think it'd be useful to explore if WARP could be pulled off with fewer levels:

Depending on how many streams are outstanding, you need to reallocate priorities as your supply of priorities exhaust. At just 8, it might be that the rate at which you need to perform this reshuffle is pretty high.

An overly simplistic example situation relying on int32 would seem to go something like this (assume 1 second segments):

  1. I write() my first segment A at priority 1 (lowest). I know from the promise when it's been sent
  2. But my next segment B is ready sooner, so I write(B) with priority 2 before then. This pauses A to send B instead
  3. My next segment C after that is also ready sooner, so I write(C) with priority 3, pausing B to send C
  4. and so on...
  5. At some point I cancel streams that are too old to avoid send buffer bloat

Never mind that this describes starvation atm — assume enough time passes eventually to avoid it — my aim is merely to illustrate a situation that relies on increasing priority levels to keep things going.

But look at step 2 where we're effectively down-prioritizing A. What if we did this literally instead? Using only 2 levels:

  1. I write() my first segment A at priority "medium". I know from the promise when it's been sent
  2. But segment B is ready sooner, so I set A.priority = "low" and write(B) at "medium", pausing A to send B
  3. Segment C is also ready sooner, so I set B.priority = "low" and write(C) at "medium" pausing B to send C
  4. and so on...
  5. At some point I cancel streams that are too old to avoid send buffer bloat

This seems comparable in that it always prioritizes the latest 1-second segment (C), and only differs in it's prioritizing of missed media (A vs B) since we only have 2 levels. With 8 levels an app could prioritize up to 7 seconds of missed playback.

The key here is instead of relying on rigidly increasing numbers, we allow the app to change the priority of ongoing streams. @kixelated thoughts?

@LPardue
Copy link

LPardue commented Jun 1, 2022

Changing priorities locally (or simply just pre-empting existing prioritised data with something more important) can be done quite reactively (aside: in TCP large write buffers can neutralise this, because you're already committed, webtransport should look out for that).

Some of the concerns about reprioritization come about because it's the remote peer that wants reprioritize. When chasing low latency transfer, the latency of reprioritization signals and reaction often work against the goals, with a poor outcome.

@LPardue
Copy link

LPardue commented Jun 1, 2022

Forgot to make the critical point, when yalk about prioritization changes after initiation we need to be clear what exactly to target actor is.

@jan-ivar
Copy link
Member

jan-ivar commented Jun 1, 2022

Not sure I follow. I was purposely avoiding externalities above, by observing that, all else being equal, rigidly increasing numbers and changeable priorities are two ways to express the same thing: a change in priority of an ongoing outgoing stream.

@LPardue
Copy link

LPardue commented Jun 1, 2022

You're not wrong. I'm just highlighting that discussion in other venues has highlighted that changing priorities of open and active streams whether local or remote, can be too "slow" for some use cases.

I get that WebTransport API has to focus on its own local needs in the web platform. But it will probably need to be compatible or adaptable with the other ecosystem of clients and servers. So your choice here has effects.

@kixelated
Copy link

@jan-ivar your example is good, although there may always be an active stream so you can't reset the priority back to 1. In the video case, we only close one stream (FIN) at the same time as creating a new stream (at I-frame boundaries), and both frames can be flushed at the same time. In practice I've never run into a situation where there are no active streams. And of course, any persistent congestion is just going to eat through the tiny number space.

The ability to reprioritize local streams definitely improves things. There's quite a bit more bookkeeping, but we could effectively park all but the highest priority stream. But it's not ideal when RTT is high, as your feedback that a stream has been finished is delayed (waiting for the ACK). In the meantime you'll be round-robining every other stream, only promoting the next highest priority stream once the ACK has been received.

The fix is to add more priority levels so you can signal the stream order ahead of time. Warp would need at least 4 levels (new audio, new video, prev video, prev audio) to handle RTTs <2s. However, RUSH would need at least 120 levels at 30fps given how frequently it creates streams.

When it comes to strict prioritization, I think there should be as many levels as active streams. After all, the goal is to be able to order all of the active streams. When it comes to weighted prioritization or hints (ex. high, medium, low), a smaller like 8 levels makes sense because the behavior itself is vague.

@martinthomson
Copy link
Member

What Jan-Ivar is describing is effectively what the browser would have to do if the API had more levels. So it is clearly possible, modulo the part where network issues cause the whole thing to break down. As Luke says, this is easier on the app if the number of levels >= the number of active streams, but it is possible to have approximations with fewer.

@jan-ivar
Copy link
Member

jan-ivar commented Jun 21, 2022

Warp would need at least 4 levels (new audio, new video, prev video, prev audio) to handle RTTs <2s

Great!

However, RUSH would need at least 120 levels at 30fps given how frequently it creates streams.

Why not 4 levels? Assuming Warp is also 30fps, if there's no new network info for 2 seconds, what's wrong with either Warp or RUSH filling the send buffer with 30 frames at "new video" priority and 30 frames at "prev video" priority however interleaved (and however many audio segments)? Hypothetically, just to set the next point:

Why not 6 levels? To improve on that baseline comparison: If we want to send 30 fps video but anticipate bad congestion (more appropriate for, say, 5 fps), why not e.g. send every 3rd frame at +1 priority, and every 6th frame at +2 priority? This would appear to give us 30 fps with fallback to 10 fps and then 5fps if there's congestion.

Disclaimer: the above are thought experiments off of presented diagrams, since I know nothing about how RUSH actually works. My point is merely to question whether JS needs finer-grained order control.

Lastly, I appreciate using conservative numbers, but also measure WebTransport bidi stream response at ~75ms in Chrome.

@kixelated
Copy link

kixelated commented Jun 23, 2022

Unfortunately you can't just arbitrarily drop frames like that. Video is delta-encoded, so a frame can only be decoded if any frames that it depends on have also been decoded.

My diagram only showed 4 frames per segment, but in reality there's going to be one I-frame followed by at least 59 P-frames (30fps with 2s GoPs) that all depend on the previous frame. For example, you could deprioritize the 5th frame, but then the following 54 frames cannot be decoded if they were received first.

Technically, the video could be encoded to support this 6-level scheme, however VQ will suffer and it's too custom for hardware encoding support. My point is that the network layer doesn't get to decide the frame dependency graph.

But ideally the network layer is able to match the encoding dependencies. There's no point receiving a frame before it can be decoded. It becomes possible to support any encoding structure with a sufficiently large number space, as the application can determine the delivery order of each frame.

@alvestrand
Copy link

Encoding a deep dependency tree as priorities is going to give you trouble.
Encoding (for instance) L2T3 with I-frames on demand rather than programmatic as priorities is, I think, impossible in a finite number space.

@kixelated
Copy link

I absolutely agree that it would a mess to encode arbitrary dependency trees in an integer space. I'm thinking something simpler, like the top x bits indicate the GoP priority while the bottom x bits indicate the frame priority (within a GoP). Anyway, I'm not actually doing frame-level prioritization, so it's kind of a moot point.

I made a demo for Warp and specifically wanted to point out how segments are prioritized. I'm hosting crude public demo (Chrome only) at https://quic.video/demo but it might be at the moment; I'll have to investigate.

@jan-ivar
Copy link
Member

jan-ivar commented Aug 2, 2022

See also #365 (comment)

@jan-ivar
Copy link
Member

jan-ivar commented Aug 30, 2022

Meeting:

  • Some discussion about whether send priorities should also be signaled. Agreed this was more of an IETF issue that can be left orthogonal to this issue about exposing a local JS send priority API.
  • Some concern expressed that the discussion has strayed from the OP need for rudimentary prioritization of long-lived streams (e.g. datagram vs data stream vs control-channel) vs. order my (frame/segment-per-stream) superchunks. Would separating into a fresh issue help?

@jan-ivar
Copy link
Member

jan-ivar commented Jan 4, 2023

With #438 merged the remaining issue here is #451. Closing.

@jan-ivar jan-ivar closed this as completed Jan 4, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

10 participants