Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Stuck in a loop when VOD segments timeout #7368

Closed
joeyparrish opened this issue Sep 24, 2024 · 12 comments · Fixed by #7369 or #7257
Closed

Stuck in a loop when VOD segments timeout #7368

joeyparrish opened this issue Sep 24, 2024 · 12 comments · Fixed by #7369 or #7257
Assignees
Labels
platform: Cast Issues affecting Cast devices priority: P1 Big impact or workaround impractical; resolve before feature release status: archived Archived and locked; will not be updated type: bug Something isn't working correctly
Milestone

Comments

@joeyparrish
Copy link
Member

Have you read the FAQ and checked for duplicate open issues?
Yes

If the problem is related to FairPlay, have you read the tutorial?

N/A

What version of Shaka Player are you using?

4.9.2-caf1 through 4.9.28 and main

Can you reproduce the issue with our latest release version?
Did not try

Can you reproduce the issue with the latest code from main?
Yes

Are you using the demo app or your own custom app?
Custom app (partner Cast app)

If custom app, can you reproduce the issue using our demo app?
Did not try

What browser and OS are you using?
Chromecast Gen 3

For embedded devices (smart TVs, etc.), what model and firmware version are you using?
Cast 1.52.something (a bit behind)

What are the manifest and license server URIs?

Shared privately by partner (b/368055424 internally)

What configuration are you using? What is the output of player.getNonDefaultConfiguration()?

Default on CAF

What did you do?

Load with poor network throughput. You can limit throughput to 20kB/s in the debugger to see this quickly.

What did you expect to happen?
Playback should eventually fail with a timeout.

What actually happened?

StreamingEngine disables a stream (audio or video, whichever times out first), then the player chooses a new variant. 30s later, the first stream is re-enabled, then the new one times out, etc. Forever in a loop, with no fatal error.

Are you planning send a PR to fix it?
Yes

Misc
This was introduced in #5057 and v4.4.0, which moved the logic and dropped a condition (maybe on accident?). Prior to this, the disable logic only fired on code == HTTP_ERROR. Now it fires on all category == NETWORK errors.

One solution is to add that condition back.

Another is to only disable streams in live content. I think it doesn't make sense to do it in VOD, since VOD is static and doesn't "recover".

@joeyparrish joeyparrish added type: bug Something isn't working correctly priority: P1 Big impact or workaround impractical; resolve before feature release platform: Cast Issues affecting Cast devices labels Sep 24, 2024
@joeyparrish joeyparrish self-assigned this Sep 24, 2024
@shaka-bot shaka-bot added this to the v4.12 milestone Sep 24, 2024
joeyparrish added a commit to joeyparrish/shaka-player that referenced this issue Sep 24, 2024
joeyparrish added a commit to joeyparrish/shaka-player that referenced this issue Sep 24, 2024
@joeyparrish
Copy link
Member Author

@zangue, who sent the PR that caused this regression, and @avelad, who reviewed it:

I have two PRs that both solve this issue, but in very different ways. I don't know if the change to the logic (removing the check for code == HTTP_ERROR) was intentional. Please let me know your thoughts on these fixes, either of which would satisfy me:

@zangue
Copy link
Contributor

zangue commented Sep 25, 2024

@joeyparrish, I don’t recall any particular reason for the removal. I probably just failed to refactor that part of the logic properly or misunderstood it. As for the fix, I’d be more in favour of restoring the logic because that’s how it was originally introduced (I’m assuming for a good reason or use case).

@zangue
Copy link
Contributor

zangue commented Sep 25, 2024

Thinking about it more, with the first fix, we’d still end up with the player stuck in a loop in case an HTTP_ERROR occurs instead of a TIMEOUT no? The recovery logic disables the failed stream and falls back to another one. By the time the player exhausts all alternate stream choices (in case the network persists), the previously disabled streams may have re-enabled, and so on, putting the player in recovery a loop. In this case, it might be best to go with the second fix. But, I can think of at least one use case how this feature may be relevant of VOD:

  • For a particular stream some segments might be missing resulting in HTTP 404 errors

So, for VOD it may make sense instead to:

  • Permanently disable the stream for the rest of the presentation
  • During recovery perform network requests without retry parameters? (This would speed up the recovery process)
  • Give up when all choices are exhausted

Also, in general, maybe it makes sense to have a longer disable time (minutes?).

What do you think?

@joeyparrish
Copy link
Member Author

Thinking about it more, with the first fix, we’d still end up with the player stuck in a loop in case an HTTP_ERROR occurs instead of a TIMEOUT no?

That shouldn't be an issue, because HTTP_ERROR is a specific failure mode where there's a CORS failure or some other non-response. These don't just resolve themselves, ever. That may be why that was the conservative original logic. If there's a 4xx response or similar, you get BAD_HTTP_STATUS. See https://shaka-player-demo.appspot.com/docs/api/shaka.util.Error.html#value:1001 and https://shaka-player-demo.appspot.com/docs/api/shaka.util.Error.html#value:1002

Permanently disable the stream for the rest of the presentation

I think permanently disabling a VOD stream, in the case of a timeout, could be a problem. In particular, for mobile networks whose connectivity may be inconsistent.

Another option would be to exclude timeout errors from the stream-disable logic.

@avelad
Copy link
Member

avelad commented Sep 25, 2024

We also need to accept errors like TIMEOUT or BAD_HTTP_STATUS, for low latency these errors are common and we need to deal with them as well. Also sometimes there is a timeout when changing networks (mobile data vs WiFi), in this case I prefer to change to another stream before throwing an error.

@joeyparrish
Copy link
Member Author

I think being stuck in a loop loading forever is a bug, and it's a regression compared to v4.3.x.

The loop is composed of these elements:

  1. an error that triggers disabling a stream (currently any network error, previously only HTTP_ERROR for CORS configuration or bad URLs)
  2. re-enabling a stream after 30 seconds
  3. an error that doesn't resolve itself over time
  4. an error that is slow to trigger (like a network timeout), giving enough time for streams to be re-enabled

There are many ways to resolve this:

  1. revert the logic to v4.3.x to only disable on HTTP_ERROR
  2. disable streams for longer so we eventually exhaust them and fail instead of looping
  3. exclude TIMEOUT when disabling streams, since only TIMEOUT takes enough time for streams to be re-enabled and cause a loop
  4. don't disable streams at all for VOD

Arguments in favor of restoring the original logic:

  1. it was removed accidentally
  2. it resolves the regression
  3. we can add new logic later to address other concerns or handle other errors more carefully

Arguments in favor of excluding TIMEOUT:

  1. it resolves the regression
  2. it targets the only error code likely to cause a loop because it's the only error that takes a long time to trigger

Arguments in favor of excluding VOD:

  1. it resolves the regression
  2. we already have a policy to retry forever on live, but we intend VOD failures to be failures
  3. live streams are a moving target, so if something is wrong with segment A, even in all streams, segment A will eventually fall out of the availability window and we'll try another segment
  4. apps can choose to try again on VOD through the streaming error callback

@joeyparrish
Copy link
Member Author

I'd like to reach a consensus, but this regression is blocking the Cast release. We can't have an infinite loading loop.

If there's no consensus, I'll have to choose a conservative change by the end of the week to unblock the release. We can make additional changes to the behavior later to try to satisfy everyone's more nuanced needs/wants around error-handling.

I think the cheapest and most conservative things we can do are either revert to v4.3.x behavior or to exclude TIMEOUT.

@joeyparrish
Copy link
Member Author

@avelad is going on leave and won't participate here for a little while, but sent me this message regarding this issue: "About the disable stream, I’d prefer exclude only TIMEOUT then. Or change maxDisabledTime for VoD."

@zangue
Copy link
Contributor

zangue commented Sep 25, 2024

@joeyparrish I took a look at the history again and I noticed that while the original logic only considered HTTP_ERROR, it got extended such that by the time #5057 was introduced, following errors where actually considered by the recovery logic:

Reverting back to only consider HTTP_ERROR may cause further regression. So, imo, at least BAD_HTTP_STATUS & SEGMENT_MISSING should be preserved (as they were added for a reason) and TIMEOUT should be handled specially e.g. either:

@joeyparrish
Copy link
Member Author

Thank you for finding these details. It looks like all of those changes went into v4.4.0, so from a regression perspective, we went from HTTP_ERROR only in v4.3.0 to all network errors in v4.4.0. These details show, however, that the change you made to remove the individual error codes was more permissive, but not by as much as I thought. Thank you!

@joeyparrish
Copy link
Member Author

I don't think excluding TIMEOUT would be a regression with respect to #4764. When you read the details around TIMEOUT there, you see that someone suggested it not because it was part of the issue, but because they felt it should be there. It was added seemingly out of caution. The PR #4769 mentions that TIMEOUT was added because of that issue, but it's not mentioned in the issue by the OP at all, only by the eventual PR author suggesting it. So I think it was never actually critical to anyone.

Also, my analysis of this issue shows that it's only TIMEOUT that can cause an infinite loop such as this.

To @avelad's point about timeouts on mobile, if the timeout is because of a transient network event, there is no reason to exclude a stream. The stream itself isn't broken in this case. There is also a retry mechanism in the network config that has to be exhausted before we give up and fail. So I think in all these cases, it's safe to skip disabling the stream on timeout.

So I move that we exclude TIMEOUT specifically from this disabling logic. If there are any objections, please let me know. Otherwise, I'll move forward with that solution in less than 24 hours. Thanks!

joeyparrish added a commit to joeyparrish/shaka-player that referenced this issue Sep 26, 2024
@joeyparrish
Copy link
Member Author

Pushing forward with #7369, which I have modified to exclude TIMEOUT only.

joeyparrish added a commit that referenced this issue Sep 26, 2024
In #7368, we get stuck in a loop loading forever. This regression was
introduced in v4.4.0 and affects all v4.4, v4.5, v4.6, v4.7, and v4.8
releases, as well as v4.9.0-28, v4.9.2-caf1, v4.10.0-20, and v4.11.0-6.

The loop is composed of these elements:

1. an error that triggers disabling a stream
2. an error that doesn't resolve itself over time
3. an error that is slow enough to trigger that the first streams get
re-enabled
4. VOD content that doesn't change while we sit in the loop
5. enough streams to avoid exhausting them during the cycle

Only `TIMEOUT` errors can trigger this bug AFAICT, so we should exclude
those from the logic to disable streams. Note also that live streaming
already retries indefinitely by default, and that normal ABR logic will
change streams for us if we timeout due to a lack of bandwidth.

Disabling streams on `TIMEOUT` was suggested initially in #4764, but was
not a requirement of the OP. It was added out of caution in #4769, but
not really vetted. Because it was not ever explicitly needed, excluding
it is not a regression.

Closes #7368

Backported to v4.9.2-caf

Release-As: 4.9.2-caf2
joeyparrish added a commit that referenced this issue Sep 26, 2024
In #7368, we get stuck in a loop loading forever. This regression was
introduced in v4.4.0 and affects all v4.4, v4.5, v4.6, v4.7, and v4.8
releases, as well as v4.9.0-28, v4.9.2-caf1, v4.10.0-20, and v4.11.0-6.

The loop is composed of these elements:

1. an error that triggers disabling a stream
2. an error that doesn't resolve itself over time
3. an error that is slow enough to trigger that the first streams get
re-enabled
4. VOD content that doesn't change while we sit in the loop
5. enough streams to avoid exhausting them during the cycle

Only `TIMEOUT` errors can trigger this bug AFAICT, so we should exclude
those from the logic to disable streams. Note also that live streaming
already retries indefinitely by default, and that normal ABR logic will
change streams for us if we timeout due to a lack of bandwidth.

Disabling streams on `TIMEOUT` was suggested initially in #4764, but was
not a requirement of the OP. It was added out of caution in #4769, but
not really vetted. Because it was not ever explicitly needed, excluding
it is not a regression.

Closes #7368
joeyparrish added a commit that referenced this issue Sep 26, 2024
In #7368, we get stuck in a loop loading forever. This regression was
introduced in v4.4.0 and affects all v4.4, v4.5, v4.6, v4.7, and v4.8
releases, as well as v4.9.0-28, v4.9.2-caf1, v4.10.0-20, and v4.11.0-6.

The loop is composed of these elements:

1. an error that triggers disabling a stream
2. an error that doesn't resolve itself over time
3. an error that is slow enough to trigger that the first streams get
re-enabled
4. VOD content that doesn't change while we sit in the loop
5. enough streams to avoid exhausting them during the cycle

Only `TIMEOUT` errors can trigger this bug AFAICT, so we should exclude
those from the logic to disable streams. Note also that live streaming
already retries indefinitely by default, and that normal ABR logic will
change streams for us if we timeout due to a lack of bandwidth.

Disabling streams on `TIMEOUT` was suggested initially in #4764, but was
not a requirement of the OP. It was added out of caution in #4769, but
not really vetted. Because it was not ever explicitly needed, excluding
it is not a regression.

Closes #7368
joeyparrish added a commit that referenced this issue Sep 26, 2024
In #7368, we get stuck in a loop loading forever. This regression was
introduced in v4.4.0 and affects all v4.4, v4.5, v4.6, v4.7, and v4.8
releases, as well as v4.9.0-28, v4.9.2-caf1, v4.10.0-20, and v4.11.0-6.

The loop is composed of these elements:

1. an error that triggers disabling a stream
2. an error that doesn't resolve itself over time
3. an error that is slow enough to trigger that the first streams get
re-enabled
4. VOD content that doesn't change while we sit in the loop
5. enough streams to avoid exhausting them during the cycle

Only `TIMEOUT` errors can trigger this bug AFAICT, so we should exclude
those from the logic to disable streams. Note also that live streaming
already retries indefinitely by default, and that normal ABR logic will
change streams for us if we timeout due to a lack of bandwidth.

Disabling streams on `TIMEOUT` was suggested initially in #4764, but was
not a requirement of the OP. It was added out of caution in #4769, but
not really vetted. Because it was not ever explicitly needed, excluding
it is not a regression.

Closes #7368
@shaka-bot shaka-bot added the status: archived Archived and locked; will not be updated label Nov 25, 2024
@shaka-project shaka-project locked as resolved and limited conversation to collaborators Nov 25, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
platform: Cast Issues affecting Cast devices priority: P1 Big impact or workaround impractical; resolve before feature release status: archived Archived and locked; will not be updated type: bug Something isn't working correctly
Projects
None yet
4 participants