-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Stuck in a loop when VOD segments timeout #7368
Comments
@zangue, who sent the PR that caused this regression, and @avelad, who reviewed it: I have two PRs that both solve this issue, but in very different ways. I don't know if the change to the logic (removing the check for |
@joeyparrish, I don’t recall any particular reason for the removal. I probably just failed to refactor that part of the logic properly or misunderstood it. As for the fix, I’d be more in favour of restoring the logic because that’s how it was originally introduced (I’m assuming for a good reason or use case). |
Thinking about it more, with the first fix, we’d still end up with the player stuck in a loop in case an
So, for VOD it may make sense instead to:
Also, in general, maybe it makes sense to have a longer disable time (minutes?). What do you think? |
That shouldn't be an issue, because HTTP_ERROR is a specific failure mode where there's a CORS failure or some other non-response. These don't just resolve themselves, ever. That may be why that was the conservative original logic. If there's a 4xx response or similar, you get
I think permanently disabling a VOD stream, in the case of a timeout, could be a problem. In particular, for mobile networks whose connectivity may be inconsistent. Another option would be to exclude timeout errors from the stream-disable logic. |
We also need to accept errors like TIMEOUT or BAD_HTTP_STATUS, for low latency these errors are common and we need to deal with them as well. Also sometimes there is a timeout when changing networks (mobile data vs WiFi), in this case I prefer to change to another stream before throwing an error. |
I think being stuck in a loop loading forever is a bug, and it's a regression compared to v4.3.x. The loop is composed of these elements:
There are many ways to resolve this:
Arguments in favor of restoring the original logic:
Arguments in favor of excluding TIMEOUT:
Arguments in favor of excluding VOD:
|
I'd like to reach a consensus, but this regression is blocking the Cast release. We can't have an infinite loading loop. If there's no consensus, I'll have to choose a conservative change by the end of the week to unblock the release. We can make additional changes to the behavior later to try to satisfy everyone's more nuanced needs/wants around error-handling. I think the cheapest and most conservative things we can do are either revert to v4.3.x behavior or to exclude TIMEOUT. |
@avelad is going on leave and won't participate here for a little while, but sent me this message regarding this issue: "About the disable stream, I’d prefer exclude only TIMEOUT then. Or change maxDisabledTime for VoD." |
@joeyparrish I took a look at the history again and I noticed that while the original logic only considered
Reverting back to only consider
|
Thank you for finding these details. It looks like all of those changes went into v4.4.0, so from a regression perspective, we went from |
I don't think excluding Also, my analysis of this issue shows that it's only To @avelad's point about timeouts on mobile, if the timeout is because of a transient network event, there is no reason to exclude a stream. The stream itself isn't broken in this case. There is also a retry mechanism in the network config that has to be exhausted before we give up and fail. So I think in all these cases, it's safe to skip disabling the stream on timeout. So I move that we exclude |
Pushing forward with #7369, which I have modified to exclude TIMEOUT only. |
In #7368, we get stuck in a loop loading forever. This regression was introduced in v4.4.0 and affects all v4.4, v4.5, v4.6, v4.7, and v4.8 releases, as well as v4.9.0-28, v4.9.2-caf1, v4.10.0-20, and v4.11.0-6. The loop is composed of these elements: 1. an error that triggers disabling a stream 2. an error that doesn't resolve itself over time 3. an error that is slow enough to trigger that the first streams get re-enabled 4. VOD content that doesn't change while we sit in the loop 5. enough streams to avoid exhausting them during the cycle Only `TIMEOUT` errors can trigger this bug AFAICT, so we should exclude those from the logic to disable streams. Note also that live streaming already retries indefinitely by default, and that normal ABR logic will change streams for us if we timeout due to a lack of bandwidth. Disabling streams on `TIMEOUT` was suggested initially in #4764, but was not a requirement of the OP. It was added out of caution in #4769, but not really vetted. Because it was not ever explicitly needed, excluding it is not a regression. Closes #7368 Backported to v4.9.2-caf Release-As: 4.9.2-caf2
In #7368, we get stuck in a loop loading forever. This regression was introduced in v4.4.0 and affects all v4.4, v4.5, v4.6, v4.7, and v4.8 releases, as well as v4.9.0-28, v4.9.2-caf1, v4.10.0-20, and v4.11.0-6. The loop is composed of these elements: 1. an error that triggers disabling a stream 2. an error that doesn't resolve itself over time 3. an error that is slow enough to trigger that the first streams get re-enabled 4. VOD content that doesn't change while we sit in the loop 5. enough streams to avoid exhausting them during the cycle Only `TIMEOUT` errors can trigger this bug AFAICT, so we should exclude those from the logic to disable streams. Note also that live streaming already retries indefinitely by default, and that normal ABR logic will change streams for us if we timeout due to a lack of bandwidth. Disabling streams on `TIMEOUT` was suggested initially in #4764, but was not a requirement of the OP. It was added out of caution in #4769, but not really vetted. Because it was not ever explicitly needed, excluding it is not a regression. Closes #7368
In #7368, we get stuck in a loop loading forever. This regression was introduced in v4.4.0 and affects all v4.4, v4.5, v4.6, v4.7, and v4.8 releases, as well as v4.9.0-28, v4.9.2-caf1, v4.10.0-20, and v4.11.0-6. The loop is composed of these elements: 1. an error that triggers disabling a stream 2. an error that doesn't resolve itself over time 3. an error that is slow enough to trigger that the first streams get re-enabled 4. VOD content that doesn't change while we sit in the loop 5. enough streams to avoid exhausting them during the cycle Only `TIMEOUT` errors can trigger this bug AFAICT, so we should exclude those from the logic to disable streams. Note also that live streaming already retries indefinitely by default, and that normal ABR logic will change streams for us if we timeout due to a lack of bandwidth. Disabling streams on `TIMEOUT` was suggested initially in #4764, but was not a requirement of the OP. It was added out of caution in #4769, but not really vetted. Because it was not ever explicitly needed, excluding it is not a regression. Closes #7368
In #7368, we get stuck in a loop loading forever. This regression was introduced in v4.4.0 and affects all v4.4, v4.5, v4.6, v4.7, and v4.8 releases, as well as v4.9.0-28, v4.9.2-caf1, v4.10.0-20, and v4.11.0-6. The loop is composed of these elements: 1. an error that triggers disabling a stream 2. an error that doesn't resolve itself over time 3. an error that is slow enough to trigger that the first streams get re-enabled 4. VOD content that doesn't change while we sit in the loop 5. enough streams to avoid exhausting them during the cycle Only `TIMEOUT` errors can trigger this bug AFAICT, so we should exclude those from the logic to disable streams. Note also that live streaming already retries indefinitely by default, and that normal ABR logic will change streams for us if we timeout due to a lack of bandwidth. Disabling streams on `TIMEOUT` was suggested initially in #4764, but was not a requirement of the OP. It was added out of caution in #4769, but not really vetted. Because it was not ever explicitly needed, excluding it is not a regression. Closes #7368
Have you read the FAQ and checked for duplicate open issues?
Yes
If the problem is related to FairPlay, have you read the tutorial?
N/A
What version of Shaka Player are you using?
4.9.2-caf1 through 4.9.28 and main
Can you reproduce the issue with our latest release version?
Did not try
Can you reproduce the issue with the latest code from
main
?Yes
Are you using the demo app or your own custom app?
Custom app (partner Cast app)
If custom app, can you reproduce the issue using our demo app?
Did not try
What browser and OS are you using?
Chromecast Gen 3
For embedded devices (smart TVs, etc.), what model and firmware version are you using?
Cast 1.52.something (a bit behind)
What are the manifest and license server URIs?
Shared privately by partner (b/368055424 internally)
What configuration are you using? What is the output of
player.getNonDefaultConfiguration()
?Default on CAF
What did you do?
Load with poor network throughput. You can limit throughput to 20kB/s in the debugger to see this quickly.
What did you expect to happen?
Playback should eventually fail with a timeout.
What actually happened?
StreamingEngine disables a stream (audio or video, whichever times out first), then the player chooses a new variant. 30s later, the first stream is re-enabled, then the new one times out, etc. Forever in a loop, with no fatal error.
Are you planning send a PR to fix it?
Yes
Misc
This was introduced in #5057 and v4.4.0, which moved the logic and dropped a condition (maybe on accident?). Prior to this, the disable logic only fired on
code == HTTP_ERROR
. Now it fires on allcategory == NETWORK
errors.One solution is to add that condition back.
Another is to only disable streams in live content. I think it doesn't make sense to do it in VOD, since VOD is static and doesn't "recover".
The text was updated successfully, but these errors were encountered: