Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve caption extraction to handle corrections #133

Closed
slifty opened this issue Jul 2, 2021 · 1 comment · Fixed by #142
Closed

Improve caption extraction to handle corrections #133

slifty opened this issue Jul 2, 2021 · 1 comment · Fixed by #142

Comments

@slifty
Copy link
Member

slifty commented Jul 2, 2021

Task

Description

In order to get letter-by-letter we use the CCExtractor rollup feature, which extracts each new caption line as they come in.

This works well for some captions, but there are cases where captions are being corrected it can make a huge mess:

image

We need to figure out how to detect when this kind of correction is happening and either (A) ignore the corrections or (B) handle them and emit them in a more effective way effectively (e.g. maybe there is a data flag as part of the emit that flags the new atom as a correction).

Relevant Resources / Research

None yet

@slifty slifty changed the title Improve caption extraction Improve caption extraction to handle corrections Jul 2, 2021
@slifty
Copy link
Member Author

slifty commented Aug 17, 2021

I bought the closed captioning handbook and read through it. There is a section about control characters which exist in the CC spec. These allow a broadcaster to move the cursor back in a given rollup line, or wipe out all characters after a certain point in the buffer.

CCExtractor knows how to process these characters, but so far I don't see any indication that it will emit those character. Really the issue is that CCExtractor doesn't directly handle the "stream of individual characters" use case. It does the magic behind the scenes, and what TV Kitchen has done is take all of that processing and then go back to simulate a rollup.

This is all a long way of saying that I believe what we need to do is ditch the simulated rollup, and instead just have the caption extractor emit payloads one line at a time. I think it can still break the lines into ATOM payloads, which will have more backwards compatibility down the line if we decide to ditch CCExtractor and parse caption streams directly.

By doing this I believe we will fix this bug, and possibly #139 as well

slifty added a commit that referenced this issue Aug 17, 2021
The previous caption extractor logic attempted to parse out CCExtractor
captions as they were generated.  This had the benefit of getting the
data as fast as possible, but in reality this could create a lot of
downstream complexity because captions have a mechanism for correction.

For instance, if a stenographer of a live program makes a correction to
a word we would require downstream appliances to know how to remove the
previous data and potentially even issue corrections of their own to
their downstream.

Things get much more simple if we only extract finalized lines.  This
introduces a minor delay since no content is rendered until the line's
completion, but for now that is a trade-off we are comfortable with.

Issue #133
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant