Improve caption extraction to handle corrections #133

slifty · 2021-07-02T16:11:28Z

Task

Description

In order to get letter-by-letter we use the CCExtractor rollup feature, which extracts each new caption line as they come in.

This works well for some captions, but there are cases where captions are being corrected it can make a huge mess:

We need to figure out how to detect when this kind of correction is happening and either (A) ignore the corrections or (B) handle them and emit them in a more effective way effectively (e.g. maybe there is a data flag as part of the emit that flags the new atom as a correction).

Relevant Resources / Research

None yet

slifty · 2021-08-17T18:33:26Z

I bought the closed captioning handbook and read through it. There is a section about control characters which exist in the CC spec. These allow a broadcaster to move the cursor back in a given rollup line, or wipe out all characters after a certain point in the buffer.

CCExtractor knows how to process these characters, but so far I don't see any indication that it will emit those character. Really the issue is that CCExtractor doesn't directly handle the "stream of individual characters" use case. It does the magic behind the scenes, and what TV Kitchen has done is take all of that processing and then go back to simulate a rollup.

This is all a long way of saying that I believe what we need to do is ditch the simulated rollup, and instead just have the caption extractor emit payloads one line at a time. I think it can still break the lines into ATOM payloads, which will have more backwards compatibility down the line if we decide to ditch CCExtractor and parse caption streams directly.

By doing this I believe we will fix this bug, and possibly #139 as well

The previous caption extractor logic attempted to parse out CCExtractor captions as they were generated. This had the benefit of getting the data as fast as possible, but in reality this could create a lot of downstream complexity because captions have a mechanism for correction. For instance, if a stenographer of a live program makes a correction to a word we would require downstream appliances to know how to remove the previous data and potentially even issue corrections of their own to their downstream. Things get much more simple if we only extract finalized lines. This introduces a minor delay since no content is rendered until the line's completion, but for now that is a trade-off we are comfortable with. Issue #133

slifty changed the title ~~Improve caption extraction~~ Improve caption extraction to handle corrections Jul 2, 2021

slifty mentioned this issue Jul 3, 2021

SRTs sometimes roll across text #139

Open

slifty mentioned this issue Aug 17, 2021

Load CCExtractor lines all at once #142

Merged

slifty closed this as completed in #142 Aug 18, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve caption extraction to handle corrections #133

Improve caption extraction to handle corrections #133

slifty commented Jul 2, 2021 •

edited

Loading

slifty commented Aug 17, 2021

Improve caption extraction to handle corrections #133

Improve caption extraction to handle corrections #133

Comments

slifty commented Jul 2, 2021 • edited Loading

Task

Description

Relevant Resources / Research

slifty commented Aug 17, 2021

slifty commented Jul 2, 2021 •

edited

Loading