-
-
Notifications
You must be signed in to change notification settings - Fork 354
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for WebVTT #658
Conversation
core.js
Outdated
if ( | ||
this.check([0x57, 0x45, 0x42, 0x56, 0x54, 0x54, 0x00]) // EOF | ||
|| this.check([0x57, 0x45, 0x42, 0x56, 0x54, 0x54, 0x0A]) // LF | ||
|| this.check([0x57, 0x45, 0x42, 0x56, 0x54, 0x54, 0x0D]) // CR | ||
|| this.check([0x57, 0x45, 0x42, 0x56, 0x54, 0x54, 0x09]) // Tab | ||
|| this.check([0x57, 0x45, 0x42, 0x56, 0x54, 0x54, 0x20]) // Space | ||
) { | ||
return { | ||
ext: 'vtt', | ||
mime: 'text/vtt', | ||
}; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this okay, or should I check for the common byte sequence first ([0x57, 0x45, 0x42, 0x56, 0x54, 0x54]
) and then the last byte in a nested statement? (Couldn't find examples of how to do that.)
Also, I added four fixtures, one for each possibility. Is this okay?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I recommend to first check WEBVTT
as a string. Makes the code more readable.
If that is match, you can check the last character. That way you prevent the WEBVTT
need be checked 5 times.
Other remark:
In the description of the PR, change Issue: #657
to Resolves: #657
1
Footnotes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That makes sense, thanks! I'll make these changes.
Btw, is there any code formatter set up for this project (or one that I could run as a standalone with npx that wouldn't change the whole file)? I couldn't find anything in the docs or package.json. Fixing the tests is a bit frustrating as it's getting hung up on code formatting.
Edit: Ended up just copying similar code and changing it to match the expected formatting.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
npx xo --fix
core.js
Outdated
&& ( | ||
this.check([0x00], {offset: 6}) // EOF | ||
|| this.check([0x0A], {offset: 6}) // LF | ||
|| this.check([0x0D], {offset: 6}) // CR | ||
|| this.check([0x09], {offset: 6}) // Tab | ||
|| this.check([0x20], {offset: 6}) // Space | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry that I'm asking so many questions: Should I be checking some of these as strings/escape sequences too since they're all ASCII? Also, is there a way to avoid repeating the offset? I tried doing an await tokenizer.ignore(6)
after the WEBVTT
check but that didn't seem to work.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you call await tokenizer.ignore(6)
, you permanently move the offset with 6 bytes. That also require you need to re-read the buffer used by this.check
, starting on that offset.
But, that approach is irreversible, meaning all following tests will fail, as they can not ,longer read from offset 0.
Therefor, if you read of (or ignore) from the tokenizer, you have to return the file type or return undefined. You essentially can only do that, if you are sure that remaining tests are pointless to run.
Your code looks fine to me, maybe a bit lengthy as a result of the repeating offset argument, but very readable.
If you prefere you could shorten it by putting the special character in an array, and calling Array.prototype.some()
(not tested):
[0x00, 0x0A, 0x0D, 0x09, 0x20].some(lastChar => this.check([lastChar], {offset: 6}))
Should I be checking some of these as strings/escape sequences
I think this is fine without, but you can try, and see if gets more elegant.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the help!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The 2 tests before your tests are actually not testing within the 7 first bytes. But that is different issue.
fixture/fixture-vtt-eof.vtt
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is no 0x00
/ \0
null character after WEBVTT
This is called the null character not the EOF character as far as I know. 1
Footnotes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see—although that makes me wonder: why does the test pass for the EOF fixture? Maybe I'm misunderstanding how the null character behaves in this.check
.
Is there a method that would tell me if the tokenizer has reached the end of the file? I see it has a position
property as well as some peekX
methods.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was puzzled but that one as well one, so I dived into that. Chunks of the file are read, from smaller to larger (that is why we want to have texted ordered by fixture, starting count from 0 offset).
If a file is read, smaller then the provided buffer size (EOF is ignored with mayBeLess
).
Line 174 in 988bf4b
await tokenizer.peekBuffer(this.buffer, {length: 12, mayBeLess: true}); |
Remaining buffer is set to zero, hence you zero, hence your test match.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Gotcha. What would you recommend we do here? I want to make sure I still test for the EOF edge case, even though it's admittedly rare in practice (nobody should upload an empty caption file). That said, I don't think it would be enough to test for just WEBVTT
as a file that starts with WEBVTT-not-webvtt
, for example, would pass the test but doesn't comply with the spec. Is there a way to do an exact-length check?
fixture/fixture-vtt-space.vtt
Outdated
@@ -0,0 +1,38 @@ | |||
WEBVTT 00:11.000 --> 00:13.000 | |||
<v Roger Bingham>We are in New York City |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you reverse engineer a sample, which you can argue is the best test, I suggest to abuse the subtitle to explain why the file is there, and remove remaining content.
<v Roger Bingham>We are in New York City | |
<v file-type>Test WEBVTT prefix followed by a tab character |
Did you test the file in it's actual context, does it work?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't test the file in an actual <track>
(don't think I need to) but I just grabbed it from the W3 doc. See A simple caption file.
readme.md
Outdated
@@ -507,6 +507,7 @@ console.log(fileType); | |||
- [`vcf`](https://en.wikipedia.org/wiki/VCard) - vCard | |||
- [`voc`](https://wiki.multimedia.cx/index.php/Creative_Voice) - Creative Voice File | |||
- [`vsdx`](https://en.wikipedia.org/wiki/Microsoft_Visio) - Microsoft Visio File | |||
- [`vtt`](https://www.w3.org/TR/webvtt1/) - WebVTT File (for video captions) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like the convention is to link to wikipedia
- [`vtt`](https://www.w3.org/TR/webvtt1/) - WebVTT File (for video captions) | |
- [`vtt`](https://en.wikipedia.org/wiki/WebVTT) - Web Video Text Tracks, a W3C standard file for subtitles or captures |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yup, I noticed that too. The only reason I didn't do that is because the Wikipedia article doesn't list the magic numbers for this file format, so I thought it would be more helpful to link to the w3 spec.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It also the formal standard, so I fully get you. But the W3C link remains as part of this PR.
I think the function of this link, in addition as a reference, the links servers to give an context to ensure there can be no ambiguity which file type we are talking about. README is mostly focused on end users.
@AleksandrHovhannisyan, I propose to keep the single fixture as you got it from WRC and drop the reverse engineered fixtures to aiming to match each signature cases. Keep it simple. |
@Borewit Okay, happy to do that, although I will say I found the fixtures helpful when writing the code, as I wrote those first and ran tests as I made adjustments. Without those fixtures, there's no way to guarantee that the code works for edge cases. Also, I think it would help both of us if I could get a single set of change requests for this PR. Happy to push whatever changes are needed to get it merged, but I do want to limit back and forth. |
fixture/fixture-vtt-eof.vtt
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is no leading null character, and a null character is not the same as a EOF.
Hmm, not sure why tests are failing for Node 20+. Node 18 works fine locally for me. What is the procedure for running those same CI tests locally? |
You essentially need to install a different version of Node or use some fancy mechanism to switch Node versions. Some tests are only performed on Node >= 20 |
Okay that's fine, I have nvm/fnm so I'll just use one of those. Thanks |
core.js
Outdated
// Try-catch to handle valid edge case of "WEBVTT<EOF>" | ||
try { | ||
// The WebVTT standard says that the first line should be "WEBVTT" followed by a single ASCII whitespace character | ||
const whitespaceToken = await tokenizer.readToken(new Token.StringType(1, 'ascii')); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you consuming (read
) from the stream, all following on test will fail, so you have to determine the file type.
The reason you see the following tests did not break is because this rare situation only occures on files starting with WEBVTT
, but that does not make it right.
The 7th character you are interested in has already been peeked, the EOF has already been handled.
I know you lost the info where the EOF is, but both \0
and EOF are acceptance, there is no is issue simply testing for \0
is fine.
Line 174 in 988bf4b
await tokenizer.peekBuffer(this.buffer, {length: 12, mayBeLess: true}); |
It is this.buffer
which is being tested by every signature test.
Keep it simple. Even without testing for 7th character, the initial 6 characters is pretty reliable, more reliable then many other signature tests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The reason you see the following tests did not break is because this rare situation only occures on files starting with WEBVTT, but that does not make it right... Keep it simple. Even without testing for 7th character, the initial 6 characters is pretty reliable, more reliable then many other signature tests.
What if someone creates a file that has WEBVTTx
as the first seven bytes? The first six bytes aren't the full signature per the spec: "An optional UTF-8 BOM, the ASCII string "WEBVTT", and finally a space, tab, line break, or the end of the file." So I don't think it's rigorous enough to just check the first six bytes as it doesn't tell us anything about the seventh byte. With my current code, a file that starts with WEBVTT
but is not followed by a whitespace character would return undefined
, which I think is the expected behavior (because that file does not comply with the signature format for valid WebVTT files).
The 7th character you are interested in has already been peeked, the EOF has already been handled.
I know you lost the info where the EOF is, but both \0 and EOF are acceptance, there is no is issue simply testing for \0 is fine.
Sorry, I must have misunderstood you before. I thought you didn't want me to test for \0
since it's not the same as an EOF.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What if someone creates a file that has WEBVTTx as the first seven bytes? The first six bytes aren't the full signature per the spec: "An optional UTF-8 BOM, the ASCII string "WEBVTT", and finally a space, tab, line break, or the end of the file." So I don't think it's rigorous enough to just check the first six bytes as it doesn't tell us anything about the seventh byte. With my current code, a file that starts with WEBVTT but is not followed by a whitespace character would return undefined, which I think is the expected behavior (because that file does not comply with the signature format for valid WebVTT files).
That is indeed the best solution, but you cannot do it by reading from stream, as that irreversible consumes the stream. If you do it by reading, you break the case of "The first six bytes aren't the full signature".
Sorry, I must have misunderstood you before. I thought you didn't want me to test for \0 since it's not the same as an EOF.
I am sorry, as I thought \0
was defined as a formal separator. But it is not, and you are testing on \0
as that is the way the EOF translates in file-type
. That is why complained about the EOF fixture file, I thought it had to to have \0
separator.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What I mean is, lets assume we have the following example:
WEBVTT78WEBP123XXXX
This should be not be recognized as WebVTT
, but it should be recognized as WEBP
.
But will no longer recognized as the WEBP, as the WEBP test will detect XXXX
as the signature instead of WEBP
, as you advanced the pointer with 7 positions. It will read the signature from positions 7+8 instead of position 8.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
WEBVTT78WEBP123XXXX
. This should be not be recognized asWebVTT
, but it should be recognized asWEBP
.
Oh, I see. Does this mean the file-type
parser ignores unrecognized byte sequences at the start of a file and continues searching until it eventually finds a valid signature? I thought magic numbers had to appear at the start of a file.
I am sorry, as I thought \0 was defined as a formal separator. But it is not, and you are testing on \0 as that is the way the EOF translates in file-type. That is why complained about the EOF fixture file, I thought it had to to have \0 separator.
All good, I think I understand what you meant now.
Edit: So just to be clear, do you want me to simplify the code to just check WEBVTT
, without checking the 7th byte, and remove the other fixtures?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this mean the file-type parser ignores unrecognized byte sequences at the start of a file and continues searching until it eventually finds a valid signature?
Exactly.
Edit: So just to be clear, do you want me to simplify the code to just check WEBVTT, without checking the 7th byte, and remove the other fixtures?
No I agree with you to include the 7th.
This commit was very close to perfection 😉 : 53be4b4
- Rename variable
whitespace
tochar7
- change
\r\n
into\r
- Using
tokenizer.fileInfo.size === 6
is tricky, it is possible to let file-size read from a stream with an unknown length. Safer to use the0
, assuming that is the result of an "EOF".
, and I think it is good to go.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Resolves: #657
Docs/magic numbers: https://www.w3.org/TR/webvtt1/#iana-text-vtt