Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for WebVTT #658

Merged
merged 12 commits into from
Sep 7, 2024
Merged

Conversation

AleksandrHovhannisyan
Copy link
Contributor

@AleksandrHovhannisyan AleksandrHovhannisyan commented Aug 14, 2024

Resolves: #657

Docs/magic numbers: https://www.w3.org/TR/webvtt1/#iana-text-vtt

core.js Outdated
Comment on lines 1142 to 1153
if (
this.check([0x57, 0x45, 0x42, 0x56, 0x54, 0x54, 0x00]) // EOF
|| this.check([0x57, 0x45, 0x42, 0x56, 0x54, 0x54, 0x0A]) // LF
|| this.check([0x57, 0x45, 0x42, 0x56, 0x54, 0x54, 0x0D]) // CR
|| this.check([0x57, 0x45, 0x42, 0x56, 0x54, 0x54, 0x09]) // Tab
|| this.check([0x57, 0x45, 0x42, 0x56, 0x54, 0x54, 0x20]) // Space
) {
return {
ext: 'vtt',
mime: 'text/vtt',
};
}
Copy link
Contributor Author

@AleksandrHovhannisyan AleksandrHovhannisyan Aug 14, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this okay, or should I check for the common byte sequence first ([0x57, 0x45, 0x42, 0x56, 0x54, 0x54]) and then the last byte in a nested statement? (Couldn't find examples of how to do that.)

Also, I added four fixtures, one for each possibility. Is this okay?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I recommend to first check WEBVTT as a string. Makes the code more readable.

If that is match, you can check the last character. That way you prevent the WEBVTT need be checked 5 times.

Other remark:
In the description of the PR, change Issue: #657 to Resolves: #657 1

Footnotes

  1. Using keywords in issues and pull requests

Copy link
Contributor Author

@AleksandrHovhannisyan AleksandrHovhannisyan Aug 15, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That makes sense, thanks! I'll make these changes.

Btw, is there any code formatter set up for this project (or one that I could run as a standalone with npx that wouldn't change the whole file)? I couldn't find anything in the docs or package.json. Fixing the tests is a bit frustrating as it's getting hung up on code formatting.

Edit: Ended up just copying similar code and changing it to match the expected formatting.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

npx xo --fix

@AleksandrHovhannisyan AleksandrHovhannisyan marked this pull request as ready for review August 14, 2024 23:52
core.js Outdated
Comment on lines 1144 to 1150
&& (
this.check([0x00], {offset: 6}) // EOF
|| this.check([0x0A], {offset: 6}) // LF
|| this.check([0x0D], {offset: 6}) // CR
|| this.check([0x09], {offset: 6}) // Tab
|| this.check([0x20], {offset: 6}) // Space
)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry that I'm asking so many questions: Should I be checking some of these as strings/escape sequences too since they're all ASCII? Also, is there a way to avoid repeating the offset? I tried doing an await tokenizer.ignore(6) after the WEBVTT check but that didn't seem to work.

Copy link
Collaborator

@Borewit Borewit Aug 16, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you call await tokenizer.ignore(6), you permanently move the offset with 6 bytes. That also require you need to re-read the buffer used by this.check, starting on that offset.

But, that approach is irreversible, meaning all following tests will fail, as they can not ,longer read from offset 0.
Therefor, if you read of (or ignore) from the tokenizer, you have to return the file type or return undefined. You essentially can only do that, if you are sure that remaining tests are pointless to run.

Your code looks fine to me, maybe a bit lengthy as a result of the repeating offset argument, but very readable.

If you prefere you could shorten it by putting the special character in an array, and calling Array.prototype.some() (not tested):

[0x00, 0x0A, 0x0D, 0x09, 0x20].some(lastChar => this.check([lastChar], {offset: 6}))

Should I be checking some of these as strings/escape sequences

I think this is fine without, but you can try, and see if gets more elegant.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the help!

Copy link
Collaborator

@Borewit Borewit left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The 2 tests before your tests are actually not testing within the 7 first bytes. But that is different issue.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is no 0x00 / \0 null character after WEBVTT
This is called the null character not the EOF character as far as I know. 1

Footnotes

  1. https://unicodeplus.com/U+0000

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see—although that makes me wonder: why does the test pass for the EOF fixture? Maybe I'm misunderstanding how the null character behaves in this.check.

Is there a method that would tell me if the tokenizer has reached the end of the file? I see it has a position property as well as some peekX methods.

Copy link
Collaborator

@Borewit Borewit Aug 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was puzzled but that one as well one, so I dived into that. Chunks of the file are read, from smaller to larger (that is why we want to have texted ordered by fixture, starting count from 0 offset).

If a file is read, smaller then the provided buffer size (EOF is ignored with mayBeLess).

await tokenizer.peekBuffer(this.buffer, {length: 12, mayBeLess: true});

Remaining buffer is set to zero, hence you zero, hence your test match.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Gotcha. What would you recommend we do here? I want to make sure I still test for the EOF edge case, even though it's admittedly rare in practice (nobody should upload an empty caption file). That said, I don't think it would be enough to test for just WEBVTT as a file that starts with WEBVTT-not-webvtt, for example, would pass the test but doesn't comply with the spec. Is there a way to do an exact-length check?

@@ -0,0 +1,38 @@
WEBVTT 00:11.000 --> 00:13.000
<v Roger Bingham>We are in New York City
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you reverse engineer a sample, which you can argue is the best test, I suggest to abuse the subtitle to explain why the file is there, and remove remaining content.

Suggested change
<v Roger Bingham>We are in New York City
<v file-type>Test WEBVTT prefix followed by a tab character

Did you test the file in it's actual context, does it work?

Copy link
Contributor Author

@AleksandrHovhannisyan AleksandrHovhannisyan Aug 16, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't test the file in an actual <track> (don't think I need to) but I just grabbed it from the W3 doc. See A simple caption file.

readme.md Outdated
@@ -507,6 +507,7 @@ console.log(fileType);
- [`vcf`](https://en.wikipedia.org/wiki/VCard) - vCard
- [`voc`](https://wiki.multimedia.cx/index.php/Creative_Voice) - Creative Voice File
- [`vsdx`](https://en.wikipedia.org/wiki/Microsoft_Visio) - Microsoft Visio File
- [`vtt`](https://www.w3.org/TR/webvtt1/) - WebVTT File (for video captions)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like the convention is to link to wikipedia

Suggested change
- [`vtt`](https://www.w3.org/TR/webvtt1/) - WebVTT File (for video captions)
- [`vtt`](https://en.wikipedia.org/wiki/WebVTT) - Web Video Text Tracks, a W3C standard file for subtitles or captures

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup, I noticed that too. The only reason I didn't do that is because the Wikipedia article doesn't list the magic numbers for this file format, so I thought it would be more helpful to link to the w3 spec.

Copy link
Collaborator

@Borewit Borewit Aug 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It also the formal standard, so I fully get you. But the W3C link remains as part of this PR.

I think the function of this link, in addition as a reference, the links servers to give an context to ensure there can be no ambiguity which file type we are talking about. README is mostly focused on end users.

@Borewit
Copy link
Collaborator

Borewit commented Aug 20, 2024

@AleksandrHovhannisyan, I propose to keep the single fixture as you got it from WRC and drop the reverse engineered fixtures to aiming to match each signature cases. Keep it simple.

@AleksandrHovhannisyan
Copy link
Contributor Author

@Borewit Okay, happy to do that, although I will say I found the fixtures helpful when writing the code, as I wrote those first and ran tests as I made adjustments. Without those fixtures, there's no way to guarantee that the code works for edge cases.

Also, I think it would help both of us if I could get a single set of change requests for this PR. Happy to push whatever changes are needed to get it merged, but I do want to limit back and forth.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is no leading null character, and a null character is not the same as a EOF.

@AleksandrHovhannisyan
Copy link
Contributor Author

Hmm, not sure why tests are failing for Node 20+. Node 18 works fine locally for me.

What is the procedure for running those same CI tests locally? npm run test only runs on my local Node 18.

@Borewit
Copy link
Collaborator

Borewit commented Aug 29, 2024

What is the procedure for running those same CI tests locally?

You essentially need to install a different version of Node or use some fancy mechanism to switch Node versions.

Some tests are only performed on Node >= 20

@AleksandrHovhannisyan
Copy link
Contributor Author

Okay that's fine, I have nvm/fnm so I'll just use one of those. Thanks

core.js Outdated
// Try-catch to handle valid edge case of "WEBVTT<EOF>"
try {
// The WebVTT standard says that the first line should be "WEBVTT" followed by a single ASCII whitespace character
const whitespaceToken = await tokenizer.readToken(new Token.StringType(1, 'ascii'));
Copy link
Collaborator

@Borewit Borewit Sep 3, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you consuming (read) from the stream, all following on test will fail, so you have to determine the file type.

The reason you see the following tests did not break is because this rare situation only occures on files starting with WEBVTT, but that does not make it right.

The 7th character you are interested in has already been peeked, the EOF has already been handled.
I know you lost the info where the EOF is, but both \0 and EOF are acceptance, there is no is issue simply testing for \0 is fine.

await tokenizer.peekBuffer(this.buffer, {length: 12, mayBeLess: true});

It is this.buffer which is being tested by every signature test.

Keep it simple. Even without testing for 7th character, the initial 6 characters is pretty reliable, more reliable then many other signature tests.

Copy link
Contributor Author

@AleksandrHovhannisyan AleksandrHovhannisyan Sep 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason you see the following tests did not break is because this rare situation only occures on files starting with WEBVTT, but that does not make it right... Keep it simple. Even without testing for 7th character, the initial 6 characters is pretty reliable, more reliable then many other signature tests.

What if someone creates a file that has WEBVTTx as the first seven bytes? The first six bytes aren't the full signature per the spec: "An optional UTF-8 BOM, the ASCII string "WEBVTT", and finally a space, tab, line break, or the end of the file." So I don't think it's rigorous enough to just check the first six bytes as it doesn't tell us anything about the seventh byte. With my current code, a file that starts with WEBVTT but is not followed by a whitespace character would return undefined, which I think is the expected behavior (because that file does not comply with the signature format for valid WebVTT files).

The 7th character you are interested in has already been peeked, the EOF has already been handled.
I know you lost the info where the EOF is, but both \0 and EOF are acceptance, there is no is issue simply testing for \0 is fine.

Sorry, I must have misunderstood you before. I thought you didn't want me to test for \0 since it's not the same as an EOF.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if someone creates a file that has WEBVTTx as the first seven bytes? The first six bytes aren't the full signature per the spec: "An optional UTF-8 BOM, the ASCII string "WEBVTT", and finally a space, tab, line break, or the end of the file." So I don't think it's rigorous enough to just check the first six bytes as it doesn't tell us anything about the seventh byte. With my current code, a file that starts with WEBVTT but is not followed by a whitespace character would return undefined, which I think is the expected behavior (because that file does not comply with the signature format for valid WebVTT files).

That is indeed the best solution, but you cannot do it by reading from stream, as that irreversible consumes the stream. If you do it by reading, you break the case of "The first six bytes aren't the full signature".

Sorry, I must have misunderstood you before. I thought you didn't want me to test for \0 since it's not the same as an EOF.

I am sorry, as I thought \0 was defined as a formal separator. But it is not, and you are testing on \0 as that is the way the EOF translates in file-type. That is why complained about the EOF fixture file, I thought it had to to have \0 separator.

Copy link
Collaborator

@Borewit Borewit Sep 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What I mean is, lets assume we have the following example:

WEBVTT78WEBP123XXXX

This should be not be recognized as WebVTT, but it should be recognized as WEBP.
But will no longer recognized as the WEBP, as the WEBP test will detect XXXX as the signature instead of WEBP, as you advanced the pointer with 7 positions. It will read the signature from positions 7+8 instead of position 8.

Copy link
Contributor Author

@AleksandrHovhannisyan AleksandrHovhannisyan Sep 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

WEBVTT78WEBP123XXXX. This should be not be recognized as WebVTT, but it should be recognized as WEBP.

Oh, I see. Does this mean the file-type parser ignores unrecognized byte sequences at the start of a file and continues searching until it eventually finds a valid signature? I thought magic numbers had to appear at the start of a file.

I am sorry, as I thought \0 was defined as a formal separator. But it is not, and you are testing on \0 as that is the way the EOF translates in file-type. That is why complained about the EOF fixture file, I thought it had to to have \0 separator.

All good, I think I understand what you meant now.

Edit: So just to be clear, do you want me to simplify the code to just check WEBVTT, without checking the 7th byte, and remove the other fixtures?

Copy link
Collaborator

@Borewit Borewit Sep 5, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this mean the file-type parser ignores unrecognized byte sequences at the start of a file and continues searching until it eventually finds a valid signature?

Exactly.

Edit: So just to be clear, do you want me to simplify the code to just check WEBVTT, without checking the 7th byte, and remove the other fixtures?

No I agree with you to include the 7th.

This commit was very close to perfection 😉 : 53be4b4

  • Rename variable whitespace to char7
  • change \r\n into \r
  • Using tokenizer.fileInfo.size === 6 is tricky, it is possible to let file-size read from a stream with an unknown length. Safer to use the 0, assuming that is the result of an "EOF".

, and I think it is good to go.

Copy link
Collaborator

@Borewit Borewit left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@Borewit Borewit added the enhancement Add new functionality label Sep 7, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Add new functionality
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support for WebVTT files (text/vtt)
3 participants