Add support for WebVTT #658

AleksandrHovhannisyan · 2024-08-14T22:02:20Z

Resolves: #657

Docs/magic numbers: https://www.w3.org/TR/webvtt1/#iana-text-vtt

AleksandrHovhannisyan · 2024-08-14T22:04:39Z

core.js

+		if (
+			this.check([0x57, 0x45, 0x42, 0x56, 0x54, 0x54, 0x00])		// EOF
+			|| this.check([0x57, 0x45, 0x42, 0x56, 0x54, 0x54, 0x0A])	// LF
+			|| this.check([0x57, 0x45, 0x42, 0x56, 0x54, 0x54, 0x0D])	// CR
+			|| this.check([0x57, 0x45, 0x42, 0x56, 0x54, 0x54, 0x09])	// Tab
+			|| this.check([0x57, 0x45, 0x42, 0x56, 0x54, 0x54, 0x20])	// Space
+		) {
+			return {
+				ext: 'vtt',
+				mime: 'text/vtt',
+			};
+		}


Is this okay, or should I check for the common byte sequence first ([0x57, 0x45, 0x42, 0x56, 0x54, 0x54]) and then the last byte in a nested statement? (Couldn't find examples of how to do that.)

Also, I added four fixtures, one for each possibility. Is this okay?

I recommend to first check WEBVTT as a string. Makes the code more readable.

If that is match, you can check the last character. That way you prevent the WEBVTT need be checked 5 times.

Other remark:
In the description of the PR, change Issue: #657 to Resolves: #657 ¹

Footnotes

Using keywords in issues and pull requests ↩

That makes sense, thanks! I'll make these changes.

Btw, is there any code formatter set up for this project (or one that I could run as a standalone with npx that wouldn't change the whole file)? I couldn't find anything in the docs or package.json. Fixing the tests is a bit frustrating as it's getting hung up on code formatting.

Edit: Ended up just copying similar code and changing it to match the expected formatting.

npx xo --fix

…orhus#657

AleksandrHovhannisyan · 2024-08-15T20:10:14Z

core.js

+			&& (
+				this.check([0x00], {offset: 6})		// EOF
+				|| this.check([0x0A], {offset: 6})	// LF
+				|| this.check([0x0D], {offset: 6})	// CR
+				|| this.check([0x09], {offset: 6})	// Tab
+				|| this.check([0x20], {offset: 6})	// Space
+			)


Sorry that I'm asking so many questions: Should I be checking some of these as strings/escape sequences too since they're all ASCII? Also, is there a way to avoid repeating the offset? I tried doing an await tokenizer.ignore(6) after the WEBVTT check but that didn't seem to work.

If you call await tokenizer.ignore(6), you permanently move the offset with 6 bytes. That also require you need to re-read the buffer used by this.check, starting on that offset.

But, that approach is irreversible, meaning all following tests will fail, as they can not ,longer read from offset 0.
Therefor, if you read of (or ignore) from the tokenizer, you have to return the file type or return undefined. You essentially can only do that, if you are sure that remaining tests are pointless to run.

Your code looks fine to me, maybe a bit lengthy as a result of the repeating offset argument, but very readable.

If you prefere you could shorten it by putting the special character in an array, and calling Array.prototype.some() (not tested):

[0x00, 0x0A, 0x0D, 0x09, 0x20].some(lastChar => this.check([lastChar], {offset: 6}))

Should I be checking some of these as strings/escape sequences

I think this is fine without, but you can try, and see if gets more elegant.

Thanks for the help!

Borewit

The 2 tests before your tests are actually not testing within the 7 first bytes. But that is different issue.

Borewit · 2024-08-16T15:50:34Z

fixture/fixture-vtt-eof.vtt

There is no 0x00 / \0 null character after WEBVTT
This is called the null character not the EOF character as far as I know. ¹

Footnotes

https://unicodeplus.com/U+0000 ↩

I see—although that makes me wonder: why does the test pass for the EOF fixture? Maybe I'm misunderstanding how the null character behaves in this.check.

Is there a method that would tell me if the tokenizer has reached the end of the file? I see it has a position property as well as some peekX methods.

I was puzzled but that one as well one, so I dived into that. Chunks of the file are read, from smaller to larger (that is why we want to have texted ordered by fixture, starting count from 0 offset).

If a file is read, smaller then the provided buffer size (EOF is ignored with mayBeLess).

file-type/core.js

Line 174 in 988bf4b

await tokenizer.peekBuffer(this.buffer, {length: 12, mayBeLess: true});

Remaining buffer is set to zero, hence you zero, hence your test match.

Gotcha. What would you recommend we do here? I want to make sure I still test for the EOF edge case, even though it's admittedly rare in practice (nobody should upload an empty caption file). That said, I don't think it would be enough to test for just WEBVTT as a file that starts with WEBVTT-not-webvtt, for example, would pass the test but doesn't comply with the spec. Is there a way to do an exact-length check?

Borewit · 2024-08-16T15:57:26Z

fixture/fixture-vtt-space.vtt

@@ -0,0 +1,38 @@
+WEBVTT 00:11.000 --> 00:13.000
+<v Roger Bingham>We are in New York City


If you reverse engineer a sample, which you can argue is the best test, I suggest to abuse the subtitle to explain why the file is there, and remove remaining content.

Suggested change

<v Roger Bingham>We are in New York City

<v file-type>Test WEBVTT prefix followed by a tab character

Did you test the file in it's actual context, does it work?

I didn't test the file in an actual <track> (don't think I need to) but I just grabbed it from the W3 doc. See A simple caption file.

Borewit · 2024-08-16T16:09:52Z

readme.md

@@ -507,6 +507,7 @@ console.log(fileType);
 - [`vcf`](https://en.wikipedia.org/wiki/VCard) - vCard
 - [`voc`](https://wiki.multimedia.cx/index.php/Creative_Voice) - Creative Voice File
 - [`vsdx`](https://en.wikipedia.org/wiki/Microsoft_Visio) - Microsoft Visio File
+- [`vtt`](https://www.w3.org/TR/webvtt1/) - WebVTT File (for video captions)


It looks like the convention is to link to wikipedia

Suggested change

- [`vtt`](https://www.w3.org/TR/webvtt1/) - WebVTT File (for video captions)

- [`vtt`](https://en.wikipedia.org/wiki/WebVTT) - Web Video Text Tracks, a W3C standard file for subtitles or captures

Yup, I noticed that too. The only reason I didn't do that is because the Wikipedia article doesn't list the magic numbers for this file format, so I thought it would be more helpful to link to the w3 spec.

It also the formal standard, so I fully get you. But the W3C link remains as part of this PR.

I think the function of this link, in addition as a reference, the links servers to give an context to ensure there can be no ambiguity which file type we are talking about. README is mostly focused on end users.

…#657

Borewit · 2024-08-20T19:29:52Z

@AleksandrHovhannisyan, I propose to keep the single fixture as you got it from WRC and drop the reverse engineered fixtures to aiming to match each signature cases. Keep it simple.

AleksandrHovhannisyan · 2024-08-20T20:01:36Z

@Borewit Okay, happy to do that, although I will say I found the fixtures helpful when writing the code, as I wrote those first and ran tests as I made adjustments. Without those fixtures, there's no way to guarantee that the code works for edge cases.

Also, I think it would help both of us if I could get a single set of change requests for this PR. Happy to push whatever changes are needed to get it merged, but I do want to limit back and forth.

Borewit · 2024-08-28T09:49:34Z

fixture/fixture-vtt-eof.vtt

There is no leading null character, and a null character is not the same as a EOF.

AleksandrHovhannisyan · 2024-08-29T16:31:15Z

Hmm, not sure why tests are failing for Node 20+. Node 18 works fine locally for me.

What is the procedure for running those same CI tests locally? npm run test only runs on my local Node 18.

…or VTT

Borewit · 2024-08-29T17:29:30Z

What is the procedure for running those same CI tests locally?

You essentially need to install a different version of Node or use some fancy mechanism to switch Node versions.

Some tests are only performed on Node >= 20

AleksandrHovhannisyan · 2024-08-29T17:58:50Z

Okay that's fine, I have nvm/fnm so I'll just use one of those. Thanks

Borewit · 2024-09-03T19:33:38Z

core.js

+			// Try-catch to handle valid edge case of "WEBVTT<EOF>"
+			try {
+				// The WebVTT standard says that the first line should be "WEBVTT" followed by a single ASCII whitespace character
+				const whitespaceToken = await tokenizer.readToken(new Token.StringType(1, 'ascii'));


If you consuming (read) from the stream, all following on test will fail, so you have to determine the file type.

The reason you see the following tests did not break is because this rare situation only occures on files starting with WEBVTT, but that does not make it right.

The 7th character you are interested in has already been peeked, the EOF has already been handled.
I know you lost the info where the EOF is, but both \0 and EOF are acceptance, there is no is issue simply testing for \0 is fine.

file-type/core.js

Line 174 in 988bf4b

await tokenizer.peekBuffer(this.buffer, {length: 12, mayBeLess: true});

It is this.buffer which is being tested by every signature test.

Keep it simple. Even without testing for 7th character, the initial 6 characters is pretty reliable, more reliable then many other signature tests.

The reason you see the following tests did not break is because this rare situation only occures on files starting with WEBVTT, but that does not make it right... Keep it simple. Even without testing for 7th character, the initial 6 characters is pretty reliable, more reliable then many other signature tests.

What if someone creates a file that has WEBVTTx as the first seven bytes? The first six bytes aren't the full signature per the spec: "An optional UTF-8 BOM, the ASCII string "WEBVTT", and finally a space, tab, line break, or the end of the file." So I don't think it's rigorous enough to just check the first six bytes as it doesn't tell us anything about the seventh byte. With my current code, a file that starts with WEBVTT but is not followed by a whitespace character would return undefined, which I think is the expected behavior (because that file does not comply with the signature format for valid WebVTT files).

The 7th character you are interested in has already been peeked, the EOF has already been handled.
I know you lost the info where the EOF is, but both \0 and EOF are acceptance, there is no is issue simply testing for \0 is fine.

Sorry, I must have misunderstood you before. I thought you didn't want me to test for \0 since it's not the same as an EOF.

What if someone creates a file that has WEBVTTx as the first seven bytes? The first six bytes aren't the full signature per the spec: "An optional UTF-8 BOM, the ASCII string "WEBVTT", and finally a space, tab, line break, or the end of the file." So I don't think it's rigorous enough to just check the first six bytes as it doesn't tell us anything about the seventh byte. With my current code, a file that starts with WEBVTT but is not followed by a whitespace character would return undefined, which I think is the expected behavior (because that file does not comply with the signature format for valid WebVTT files).

That is indeed the best solution, but you cannot do it by reading from stream, as that irreversible consumes the stream. If you do it by reading, you break the case of "The first six bytes aren't the full signature".

Sorry, I must have misunderstood you before. I thought you didn't want me to test for \0 since it's not the same as an EOF.

I am sorry, as I thought \0 was defined as a formal separator. But it is not, and you are testing on \0 as that is the way the EOF translates in file-type. That is why complained about the EOF fixture file, I thought it had to to have \0 separator.

What I mean is, lets assume we have the following example:

WEBVTT78WEBP123XXXX

This should be not be recognized as WebVTT, but it should be recognized as WEBP.
But will no longer recognized as the WEBP, as the WEBP test will detect XXXX as the signature instead of WEBP, as you advanced the pointer with 7 positions. It will read the signature from positions 7+8 instead of position 8.

WEBVTT78WEBP123XXXX. This should be not be recognized as WebVTT, but it should be recognized as WEBP.

Oh, I see. Does this mean the file-type parser ignores unrecognized byte sequences at the start of a file and continues searching until it eventually finds a valid signature? I thought magic numbers had to appear at the start of a file.

I am sorry, as I thought \0 was defined as a formal separator. But it is not, and you are testing on \0 as that is the way the EOF translates in file-type. That is why complained about the EOF fixture file, I thought it had to to have \0 separator.

All good, I think I understand what you meant now.

Edit: So just to be clear, do you want me to simplify the code to just check WEBVTT, without checking the 7th byte, and remove the other fixtures?

Does this mean the file-type parser ignores unrecognized byte sequences at the start of a file and continues searching until it eventually finds a valid signature?

Exactly.

Edit: So just to be clear, do you want me to simplify the code to just check WEBVTT, without checking the 7th byte, and remove the other fixtures?

No I agree with you to include the 7th.

This commit was very close to perfection 😉 : 53be4b4

Rename variable whitespace to char7

change \r\n into \r

Using tokenizer.fileInfo.size === 6 is tricky, it is possible to let file-size read from a stream with an unknown length. Safer to use the 0, assuming that is the result of an "EOF".

, and I think it is good to go.

This reverts commit 546e086.

Borewit

LGTM

AleksandrHovhannisyan added 2 commits August 14, 2024 17:00

Validate WebVTT files sindresorhus#657

45bea47

Replace tab with space in fixture

294c6be

AleksandrHovhannisyan commented Aug 14, 2024

View reviewed changes

AleksandrHovhannisyan marked this pull request as ready for review August 14, 2024 23:52

AleksandrHovhannisyan added 2 commits August 14, 2024 19:01

Improve comments for vtt checks

daa1b38

Refactor VTT-checking code to only check 'WEBVTT' prefix once sindres…

0e7b5d2

…orhus#657

AleksandrHovhannisyan commented Aug 15, 2024

View reviewed changes

Combine separate whitespace checks for VTT files sindresorhus#657

d399775

AleksandrHovhannisyan requested a review from Borewit August 16, 2024 13:45

Borewit reviewed Aug 16, 2024

View reviewed changes

Borewit requested changes Aug 16, 2024

View reviewed changes

AleksandrHovhannisyan and others added 2 commits August 16, 2024 14:42

Simplify vtt fixtures sindresorhus#657

3ff2a7f

Replace VTT doc link with wikipedia link for consistency sindresorhus…

43c7255

…#657

Borewit requested changes Aug 28, 2024

View reviewed changes

fixture/fixture-vtt-eof.vtt Outdated

Copy link

Collaborator

Borewit Aug 28, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is no leading null character, and a null character is not the same as a EOF.

Correctly handle EOF detection for VTT sindresorhus#657

344c420

Attempt to fix Node 20+ errors: replace this.tokenizer -> tokenizer f…

53be4b4

…or VTT

Fix VTT EOF handling sindresorhus#657

546e086

AleksandrHovhannisyan requested a review from Borewit August 29, 2024 20:05

Borewit requested changes Sep 3, 2024

View reviewed changes

AleksandrHovhannisyan added 2 commits September 5, 2024 14:00

Revert "Fix VTT EOF handling sindresorhus#657"

3dd6554

This reverts commit 546e086.

Address change requests for vtt check sindresorhus#657

9816a18

AleksandrHovhannisyan requested a review from Borewit September 5, 2024 19:11

Borewit approved these changes Sep 7, 2024

View reviewed changes

Borewit added the enhancement Add new functionality label Sep 7, 2024

Borewit requested a review from sindresorhus September 7, 2024 10:45

sindresorhus merged commit 21ed763 into sindresorhus:main Sep 7, 2024
3 checks passed

AleksandrHovhannisyan deleted the vtt branch September 7, 2024 19:34

andrewgremlich mentioned this pull request Sep 29, 2024

[Snyk] Upgrade file-type from 19.4.1 to 19.5.0 andrewgremlich/socket-print#2

Open

KaliforniaShell mentioned this pull request Nov 7, 2024

[Snyk] Upgrade file-type from 19.4.1 to 19.6.0 KaliforniaShell/docs#21

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support for WebVTT #658

Add support for WebVTT #658

AleksandrHovhannisyan commented Aug 14, 2024 •

edited

Loading

AleksandrHovhannisyan Aug 14, 2024 •

edited

Loading

Borewit Aug 15, 2024

AleksandrHovhannisyan Aug 15, 2024 •

edited

Loading

sindresorhus Aug 15, 2024

AleksandrHovhannisyan Aug 15, 2024

Borewit Aug 16, 2024 •

edited

Loading

AleksandrHovhannisyan Aug 16, 2024

Borewit left a comment

Borewit Aug 16, 2024

AleksandrHovhannisyan Aug 16, 2024

Borewit Aug 17, 2024 •

edited

Loading

AleksandrHovhannisyan Aug 17, 2024

Borewit Aug 16, 2024

AleksandrHovhannisyan Aug 16, 2024 •

edited

Loading

Borewit Aug 16, 2024

AleksandrHovhannisyan Aug 16, 2024

Borewit Aug 17, 2024 •

edited

Loading

Borewit commented Aug 20, 2024

AleksandrHovhannisyan commented Aug 20, 2024

Borewit Aug 28, 2024

AleksandrHovhannisyan commented Aug 29, 2024

Borewit commented Aug 29, 2024

AleksandrHovhannisyan commented Aug 29, 2024

Borewit Sep 3, 2024 •

edited

Loading

AleksandrHovhannisyan Sep 4, 2024 •

edited

Loading

Borewit Sep 4, 2024

Borewit Sep 4, 2024 •

edited

Loading

AleksandrHovhannisyan Sep 4, 2024 •

edited

Loading

Borewit Sep 5, 2024 •

edited

Loading

Borewit left a comment

		@@ -0,0 +1,38 @@
		WEBVTT 00:11.000 --> 00:13.000
		<v Roger Bingham>We are in New York City

	<v Roger Bingham>We are in New York City
	<v file-type>Test WEBVTT prefix followed by a tab character

	- [`vtt`](https://www.w3.org/TR/webvtt1/) - WebVTT File (for video captions)
	- [`vtt`](https://en.wikipedia.org/wiki/WebVTT) - Web Video Text Tracks, a W3C standard file for subtitles or captures

Add support for WebVTT #658

Add support for WebVTT #658

Conversation

AleksandrHovhannisyan commented Aug 14, 2024 • edited Loading

AleksandrHovhannisyan Aug 14, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Footnotes

AleksandrHovhannisyan Aug 15, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Borewit Aug 16, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Borewit left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Footnotes

Choose a reason for hiding this comment

Borewit Aug 17, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

AleksandrHovhannisyan Aug 16, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Borewit Aug 17, 2024 • edited Loading

Choose a reason for hiding this comment

Borewit commented Aug 20, 2024

AleksandrHovhannisyan commented Aug 20, 2024

Choose a reason for hiding this comment

AleksandrHovhannisyan commented Aug 29, 2024

Borewit commented Aug 29, 2024

AleksandrHovhannisyan commented Aug 29, 2024

Borewit Sep 3, 2024 • edited Loading

Choose a reason for hiding this comment

AleksandrHovhannisyan Sep 4, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Borewit Sep 4, 2024 • edited Loading

Choose a reason for hiding this comment

AleksandrHovhannisyan Sep 4, 2024 • edited Loading

Choose a reason for hiding this comment

Borewit Sep 5, 2024 • edited Loading

Choose a reason for hiding this comment

Borewit left a comment

Choose a reason for hiding this comment

AleksandrHovhannisyan commented Aug 14, 2024 •

edited

Loading

AleksandrHovhannisyan Aug 14, 2024 •

edited

Loading

AleksandrHovhannisyan Aug 15, 2024 •

edited

Loading

Borewit Aug 16, 2024 •

edited

Loading

Borewit Aug 17, 2024 •

edited

Loading

AleksandrHovhannisyan Aug 16, 2024 •

edited

Loading

Borewit Aug 17, 2024 •

edited

Loading

Borewit Sep 3, 2024 •

edited

Loading

AleksandrHovhannisyan Sep 4, 2024 •

edited

Loading

Borewit Sep 4, 2024 •

edited

Loading

AleksandrHovhannisyan Sep 4, 2024 •

edited

Loading

Borewit Sep 5, 2024 •

edited

Loading