Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Question about Videos #2

Closed
valeriechen opened this issue May 24, 2018 · 5 comments
Closed

Question about Videos #2

valeriechen opened this issue May 24, 2018 · 5 comments
Labels

Comments

@valeriechen
Copy link

How do you know if these videos are the ones specifically in the 8M dataset? It seems like your script only queries the categories. However, there might be more videos in that category but are not part of the selected dataset.

Thanks!

@gsssrao
Copy link
Owner

gsssrao commented May 25, 2018

@valeriechen The script downloads only the videos specified in the youtube 8M dataset. I should have put up some information in the README explaining how it works.

Basically, the repository works in the following way (I will link this issue to README for reference):

If you try to go to this page, it displays all the 3862 classes of the Youtube-8M dataset. On inspecting the html code, you can obtain the links to the javascript files in the google database, corresponding to each of these classes. I have stored the useful part of this in selectedcategories.txt.

Now, for the class Games, the list of the corresponding tf-record files can be accessed via the following link:
https://storage.googleapis.com/data.yt8m.org/2/j/v/03bt1gh.js
Here, 03bt1gh corresponds to the value for Games in the selectedcategories.txt.

Each of the fields in the JSON array obtained by accessing the above link contains a 4 character <tf-record-id> corresponding to each video of the Games class. There should be a total of 788288 such ids for Games.

Each of these ids can next be translated to the corresponding youtube-id by replacing the <tf-record-id> in a specific way (Reference). Say the record id was 19Mn then you need to access the following link:
https://storage.googleapis.com/data.yt8m.org/2/j/i/19/19Mn.js
to get the actual youtube-id for that tf-record id (Note: first 2 characters are repeated and appended to / and 19Mn.js).

Once, you get the video-id (which is cmh9FnLbE5s in the above case), you just need to pipe it to the youtube-dl command to download it. The actual youtube link would be:
http://www.youtube.com/watch?v=cmh9FnLbE5s

Hope that this answers your question.

PS: Thanks for pointing this out. Due to this, I checked the youtube8M website and it seems that they have updated the dataset to a newer version. Hence, I had the chance to update the repo to support the newer version.

@kli-casia
Copy link

super useful, thank you very much

@hellowodex
Copy link

good job!

@anavc94
Copy link

anavc94 commented Oct 19, 2018

@gsssrao just wanted to thank you for this repository! Thanks a lot!

@wsuen
Copy link

wsuen commented Feb 28, 2019

Thanks for this! It's great!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

6 participants