Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Buildout YouTubeResource functionality #333

Open
ivanistheone opened this issue Jan 9, 2019 · 2 comments
Open

Buildout YouTubeResource functionality #333

ivanistheone opened this issue Jan 9, 2019 · 2 comments

Comments

@ivanistheone
Copy link
Contributor

@kollivier Please check this extra functionality added by Alejandro around the YouTubeResource class and see if there aren't any parts we might want to incorporate into pressurecooker:
https://github.com/elaeon/sushi-chef-science-ahmed-al-hoot-ar/blob/master/sushichef.py#L162-L279

^ all seems very useful and reusable

@ivanistheone
Copy link
Contributor Author

@ivanistheone
Copy link
Contributor Author

ivanistheone commented Jul 13, 2020

The YouTubeResource class is currently limited in it's ability to process playlists and channels/usernames. However the functionality for videos has been proven to be very useful (robust to all kinds of errors and with support for proxy servers). Recently PR#278 was opened which provides additional caching functionality.

It is time to revisit the functionality in pressurecooker.youtube to implement some general purpose scraper that all chef code can use.

Requirements

  • robust to all errors and exceptions
  • maintain proxy functionality (rotate to a new proxy server when networks errors like 429)
  • maintain backward compatibility of YouTubeResource for existing chefs (only used for videos)
  • add support for caching (using json files saved to filesystem)
  • add support for playlist > videos and channel > playlist > videos

Design

  • Continue to do proxy selection and automatic proxy use base don ENV variables
  • Maintain data as close to "native" info dict format used by YoutubeDL (use for caching and allow
  • Allow users access to the raw info json
  • Add to_node functions to return data formatted for use in ricecooker
    • For videos to_node returns metadata suitable for VideoNode + YouTubeVideoFile (and optionally subtitles)
    • For playlists to_node returns metadata suitable for TopicNode containing VideoNode children
    • For channel/playlists to_node returns a two-levels of topic hierarchy and VideoNode leaf nodes

Classes

YouTubeResource

  • maintain current interface for backward compatibility
  • implementation is just calls the new YouTubeVideo
  • raise error if used with playlist or channel URL
  • get_resource_info returns json formatted for ricecooker = YouTubeVideo.to_node

YouTubeBase

  • provide interface similar to underlying YoutubDL class
  • handles auto proxy selection and rotation on network errors
  • provides robust error handling for all errors and exceptions
  • get_info method that returns same data as ydl.extract_info(url, download=False, process=True)
  • does not do any "packaging" for ricecooker (see subclasses)

YouTubeVideo(YouTubeBase)

  • method to_ricecooker_node returns metadata suitable for VideoNode + YouTubeVideoFile
  • add get_subtitle_languages method see here including caching
  • implements download method (in case chef needs direct access to video files)
  • (future) to_studio_node return metadata required to create Studio ContentNode

Example usage to download video_url and all available subs:

yt_vid = YouTubeVideo(url=video_url)
vid_metadata = yt_vid.get_ricecooker_node()
vid_node = VideoNode(**vid_metadata)
vid_node.add_file(YouTubeVideoFile(url=vid_metadata['id'], lang=?))
lang_codes = yt_vid.get_subtitle_languages()
for lang_code in lang_codes:
    vid_node.add_file(YouTubeSubtitleFile(youtube_id=vid_metadata['id'], lang=lang_code))

YouTubePlaylist(YouTubeBase)

  • We don't want to call YouTubeBase directly on youtube url because this results in O(n) API calls to the YouTube API and leads to blocked
  • Instead use "lightweight" playlist downloader based on extract_flat and calls to YouTubeVideo
  • to_ricecooker_node method returns metadata suitable for TopicNode containing VideoNode children
  • (future) to_studio_node return metadata required to create Studio ContentNode
  • (optional) download method that downloads all videos in playlist to a folder

Example usage to download playlist_url:

yt_pl = YouTubePlaylist(url=playlist_url)
pl_metadata = yt_pl.get_ricecooker_node(options={"extract_flat":True})
video_urls = pl_metadata.pop('children')
topic_node = TopicNode(**pl_metadata)
for video_url in video_urls:
    yt_vid = YouTubeVideo(url=video_url)
    vid_metadata = yt_vid.get_ricecooker_node()
    vid_node = VideoNode(**vid_metadata)
    vid_node.add_file(YouTubeVideoFile(url=vid_metadata['id'], lang=?))
    topic_node.add_child(vid_node)

YouTubeChannel(YouTubeBase)

  • to to_ricecooker_node returns info required to create the TopicNode for that channel
  • get_playlists_flat = list of URLs of all playlists in for that youtube channel (or username)
  • see example code, but note this code is problematic since it results in thousands of youtube calls using the same proxy server --- this is why we need to replace it with calls that use the "extract_flat":True option and separate calls to YouTubeBase/YoutubeDL so that each request gets assigned a new proxy server.

Example usage, to download the videos from all the playlists of the youtube user KhanAcademyKiswahili, run:

channel_node = Channel(name="KA Swahili", source_id, ...)
yt_ch = YouTubeChannel(id="KhanAcademyKiswahili")
playlist_urls = yt_ch.get_playlists_flat() # == get_info(options={"extract_flat":True})['entries']
for playlist_url in playlist_urls:
    yt_pl = YouTubePlaylist(url=playlist_url)
    pl_metadata = yt_pl.get_ricecooker_node(options={"extract_flat":True})
    video_urls = pl_metadata.pop('children')
    topic_node = TopicNode(**pl_metadata)
    for video_url in video_urls:
        yt_vid = YouTubeVideo(url=video_url)
        vid_metadata = yt_vid.get_ricecooker_node()
        vid_node = VideoNode(**vid_metadata)
        vid_node.add_file(YouTubeVideoFile(url=vid_metadata['id'], lang=?))
        topic_node.add_child(vid_node)
    channel_node.add_child(topic_node)

@WenyuZhang1992 ^ note the usage examples above refer to code that doesn't exist yet — this is just my proposal for classes and methods that would handle all the user cases and would be easy to use in chef code. This is a kind of readme-driven-programming ;)

@rtibbles rtibbles transferred this issue from learningequality/pressurecooker May 11, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant