diff --git a/.gitignore b/.gitignore index 9d371d9978f..065a14f49be 100644 --- a/.gitignore +++ b/.gitignore @@ -60,3 +60,5 @@ venv/ # VS Code related files .vscode + +cookies.txt diff --git a/README.md b/README.md index 2a0cf3a48c3..2d8bd9b8524 100644 --- a/README.md +++ b/README.md @@ -26,6 +26,14 @@ youtube-dlc is a fork of youtube-dl with the intention of getting features teste - [Adobe Pass Options:](#adobe-pass-options) - [Post-processing Options:](#post-processing-options) - [Extractor Options:](#extractor-options) +- [CONFIGURATION](#configuration) + - [Authentication with `.netrc` file](#authentication-with-netrc-file) +- [OUTPUT TEMPLATE](#output-template) + - [Output template and Windows batch files](#output-template-and-windows-batch-files) + - [Output template examples](#output-template-examples) +- [FORMAT SELECTION](#format-selection) + - [Format selection examples](#format-selection-examples) +- [VIDEO SELECTION](#video-selection-1) # INSTALLATION @@ -474,3 +482,297 @@ Then simply type this ## Extractor Options: --ignore-dynamic-mpd Do not process dynamic DASH manifests +# CONFIGURATION + +You can configure youtube-dlc by placing any supported command line option to a configuration file. On Linux and macOS, the system wide configuration file is located at `/etc/youtube-dlc.conf` and the user wide configuration file at `~/.config/youtube-dlc/config`. On Windows, the user wide configuration file locations are `%APPDATA%\youtube-dlc\config.txt` or `C:\Users\\youtube-dlc.conf`. Note that by default configuration file may not exist so you may need to create it yourself. + +For example, with the following configuration file youtube-dlc will always extract the audio, not copy the mtime, use a proxy and save all videos under `Movies` directory in your home directory: +``` +# Lines starting with # are comments + +# Always extract audio +-x + +# Do not copy the mtime +--no-mtime + +# Use this proxy +--proxy 127.0.0.1:3128 + +# Save all videos under Movies directory in your home directory +-o ~/Movies/%(title)s.%(ext)s +``` + +Note that options in configuration file are just the same options aka switches used in regular command line calls thus there **must be no whitespace** after `-` or `--`, e.g. `-o` or `--proxy` but not `- o` or `-- proxy`. + +You can use `--ignore-config` if you want to disable the configuration file for a particular youtube-dlc run. + +You can also use `--config-location` if you want to use custom configuration file for a particular youtube-dlc run. + +### Authentication with `.netrc` file + +You may also want to configure automatic credentials storage for extractors that support authentication (by providing login and password with `--username` and `--password`) in order not to pass credentials as command line arguments on every youtube-dlc execution and prevent tracking plain text passwords in the shell command history. You can achieve this using a [`.netrc` file](https://stackoverflow.com/tags/.netrc/info) on a per extractor basis. For that you will need to create a `.netrc` file in your `$HOME` and restrict permissions to read/write by only you: +``` +touch $HOME/.netrc +chmod a-rwx,u+rw $HOME/.netrc +``` +After that you can add credentials for an extractor in the following format, where *extractor* is the name of the extractor in lowercase: +``` +machine login password +``` +For example: +``` +machine youtube login myaccount@gmail.com password my_youtube_password +machine twitch login my_twitch_account_name password my_twitch_password +``` +To activate authentication with the `.netrc` file you should pass `--netrc` to youtube-dlc or place it in the [configuration file](#configuration). + +On Windows you may also need to setup the `%HOME%` environment variable manually. For example: +``` +set HOME=%USERPROFILE% +``` + +# OUTPUT TEMPLATE + +The `-o` option allows users to indicate a template for the output file names. + +**tl;dr:** [navigate me to examples](#output-template-examples). + +The basic usage is not to set any template arguments when downloading a single file, like in `youtube-dlc -o funny_video.flv "https://some/video"`. However, it may contain special sequences that will be replaced when downloading each video. The special sequences may be formatted according to [python string formatting operations](https://docs.python.org/2/library/stdtypes.html#string-formatting). For example, `%(NAME)s` or `%(NAME)05d`. To clarify, that is a percent symbol followed by a name in parentheses, followed by formatting operations. Allowed names along with sequence type are: + + - `id` (string): Video identifier + - `title` (string): Video title + - `url` (string): Video URL + - `ext` (string): Video filename extension + - `alt_title` (string): A secondary title of the video + - `display_id` (string): An alternative identifier for the video + - `uploader` (string): Full name of the video uploader + - `license` (string): License name the video is licensed under + - `creator` (string): The creator of the video + - `release_date` (string): The date (YYYYMMDD) when the video was released + - `timestamp` (numeric): UNIX timestamp of the moment the video became available + - `upload_date` (string): Video upload date (YYYYMMDD) + - `uploader_id` (string): Nickname or id of the video uploader + - `channel` (string): Full name of the channel the video is uploaded on + - `channel_id` (string): Id of the channel + - `location` (string): Physical location where the video was filmed + - `duration` (numeric): Length of the video in seconds + - `view_count` (numeric): How many users have watched the video on the platform + - `like_count` (numeric): Number of positive ratings of the video + - `dislike_count` (numeric): Number of negative ratings of the video + - `repost_count` (numeric): Number of reposts of the video + - `average_rating` (numeric): Average rating give by users, the scale used depends on the webpage + - `comment_count` (numeric): Number of comments on the video + - `age_limit` (numeric): Age restriction for the video (years) + - `is_live` (boolean): Whether this video is a live stream or a fixed-length video + - `start_time` (numeric): Time in seconds where the reproduction should start, as specified in the URL + - `end_time` (numeric): Time in seconds where the reproduction should end, as specified in the URL + - `format` (string): A human-readable description of the format + - `format_id` (string): Format code specified by `--format` + - `format_note` (string): Additional info about the format + - `width` (numeric): Width of the video + - `height` (numeric): Height of the video + - `resolution` (string): Textual description of width and height + - `tbr` (numeric): Average bitrate of audio and video in KBit/s + - `abr` (numeric): Average audio bitrate in KBit/s + - `acodec` (string): Name of the audio codec in use + - `asr` (numeric): Audio sampling rate in Hertz + - `vbr` (numeric): Average video bitrate in KBit/s + - `fps` (numeric): Frame rate + - `vcodec` (string): Name of the video codec in use + - `container` (string): Name of the container format + - `filesize` (numeric): The number of bytes, if known in advance + - `filesize_approx` (numeric): An estimate for the number of bytes + - `protocol` (string): The protocol that will be used for the actual download + - `extractor` (string): Name of the extractor + - `extractor_key` (string): Key name of the extractor + - `epoch` (numeric): Unix epoch when creating the file + - `autonumber` (numeric): Number that will be increased with each download, starting at `--autonumber-start` + - `playlist` (string): Name or id of the playlist that contains the video + - `playlist_index` (numeric): Index of the video in the playlist padded with leading zeros according to the total length of the playlist + - `playlist_id` (string): Playlist identifier + - `playlist_title` (string): Playlist title + - `playlist_uploader` (string): Full name of the playlist uploader + - `playlist_uploader_id` (string): Nickname or id of the playlist uploader + +Available for the video that belongs to some logical chapter or section: + + - `chapter` (string): Name or title of the chapter the video belongs to + - `chapter_number` (numeric): Number of the chapter the video belongs to + - `chapter_id` (string): Id of the chapter the video belongs to + +Available for the video that is an episode of some series or programme: + + - `series` (string): Title of the series or programme the video episode belongs to + - `season` (string): Title of the season the video episode belongs to + - `season_number` (numeric): Number of the season the video episode belongs to + - `season_id` (string): Id of the season the video episode belongs to + - `episode` (string): Title of the video episode + - `episode_number` (numeric): Number of the video episode within a season + - `episode_id` (string): Id of the video episode + +Available for the media that is a track or a part of a music album: + + - `track` (string): Title of the track + - `track_number` (numeric): Number of the track within an album or a disc + - `track_id` (string): Id of the track + - `artist` (string): Artist(s) of the track + - `genre` (string): Genre(s) of the track + - `album` (string): Title of the album the track belongs to + - `album_type` (string): Type of the album + - `album_artist` (string): List of all artists appeared on the album + - `disc_number` (numeric): Number of the disc or other physical medium the track belongs to + - `release_year` (numeric): Year (YYYY) when the album was released + +Each aforementioned sequence when referenced in an output template will be replaced by the actual value corresponding to the sequence name. Note that some of the sequences are not guaranteed to be present since they depend on the metadata obtained by a particular extractor. Such sequences will be replaced with `NA`. + +For example for `-o %(title)s-%(id)s.%(ext)s` and an mp4 video with title `youtube-dlc test video` and id `BaW_jenozKcj`, this will result in a `youtube-dlc test video-BaW_jenozKcj.mp4` file created in the current directory. + +For numeric sequences you can use numeric related formatting, for example, `%(view_count)05d` will result in a string with view count padded with zeros up to 5 characters, like in `00042`. + +Output templates can also contain arbitrary hierarchical path, e.g. `-o '%(playlist)s/%(playlist_index)s - %(title)s.%(ext)s'` which will result in downloading each video in a directory corresponding to this path template. Any missing directory will be automatically created for you. + +To use percent literals in an output template use `%%`. To output to stdout use `-o -`. + +The current default template is `%(title)s-%(id)s.%(ext)s`. + +In some cases, you don't want special characters such as 中, spaces, or &, such as when transferring the downloaded filename to a Windows system or the filename through an 8bit-unsafe channel. In these cases, add the `--restrict-filenames` flag to get a shorter title: + +#### Output template and Windows batch files + +If you are using an output template inside a Windows batch file then you must escape plain percent characters (`%`) by doubling, so that `-o "%(title)s-%(id)s.%(ext)s"` should become `-o "%%(title)s-%%(id)s.%%(ext)s"`. However you should not touch `%`'s that are not plain characters, e.g. environment variables for expansion should stay intact: `-o "C:\%HOMEPATH%\Desktop\%%(title)s.%%(ext)s"`. + +#### Output template examples + +Note that on Windows you may need to use double quotes instead of single. + +```bash +$ youtube-dlc --get-filename -o '%(title)s.%(ext)s' BaW_jenozKc +youtube-dlc test video ''_ä↭𝕐.mp4 # All kinds of weird characters + +$ youtube-dlc --get-filename -o '%(title)s.%(ext)s' BaW_jenozKc --restrict-filenames +youtube-dlc_test_video_.mp4 # A simple file name + +# Download YouTube playlist videos in separate directory indexed by video order in a playlist +$ youtube-dlc -o '%(playlist)s/%(playlist_index)s - %(title)s.%(ext)s' https://www.youtube.com/playlist?list=PLwiyx1dc3P2JR9N8gQaQN_BCvlSlap7re + +# Download all playlists of YouTube channel/user keeping each playlist in separate directory: +$ youtube-dlc -o '%(uploader)s/%(playlist)s/%(playlist_index)s - %(title)s.%(ext)s' https://www.youtube.com/user/TheLinuxFoundation/playlists + +# Download Udemy course keeping each chapter in separate directory under MyVideos directory in your home +$ youtube-dlc -u user -p password -o '~/MyVideos/%(playlist)s/%(chapter_number)s - %(chapter)s/%(title)s.%(ext)s' https://www.udemy.com/java-tutorial/ + +# Download entire series season keeping each series and each season in separate directory under C:/MyVideos +$ youtube-dlc -o "C:/MyVideos/%(series)s/%(season_number)s - %(season)s/%(episode_number)s - %(episode)s.%(ext)s" https://videomore.ru/kino_v_detalayah/5_sezon/367617 + +# Stream the video being downloaded to stdout +$ youtube-dlc -o - BaW_jenozKc +``` + +# FORMAT SELECTION + +By default youtube-dlc tries to download the best available quality, i.e. if you want the best quality you **don't need** to pass any special options, youtube-dlc will guess it for you by **default**. + +But sometimes you may want to download in a different format, for example when you are on a slow or intermittent connection. The key mechanism for achieving this is so-called *format selection* based on which you can explicitly specify desired format, select formats based on some criterion or criteria, setup precedence and much more. + +The general syntax for format selection is `--format FORMAT` or shorter `-f FORMAT` where `FORMAT` is a *selector expression*, i.e. an expression that describes format or formats you would like to download. + +**tl;dr:** [navigate me to examples](#format-selection-examples). + +The simplest case is requesting a specific format, for example with `-f 22` you can download the format with format code equal to 22. You can get the list of available format codes for particular video using `--list-formats` or `-F`. Note that these format codes are extractor specific. + +You can also use a file extension (currently `3gp`, `aac`, `flv`, `m4a`, `mp3`, `mp4`, `ogg`, `wav`, `webm` are supported) to download the best quality format of a particular file extension served as a single file, e.g. `-f webm` will download the best quality format with the `webm` extension served as a single file. + +You can also use special names to select particular edge case formats: + + - `best`: Select the best quality format represented by a single file with video and audio. + - `worst`: Select the worst quality format represented by a single file with video and audio. + - `bestvideo`: Select the best quality video-only format (e.g. DASH video). May not be available. + - `worstvideo`: Select the worst quality video-only format. May not be available. + - `bestaudio`: Select the best quality audio only-format. May not be available. + - `worstaudio`: Select the worst quality audio only-format. May not be available. + +For example, to download the worst quality video-only format you can use `-f worstvideo`. + +If you want to download multiple videos and they don't have the same formats available, you can specify the order of preference using slashes. Note that slash is left-associative, i.e. formats on the left hand side are preferred, for example `-f 22/17/18` will download format 22 if it's available, otherwise it will download format 17 if it's available, otherwise it will download format 18 if it's available, otherwise it will complain that no suitable formats are available for download. + +If you want to download several formats of the same video use a comma as a separator, e.g. `-f 22,17,18` will download all these three formats, of course if they are available. Or a more sophisticated example combined with the precedence feature: `-f 136/137/mp4/bestvideo,140/m4a/bestaudio`. + +You can also filter the video formats by putting a condition in brackets, as in `-f "best[height=720]"` (or `-f "[filesize>10M]"`). + +The following numeric meta fields can be used with comparisons `<`, `<=`, `>`, `>=`, `=` (equals), `!=` (not equals): + + - `filesize`: The number of bytes, if known in advance + - `width`: Width of the video, if known + - `height`: Height of the video, if known + - `tbr`: Average bitrate of audio and video in KBit/s + - `abr`: Average audio bitrate in KBit/s + - `vbr`: Average video bitrate in KBit/s + - `asr`: Audio sampling rate in Hertz + - `fps`: Frame rate + +Also filtering work for comparisons `=` (equals), `^=` (starts with), `$=` (ends with), `*=` (contains) and following string meta fields: + + - `ext`: File extension + - `acodec`: Name of the audio codec in use + - `vcodec`: Name of the video codec in use + - `container`: Name of the container format + - `protocol`: The protocol that will be used for the actual download, lower-case (`http`, `https`, `rtsp`, `rtmp`, `rtmpe`, `mms`, `f4m`, `ism`, `http_dash_segments`, `m3u8`, or `m3u8_native`) + - `format_id`: A short description of the format + +Any string comparison may be prefixed with negation `!` in order to produce an opposite comparison, e.g. `!*=` (does not contain). + +Note that none of the aforementioned meta fields are guaranteed to be present since this solely depends on the metadata obtained by particular extractor, i.e. the metadata offered by the video hoster. + +Formats for which the value is not known are excluded unless you put a question mark (`?`) after the operator. You can combine format filters, so `-f "[height <=? 720][tbr>500]"` selects up to 720p videos (or videos where the height is not known) with a bitrate of at least 500 KBit/s. + +You can merge the video and audio of two formats into a single file using `-f +` (requires ffmpeg or avconv installed), for example `-f bestvideo+bestaudio` will download the best video-only format, the best audio-only format and mux them together with ffmpeg/avconv. + +Format selectors can also be grouped using parentheses, for example if you want to download the best mp4 and webm formats with a height lower than 480 you can use `-f '(mp4,webm)[height<480]'`. + +Since the end of April 2015 and version 2015.04.26, youtube-dlc uses `-f bestvideo+bestaudio/best` as the default format selection (see [#5447](https://github.com/ytdl-org/youtube-dl/issues/5447), [#5456](https://github.com/ytdl-org/youtube-dl/issues/5456)). If ffmpeg or avconv are installed this results in downloading `bestvideo` and `bestaudio` separately and muxing them together into a single file giving the best overall quality available. Otherwise it falls back to `best` and results in downloading the best available quality served as a single file. `best` is also needed for videos that don't come from YouTube because they don't provide the audio and video in two different files. If you want to only download some DASH formats (for example if you are not interested in getting videos with a resolution higher than 1080p), you can add `-f bestvideo[height<=?1080]+bestaudio/best` to your configuration file. Note that if you use youtube-dlc to stream to `stdout` (and most likely to pipe it to your media player then), i.e. you explicitly specify output template as `-o -`, youtube-dlc still uses `-f best` format selection in order to start content delivery immediately to your player and not to wait until `bestvideo` and `bestaudio` are downloaded and muxed. + +If you want to preserve the old format selection behavior (prior to youtube-dlc 2015.04.26), i.e. you want to download the best available quality media served as a single file, you should explicitly specify your choice with `-f best`. You may want to add it to the [configuration file](#configuration) in order not to type it every time you run youtube-dlc. + +#### Format selection examples + +Note that on Windows you may need to use double quotes instead of single. + +```bash +# Download best mp4 format available or any other best if no mp4 available +$ youtube-dlc -f 'bestvideo[ext=mp4]+bestaudio[ext=m4a]/best[ext=mp4]/best' + +# Download best format available but no better than 480p +$ youtube-dlc -f 'bestvideo[height<=480]+bestaudio/best[height<=480]' + +# Download best video only format but no bigger than 50 MB +$ youtube-dlc -f 'best[filesize<50M]' + +# Download best format available via direct link over HTTP/HTTPS protocol +$ youtube-dlc -f '(bestvideo+bestaudio/best)[protocol^=http]' + +# Download the best video format and the best audio format without merging them +$ youtube-dlc -f 'bestvideo,bestaudio' -o '%(title)s.f%(format_id)s.%(ext)s' +``` +Note that in the last example, an output template is recommended as bestvideo and bestaudio may have the same file name. + + +# VIDEO SELECTION + +Videos can be filtered by their upload date using the options `--date`, `--datebefore` or `--dateafter`. They accept dates in two formats: + + - Absolute dates: Dates in the format `YYYYMMDD`. + - Relative dates: Dates in the format `(now|today)[+-][0-9](day|week|month|year)(s)?` + +Examples: + +```bash +# Download only the videos uploaded in the last 6 months +$ youtube-dlc --dateafter now-6months + +# Download only the videos uploaded on January 1, 1970 +$ youtube-dlc --date 19700101 + +$ # Download only the videos uploaded in the 200x decade +$ youtube-dlc --dateafter 20000101 --datebefore 20091231 +``` \ No newline at end of file diff --git a/youtube_dlc/extractor/bandcamp.py b/youtube_dlc/extractor/bandcamp.py index 2022e69f896..9dbafe86dd6 100644 --- a/youtube_dlc/extractor/bandcamp.py +++ b/youtube_dlc/extractor/bandcamp.py @@ -99,7 +99,6 @@ def _real_extract(self, url): webpage, 'track info', default='{}') track_info = self._parse_json(trackinfo_block, title) - if track_info: file_ = track_info.get('file') if isinstance(file_, dict): @@ -115,7 +114,7 @@ def _real_extract(self, url): 'acodec': ext, 'abr': int_or_none(abr_str), }) - track = track_info.get('title') + track_id = str_or_none(track_info.get('track_id') or track_info.get('id')) track_number = int_or_none(track_info.get('track_num')) duration = float_or_none(track_info.get('duration')) @@ -126,6 +125,7 @@ def extract(key): webpage, key, default=None, group='value') return data.replace(r'\"', '"').replace('\\\\', '\\') if data else data + track = extract('title') artist = extract('artist') album = extract('album_title') timestamp = unified_timestamp( diff --git a/youtube_dlc/extractor/bet.py b/youtube_dlc/extractor/bet.py index d7ceaa85e45..2c714423503 100644 --- a/youtube_dlc/extractor/bet.py +++ b/youtube_dlc/extractor/bet.py @@ -3,6 +3,8 @@ from .mtv import MTVServicesInfoExtractor from ..utils import unified_strdate +# TODO Remove - Reason: Outdated Site + class BetIE(MTVServicesInfoExtractor): _VALID_URL = r'https?://(?:www\.)?bet\.com/(?:[^/]+/)+(?P.+?)\.html' diff --git a/youtube_dlc/extractor/cmt.py b/youtube_dlc/extractor/cmt.py index e701fbeab82..a4ddb91609f 100644 --- a/youtube_dlc/extractor/cmt.py +++ b/youtube_dlc/extractor/cmt.py @@ -2,6 +2,8 @@ from .mtv import MTVIE +# TODO Remove - Reason: Outdated Site + class CMTIE(MTVIE): IE_NAME = 'cmt.com' @@ -39,7 +41,7 @@ class CMTIE(MTVIE): 'only_matching': True, }] - def _extract_mgid(self, webpage): + def _extract_mgid(self, webpage, url): mgid = self._search_regex( r'MTVN\.VIDEO\.contentUri\s*=\s*([\'"])(?P.+?)\1', webpage, 'mgid', group='mgid', default=None) @@ -50,5 +52,5 @@ def _extract_mgid(self, webpage): def _real_extract(self, url): video_id = self._match_id(url) webpage = self._download_webpage(url, video_id) - mgid = self._extract_mgid(webpage) + mgid = self._extract_mgid(webpage, url) return self.url_result('http://media.mtvnservices.com/embed/%s' % mgid) diff --git a/youtube_dlc/extractor/comedycentral.py b/youtube_dlc/extractor/comedycentral.py index d08b909a68e..f54c4adeb9f 100644 --- a/youtube_dlc/extractor/comedycentral.py +++ b/youtube_dlc/extractor/comedycentral.py @@ -48,7 +48,7 @@ class ComedyCentralFullEpisodesIE(MTVServicesInfoExtractor): def _real_extract(self, url): playlist_id = self._match_id(url) webpage = self._download_webpage(url, playlist_id) - mgid = self._extract_triforce_mgid(webpage, data_zone='t2_lc_promo1') + mgid = self._extract_mgid(webpage, url, data_zone='t2_lc_promo1') videos_info = self._get_videos_info(mgid) return videos_info diff --git a/youtube_dlc/extractor/expressen.py b/youtube_dlc/extractor/expressen.py index f79365038d9..dc8b855d233 100644 --- a/youtube_dlc/extractor/expressen.py +++ b/youtube_dlc/extractor/expressen.py @@ -15,7 +15,7 @@ class ExpressenIE(InfoExtractor): _VALID_URL = r'''(?x) https?:// - (?:www\.)?expressen\.se/ + (?:www\.)?(?:expressen|di)\.se/ (?:(?:tvspelare/video|videoplayer/embed)/)? tv/(?:[^/]+/)* (?P[^/?#&]+) @@ -42,13 +42,16 @@ class ExpressenIE(InfoExtractor): }, { 'url': 'https://www.expressen.se/videoplayer/embed/tv/ditv/ekonomistudion/experterna-har-ar-fragorna-som-avgor-valet/?embed=true&external=true&autoplay=true&startVolume=0&partnerId=di', 'only_matching': True, + }, { + 'url': 'https://www.di.se/videoplayer/embed/tv/ditv/borsmorgon/implantica-rusar-70--under-borspremiaren-hor-styrelsemedlemmen/?embed=true&external=true&autoplay=true&startVolume=0&partnerId=di', + 'only_matching': True, }] @staticmethod def _extract_urls(webpage): return [ mobj.group('url') for mobj in re.finditer( - r']+\bsrc=(["\'])(?P(?:https?:)?//(?:www\.)?expressen\.se/(?:tvspelare/video|videoplayer/embed)/tv/.+?)\1', + r']+\bsrc=(["\'])(?P(?:https?:)?//(?:www\.)?(?:expressen|di)\.se/(?:tvspelare/video|videoplayer/embed)/tv/.+?)\1', webpage)] def _real_extract(self, url): diff --git a/youtube_dlc/extractor/iprima.py b/youtube_dlc/extractor/iprima.py index 53a550c11e4..648ae6741f1 100644 --- a/youtube_dlc/extractor/iprima.py +++ b/youtube_dlc/extractor/iprima.py @@ -86,7 +86,8 @@ def _real_extract(self, url): (r']+\bsrc=["\'](?:https?:)?//(?:api\.play-backend\.iprima\.cz/prehravac/embedded|prima\.iprima\.cz/[^/]+/[^/]+)\?.*?\bid=(p\d+)', r'data-product="([^"]+)">', r'id=["\']player-(p\d+)"', - r'playerId\s*:\s*["\']player-(p\d+)'), + r'playerId\s*:\s*["\']player-(p\d+)', + r'\bvideos\s*=\s*["\'](p\d+)'), webpage, 'real id') playerpage = self._download_webpage( diff --git a/youtube_dlc/extractor/mtv.py b/youtube_dlc/extractor/mtv.py index fedd5f46bba..e545a9ef3bd 100644 --- a/youtube_dlc/extractor/mtv.py +++ b/youtube_dlc/extractor/mtv.py @@ -7,6 +7,7 @@ from ..compat import ( compat_str, compat_xpath, + compat_urlparse, ) from ..utils import ( ExtractorError, @@ -22,6 +23,7 @@ unescapeHTML, update_url_query, url_basename, + get_domain, xpath_text, ) @@ -253,7 +255,42 @@ def _extract_triforce_mgid(self, webpage, data_zone=None, video_id=None): return try_get(feed, lambda x: x['result']['data']['id'], compat_str) - def _extract_mgid(self, webpage): + def _extract_new_triforce_mgid(self, webpage, url='', video_id=None): + # print(compat_urlparse.urlparse(url).netloc) + if url == '': + return + domain = get_domain(url) + if domain is None: + raise ExtractorError( + '[%s] could not get domain' % self.IE_NAME, + expected=True) + url = url.replace("https://", "http://") + enc_url = compat_urlparse.quote(url, safe='') + _TRIFORCE_V8_TEMPLATE = 'https://%s/feeds/triforce/manifest/v8?url=%s' + triforce_manifest_url = _TRIFORCE_V8_TEMPLATE % (domain, enc_url) + + manifest = self._download_json(triforce_manifest_url, video_id, fatal=False) + if manifest: + if manifest.get('manifest').get('type') == 'redirect': + self.to_screen('Found a redirect. Downloading manifest from new location') + new_loc = manifest.get('manifest').get('newLocation') + new_loc = new_loc.replace("https://", "http://") + enc_new_loc = compat_urlparse.quote(new_loc, safe='') + triforce_manifest_new_loc = _TRIFORCE_V8_TEMPLATE % (domain, enc_new_loc) + manifest = self._download_json(triforce_manifest_new_loc, video_id, fatal=False) + + item_id = try_get(manifest, lambda x: x['manifest']['reporting']['itemId'], compat_str) + if not item_id: + self.to_screen('Found no id!') + return + + # 'episode' can be anything. 'content' is used often as well + _MGID_TEMPLATE = 'mgid:arc:episode:%s:%s' + mgid = _MGID_TEMPLATE % (domain, item_id) + + return mgid + + def _extract_mgid(self, webpage, url, data_zone=None): try: # the url can be http://media.mtvnservices.com/fb/{mgid}.swf # or http://media.mtvnservices.com/{mgid} @@ -276,14 +313,17 @@ def _extract_mgid(self, webpage): r'embed/(mgid:.+?)["\'&?/]', sm4_embed, 'mgid', default=None) if not mgid: - mgid = self._extract_triforce_mgid(webpage) + mgid = self._extract_new_triforce_mgid(webpage, url) + + if not mgid: + mgid = self._extract_triforce_mgid(webpage, data_zone) return mgid def _real_extract(self, url): title = url_basename(url) webpage = self._download_webpage(url, title) - mgid = self._extract_mgid(webpage) + mgid = self._extract_mgid(webpage, url) videos_info = self._get_videos_info(mgid) return videos_info diff --git a/youtube_dlc/extractor/nick.py b/youtube_dlc/extractor/nick.py index 2e8b302ac85..04b98f7bde5 100644 --- a/youtube_dlc/extractor/nick.py +++ b/youtube_dlc/extractor/nick.py @@ -245,5 +245,5 @@ class NickRuIE(MTVServicesInfoExtractor): def _real_extract(self, url): video_id = self._match_id(url) webpage = self._download_webpage(url, video_id) - mgid = self._extract_mgid(webpage) + mgid = self._extract_mgid(webpage, url) return self.url_result('http://media.mtvnservices.com/embed/%s' % mgid) diff --git a/youtube_dlc/extractor/spike.py b/youtube_dlc/extractor/spike.py index aabff7a3ce7..3cee331f6a7 100644 --- a/youtube_dlc/extractor/spike.py +++ b/youtube_dlc/extractor/spike.py @@ -20,8 +20,18 @@ class BellatorIE(MTVServicesInfoExtractor): _FEED_URL = 'http://www.bellator.com/feeds/mrss/' _GEO_COUNTRIES = ['US'] - def _extract_mgid(self, webpage): - return self._extract_triforce_mgid(webpage) + def _extract_mgid(self, webpage, url): + mgid = None + + if not mgid: + mgid = self._extract_triforce_mgid(webpage) + + if not mgid: + mgid = self._extract_new_triforce_mgid(webpage, url) + + return mgid + +# TODO Remove - Reason: Outdated Site class ParamountNetworkIE(MTVServicesInfoExtractor): @@ -43,7 +53,7 @@ class ParamountNetworkIE(MTVServicesInfoExtractor): _FEED_URL = 'http://www.paramountnetwork.com/feeds/mrss/' _GEO_COUNTRIES = ['US'] - def _extract_mgid(self, webpage): + def _extract_mgid(self, webpage, url): root_data = self._parse_json(self._search_regex( r'window\.__DATA__\s*=\s*({.+})', webpage, 'data'), None) diff --git a/youtube_dlc/extractor/vh1.py b/youtube_dlc/extractor/vh1.py index dff94a2b845..ea576dc6ba6 100644 --- a/youtube_dlc/extractor/vh1.py +++ b/youtube_dlc/extractor/vh1.py @@ -3,6 +3,8 @@ from .mtv import MTVServicesInfoExtractor +# TODO Remove - Reason: Outdated Site + class VH1IE(MTVServicesInfoExtractor): IE_NAME = 'vh1.com' diff --git a/youtube_dlc/extractor/youtube.py b/youtube_dlc/extractor/youtube.py index d781c35b5f6..fbfc11563c2 100644 --- a/youtube_dlc/extractor/youtube.py +++ b/youtube_dlc/extractor/youtube.py @@ -39,6 +39,7 @@ mimetype2ext, orderedSet, parse_codecs, + parse_count, parse_duration, remove_quotes, remove_start, @@ -1861,31 +1862,65 @@ def extract_player_response(player_response, video_id): embed_webpage = None if (self._og_search_property('restrictions:age', video_webpage, default=None) == '18+' or re.search(r'player-age-gate-content">', video_webpage) is not None): + cookie_keys = self._get_cookies('https://www.youtube.com').keys() age_gate = True # We simulate the access to the video from www.youtube.com/v/{video_id} # this can be viewed without login into Youtube url = proto + '://www.youtube.com/embed/%s' % video_id embed_webpage = self._download_webpage(url, video_id, 'Downloading embed webpage') - data = compat_urllib_parse_urlencode({ - 'video_id': video_id, - 'eurl': 'https://youtube.googleapis.com/v/' + video_id, - 'sts': self._search_regex( - r'"sts"\s*:\s*(\d+)', embed_webpage, 'sts', default=''), - }) - video_info_url = proto + '://www.youtube.com/get_video_info?' + data - try: - video_info_webpage = self._download_webpage( - video_info_url, video_id, - note='Refetching age-gated info webpage', - errnote='unable to download video info webpage') - except ExtractorError: - video_info_webpage = None - if video_info_webpage: - video_info = compat_parse_qs(video_info_webpage) - pl_response = video_info.get('player_response', [None])[0] - player_response = extract_player_response(pl_response, video_id) - add_dash_mpd(video_info) - view_count = extract_view_count(video_info) + # check if video is only playable on youtube - if so it requires auth (cookies) + if re.search(r'player-unavailable">', embed_webpage) is not None: + ''' + # TODO apply this patch when Support for Python 2.6(!) and above drops + if ({'VISITOR_INFO1_LIVE', 'HSID', 'SSID', 'SID'} <= cookie_keys + or {'VISITOR_INFO1_LIVE', '__Secure-3PSID', 'LOGIN_INFO'} <= cookie_keys): + ''' + if (set(('VISITOR_INFO1_LIVE', 'HSID', 'SSID', 'SID')) <= set(cookie_keys) + or set(('VISITOR_INFO1_LIVE', '__Secure-3PSID', 'LOGIN_INFO')) <= set(cookie_keys)): + age_gate = False + # Try looking directly into the video webpage + ytplayer_config = self._get_ytplayer_config(video_id, video_webpage) + if ytplayer_config: + args = ytplayer_config['args'] + if args.get('url_encoded_fmt_stream_map') or args.get('hlsvp'): + # Convert to the same format returned by compat_parse_qs + video_info = dict((k, [v]) for k, v in args.items()) + add_dash_mpd(video_info) + # Rental video is not rented but preview is available (e.g. + # https://www.youtube.com/watch?v=yYr8q0y5Jfg, + # https://github.com/ytdl-org/youtube-dl/issues/10532) + if not video_info and args.get('ypc_vid'): + return self.url_result( + args['ypc_vid'], YoutubeIE.ie_key(), video_id=args['ypc_vid']) + if args.get('livestream') == '1' or args.get('live_playback') == 1: + is_live = True + if not player_response: + player_response = extract_player_response(args.get('player_response'), video_id) + if not video_info or self._downloader.params.get('youtube_include_dash_manifest', True): + add_dash_mpd_pr(player_response) + else: + raise ExtractorError('Video is age restricted and only playable on Youtube. Requires cookies!', expected=True) + else: + data = compat_urllib_parse_urlencode({ + 'video_id': video_id, + 'eurl': 'https://youtube.googleapis.com/v/' + video_id, + 'sts': self._search_regex( + r'"sts"\s*:\s*(\d+)', embed_webpage, 'sts', default=''), + }) + video_info_url = proto + '://www.youtube.com/get_video_info?' + data + try: + video_info_webpage = self._download_webpage( + video_info_url, video_id, + note='Refetching age-gated info webpage', + errnote='unable to download video info webpage') + except ExtractorError: + video_info_webpage = None + if video_info_webpage: + video_info = compat_parse_qs(video_info_webpage) + pl_response = video_info.get('player_response', [None])[0] + player_response = extract_player_response(pl_response, video_id) + add_dash_mpd(video_info) + view_count = extract_view_count(video_info) else: age_gate = False # Try looking directly into the video webpage @@ -2421,7 +2456,7 @@ def extract_meta(field): def _extract_count(count_name): return str_to_int(self._search_regex( - r'-%s-button[^>]+>]+class="yt-uix-button-content"[^>]*>([\d,]+)' + r'"accessibilityData":\{"label":"([\d,\w]+) %ss"\}' % re.escape(count_name), video_webpage, count_name, default=None)) @@ -2450,6 +2485,14 @@ def _extract_count(count_name): video_duration = parse_duration(self._html_search_meta( 'duration', video_webpage, 'video duration')) + # Get Subscriber Count of channel + subscriber_count = parse_count(self._search_regex( + r'"text":"([\d\.]+\w?) subscribers"', + video_webpage, + 'subscriber count', + default=None + )) + # annotations video_annotations = None if self._downloader.params.get('writeannotations', False): @@ -2587,6 +2630,7 @@ def decrypt_sig(mobj): 'album': album, 'release_date': release_date, 'release_year': release_year, + 'subscriber_count': subscriber_count, } @@ -3264,6 +3308,7 @@ class YoutubeSearchURLIE(YoutubeSearchBaseInfoExtractor): IE_DESC = 'YouTube.com search URLs' IE_NAME = 'youtube:search_url' _VALID_URL = r'https?://(?:www\.)?youtube\.com/results\?(.*?&)?(?:search_query|q)=(?P[^&]+)(?:[&]|$)' + _SEARCH_DATA = r'(?:window\["ytInitialData"\]|ytInitialData)\W?=\W?({.*?});' _TESTS = [{ 'url': 'https://www.youtube.com/results?baz=bar&search_query=youtube-dl+test+video&filters=video&lclk=video', 'playlist_mincount': 5, @@ -3275,6 +3320,58 @@ class YoutubeSearchURLIE(YoutubeSearchBaseInfoExtractor): 'only_matching': True, }] + def _find_videos_in_json(self, extracted): + videos = [] + + def _real_find(obj): + if obj is None or isinstance(obj, str): + return + + if type(obj) is list: + for elem in obj: + _real_find(elem) + + if type(obj) is dict: + if "videoId" in obj: + videos.append(obj) + return + + for _, o in obj.items(): + _real_find(o) + + _real_find(extracted) + + return videos + + def extract_videos_from_page_impl(self, page, ids_in_page, titles_in_page): + search_response = self._parse_json(self._search_regex(self._SEARCH_DATA, page, 'ytInitialData'), None) + + result_items = self._find_videos_in_json(search_response) + + for renderer in result_items: + video_id = try_get(renderer, lambda x: x['videoId']) + video_title = try_get(renderer, lambda x: x['title']['runs'][0]['text']) or try_get(renderer, lambda x: x['title']['simpleText']) + + if video_id is None or video_title is None: + # we do not have a videoRenderer or title extraction broke + continue + + video_title = video_title.strip() + + try: + idx = ids_in_page.index(video_id) + if video_title and not titles_in_page[idx]: + titles_in_page[idx] = video_title + except ValueError: + ids_in_page.append(video_id) + titles_in_page.append(video_title) + + def extract_videos_from_page(self, page): + ids_in_page = [] + titles_in_page = [] + self.extract_videos_from_page_impl(page, ids_in_page, titles_in_page) + return zip(ids_in_page, titles_in_page) + def _real_extract(self, url): mobj = re.match(self._VALID_URL, url) query = compat_urllib_parse_unquote_plus(mobj.group('query')) @@ -3307,6 +3404,8 @@ class YoutubeFeedsInfoExtractor(YoutubeBaseInfoExtractor): Subclasses must define the _FEED_NAME and _PLAYLIST_TITLE properties. """ _LOGIN_REQUIRED = True + _FEED_DATA = r'(?:window\["ytInitialData"\]|ytInitialData)\W?=\W?({.*?});' + _YTCFG_DATA = r"ytcfg.set\(({.*?})\)" @property def IE_NAME(self): @@ -3315,37 +3414,89 @@ def IE_NAME(self): def _real_initialize(self): self._login() + def _find_videos_in_json(self, extracted): + videos = [] + c = {} + + def _real_find(obj): + if obj is None or isinstance(obj, str): + return + + if type(obj) is list: + for elem in obj: + _real_find(elem) + + if type(obj) is dict: + if "videoId" in obj: + videos.append(obj) + return + + if "nextContinuationData" in obj: + c["continuation"] = obj["nextContinuationData"] + return + + for _, o in obj.items(): + _real_find(o) + + _real_find(extracted) + + return videos, try_get(c, lambda x: x["continuation"]) + def _entries(self, page): - # The extraction process is the same as for playlists, but the regex - # for the video ids doesn't contain an index - ids = [] - more_widget_html = content_html = page + info = [] + + yt_conf = self._parse_json(self._search_regex(self._YTCFG_DATA, page, 'ytcfg.set', default="null"), None, fatal=False) + + search_response = self._parse_json(self._search_regex(self._FEED_DATA, page, 'ytInitialData'), None) + for page_num in itertools.count(1): - matches = re.findall(r'href="\s*/watch\?v=([0-9A-Za-z_-]{11})', content_html) + video_info, continuation = self._find_videos_in_json(search_response) - # 'recommended' feed has infinite 'load more' and each new portion spins - # the same videos in (sometimes) slightly different order, so we'll check - # for unicity and break when portion has no new videos - new_ids = list(filter(lambda video_id: video_id not in ids, orderedSet(matches))) - if not new_ids: + new_info = [] + + for v in video_info: + v_id = try_get(v, lambda x: x['videoId']) + if not v_id: + continue + + have_video = False + for old in info: + if old['videoId'] == v_id: + have_video = True + break + + if not have_video: + new_info.append(v) + + if not new_info: break - ids.extend(new_ids) + info.extend(new_info) - for entry in self._ids_to_results(new_ids): - yield entry + for video in new_info: + yield self.url_result(try_get(video, lambda x: x['videoId']), YoutubeIE.ie_key(), video_title=try_get(video, lambda x: x['title']['runs'][0]['text']) or try_get(video, lambda x: x['title']['simpleText'])) - mobj = re.search(r'data-uix-load-more-href="/?(?P[^"]+)"', more_widget_html) - if not mobj: + if not continuation or not yt_conf: break - more = self._download_json( - 'https://www.youtube.com/%s' % mobj.group('more'), self._PLAYLIST_TITLE, + search_response = self._download_json( + 'https://www.youtube.com/browse_ajax', self._PLAYLIST_TITLE, 'Downloading page #%s' % page_num, transform_source=uppercase_escape, - headers=self._YOUTUBE_CLIENT_HEADERS) - content_html = more['content_html'] - more_widget_html = more['load_more_widget_html'] + query={ + "ctoken": try_get(continuation, lambda x: x["continuation"]), + "continuation": try_get(continuation, lambda x: x["continuation"]), + "itct": try_get(continuation, lambda x: x["clickTrackingParams"]) + }, + headers={ + "X-YouTube-Client-Name": try_get(yt_conf, lambda x: x["INNERTUBE_CONTEXT_CLIENT_NAME"]), + "X-YouTube-Client-Version": try_get(yt_conf, lambda x: x["INNERTUBE_CONTEXT_CLIENT_VERSION"]), + "X-Youtube-Identity-Token": try_get(yt_conf, lambda x: x["ID_TOKEN"]), + "X-YouTube-Device": try_get(yt_conf, lambda x: x["DEVICE"]), + "X-YouTube-Page-CL": try_get(yt_conf, lambda x: x["PAGE_CL"]), + "X-YouTube-Page-Label": try_get(yt_conf, lambda x: x["PAGE_BUILD_LABEL"]), + "X-YouTube-Variants-Checksum": try_get(yt_conf, lambda x: x["VARIANTS_CHECKSUM"]), + }) def _real_extract(self, url): page = self._download_webpage( diff --git a/youtube_dlc/utils.py b/youtube_dlc/utils.py index 32b179c6fcb..54a4ea2aaca 100644 --- a/youtube_dlc/utils.py +++ b/youtube_dlc/utils.py @@ -1984,6 +1984,7 @@ def get_elements_by_attribute(attribute, value, html, escape_value=True): class HTMLAttributeParser(compat_HTMLParser): """Trivial HTML parser to gather the attributes for a single element""" + def __init__(self): self.attrs = {} compat_HTMLParser.__init__(self) @@ -2378,6 +2379,7 @@ class GeoRestrictedError(ExtractorError): This exception may be thrown when a video is not available from your geographic location due to geographic restrictions imposed by a website. """ + def __init__(self, msg, countries=None): super(GeoRestrictedError, self).__init__(msg, expected=True) self.msg = msg @@ -3558,6 +3560,11 @@ def remove_quotes(s): return s +def get_domain(url): + domain = re.match(r'(?:https?:\/\/)?(?:www\.)?(?P[^\n\/]+\.[^\n\/]+)(?:\/(.*))?', url) + return domain.group('domain') if domain else None + + def url_basename(url): path = compat_urlparse.urlparse(url).path return path.strip('/').split('/')[-1]