Be able to extract all videos from a YouTube channel with more than 20,000 videos #255

Benjamin-Loison · 2024-03-18T11:04:18Z

I faced this issue multiple times:

and this Stack Overflow question is asking the solution.

Benjamin-Loison · 2024-03-18T11:10:30Z

import requests

CHANNEL_ID = 'UCf8w5m0YsRa8MHQ5bwSGmbw'
PLAYLIST_ID = 'UU' + CHANNEL_ID[2:]

URL = 'https://yt.lemnoslife.com/noKey/playlistItems'
params = {
    'part': ','.join(['snippet']),
    'playlistId': PLAYLIST_ID,
    'maxResults' : 50,
    'fields': 'items/snippet/resourceId/videoId,nextPageToken',
}

videoIds = set()

while True:
    response = requests.get(URL, params = params).json()
    for item in response['items']:
        videoIds.add(item['snippet']['resourceId']['videoId'])
    if not 'nextPageToken' in response:
        break
    params['pageToken'] = response['nextPageToken']

print(len(videoIds)) # 19,997

Based on my Stack Overflow answer 74579030.

Note that OP mentions @NBA but it only has 14,451 videos.

@FRANCE24 only seems to have 5,873 videos.
@France24_ar only seems to have 6,576 videos.

Both Stack Overflow questions do not mention channels with more than 20,000 videos.

@asianetnews seems to have more than 20,000 videos, as it shows on YouTube UI 261K videos on channel homepage but uploads playlist only shows 19,997 videos.

Now that have investigated and simplified nextPageTokens in #256 (comment), let us try to iterate with our nextPageTokens.

import requests
import blackboxprotobuf
import base64

CHANNEL_ID = 'UCf8w5m0YsRa8MHQ5bwSGmbw'
PLAYLIST_ID = 'UU' + CHANNEL_ID[2:]

def getBase64Protobuf(message, typedef):
    data = blackboxprotobuf.encode_message(message, typedef)
    return base64.b64encode(data).decode('ascii')

def getPageToken(index):
    message = {
        '1': index,
    }
    
    typedef = {
        '1': {
            'type': 'int'
        },
    }
    
    three = getBase64Protobuf(message, typedef)
    
    message = {
        '2': 0,
        '3': f'PT:{three}'
    }
    
    typedef = {
        '2': {
            'type': 'int'
        },
        '3': {
            'type': 'string'
        }
    }

    pageToken = getBase64Protobuf(message, typedef)
    return pageToken

URL = 'https://yt.lemnoslife.com/noKey/playlistItems'
MAX_RESULTS = 50
params = {
    'part': ','.join(['snippet']),
    'playlistId': PLAYLIST_ID,
    'maxResults' : MAX_RESULTS,
    'fields': 'items/snippet/resourceId/videoId',
}

videoIds = set()

requestIndex = 0
while True:
    response = requests.get(URL, params = params).json()
    for item in response['items']:
        videoIds.add(item['snippet']['resourceId']['videoId'])
    print(len(videoIds))
    requestIndex += 1
    params['pageToken'] = getPageToken(requestIndex * MAX_RESULTS)

Benjamin-Loison · 2024-03-18T12:28:09Z

import requests

CHANNEL_ID = 'UCf8w5m0YsRa8MHQ5bwSGmbw'
PLAYLIST_ID = 'UU' + CHANNEL_ID[2:]

URL = 'https://yt.lemnoslife.com/playlistItems'
params = {
    'part': ','.join(['snippet']),
    'playlistId': PLAYLIST_ID,
}

videoIds = set()

while True:
    response = requests.get(URL, params = params).json()
    for item in response['items']:
        videoIds.add(item['snippet']['resourceId']['videoId'])
    if not 'nextPageToken' in response:
        break
    params['pageToken'] = response['nextPageToken']

print(len(videoIds)) # 19,996

Let us check if with the low-level approach I can bypass this limit.

Now that have investigated and simplified nextPageTokens in #256 (comment), let us try to iterate with our continuation tokens, but first let us try to do so with the YouTube Data API v3 pagination.

Used the following to debug the iteration algorithm:

import requests
import json

CHANNEL_ID = 'UCf8w5m0YsRa8MHQ5bwSGmbw'
PLAYLIST_ID = 'UU' + CHANNEL_ID[2:]

URL = 'http://localhost/YouTube-operational-API/playlistItems'
MAX_RESULTS = 100
params = {
    'part': ','.join(['snippet']),
    'playlistId': PLAYLIST_ID,
    'pageToken': '4qmFsgIqEhpWTFVVV2VnMlBrYXRlNjlORmRCZXVSRlRBdxoMZWdkUVZEcERSMUU5',
}
response = requests.get(URL, params = params).json()
print(response)

diff --git a/playlistItems.php b/playlistItems.php
index d35d5ed..7e19af8 100644
--- a/playlistItems.php
+++ b/playlistItems.php
@@ -67,7 +67,7 @@ function getAPI($playlistId, $continuationToken)
 
     $result = json_decode($res, true);
     $answerItems = [];
-    $items = $continuationTokenProvided ? getContinuationItems($result) : getTabs($result)[0]['tabRenderer']['content']['sectionListRenderer']['contents'][0]['itemSectionRenderer']['contents'][0]['playlistVideoListRenderer']['contents'];
+    $items = $continuationTokenProvided ? $result['continuationContents']['playlistVideoListContinuation']['contents'] : getTabs($result)[0]['tabRenderer']['content']['sectionListRenderer']['contents'][0]['itemSectionRenderer']['contents'][0]['playlistVideoListRenderer']['contents'];
     $itemsCount = count($items);
     for ($itemsIndex = 0; $itemsIndex < $itemsCount - 1; $itemsIndex++) {
         $item = $items[$itemsIndex];

import requests
import blackboxprotobuf
import base64

CHANNEL_ID = 'UCf8w5m0YsRa8MHQ5bwSGmbw'
PLAYLIST_ID = 'UU' + CHANNEL_ID[2:]

def getBase64Protobuf(message, typedef):
    data = blackboxprotobuf.encode_message(message, typedef)
    return base64.b64encode(data).decode('ascii')

def getPageToken(index):
    message = {
        '1': index,
    }
    
    typedef = {
        '1': {
            'type': 'int'
        },
    }
    
    fifteen = getBase64Protobuf(message, typedef)
    
    message = {
        '15': f'PT:{fifteen}'
    }
    
    typedef = {
        '15': {
            'type': 'string'
        }
    }
    
    three = getBase64Protobuf(message, typedef)
    
    message = {
        '80226972': {
            '2': f'VL{PLAYLIST_ID}',
            '3': three,
        }
    }
    
    typedef = {
        '80226972': {
            'type': 'message',
            'message_typedef': {
                '2': {
                    'type': 'string'
                },
                '3': {
                    'type': 'string'
                },
            },
            'field_order': [
                '2',
                '3',
            ]
        }
    }
    
    continuation = getBase64Protobuf(message, typedef)
    return continuation

URL = 'http://localhost/YouTube-operational-API/playlistItems'
MAX_RESULTS = 100
params = {
    'part': ','.join(['snippet']),
    'playlistId': PLAYLIST_ID,
}

videoIds = set()

requestIndex = 0
while True:
    response = requests.get(URL, params = params).json()
    for item in response['items']:
        videoIds.add(item['snippet']['resourceId']['videoId'])
    print(len(videoIds))
    requestIndex += 1
    params['pageToken'] = getPageToken(requestIndex * MAX_RESULTS)

100
199
298
...
7822
7921
Traceback (most recent call last):
  File "<tmp 1>", line 76, in <module>
    response = requests.get(URL, params = params).json()
  File "/home/benjamin/.local/lib/python3.10/site-packages/requests/models.py", line 975, in json
    raise RequestsJSONDecodeError(e.msg, e.doc, e.pos)
requests.exceptions.JSONDecodeError: Expecting value: line 1 column 1 (char 0)

The last response being '' for an unknown reason. I have the same ouput if I rerun the algorithm.

Benjamin-Loison · 2024-03-18T13:01:07Z

$ yt-dlp --dump-json "https://www.youtube.com/channel/UCf8w5m0YsRa8MHQ5bwSGmbw/videos" -i | jq -r '[.id]|@csv' | wc -l

WARNING: [youtube:tab] HTTPSConnectionPool(host='www.youtube.com', port=443): Read timed out. (read timeout=20.0). Retrying (1/3)...
WARNING: [youtube:tab] HTTPSConnectionPool(host='www.youtube.com', port=443): Read timed out. (read timeout=20.0). Retrying (1/3)...
WARNING: [youtube:tab] HTTPSConnectionPool(host='www.youtube.com', port=443): Read timed out. (read timeout=20.0). Retrying (1/3)...
ERROR: HTTPSConnectionPool(host='www.youtube.com', port=443): Read timed out.
WARNING: [youtube] YouTube said: ERROR - Precondition check failed.
WARNING: [youtube] HTTP Error 400: Bad Request. Retrying (1/3)...
WARNING: [youtube] YouTube said: ERROR - Precondition check failed.
WARNING: [youtube] HTTP Error 400: Bad Request. Retrying (2/3)...
WARNING: [youtube] YouTube said: ERROR - Precondition check failed.
WARNING: [youtube] HTTP Error 400: Bad Request. Retrying (3/3)...
WARNING: [youtube] YouTube said: ERROR - Precondition check failed.
WARNING: [youtube] Unable to download API page: HTTP Error 400: Bad Request (caused by <HTTPError 400: Bad Request>); please report this issue on  https://github.com/yt-dlp/yt-dlp/issues?q= , filling out the appropriate issue template. Confirm you are on the latest version using  yt-dlp -U
WARNING: [youtube] YouTube said: ERROR - Precondition check failed.
WARNING: [youtube] HTTP Error 400: Bad Request. Retrying (1/3)...
WARNING: [youtube] YouTube said: ERROR - Precondition check failed.
WARNING: [youtube] HTTP Error 400: Bad Request. Retrying (2/3)...
WARNING: [youtube] YouTube said: ERROR - Precondition check failed.
WARNING: [youtube] HTTP Error 400: Bad Request. Retrying (3/3)...
WARNING: [youtube] YouTube said: ERROR - Precondition check failed.
WARNING: [youtube] Unable to download API page: HTTP Error 400: Bad Request (caused by <HTTPError 400: Bad Request>); please report this issue on  https://github.com/yt-dlp/yt-dlp/issues?q= , filling out the appropriate issue template. Confirm you are on the latest version using  yt-dlp -U
...

I guess that it loops forever when it reaches the 20,000 limit.

Note that removing jq and wc does not output in realtime anyway.

However, without --dump-json I get:

[youtube:tab] Extracting URL: https://www.youtube.com/channel/UCf8w5m0YsRa8MHQ5bwSGmbw/videos
[youtube:tab] UCf8w5m0YsRa8MHQ5bwSGmbw/videos: Downloading webpage
[download] Downloading playlist: asianetnews - Videos
[youtube:tab] UCf8w5m0YsRa8MHQ5bwSGmbw page 1: Downloading API JSON
[youtube:tab] UCf8w5m0YsRa8MHQ5bwSGmbw page 2: Downloading API JSON
[youtube:tab] UCf8w5m0YsRa8MHQ5bwSGmbw page 3: Downloading API JSON
[youtube:tab] UCf8w5m0YsRa8MHQ5bwSGmbw page 4: Downloading API JSON
[youtube:tab] UCf8w5m0YsRa8MHQ5bwSGmbw page 5: Downloading API JSON
[youtube:tab] UCf8w5m0YsRa8MHQ5bwSGmbw page 6: Downloading API JSON
[youtube:tab] UCf8w5m0YsRa8MHQ5bwSGmbw page 7: Downloading API JSON
[youtube:tab] UCf8w5m0YsRa8MHQ5bwSGmbw page 8: Downloading API JSON
[youtube:tab] UCf8w5m0YsRa8MHQ5bwSGmbw page 9: Downloading API JSON
[youtube:tab] UCf8w5m0YsRa8MHQ5bwSGmbw page 10: Downloading API JSON
[youtube:tab] UCf8w5m0YsRa8MHQ5bwSGmbw page 11: Downloading API JSON
[youtube:tab] UCf8w5m0YsRa8MHQ5bwSGmbw page 12: Downloading API JSON
[youtube:tab] UCf8w5m0YsRa8MHQ5bwSGmbw page 13: Downloading API JSON
[youtube:tab] UCf8w5m0YsRa8MHQ5bwSGmbw page 14: Downloading API JSON
[youtube:tab] UCf8w5m0YsRa8MHQ5bwSGmbw page 15: Downloading API JSON
[youtube:tab] UCf8w5m0YsRa8MHQ5bwSGmbw page 16: Downloading API JSON
[youtube:tab] UCf8w5m0YsRa8MHQ5bwSGmbw page 17: Downloading API JSON
[youtube:tab] UCf8w5m0YsRa8MHQ5bwSGmbw page 18: Downloading API JSON
[youtube:tab] UCf8w5m0YsRa8MHQ5bwSGmbw page 19: Downloading API JSON
[youtube:tab] UCf8w5m0YsRa8MHQ5bwSGmbw page 20: Downloading API JSON
[youtube:tab] UCf8w5m0YsRa8MHQ5bwSGmbw page 21: Downloading API JSON
[youtube:tab] UCf8w5m0YsRa8MHQ5bwSGmbw page 22: Downloading API JSON
[youtube:tab] UCf8w5m0YsRa8MHQ5bwSGmbw page 23: Downloading API JSON
[youtube:tab] UCf8w5m0YsRa8MHQ5bwSGmbw page 24: Downloading API JSON
[youtube:tab] UCf8w5m0YsRa8MHQ5bwSGmbw page 25: Downloading API JSON
[youtube:tab] UCf8w5m0YsRa8MHQ5bwSGmbw page 26: Downloading API JSON
[youtube:tab] UCf8w5m0YsRa8MHQ5bwSGmbw page 27: Downloading API JSON
[youtube:tab] UCf8w5m0YsRa8MHQ5bwSGmbw page 28: Downloading API JSON
[youtube:tab] UCf8w5m0YsRa8MHQ5bwSGmbw page 29: Downloading API JSON
[youtube:tab] UCf8w5m0YsRa8MHQ5bwSGmbw page 30: Downloading API JSON
[youtube:tab] UCf8w5m0YsRa8MHQ5bwSGmbw page 31: Downloading API JSON
[youtube:tab] UCf8w5m0YsRa8MHQ5bwSGmbw page 32: Downloading API JSON
[youtube:tab] UCf8w5m0YsRa8MHQ5bwSGmbw page 33: Downloading API JSON
[youtube:tab] UCf8w5m0YsRa8MHQ5bwSGmbw page 34: Downloading API JSON
[youtube:tab] UCf8w5m0YsRa8MHQ5bwSGmbw page 35: Downloading API JSON
[youtube:tab] UCf8w5m0YsRa8MHQ5bwSGmbw page 36: Downloading API JSON
[youtube:tab] Playlist asianetnews - Videos: Downloading 1110 items of 1110
[download] Downloading item 1 of 1110
[youtube] Extracting URL: https://www.youtube.com/watch?v=2WChBkF2Srg
[youtube] 2WChBkF2Srg: Downloading webpage
[youtube] 2WChBkF2Srg: Downloading ios player API JSON
[youtube] 2WChBkF2Srg: Downloading android player API JSON
WARNING: [youtube] YouTube said: ERROR - Precondition check failed.
WARNING: [youtube] HTTP Error 400: Bad Request. Retrying (1/3)...
[youtube] 2WChBkF2Srg: Downloading android player API JSON
WARNING: [youtube] YouTube said: ERROR - Precondition check failed.
WARNING: [youtube] HTTP Error 400: Bad Request. Retrying (2/3)...
[youtube] 2WChBkF2Srg: Downloading android player API JSON
WARNING: [youtube] YouTube said: ERROR - Precondition check failed.
WARNING: [youtube] HTTP Error 400: Bad Request. Retrying (3/3)...
[youtube] 2WChBkF2Srg: Downloading android player API JSON
WARNING: [youtube] YouTube said: ERROR - Precondition check failed.
WARNING: [youtube] Unable to download API page: HTTP Error 400: Bad Request (caused by <HTTPError 400: Bad Request>); please report this issue on  https://github.com/yt-dlp/yt-dlp/issues?q= , filling out the appropriate issue template. Confirm you are on the latest version using  yt-dlp -U
[youtube] 2WChBkF2Srg: Downloading m3u8 information
...

Benjamin-Loison · 2024-03-18T14:59:51Z

https://www.youtube.com/feeds/videos.xml?channel_id=UCWeg2Pkate69NFdBeuRFTAw lists only the first 15 entries and the pagination does not seem clear, if there is any, got the link from the Stack Overflow answer 31514238.

Benjamin-Loison · 2024-03-18T15:50:48Z

import requests
from lxml import html
import json

CHANNEL_HANDLE = '@MLB'
text = requests.get(f'https://www.youtube.com/{CHANNEL_HANDLE}/videos').text
tree = html.fromstring(text)

ytVariableName = 'ytInitialData'
ytVariableDeclaration = ytVariableName + ' = '
for script in tree.xpath('//script'):
    scriptContent = script.text_content()
    if ytVariableDeclaration in scriptContent:
        ytVariableData = json.loads(scriptContent.split(ytVariableDeclaration)[1][:-1])
        break

contents = ytVariableData['contents']['twoColumnBrowseResultsRenderer']['tabs'][1]['tabRenderer']['content']['richGridRenderer']['contents']

videoIds = set()

def treatContents(contents):
    for content in contents:
        if not 'richItemRenderer' in content:
            break
        videoId = content['richItemRenderer']['content']['videoRenderer']['videoId']
        videoIds.add(videoId)
    print(len(videoIds))
    return getContinuationToken(contents)

def getContinuationToken(contents):
    # Sometimes have 29 actual results instead of 30.
    lastContent = contents[-1]
    if not 'continuationItemRenderer' in lastContent:
        exit(0)
    return lastContent['continuationItemRenderer']['continuationEndpoint']['continuationCommand']['token']

continuationToken = treatContents(contents)

url = 'https://www.youtube.com/youtubei/v1/browse'
headers = {
    'Content-Type': 'application/json'
}
requestData = {
    'context': {
        'client': {
            'clientName': 'WEB',
            'clientVersion': '2.20240313.05.00'
        }
    }
}
while True:
    requestData['continuation'] = continuationToken
    data = requests.post(url, headers = headers, json = requestData).json()
    # Happens not deterministically sometimes.
    if not 'onResponseReceivedActions' in data:
        print('Retrying')
        continue
    continuationItems = data['onResponseReceivedActions'][0]['appendContinuationItemsAction']['continuationItems']
    continuationToken = treatContents(continuationItems)

Got 289,814 on two different machines on two different networks when started at the same time.
Note that @MLB claims 291,597 videos but it is unclear if it counts Lives too.

Benjamin-Loison · 2024-03-18T16:12:46Z

What about reversing the order to have 40,000 results instead of 20,000? YouTube Data API v3 PlaylistItems: list endpoint does not have an order parameter. YouTube Data API v3 Search: list endpoint has an order parameter but as we face #4, it is useless. Also note that YouTube UI playlist reversing does not seem possible and would in theory not lead to anything interesting.

However, if shuffle correctly work possibly it could lead to something interesting but I have doubts. I am not able to have an interesting shuffle, as it seems only to consider the first page of entries.

Have I not written down the script I did to test with YouTube Data API v3 Search: list endpoint with published{After,Before}?? Anyway if I remember correctly it was deterministic and returning about 60K videos.

grep -r 'publishedAfter' --include='*.py'

does not return interesting results in /home/benjamin/Desktop/bens_folder/dev/.

Benjamin-Loison · 2024-03-19T23:00:16Z

import requests
from lxml import html
import json

showProgress = True

def treatContents(videoIds, contents):
    for content in contents:
        if not 'richItemRenderer' in content:
            break
        videoId = content['richItemRenderer']['content']['videoRenderer']['videoId']
        videoIds.add(videoId)
    if showProgress:
        print(len(videoIds))
    return getContinuationToken(videoIds, contents)

def getContinuationToken(videoIds, contents):
    # Sometimes have 29 actual results instead of 30.
    lastContent = contents[-1]
    if not 'continuationItemRenderer' in lastContent:
        return videoIds
    return lastContent['continuationItemRenderer']['continuationEndpoint']['continuationCommand']['token']

def getChannelVideoIds(channelHandle):
    text = requests.get(f'https://www.youtube.com/{channelHandle}/videos').text
    tree = html.fromstring(text)

    ytVariableName = 'ytInitialData'
    ytVariableDeclaration = ytVariableName + ' = '
    for script in tree.xpath('//script'):
        scriptContent = script.text_content()
        if ytVariableDeclaration in scriptContent:
            ytVariableData = json.loads(scriptContent.split(ytVariableDeclaration)[1][:-1])
            break

    contents = ytVariableData['contents']['twoColumnBrowseResultsRenderer']['tabs'][1]['tabRenderer']['content']['richGridRenderer']['contents']

    videoIds = set()

    continuationToken = treatContents(videoIds, contents)
    if type(continuationToken) is set:
        return continuationToken

    url = 'https://www.youtube.com/youtubei/v1/browse'
    headers = {
        'Content-Type': 'application/json'
    }
    requestData = {
        'context': {
            'client': {
                'clientName': 'WEB',
                'clientVersion': '2.20240313.05.00'
            }
        }
    }
    while True:
        requestData['continuation'] = continuationToken
        try:
            data = requests.post(url, headers = headers, json = requestData).json()
        except requests.exceptions.SSLError:
            print('SSL error, retrying')
            continue
        # Happens not deterministically sometimes.
        if not 'onResponseReceivedActions' in data:
            print('Missing onResponseReceivedActions, retrying')
            continue
        continuationItems = data['onResponseReceivedActions'][0]['appendContinuationItemsAction']['continuationItems']
        continuationToken = treatContents(videoIds, continuationItems)
        if type(continuationToken) is set:
            return continuationToken

# Source: https://youtube.fandom.com/wiki/List_of_YouTube_channels_with_the_most_video_uploads?oldid=1795583
CHANNEL_HANDLES = [
    '@RoelVandePaar',
    '@Doubtnut',
    '@KnowledgeBaseLibrary',
    '@betterbandai4163',
    '@Hey_Delphi',
    '@molecularmagexdshorts3706',
]

url = 'https://yt.lemnoslife.com/channels'
params = {
    'part': 'about',
}
for channelHandle in CHANNEL_HANDLES:
    params['handle'] = channelHandle
    claimedNumberOfVideos = requests.get(url, params = params).json()['items'][0]['about']['stats']['videoCount']
    print(f'{channelHandle} claims {claimedNumberOfVideos} videos.')
    foundVideoIds = getChannelVideoIds(channelHandle)
    print(f'Found {len(foundVideoIds)} videos.')

Benjamin-Loison · 2024-03-20T18:04:48Z

Benchmarking:

import requests
from lxml import html
import json
from tqdm import tqdm

def treatContents(videoIds, contents):
    for content in contents:
        if not 'richItemRenderer' in content:
            break
        videoId = content['richItemRenderer']['content']['videoRenderer']['videoId']
        videoIds.add(videoId)
    return getContinuationToken(videoIds, contents)

def getContinuationToken(videoIds, contents):
    # Sometimes have 29 actual results instead of 30.
    lastContent = contents[-1]
    if not 'continuationItemRenderer' in lastContent:
        return videoIds
    return lastContent['continuationItemRenderer']['continuationEndpoint']['continuationCommand']['token']

def getChannelVideoIds(channelHandle, claimedNumberOfVideos):
    text = requests.get(f'https://www.youtube.com/{channelHandle}/videos').text
    tree = html.fromstring(text)

    ytVariableName = 'ytInitialData'
    ytVariableDeclaration = ytVariableName + ' = '
    for script in tree.xpath('//script'):
        scriptContent = script.text_content()
        if ytVariableDeclaration in scriptContent:
            ytVariableData = json.loads(scriptContent.split(ytVariableDeclaration)[1][:-1])
            break

    contents = ytVariableData['contents']['twoColumnBrowseResultsRenderer']['tabs'][1]['tabRenderer']['content']['richGridRenderer']['contents']

    videoIds = set()

    continuationToken = treatContents(videoIds, contents)
    if type(continuationToken) is set:
        return continuationToken

    url = 'https://www.youtube.com/youtubei/v1/browse'
    headers = {
        'Content-Type': 'application/json'
    }
    requestData = {
        'context': {
            'client': {
                'clientName': 'WEB',
                'clientVersion': '2.20240313.05.00'
            }
        }
    }
    with tqdm(total = claimedNumberOfVideos) as pbar:
        while True:
            requestData['continuation'] = continuationToken
            try:
                data = requests.post(url, headers = headers, json = requestData).json()
            except requests.exceptions.SSLError:
                print('SSL error, retrying')
                continue
            # Happens not deterministically sometimes.
            if not 'onResponseReceivedActions' in data:
                print('Missing onResponseReceivedActions, retrying')
                with open('error.json', 'w') as f:
                    json.dump(data, f, indent = 4)
                continue
            continuationItems = data['onResponseReceivedActions'][0]['appendContinuationItemsAction']['continuationItems']
            continuationToken = treatContents(videoIds, continuationItems)
            if type(continuationToken) is set:
                return continuationToken
            pbar.update(len(continuationItems))

# Source: https://youtube.fandom.com/wiki/List_of_YouTube_channels_with_the_most_video_uploads?oldid=1795583
CHANNEL_HANDLES = [
    '@RoelVandePaar',
    '@Doubtnut',
    '@KnowledgeBaseLibrary',
    '@betterbandai4163',
    '@Hey_Delphi',
]

url = 'https://yt.lemnoslife.com/channels'
params = {
    'part': 'about',
}
for channelHandle in CHANNEL_HANDLES[::-1]:
    params['handle'] = channelHandle
    claimedNumberOfVideos = requests.get(url, params = params).json()['items'][0]['about']['stats']['videoCount']
    print(f'{channelHandle} claims {claimedNumberOfVideos} videos.')
    foundVideoIds = getChannelVideoIds(channelHandle, claimedNumberOfVideos)
    print(f'Found {len(foundVideoIds)} videos.')

logs.txt

Last progress line shows:

53%|█████▎    | 534068/1007767 [7:33:27<43:35:50,  3.02it/s]

which shows a speed decrease after a few tens of thousands videos. Should count how many temporary errors in comparison with how many successful requests.

Note that here I am considering the most many videos channels, hence this is an extreme case.

As the process takes a while, I may stop it before its completion. I think just doing half a million videos retrieval to have an idea of the workload.

Benjamin-Loison · 2024-03-26T23:43:35Z

error.json
logs.txt
logs_1.txt

Following an error not managed by my algorithm the process stopped, I do not plan to restart this experience as the situation is quite clear.

Benjamin-Loison · 2024-04-12T16:25:04Z

Let us verify the correctness of above algorithm for the channel @bwftv:

import requests
import json

CHANNEL_ID = 'UChh-akEbUM8_6ghGVnJd6cQ'
PLAYLIST_ID = 'UU' + CHANNEL_ID[2:]

YOUTUBE_OPERATIONAL_API_URL = 'https://yt.lemnoslife.com'
URL = f'{YOUTUBE_OPERATIONAL_API_URL}/noKey/playlistItems'
params = {
    'part': ','.join(['snippet']),
    'playlistId': PLAYLIST_ID,
    'maxResults' : 50,
}

videoIds = set()
while True:
    response = requests.get(URL, params = params).json()
    for item in response['items']:
        videoIds.add(item['snippet']['resourceId']['videoId'])
    print(len(videoIds))
    nextPageToken = response.get('nextPageToken')
    if not nextPageToken:
        break
    params['pageToken'] = nextPageToken

def chunks(lst, n):
    """Yield successive n-sized chunks from lst."""
    for i in range(0, len(lst), n):
        yield lst[i:i + n]

URL = f'{YOUTUBE_OPERATIONAL_API_URL}/noKey/videos'
params = {
    'part': ','.join(['liveStreamingDetails']),
}
videoIdsSets = chunks(list(videoIds), 50)
liveIds = set()
for videoIdsSet in videoIdsSets:
    params['id'] = ','.join(videoIdsSet)
    data = response = requests.get(URL, params = params).json()
    for item in data['items']:
        if 'liveStreamingDetails' in item:
            liveIds.add(item['id'])
    print(len(liveIds))

Returns 17,385 videos including 9,012 lives.

import requests
import json

CHANNEL_ID = 'UChh-akEbUM8_6ghGVnJd6cQ'

URL = 'https://yt.lemnoslife.com/channels'
params = {
    'part': ','.join(['shorts']),
    'id': CHANNEL_ID,
}

videoIds = set()
while True:
    response = requests.get(URL, params = params).json()
    shorts = response['items'][0]
    for item in shorts['shorts']:
        videoIds.add(item['videoId'])
    print(len(videoIds))
    nextPageToken = shorts.get('nextPageToken')
    if not nextPageToken:
        break
    params['pageToken'] = nextPageToken

Returns 243 and I verified that on @bwftv/shorts.

So above algorithm should find 17,385 - 9,012 - 243 = 8,130 videos and it is quite what we find:

python3 test.py

@bwftv claims 17385 videos.
 48%|██████████████████████████████████████████████▏                                                 | 8370/17385 [01:00<01:04, 139.29it/s]
Found 8138 videos.

So above algorithm works well.

Benjamin-Loison added bug medium priority medium labels Mar 18, 2024

Benjamin-Loison mentioned this issue Mar 18, 2024

Extend minimizeCURL.py to minimize Protobuf #256

Open

Benjamin-Loison self-assigned this Mar 19, 2024

Benjamin-Loison closed this as completed Mar 19, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Be able to extract all videos from a YouTube channel with more than 20,000 videos #255

Be able to extract all videos from a YouTube channel with more than 20,000 videos #255

Benjamin-Loison commented Mar 18, 2024 •

edited

Loading

Benjamin-Loison commented Mar 18, 2024 •

edited

Loading

Benjamin-Loison commented Mar 18, 2024 •

edited

Loading

Benjamin-Loison commented Mar 18, 2024 •

edited

Loading

Benjamin-Loison commented Mar 18, 2024

Benjamin-Loison commented Mar 18, 2024 •

edited

Loading

Benjamin-Loison commented Mar 18, 2024 •

edited

Loading

Benjamin-Loison commented Mar 19, 2024

Benjamin-Loison commented Mar 20, 2024 •

edited

Loading

Benjamin-Loison commented Mar 26, 2024

Benjamin-Loison commented Apr 12, 2024

Be able to extract all videos from a YouTube channel with more than 20,000 videos #255

Be able to extract all videos from a YouTube channel with more than 20,000 videos #255

Comments

Benjamin-Loison commented Mar 18, 2024 • edited Loading

Benjamin-Loison commented Mar 18, 2024 • edited Loading

Benjamin-Loison commented Mar 18, 2024 • edited Loading

Benjamin-Loison commented Mar 18, 2024 • edited Loading

Benjamin-Loison commented Mar 18, 2024

Benjamin-Loison commented Mar 18, 2024 • edited Loading

Benjamin-Loison commented Mar 18, 2024 • edited Loading

Benjamin-Loison commented Mar 19, 2024

Benjamin-Loison commented Mar 20, 2024 • edited Loading

Benjamin-Loison commented Mar 26, 2024

Benjamin-Loison commented Apr 12, 2024

Benjamin-Loison commented Mar 18, 2024 •

edited

Loading

Benjamin-Loison commented Mar 18, 2024 •

edited

Loading

Benjamin-Loison commented Mar 18, 2024 •

edited

Loading

Benjamin-Loison commented Mar 18, 2024 •

edited

Loading

Benjamin-Loison commented Mar 18, 2024 •

edited

Loading

Benjamin-Loison commented Mar 18, 2024 •

edited

Loading

Benjamin-Loison commented Mar 20, 2024 •

edited

Loading