Discrepancy in Expected vs. Actual Number of Video Clips Downloaded #44

pengzhiliang · 2024-04-11T05:45:39Z

Hello,

I’ve encountered an odd phenomenon where the amount of downloadable data appears to be significantly less than anticipated, potentially well below 70M.

Here are the details:

I downloaded the first 10,000 rows of a CSV file, which should contain approximately 187,111 video clips based on the following calculation: 70,723,513 / 3,779,764 * 10,000. These clips were to be downloaded using the files output_format, distributed into 100 subfolders, with each subfolder expected to contain about 1,871 video clips (calculated as 187,111 / 100).

However, upon counting the clips in each subfolder, I discovered only around 200 clips each, which is significantly lower than the expected 1,871. This discrepancy is puzzling, especially since the download success rate is notably high (>99%).

The Bash output is as follows:

xxx: /mnt/data/pandam70m/data/part_00001# ll 00000/*.mp4 |wc -l  
223  
  
xxx: /mnt/data/pandam70m/data/part_00001# ll 00000/*.json |wc -l  
225  # (success rate of clips is also high)
  
xxx: /mnt/data/pandam70m/data/part_00001# cat 00000_stats.json  
{  
    "count": 100,  
    "successes": 98,  
    "failed_to_download": 2,  
    "failed_to_subsample": 0,  
    "duration": 745.6420252323151,  
    "bytes_downloaded": 250422000,  
    "start_time": 1712764675.583739,  
    "end_time": 1712765421.2257643,  
    "status_dict": {  
        "success": 98,  
        "[Errno 2] No such file or directory: '/tmp/7ed953b1-50ca-4631-a2cf-5d388d2ad70a.mp4'": 1,  
        "[Errno 2] No such file or directory: '/tmp/578de3a9-0df1-4941-8aaf-1ccd29563093.mp4'": 1  
    }  
}

Here is my configuration:

subsampling: {}  
  
reading:  
    yt_args:  
        download_size: 360  
        download_audio: True  
        yt_metadata_args:  
            writesubtitles:  True  
            subtitleslangs: ['en']  
            writeautomaticsub: True  
            get_info: True  
    timeout: 60  
    sampler: null  
  
storage:  
    number_sample_per_shard: 100  
    oom_shard_count: 5  
    captions_are_subtitles: False  
  
distribution:  
    processes_count: 1  
    thread_count: 2  
    subjob_size: 10000  
    distributor: "multiprocessing"

Would you be able to help me analyze what might be causing this issue? Your assistance would be greatly appreciated.

The text was updated successfully, but these errors were encountered:

pengzhiliang · 2024-04-11T05:59:41Z

Upon further inspection of the 00000_stats.json file, it was expected to find 98 unique long videos within the subdirectory 00000/, each with a distinct prefix (key). However, my observations contradict this expectation:

xxx: /mnt/data/pandam70m/data/part_00001# ls 00000/*.mp4   
0000009_00000.mp4  
0000009_00001.mp4  
0000013_00000.mp4  
0000013_00001.mp4  
...  
...  
...  
0000094_00003.mp4  
0000094_00004.mp4  
  
xxx: /mnt/data/pandam70m/data/part_00001# ls *.mp4 | cut -d'_' -f1 | sort | uniq | wc -l  
21 # To find out the number of unique prefixes for the MP4 files, the filenames are processed to extract the part before the underscore (_).

The actual count of unique prefixes is only 21, which is substantially lower than the expected 98. This discrepancy leads me to question whether the downloaded long videos were deleted before they could be segmented into shorter clips. Could there be an issue with the processing or storage logic that is causing the complete videos to be removed prematurely?

I would greatly appreciate any insights or suggestions you might have to resolve this matter.
Thank you for your time and assistance.

tsaishien-chen · 2024-04-13T20:49:04Z

Hi @pengzhiliang,
Thanks for your interest in our dataset!
Did you notice any errors or warning messages during downloading?
Error messages like "...private video ..." or "...Skipping player responses..." are fine, but the messages other than them should not appear.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Discrepancy in Expected vs. Actual Number of Video Clips Downloaded #44

Discrepancy in Expected vs. Actual Number of Video Clips Downloaded #44

pengzhiliang commented Apr 11, 2024

pengzhiliang commented Apr 11, 2024

tsaishien-chen commented Apr 13, 2024

Discrepancy in Expected vs. Actual Number of Video Clips Downloaded #44

Discrepancy in Expected vs. Actual Number of Video Clips Downloaded #44

Comments

pengzhiliang commented Apr 11, 2024

pengzhiliang commented Apr 11, 2024

tsaishien-chen commented Apr 13, 2024