Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Discrepancy in Expected vs. Actual Number of Video Clips Downloaded #44

Open
pengzhiliang opened this issue Apr 11, 2024 · 2 comments
Open

Comments

@pengzhiliang
Copy link

Hello,

I’ve encountered an odd phenomenon where the amount of downloadable data appears to be significantly less than anticipated, potentially well below 70M.

Here are the details:

I downloaded the first 10,000 rows of a CSV file, which should contain approximately 187,111 video clips based on the following calculation: 70,723,513 / 3,779,764 * 10,000. These clips were to be downloaded using the files output_format, distributed into 100 subfolders, with each subfolder expected to contain about 1,871 video clips (calculated as 187,111 / 100).

However, upon counting the clips in each subfolder, I discovered only around 200 clips each, which is significantly lower than the expected 1,871. This discrepancy is puzzling, especially since the download success rate is notably high (>99%).

The Bash output is as follows:

xxx: /mnt/data/pandam70m/data/part_00001# ll 00000/*.mp4 |wc -l  
223  
  
xxx: /mnt/data/pandam70m/data/part_00001# ll 00000/*.json |wc -l  
225  # (success rate of clips is also high)
  
xxx: /mnt/data/pandam70m/data/part_00001# cat 00000_stats.json  
{  
    "count": 100,  
    "successes": 98,  
    "failed_to_download": 2,  
    "failed_to_subsample": 0,  
    "duration": 745.6420252323151,  
    "bytes_downloaded": 250422000,  
    "start_time": 1712764675.583739,  
    "end_time": 1712765421.2257643,  
    "status_dict": {  
        "success": 98,  
        "[Errno 2] No such file or directory: '/tmp/7ed953b1-50ca-4631-a2cf-5d388d2ad70a.mp4'": 1,  
        "[Errno 2] No such file or directory: '/tmp/578de3a9-0df1-4941-8aaf-1ccd29563093.mp4'": 1  
    }  
}  

Here is my configuration:

subsampling: {}  
  
reading:  
    yt_args:  
        download_size: 360  
        download_audio: True  
        yt_metadata_args:  
            writesubtitles:  True  
            subtitleslangs: ['en']  
            writeautomaticsub: True  
            get_info: True  
    timeout: 60  
    sampler: null  
  
storage:  
    number_sample_per_shard: 100  
    oom_shard_count: 5  
    captions_are_subtitles: False  
  
distribution:  
    processes_count: 1  
    thread_count: 2  
    subjob_size: 10000  
    distributor: "multiprocessing"  

Would you be able to help me analyze what might be causing this issue? Your assistance would be greatly appreciated.

@pengzhiliang
Copy link
Author

Upon further inspection of the 00000_stats.json file, it was expected to find 98 unique long videos within the subdirectory 00000/, each with a distinct prefix (key). However, my observations contradict this expectation:

xxx: /mnt/data/pandam70m/data/part_00001# ls 00000/*.mp4   
0000009_00000.mp4  
0000009_00001.mp4  
0000013_00000.mp4  
0000013_00001.mp4  
...  
...  
...  
0000094_00003.mp4  
0000094_00004.mp4  
  
xxx: /mnt/data/pandam70m/data/part_00001# ls *.mp4 | cut -d'_' -f1 | sort | uniq | wc -l  
21 # To find out the number of unique prefixes for the MP4 files, the filenames are processed to extract the part before the underscore (_).  

The actual count of unique prefixes is only 21, which is substantially lower than the expected 98. This discrepancy leads me to question whether the downloaded long videos were deleted before they could be segmented into shorter clips. Could there be an issue with the processing or storage logic that is causing the complete videos to be removed prematurely?

I would greatly appreciate any insights or suggestions you might have to resolve this matter.
Thank you for your time and assistance.

@tsaishien-chen
Copy link
Contributor

Hi @pengzhiliang,
Thanks for your interest in our dataset!
Did you notice any errors or warning messages during downloading?
Error messages like "...private video ..." or "...Skipping player responses..." are fine, but the messages other than them should not appear.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants