You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I’ve encountered an odd phenomenon where the amount of downloadable data appears to be significantly less than anticipated, potentially well below 70M.
Here are the details:
I downloaded the first 10,000 rows of a CSV file, which should contain approximately 187,111 video clips based on the following calculation: 70,723,513 / 3,779,764 * 10,000. These clips were to be downloaded using the files output_format, distributed into 100 subfolders, with each subfolder expected to contain about 1,871 video clips (calculated as 187,111 / 100).
However, upon counting the clips in each subfolder, I discovered only around 200 clips each, which is significantly lower than the expected 1,871. This discrepancy is puzzling, especially since the download success rate is notably high (>99%).
The Bash output is as follows:
xxx: /mnt/data/pandam70m/data/part_00001# ll 00000/*.mp4 |wc -l
223
xxx: /mnt/data/pandam70m/data/part_00001# ll 00000/*.json |wc -l
225 # (success rate of clips is also high)
xxx: /mnt/data/pandam70m/data/part_00001# cat 00000_stats.json
{
"count": 100,
"successes": 98,
"failed_to_download": 2,
"failed_to_subsample": 0,
"duration": 745.6420252323151,
"bytes_downloaded": 250422000,
"start_time": 1712764675.583739,
"end_time": 1712765421.2257643,
"status_dict": {
"success": 98,
"[Errno 2] No such file or directory: '/tmp/7ed953b1-50ca-4631-a2cf-5d388d2ad70a.mp4'": 1,
"[Errno 2] No such file or directory: '/tmp/578de3a9-0df1-4941-8aaf-1ccd29563093.mp4'": 1
}
}
Upon further inspection of the 00000_stats.json file, it was expected to find 98 unique long videos within the subdirectory 00000/, each with a distinct prefix (key). However, my observations contradict this expectation:
xxx: /mnt/data/pandam70m/data/part_00001# ls 00000/*.mp4
0000009_00000.mp4
0000009_00001.mp4
0000013_00000.mp4
0000013_00001.mp4
...
...
...
0000094_00003.mp4
0000094_00004.mp4
xxx: /mnt/data/pandam70m/data/part_00001# ls *.mp4 | cut -d'_' -f1 | sort | uniq | wc -l
21 # To find out the number of unique prefixes for the MP4 files, the filenames are processed to extract the part before the underscore (_).
The actual count of unique prefixes is only 21, which is substantially lower than the expected 98. This discrepancy leads me to question whether the downloaded long videos were deleted before they could be segmented into shorter clips. Could there be an issue with the processing or storage logic that is causing the complete videos to be removed prematurely?
I would greatly appreciate any insights or suggestions you might have to resolve this matter.
Thank you for your time and assistance.
Hi @pengzhiliang,
Thanks for your interest in our dataset!
Did you notice any errors or warning messages during downloading?
Error messages like "...private video ..." or "...Skipping player responses..." are fine, but the messages other than them should not appear.
Hello,
I’ve encountered an odd phenomenon where the amount of downloadable data appears to be significantly less than anticipated, potentially well below 70M.
Here are the details:
I downloaded the first 10,000 rows of a CSV file, which should contain approximately 187,111 video clips based on the following calculation: 70,723,513 / 3,779,764 * 10,000. These clips were to be downloaded using the
files
output_format, distributed into 100 subfolders, with each subfolder expected to contain about 1,871 video clips (calculated as 187,111 / 100).However, upon counting the clips in each subfolder, I discovered only around 200 clips each, which is significantly lower than the expected 1,871. This discrepancy is puzzling, especially since the download success rate is notably high (>99%).
The Bash output is as follows:
Here is my configuration:
Would you be able to help me analyze what might be causing this issue? Your assistance would be greatly appreciated.
The text was updated successfully, but these errors were encountered: