You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I'm currently uploading a large dataset (159 WARCs, just over 1 TiB) using ia upload. Because of the general stability of IA's S3 interface (#176 et al.), I was expecting from the beginning that I'd have to run ia multiple times. However, my expectation was also that the derive would only be queued after the last file was uploaded. This, however, did not happen: while the second (first resumed) upload was still running, a derive task suddenly started.
A first upload process died with an socket.timeout: The write operation timed out/Connection aborted error after about three days and a bit over a third of the data (62 files) uploaded. I restarted it with --checksum to skip the files that were uploaded already. After checksumming everything again, it skipped the existing files and resumed the uploads starting with the 63rd file during which the first process had crashed. All fine so far, only archive.php and book_op.php tasks.
Yesterday (21st) around 13:00 UTC, while that second process was still running, a derive task (ID 1111473727) was suddenly queued and started. I don't know exactly how much data was uploaded at the time, but it looks like it was uploading the 96th file. Definitely still far from done. The derive was queued from an archive.php task which had next_cmd=derive set, unlike all the previous archive.php tasks. Since then, every completed upload restarted the derive task...
I can't help but notice that 62 + 96 is very close to the total of 159 files. Is the condition for setting the queue_derive flag perhaps not taking into account skipped files?
Actually, I missed a file previously, so it looks like it was file 97 on the second leg where the derive was queued, which fits perfectly with 159.
Looking at the code of internetarchive.utils.recursive_file_count, it seems that this is indeed the issue. Specifically, total_files is only incremented when the checksum does not match one of the item's files. So total_files is the number of files that aren't uploaded yet. But in the internetarchive.item.Item.upload code mentioned above, it's treated as the number of files in total.
For my specific case, the return value of recursive_file_count on the second run would be 97 (159 - 62). After skipping those 62 files and uploading 35 more, file_index >= 97 is True in upload, and so the derive is queued.
Either recursive_file_count has to return the total number of files or upload needs to increment file_index only when a file isn't skipped.
I'm currently uploading a large dataset (159 WARCs, just over 1 TiB) using
ia upload
. Because of the general stability of IA's S3 interface (#176 et al.), I was expecting from the beginning that I'd have to runia
multiple times. However, my expectation was also that the derive would only be queued after the last file was uploaded. This, however, did not happen: while the second (first resumed) upload was still running, a derive task suddenly started.A first
upload
process died with ansocket.timeout: The write operation timed out
/Connection aborted
error after about three days and a bit over a third of the data (62 files) uploaded. I restarted it with--checksum
to skip the files that were uploaded already. After checksumming everything again, it skipped the existing files and resumed the uploads starting with the 63rd file during which the first process had crashed. All fine so far, only archive.php and book_op.php tasks.Yesterday (21st) around 13:00 UTC, while that second process was still running, a derive task (ID 1111473727) was suddenly queued and started. I don't know exactly how much data was uploaded at the time, but it looks like it was uploading the 96th file. Definitely still far from done. The derive was queued from an archive.php task which had
next_cmd=derive
set, unlike all the previous archive.php tasks. Since then, every completed upload restarted the derive task...I can't help but notice that 62 + 96 is very close to the total of 159 files. Is the condition for setting the queue_derive flag perhaps not taking into account skipped files?
internetarchive/internetarchive/item.py
Lines 820 to 825 in d093cde
The text was updated successfully, but these errors were encountered: