Derive queued before the last file (due to miscounting skipped files?) #288

JustAnotherArchivist · 2019-01-22T18:47:26Z

I'm currently uploading a large dataset (159 WARCs, just over 1 TiB) using ia upload. Because of the general stability of IA's S3 interface (#176 et al.), I was expecting from the beginning that I'd have to run ia multiple times. However, my expectation was also that the derive would only be queued after the last file was uploaded. This, however, did not happen: while the second (first resumed) upload was still running, a derive task suddenly started.

A first upload process died with an socket.timeout: The write operation timed out/Connection aborted error after about three days and a bit over a third of the data (62 files) uploaded. I restarted it with --checksum to skip the files that were uploaded already. After checksumming everything again, it skipped the existing files and resumed the uploads starting with the 63rd file during which the first process had crashed. All fine so far, only archive.php and book_op.php tasks.

Yesterday (21st) around 13:00 UTC, while that second process was still running, a derive task (ID 1111473727) was suddenly queued and started. I don't know exactly how much data was uploaded at the time, but it looks like it was uploading the 96th file. Definitely still far from done. The derive was queued from an archive.php task which had next_cmd=derive set, unlike all the previous archive.php tasks. Since then, every completed upload restarted the derive task...

I can't help but notice that 62 + 96 is very close to the total of 159 files. Is the condition for setting the queue_derive flag perhaps not taking into account skipped files?

internetarchive/internetarchive/item.py

Lines 820 to 825 in d093cde

    
           # Set derive header if queue_derive is True, 
        
           # and this is the last request being made. 
        
           if queue_derive is True and file_index >= total_files: 
        
               _queue_derive = True 
        
           else: 
        
               _queue_derive = False

The text was updated successfully, but these errors were encountered:

JustAnotherArchivist · 2019-01-22T21:25:30Z

Actually, I missed a file previously, so it looks like it was file 97 on the second leg where the derive was queued, which fits perfectly with 159.

Looking at the code of internetarchive.utils.recursive_file_count, it seems that this is indeed the issue. Specifically, total_files is only incremented when the checksum does not match one of the item's files. So total_files is the number of files that aren't uploaded yet. But in the internetarchive.item.Item.upload code mentioned above, it's treated as the number of files in total.

For my specific case, the return value of recursive_file_count on the second run would be 97 (159 - 62). After skipping those 62 files and uploading 35 more, file_index >= 97 is True in upload, and so the derive is queued.

Either recursive_file_count has to return the total number of files or upload needs to increment file_index only when a file isn't skipped.

JustAnotherArchivist mentioned this issue Jan 22, 2019

ia upload --checksum doesn't always skip existing files #289

Open

Dobatymo mentioned this issue May 28, 2020

WIP fix derive logic and redundant hash calc #351

Draft

gingerbeardman mentioned this issue Oct 31, 2022

Slow upload speeds and eventual error "The write operation timed out" #555

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Derive queued before the last file (due to miscounting skipped files?) #288

Derive queued before the last file (due to miscounting skipped files?) #288

JustAnotherArchivist commented Jan 22, 2019

JustAnotherArchivist commented Jan 22, 2019

Derive queued before the last file (due to miscounting skipped files?) #288

Derive queued before the last file (due to miscounting skipped files?) #288

Comments

JustAnotherArchivist commented Jan 22, 2019

JustAnotherArchivist commented Jan 22, 2019