Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Derive queued before the last file (due to miscounting skipped files?) #288

Open
JustAnotherArchivist opened this issue Jan 22, 2019 · 1 comment

Comments

@JustAnotherArchivist
Copy link
Contributor

I'm currently uploading a large dataset (159 WARCs, just over 1 TiB) using ia upload. Because of the general stability of IA's S3 interface (#176 et al.), I was expecting from the beginning that I'd have to run ia multiple times. However, my expectation was also that the derive would only be queued after the last file was uploaded. This, however, did not happen: while the second (first resumed) upload was still running, a derive task suddenly started.

A first upload process died with an socket.timeout: The write operation timed out/Connection aborted error after about three days and a bit over a third of the data (62 files) uploaded. I restarted it with --checksum to skip the files that were uploaded already. After checksumming everything again, it skipped the existing files and resumed the uploads starting with the 63rd file during which the first process had crashed. All fine so far, only archive.php and book_op.php tasks.

Yesterday (21st) around 13:00 UTC, while that second process was still running, a derive task (ID 1111473727) was suddenly queued and started. I don't know exactly how much data was uploaded at the time, but it looks like it was uploading the 96th file. Definitely still far from done. The derive was queued from an archive.php task which had next_cmd=derive set, unlike all the previous archive.php tasks. Since then, every completed upload restarted the derive task...

I can't help but notice that 62 + 96 is very close to the total of 159 files. Is the condition for setting the queue_derive flag perhaps not taking into account skipped files?

# Set derive header if queue_derive is True,
# and this is the last request being made.
if queue_derive is True and file_index >= total_files:
_queue_derive = True
else:
_queue_derive = False

@JustAnotherArchivist
Copy link
Contributor Author

Actually, I missed a file previously, so it looks like it was file 97 on the second leg where the derive was queued, which fits perfectly with 159.

Looking at the code of internetarchive.utils.recursive_file_count, it seems that this is indeed the issue. Specifically, total_files is only incremented when the checksum does not match one of the item's files. So total_files is the number of files that aren't uploaded yet. But in the internetarchive.item.Item.upload code mentioned above, it's treated as the number of files in total.

For my specific case, the return value of recursive_file_count on the second run would be 97 (159 - 62). After skipping those 62 files and uploading 35 more, file_index >= 97 is True in upload, and so the derive is queued.

Either recursive_file_count has to return the total number of files or upload needs to increment file_index only when a file isn't skipped.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant