-
Notifications
You must be signed in to change notification settings - Fork 716
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Concurrent import file download #7521
Concurrent import file download #7521
Conversation
…om concurrently trying to update file-specific update progress.
…t behavior and making a couple tests operate on a single file to match pre-threaded test behavior.
Codecov Report
|
Marking as ready for review, as unit tests are fixed and changes have been made. Plus, I just had a really slow import locally for the GDL changes and I was wishing I had this already... ;-) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One question about memory usage from precomputing all the file transfer objects.
…own. Also, only open temp file when we start downloading.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interesting - so if I am reading the code correctly, it's not creating the file transfer objects in memory that's the issue, it's the submitting them to the executor? So we batch the submission?
Yeah, that surprised me too. It's not the |
Summary
Changes the importcontent command to download files concurrently, using 5 threads.
After our recent discussions about the slow download profiling, I spent a bit of a time over the weekend seeing how hard it might be, and what gains we might get, if downloads were run concurrently instead of one-by-one. By switching to
ThreadPoolExecutor
, I was able to reduce African Storybooks download time from ~75min to ~21min, and based on my local connection speed test, I had estimated ~18min if the connection was fully utilized. I haven't tested this on python 2, but I believe we already have the python 2 backport of concurrent.futures in Kolibri.Note that I'm still working on updating some tests, but can confirm the failures come from a change to the number and order of calls to
is_cancelled
, along with the removal of sometime.sleep
calls. It is taking time because I want to make sure we give the right number and order of values for each call.Note that I haven't tested this on Python 2, and have not done any disk copy testing aside from running the unit tests. I am hoping to play around with this some more this coming weekend, but posting so that people are aware of the work and so that if anyone wants to play with this, they can.
Reviewer guidance
…
References
…
Contributor Checklist
PR process:
Testing:
Reviewer Checklist
yarn
andpip
)