Concurrent import file download #7521

kollivier · 2020-09-21T16:45:52Z

Summary

Changes the importcontent command to download files concurrently, using 5 threads.

After our recent discussions about the slow download profiling, I spent a bit of a time over the weekend seeing how hard it might be, and what gains we might get, if downloads were run concurrently instead of one-by-one. By switching to ThreadPoolExecutor, I was able to reduce African Storybooks download time from ~75min to ~21min, and based on my local connection speed test, I had estimated ~18min if the connection was fully utilized. I haven't tested this on python 2, but I believe we already have the python 2 backport of concurrent.futures in Kolibri.

Note that I'm still working on updating some tests, but can confirm the failures come from a change to the number and order of calls to is_cancelled, along with the removal of some time.sleep calls. It is taking time because I want to make sure we give the right number and order of values for each call.

Note that I haven't tested this on Python 2, and have not done any disk copy testing aside from running the unit tests. I am hoping to play around with this some more this coming weekend, but posting so that people are aware of the work and so that if anyone wants to play with this, they can.

Reviewer guidance

…

References

…

Contributor Checklist

PR process:

PR has the correct target branch and milestone
PR has 'needs review' or 'work-in-progress' label
If PR is ready for review, a reviewer has been added. (Don't use 'Assignees')
If this is an important user-facing change, PR or related issue has a 'changelog' label
If this includes an internal dependency change, a link to the diff is provided

Testing:

Contributor has fully tested the PR manually
If there are any front-end changes, before/after screenshots are included
Critical user journeys are covered by Gherkin stories
Critical and brittle code paths are covered by unit tests

Reviewer Checklist

Automated test coverage is satisfactory
PR is fully functional
PR has been tested for accessibility regressions
External dependency files were updated if necessary (yarn and pip)
Documentation is updated
Contributor is in AUTHORS.md

kolibri/core/content/management/commands/importcontent.py

…om concurrently trying to update file-specific update progress.

…t behavior and making a couple tests operate on a single file to match pre-threaded test behavior.

codecov · 2020-10-04T02:53:44Z

Codecov Report

Merging #7521 into release-v0.14.x will increase coverage by 0.02%.
The diff coverage is 88.09%.

Impacted Files	Coverage Δ
kolibri/core/tasks/management/commands/base.py	`75.32% <33.33%> (-1.71%)`	⬇️
.../core/content/management/commands/importcontent.py	`76.83% <91.66%> (+1.83%)`	⬆️
kolibri/core/content/utils/transfer.py	`83.69% <100.00%> (+0.27%)`	⬆️
...ility/assets/src/modules/facilityConfig/actions.js	`50.00% <0.00%> (-13.64%)`	⬇️
...acility/assets/src/modules/facilityConfig/index.js	`43.75% <0.00%> (-9.20%)`	⬇️
kolibri/core/assets/src/views/AppBar.vue	`60.86% <0.00%> (-9.14%)`	⬇️
kolibri/core/content/test/sqlalchemytesting.py	`71.42% <0.00%> (-3.58%)`	⬇️
...lity/assets/src/views/FacilityConfigPage/index.vue	`62.29% <0.00%> (-2.42%)`	⬇️
kolibri/plugins/facility/assets/src/constants.js	`80.00% <0.00%> (-1.82%)`	⬇️
kolibri/utils/version.py	`82.44% <0.00%> (-1.53%)`	⬇️
... and 30 more

kollivier · 2020-10-06T19:31:38Z

Marking as ready for review, as unit tests are fixed and changes have been made. Plus, I just had a really slow import locally for the GDL changes and I was wishing I had this already... ;-)

rtibbles

One question about memory usage from precomputing all the file transfer objects.

kolibri/core/content/management/commands/importcontent.py

…own. Also, only open temp file when we start downloading.

rtibbles

Interesting - so if I am reading the code correctly, it's not creating the file transfer objects in memory that's the issue, it's the submitting them to the executor? So we batch the submission?

kollivier · 2020-10-19T21:13:59Z

Yeah, that surprised me too. It's not the submit call itself, though, it's actually as_completed where the huge memory jump happens. (That's why I initially suspected the downloads themselves, because the memory profiler doesn't show each run but just an aggregate.) What I suspect is that internally, that function is caching and retaining some significant amount of state for each task, leaving them all in memory until the entire as_completed with block completes.

kollivier added 2 commits September 19, 2020 11:12

Start on concurrent file downloading.

74fcf39

Ensure an unhandled exception cancels remaining tasks.

4e6e974

kollivier marked this pull request as draft September 21, 2020 16:46

rtibbles reviewed Sep 21, 2020

View reviewed changes

kolibri/core/content/management/commands/importcontent.py Outdated Show resolved Hide resolved

kollivier added 4 commits September 27, 2020 20:20

Remove file-specific progress updating to prevent multiple threads fr…

d286412

…om concurrently trying to update file-specific update progress.

Fix test failures by adjusting the is_cancelled mocks to match curren…

0f4774c

…t behavior and making a couple tests operate on a single file to match pre-threaded test behavior.

Linting fixes.

4ea47f5

Black

ae5cd36

kollivier changed the title ~~[WIP] Concurrent import file download~~ Concurrent import file download Oct 6, 2020

kollivier added the TODO: needs review Waiting for review label Oct 6, 2020

kollivier marked this pull request as ready for review October 6, 2020 19:30

rtibbles reviewed Oct 8, 2020

View reviewed changes

kolibri/core/content/management/commands/importcontent.py Outdated Show resolved Hide resolved

kolibri/core/content/management/commands/importcontent.py Show resolved Hide resolved

kolibri/core/content/management/commands/importcontent.py Show resolved Hide resolved

jonboiser added this to the upcoming patch milestone Oct 13, 2020

kollivier added 2 commits October 18, 2020 11:11

Add downloads to ThreadPoolExecutor in batches to keep memory usage d…

433b44b

…own. Also, only open temp file when we start downloading.

Fix broken tests now that len is called before downloading starts.

5530496

rtibbles approved these changes Oct 19, 2020

View reviewed changes

rtibbles merged commit ef27835 into learningequality:release-v0.14.x Oct 19, 2020

jonboiser added changelog Important user-facing changes and removed TODO: needs review Waiting for review labels Oct 19, 2020

jonboiser modified the milestones: upcoming patch, 0.14.4 Oct 19, 2020

kollivier mentioned this pull request Oct 20, 2020

Khan Academy (English) Import Metrics #6684

Closed

rtibbles mentioned this pull request Mar 22, 2021

Import from a virtual disk in an external SSD is slow at roughly 5-9MBps inside a VM. #6715

Closed

rtibbles mentioned this pull request Apr 1, 2022

Import content issue fixes #9242

Merged

9 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Concurrent import file download #7521

Concurrent import file download #7521

kollivier commented Sep 21, 2020

codecov bot commented Oct 4, 2020 •

edited

Loading

kollivier commented Oct 6, 2020

rtibbles left a comment

rtibbles left a comment

kollivier commented Oct 19, 2020

Concurrent import file download #7521

Concurrent import file download #7521

Conversation

kollivier commented Sep 21, 2020

Summary

Reviewer guidance

References

Contributor Checklist

Reviewer Checklist

codecov bot commented Oct 4, 2020 • edited Loading

Codecov Report

kollivier commented Oct 6, 2020

rtibbles left a comment

Choose a reason for hiding this comment

rtibbles left a comment

Choose a reason for hiding this comment

kollivier commented Oct 19, 2020

codecov bot commented Oct 4, 2020 •

edited

Loading