Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Concurrent import file download #7521

Merged
merged 8 commits into from
Oct 19, 2020

Conversation

kollivier
Copy link
Contributor

Summary

Changes the importcontent command to download files concurrently, using 5 threads.

After our recent discussions about the slow download profiling, I spent a bit of a time over the weekend seeing how hard it might be, and what gains we might get, if downloads were run concurrently instead of one-by-one. By switching to ThreadPoolExecutor, I was able to reduce African Storybooks download time from ~75min to ~21min, and based on my local connection speed test, I had estimated ~18min if the connection was fully utilized. I haven't tested this on python 2, but I believe we already have the python 2 backport of concurrent.futures in Kolibri.

Note that I'm still working on updating some tests, but can confirm the failures come from a change to the number and order of calls to is_cancelled, along with the removal of some time.sleep calls. It is taking time because I want to make sure we give the right number and order of values for each call.

Note that I haven't tested this on Python 2, and have not done any disk copy testing aside from running the unit tests. I am hoping to play around with this some more this coming weekend, but posting so that people are aware of the work and so that if anyone wants to play with this, they can.

Reviewer guidance

References


Contributor Checklist

PR process:

  • PR has the correct target branch and milestone
  • PR has 'needs review' or 'work-in-progress' label
  • If PR is ready for review, a reviewer has been added. (Don't use 'Assignees')
  • If this is an important user-facing change, PR or related issue has a 'changelog' label
  • If this includes an internal dependency change, a link to the diff is provided

Testing:

  • Contributor has fully tested the PR manually
  • If there are any front-end changes, before/after screenshots are included
  • Critical user journeys are covered by Gherkin stories
  • Critical and brittle code paths are covered by unit tests

Reviewer Checklist

  • Automated test coverage is satisfactory
  • PR is fully functional
  • PR has been tested for accessibility regressions
  • External dependency files were updated if necessary (yarn and pip)
  • Documentation is updated
  • Contributor is in AUTHORS.md

@kollivier kollivier marked this pull request as draft September 21, 2020 16:46
…om concurrently trying to update file-specific update progress.
…t behavior and making a couple tests operate on a single file to match pre-threaded test behavior.
@codecov
Copy link

codecov bot commented Oct 4, 2020

Codecov Report

Merging #7521 into release-v0.14.x will increase coverage by 0.02%.
The diff coverage is 88.09%.

Impacted Files Coverage Δ
kolibri/core/tasks/management/commands/base.py 75.32% <33.33%> (-1.71%) ⬇️
.../core/content/management/commands/importcontent.py 76.83% <91.66%> (+1.83%) ⬆️
kolibri/core/content/utils/transfer.py 83.69% <100.00%> (+0.27%) ⬆️
...ility/assets/src/modules/facilityConfig/actions.js 50.00% <0.00%> (-13.64%) ⬇️
...acility/assets/src/modules/facilityConfig/index.js 43.75% <0.00%> (-9.20%) ⬇️
kolibri/core/assets/src/views/AppBar.vue 60.86% <0.00%> (-9.14%) ⬇️
kolibri/core/content/test/sqlalchemytesting.py 71.42% <0.00%> (-3.58%) ⬇️
...lity/assets/src/views/FacilityConfigPage/index.vue 62.29% <0.00%> (-2.42%) ⬇️
kolibri/plugins/facility/assets/src/constants.js 80.00% <0.00%> (-1.82%) ⬇️
kolibri/utils/version.py 82.44% <0.00%> (-1.53%) ⬇️
... and 30 more

@kollivier kollivier changed the title [WIP] Concurrent import file download Concurrent import file download Oct 6, 2020
@kollivier kollivier added the TODO: needs review Waiting for review label Oct 6, 2020
@kollivier kollivier marked this pull request as ready for review October 6, 2020 19:30
@kollivier
Copy link
Contributor Author

Marking as ready for review, as unit tests are fixed and changes have been made. Plus, I just had a really slow import locally for the GDL changes and I was wishing I had this already... ;-)

Copy link
Member

@rtibbles rtibbles left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One question about memory usage from precomputing all the file transfer objects.

@jonboiser jonboiser added this to the upcoming patch milestone Oct 13, 2020
Copy link
Member

@rtibbles rtibbles left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting - so if I am reading the code correctly, it's not creating the file transfer objects in memory that's the issue, it's the submitting them to the executor? So we batch the submission?

@kollivier
Copy link
Contributor Author

Yeah, that surprised me too. It's not the submit call itself, though, it's actually as_completed where the huge memory jump happens. (That's why I initially suspected the downloads themselves, because the memory profiler doesn't show each run but just an aggregate.) What I suspect is that internally, that function is caching and retaining some significant amount of state for each task, leaving them all in memory until the entire as_completed with block completes.

@rtibbles rtibbles merged commit ef27835 into learningequality:release-v0.14.x Oct 19, 2020
@jonboiser jonboiser added changelog Important user-facing changes and removed TODO: needs review Waiting for review labels Oct 19, 2020
@jonboiser jonboiser modified the milestones: upcoming patch, 0.14.4 Oct 19, 2020
@rtibbles rtibbles mentioned this pull request Apr 1, 2022
9 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
changelog Important user-facing changes
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants