Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extract subtitles and assessment items into the right directories #4827

Conversation

aronasorman
Copy link
Collaborator

WIP, but once merged, this fully deprecates languagepackdownload and unpack_assessment_zip.

TODO:

  • change the setup command to use retrievecontentpack instead of unpack_assessment_zip
  • use retrievecontentpack in the language pack UI
  • [ ] delete languagepackdownload and unpack_assessment_zip (unpack_assessment_zip removal will happen for 0.17.)

Changes:

  • extend retrievecontentpack to extract subtitles and assessment items.
  • setup command now uses retrievecontentpack.
  • deleted languagepackdownload command. Purged associated tests as well.

Bug:

  • It seems like because we use call_command_async rather than force_job for launching retrievecontentpack, the progress bar doesn't appear anymore.

Note to reviewers: It's easier to review this commit by commit.

assessment_zip_dir = "assessment_resources/"

try:
channel = zf.read("channel.name")

This comment was marked as spam.

This comment was marked as spam.

@aronasorman aronasorman force-pushed the extract-subtitles-and-assessment-items branch 2 times, most recently from d24118c to b55925d Compare February 10, 2016 23:29
@aronasorman aronasorman force-pushed the extract-subtitles-and-assessment-items branch from b55925d to fde7af9 Compare February 11, 2016 21:46
@@ -61,8 +61,7 @@ def update_content_availability(content_list, language="en", channel="khan"):

# Databases have been pre-filtered to only contain existing exercises
# Assume if the assessment items have been downloaded, then everything is hunky dory.
if os.path.exists(contentload_settings.KHAN_ASSESSMENT_ITEM_VERSION_PATH):
update["available"] = True
update["available"] = True

This comment was marked as spam.

This comment was marked as spam.

This comment was marked as spam.

@aronasorman
Copy link
Collaborator Author

As mentioned by @benjaoming, there will be a significant probability of delay if we go full hog with retrievecontentpack. So I'm restoring unpack_assessment_zip, and for 0.16 we're going with a hybrid of:

  1. small content packs taking the place of language packs
  2. a new khan_assessment_zip extracted from the english content pack

TODO:

  • restore unpack_assessment_zip

@aronasorman
Copy link
Collaborator Author

@MCGallaspy @cpauya ^ take note.

@aronasorman aronasorman force-pushed the extract-subtitles-and-assessment-items branch from d414010 to f493ea7 Compare February 12, 2016 20:26
@aronasorman
Copy link
Collaborator Author

unpack_assessment_zip is back!

@aronasorman aronasorman force-pushed the extract-subtitles-and-assessment-items branch from f493ea7 to 7347aef Compare February 12, 2016 20:36
@benjaoming
Copy link
Contributor

Super, @aronasorman, thanks for taking quick action on this! :)

@cpauya
Copy link
Contributor

cpauya commented Feb 15, 2016

Thanks for the mention @aronasorman.

@benjaoming
Copy link
Contributor

@aronasorman may I take it from here or do you have further work you're planning to push?

@aronasorman
Copy link
Collaborator Author

@benjaoming tests are failing, but feel free to take it from here. Further changes I need to do are on the content-pack-maker side now.

@benjaoming
Copy link
Contributor

awesome, thanks for the response!

@aronasorman
Copy link
Collaborator Author

@benjaoming reopened this, with merge conflicts and test failures fixed.

@@ -5,7 +5,7 @@
- Run migrations
- Find and relocate obsolete user and data files
- if interactive:
- Download and unpack assessment items
- Download and english content pack, containing assessment items

This comment was marked as spam.

@aronasorman aronasorman added this to the 0.16.0 milestone Feb 16, 2016
else:
call_command("unpack_assessment_zip", ass_item_filename)
call_command("retrievecontentpack", retrieval_method, "en", ass_item_filename)

This comment was marked as spam.

This comment was marked as spam.

This comment was marked as spam.

Since we now skip over exercise availability annotation (they've been
premarked in the content-pack-maker side), `update_content_availability`
will now never return any exercises. So these tests will always fail
unless we change the way we test `update_content_availability`.
@aronasorman
Copy link
Collaborator Author

Tests now pass, but I had to delete the previous failing tests, since they didn't capture the right behaviour anymore.

I'd rather put off restoring the tests into another PR, so that the commits in here can be merged.

@MCGallaspy
Copy link
Contributor

If the behavior of annotate_content_availability has changed so drastically as to be useless, then perhaps we should remove that function (and the associated test module) entirely? Practically, the likelihood of us writing a test drops way down if we don't do it now, and the result will be that we have both dead tests and untested code.

@aronasorman
Copy link
Collaborator Author

It's not that it's now useless, the test just wasn't up to spec.

Here's what happened:

  1. Exercises used to be marked as unavailable once made in the content-pack-maker. update_content_availability then looped over both videos and exercises, and if either a video or exercise node's availability was changed, it was returned (to be consumed by another function for the actual writing to the content db and bubbling up availability through topics.
  2. Now to make update_content_availability faster, I now premark exercises as available on the content-pack-maker side, and then on KA Lite, skip marking exercises, since we now assume that they've been marked appropriately.
  3. The deleted test, tested update_content_availability by marking an exercise as available or not. Since we now explicitly skip over exercises, update_content_availability will never return the right answer for said test.

Now, the proper solution is to refactor the test to use videos instead of exercises. However, that will take 1-2 more days just writing the test. I think the contents of the PR are critical enough such that we should merge it soon without waiting for 2 more days, since proper KA Lite behaviour depends on the fixes here.

assessment_item = get_assessment_item_data(channel=getattr(request, "channel", "khan"),
language=getattr(request, "language", "en"),
assessment_item_id=assessment_item)
return JsonResponse(assessment_item)

This comment was marked as spam.

@MCGallaspy
Copy link
Contributor

Profile

Top 20 items from profiling, sorted by total time:

         7037860 function calls (7016266 primitive calls) in 1043.907 seconds

   Ordered by: internal time
   List reduced from 781 to 20 due to restriction <20>

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1  378.022  378.022 1020.743 1020.743 D:\ka-lite\kalite\distributed\management\commands\retrievecontentpack.py:180(extract_assessment_items)
    59514  323.828    0.005  323.828    0.005 {open}
   113105  212.975    0.002  212.975    0.002 {method 'read' of 'file' objects}
    83355   55.138    0.001   55.158    0.001 {method 'write' of 'file' objects}
    29756   35.008    0.001   35.008    0.001 {method 'close' of 'file' objects}
      328   12.257    0.037   12.257    0.037 {method 'execute' of 'sqlite3.Cursor' objects}
        7    7.243    1.035    7.243    1.035 {method 'commit' of 'sqlite3.Connection' objects}
    29755    3.196    0.000    3.196    0.000 {nt.mkdir}
    51323    1.487    0.000    1.487    0.000 {nt.stat}
    83346    0.785    0.000    0.785    0.000 {zlib.crc32}
    29753    0.666    0.000    3.516    0.000 C:\Users\Mike\Envs\develop-0-16\lib\site-packages\progressbar\bar.py:362(_format_widgets)
   178927    0.582    0.000    0.998    0.000 C:\Users\Mike\Envs\develop-0-16\lib\abc.py:128(__instancecheck__)
29755/29753    0.580    0.000   34.126    0.001 C:\Users\Mike\Envs\develop-0-16\lib\site-packages\progressbar\bar.py:428(update)
    59512    0.545    0.000    0.545    0.000 {built-in method now}
   113073    0.526    0.000  179.552    0.002 C:\Python27\Lib\zipfile.py:649(read1)
    29753    0.511    0.000    0.796    0.000 C:\Users\Mike\Envs\develop-0-16\lib\site-packages\progressbar\bar.py:251(data)
    29755    0.500    0.000   38.146    0.001 C:\Python27\Lib\zipfile.py:937(open)
    29754    0.488    0.000    0.898    0.000 C:\Users\Mike\Envs\develop-0-16\lib\site-packages\progressbar\bar.py:403(_needs_update)
   221885    0.451    0.000    0.622    0.000 C:\Users\Mike\Envs\develop-0-16\lib\ntpath.py:96(splitdrive)
    59553    0.439    0.000    0.721    0.000 C:\Users\Mike\Envs\develop-0-16\lib\ntpath.py:174(split)

Top 20 items sorted by cumulative time:

         7037860 function calls (7016266 primitive calls) in 1043.907 seconds

   Ordered by: cumulative time
   List reduced from 781 to 20 due to restriction <20>

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.000    0.000 1043.485 1043.485 D:\ka-lite\kalite\distributed\management\commands\retrievecontentpack.py:79(local)
        1    0.000    0.000 1043.170 1043.170 D:\ka-lite\kalite\distributed\management\commands\retrievecontentpack.py:89(process_content_pack)
        1  378.022  378.022 1020.743 1020.743 D:\ka-lite\kalite\distributed\management\commands\retrievecontentpack.py:180(extract_assessment_items)
    59514  323.828    0.005  323.828    0.005 {open}
   113105  212.975    0.002  212.975    0.002 {method 'read' of 'file' objects}
    29755    0.159    0.000  206.805    0.007 C:\Python27\Lib\shutil.py:46(copyfileobj)
    83346    0.228    0.000  179.811    0.002 C:\Python27\Lib\zipfile.py:621(read)
   113073    0.526    0.000  179.552    0.002 C:\Python27\Lib\zipfile.py:649(read1)
    83355   55.138    0.001   55.158    0.001 {method 'write' of 'file' objects}
    29755    0.500    0.000   38.146    0.001 C:\Python27\Lib\zipfile.py:937(open)
    29755    0.151    0.000   35.239    0.001 C:\Python27\Lib\zipfile.py:701(close)
    29756   35.008    0.001   35.008    0.001 {method 'close' of 'file' objects}
    29753    0.288    0.000   34.546    0.001 C:\Users\Mike\Envs\develop-0-16\lib\site-packages\progressbar\bar.py:335(__next__)
29755/29753    0.580    0.000   34.126    0.001 C:\Users\Mike\Envs\develop-0-16\lib\site-packages\progressbar\bar.py:428(update)
    29753    0.176    0.000   32.397    0.001 C:\Users\Mike\Envs\develop-0-16\lib\site-packages\progressbar\bar.py:92(update)
    29753    0.189    0.000   32.220    0.001 C:\Users\Mike\Envs\develop-0-16\lib\site-packages\progressbar\bar.py:37(update)
        1    0.000    0.000   18.214   18.214 D:\ka-lite\python-packages\django\core\management\__init__.py:126(call_command)
        1    0.000    0.000   18.017   18.017 D:\ka-lite\python-packages\django\core\management\base.py:240(execute)
        1    0.000    0.000   18.006   18.006 D:\ka-lite\kalite\topic_tools\management\commands\annotate_content_items.py:33(handle)
        1    0.000    0.000   18.002   18.002 D:\ka-lite\kalite\topic_tools\content_models.py:103(wrapper)

Conclusion

The built-in open and read method of file objects are the culprits, and I'm assuming the majority of uses for each are occurring in extract_assessment_items -- how can we cut down the # of calls to those functions?

@MCGallaspy
Copy link
Contributor

@rtibbles I'll defer to you to merge this -- there are no obvious errors to me, though I am concerned about the above items. However, after discussing with @aronasorman I think it's important to merge this and tackle any related issues separately, so that testing KA Lite can proceed.

@rtibbles
Copy link
Member

Some open questions here - I think we may want to revisit the current client side behaviour if nothing is returned from the assessment items, for example. But let's merge this so KA Lite testing can continue.

rtibbles added a commit that referenced this pull request Feb 18, 2016
…sment-items

Extract subtitles and assessment items into the right directories
@rtibbles rtibbles merged commit 653cef9 into learningequality:0.16.x Feb 18, 2016
@rtibbles rtibbles removed the has PR label Feb 18, 2016
@aronasorman
Copy link
Collaborator Author

Thanks for benchmarking! I can make this faster by assuming certain things in the path. That would make it brittle (it would mean the path we extract to and the path of assessment items in the zip file wouldn't change), but it would mean we can just call zf.extractall rather than looping over items and extracting them one by one.

Made an issue: #4827 (comment)

@MCGallaspy
Copy link
Contributor

If the performance increase is very substantial, I would prefer that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants