Extract subtitles and assessment items into the right directories #4827

aronasorman · 2016-02-09T17:56:47Z

WIP, but once merged, this fully deprecates languagepackdownload and unpack_assessment_zip.

TODO:

change the setup command to use retrievecontentpack instead of unpack_assessment_zip
use retrievecontentpack in the language pack UI
~~[ ] delete languagepackdownload and unpack_assessment_zip~~ (unpack_assessment_zip removal will happen for 0.17.)

Changes:

extend retrievecontentpack to extract subtitles and assessment items.
setup command now uses retrievecontentpack.
deleted languagepackdownload command. Purged associated tests as well.

Bug:

It seems like because we use call_command_async rather than force_job for launching retrievecontentpack, the progress bar doesn't appear anymore.

Note to reviewers: It's easier to review this commit by commit.

kalite/distributed/management/commands/retrievecontentpack.py

+    assessment_zip_dir = "assessment_resources/"
+
+    try:
+        channel = zf.read("channel.name")


Instead of unpack_assessment_zip.

Move away from tastypie endpoint, since our needs are simple and few.

kalite/topic_tools/annotate.py

@@ -61,8 +61,7 @@ def update_content_availability(content_list, language="en", channel="khan"):

            # Databases have been pre-filtered to only contain existing exercises
            # Assume if the assessment items have been downloaded, then everything is hunky dory.
-            if os.path.exists(contentload_settings.KHAN_ASSESSMENT_ITEM_VERSION_PATH):
-                update["available"] = True
+            update["available"] = True


aronasorman · 2016-02-12T18:46:51Z

As mentioned by @benjaoming, there will be a significant probability of delay if we go full hog with retrievecontentpack. So I'm restoring unpack_assessment_zip, and for 0.16 we're going with a hybrid of:

small content packs taking the place of language packs
a new khan_assessment_zip extracted from the english content pack

TODO:

restore unpack_assessment_zip

aronasorman · 2016-02-12T19:37:04Z

@MCGallaspy @cpauya ^ take note.

aronasorman · 2016-02-12T20:26:29Z

unpack_assessment_zip is back!

benjaoming · 2016-02-12T21:58:02Z

Super, @aronasorman, thanks for taking quick action on this! :)

cpauya · 2016-02-15T02:43:52Z

Thanks for the mention @aronasorman.

benjaoming · 2016-02-15T19:05:04Z

@aronasorman may I take it from here or do you have further work you're planning to push?

aronasorman · 2016-02-15T21:44:53Z

@benjaoming tests are failing, but feel free to take it from here. Further changes I need to do are on the content-pack-maker side now.

benjaoming · 2016-02-15T22:31:20Z

awesome, thanks for the response!

…sessment-items

aronasorman · 2016-02-16T22:55:35Z

@benjaoming reopened this, with merge conflicts and test failures fixed.

kalite/distributed/management/commands/setup.py

@@ -5,7 +5,7 @@
 - Run migrations
 - Find and relocate obsolete user and data files
 - if interactive:
-     - Download and unpack assessment items
+     - Download and english content pack, containing assessment items


kalite/distributed/management/commands/setup.py

                else:
-                    call_command("unpack_assessment_zip", ass_item_filename)
+                    call_command("retrievecontentpack", retrieval_method, "en", ass_item_filename)


Since we now skip over exercise availability annotation (they've been premarked in the content-pack-maker side), `update_content_availability` will now never return any exercises. So these tests will always fail unless we change the way we test `update_content_availability`.

aronasorman · 2016-02-17T17:01:13Z

Tests now pass, but I had to delete the previous failing tests, since they didn't capture the right behaviour anymore.

I'd rather put off restoring the tests into another PR, so that the commits in here can be merged.

MCGallaspy · 2016-02-17T20:16:59Z

If the behavior of annotate_content_availability has changed so drastically as to be useless, then perhaps we should remove that function (and the associated test module) entirely? Practically, the likelihood of us writing a test drops way down if we don't do it now, and the result will be that we have both dead tests and untested code.

aronasorman · 2016-02-17T20:25:41Z

It's not that it's now useless, the test just wasn't up to spec.

Here's what happened:

Exercises used to be marked as unavailable once made in the content-pack-maker. update_content_availability then looped over both videos and exercises, and if either a video or exercise node's availability was changed, it was returned (to be consumed by another function for the actual writing to the content db and bubbling up availability through topics.
Now to make update_content_availability faster, I now premark exercises as available on the content-pack-maker side, and then on KA Lite, skip marking exercises, since we now assume that they've been marked appropriately.
The deleted test, tested update_content_availability by marking an exercise as available or not. Since we now explicitly skip over exercises, update_content_availability will never return the right answer for said test.

Now, the proper solution is to refactor the test to use videos instead of exercises. However, that will take 1-2 more days just writing the test. I think the contents of the PR are critical enough such that we should merge it soon without waiting for 2 more days, since proper KA Lite behaviour depends on the fixes here.

kalite/main/api_views.py

+    assessment_item = get_assessment_item_data(channel=getattr(request, "channel", "khan"),
+                                               language=getattr(request, "language", "en"),
+                                               assessment_item_id=assessment_item)
+    return JsonResponse(assessment_item)


MCGallaspy · 2016-02-18T01:33:26Z

Profile

Top 20 items from profiling, sorted by total time:

         7037860 function calls (7016266 primitive calls) in 1043.907 seconds

   Ordered by: internal time
   List reduced from 781 to 20 due to restriction <20>

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1  378.022  378.022 1020.743 1020.743 D:\ka-lite\kalite\distributed\management\commands\retrievecontentpack.py:180(extract_assessment_items)
    59514  323.828    0.005  323.828    0.005 {open}
   113105  212.975    0.002  212.975    0.002 {method 'read' of 'file' objects}
    83355   55.138    0.001   55.158    0.001 {method 'write' of 'file' objects}
    29756   35.008    0.001   35.008    0.001 {method 'close' of 'file' objects}
      328   12.257    0.037   12.257    0.037 {method 'execute' of 'sqlite3.Cursor' objects}
        7    7.243    1.035    7.243    1.035 {method 'commit' of 'sqlite3.Connection' objects}
    29755    3.196    0.000    3.196    0.000 {nt.mkdir}
    51323    1.487    0.000    1.487    0.000 {nt.stat}
    83346    0.785    0.000    0.785    0.000 {zlib.crc32}
    29753    0.666    0.000    3.516    0.000 C:\Users\Mike\Envs\develop-0-16\lib\site-packages\progressbar\bar.py:362(_format_widgets)
   178927    0.582    0.000    0.998    0.000 C:\Users\Mike\Envs\develop-0-16\lib\abc.py:128(__instancecheck__)
29755/29753    0.580    0.000   34.126    0.001 C:\Users\Mike\Envs\develop-0-16\lib\site-packages\progressbar\bar.py:428(update)
    59512    0.545    0.000    0.545    0.000 {built-in method now}
   113073    0.526    0.000  179.552    0.002 C:\Python27\Lib\zipfile.py:649(read1)
    29753    0.511    0.000    0.796    0.000 C:\Users\Mike\Envs\develop-0-16\lib\site-packages\progressbar\bar.py:251(data)
    29755    0.500    0.000   38.146    0.001 C:\Python27\Lib\zipfile.py:937(open)
    29754    0.488    0.000    0.898    0.000 C:\Users\Mike\Envs\develop-0-16\lib\site-packages\progressbar\bar.py:403(_needs_update)
   221885    0.451    0.000    0.622    0.000 C:\Users\Mike\Envs\develop-0-16\lib\ntpath.py:96(splitdrive)
    59553    0.439    0.000    0.721    0.000 C:\Users\Mike\Envs\develop-0-16\lib\ntpath.py:174(split)

Top 20 items sorted by cumulative time:

         7037860 function calls (7016266 primitive calls) in 1043.907 seconds

   Ordered by: cumulative time
   List reduced from 781 to 20 due to restriction <20>

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.000    0.000 1043.485 1043.485 D:\ka-lite\kalite\distributed\management\commands\retrievecontentpack.py:79(local)
        1    0.000    0.000 1043.170 1043.170 D:\ka-lite\kalite\distributed\management\commands\retrievecontentpack.py:89(process_content_pack)
        1  378.022  378.022 1020.743 1020.743 D:\ka-lite\kalite\distributed\management\commands\retrievecontentpack.py:180(extract_assessment_items)
    59514  323.828    0.005  323.828    0.005 {open}
   113105  212.975    0.002  212.975    0.002 {method 'read' of 'file' objects}
    29755    0.159    0.000  206.805    0.007 C:\Python27\Lib\shutil.py:46(copyfileobj)
    83346    0.228    0.000  179.811    0.002 C:\Python27\Lib\zipfile.py:621(read)
   113073    0.526    0.000  179.552    0.002 C:\Python27\Lib\zipfile.py:649(read1)
    83355   55.138    0.001   55.158    0.001 {method 'write' of 'file' objects}
    29755    0.500    0.000   38.146    0.001 C:\Python27\Lib\zipfile.py:937(open)
    29755    0.151    0.000   35.239    0.001 C:\Python27\Lib\zipfile.py:701(close)
    29756   35.008    0.001   35.008    0.001 {method 'close' of 'file' objects}
    29753    0.288    0.000   34.546    0.001 C:\Users\Mike\Envs\develop-0-16\lib\site-packages\progressbar\bar.py:335(__next__)
29755/29753    0.580    0.000   34.126    0.001 C:\Users\Mike\Envs\develop-0-16\lib\site-packages\progressbar\bar.py:428(update)
    29753    0.176    0.000   32.397    0.001 C:\Users\Mike\Envs\develop-0-16\lib\site-packages\progressbar\bar.py:92(update)
    29753    0.189    0.000   32.220    0.001 C:\Users\Mike\Envs\develop-0-16\lib\site-packages\progressbar\bar.py:37(update)
        1    0.000    0.000   18.214   18.214 D:\ka-lite\python-packages\django\core\management\__init__.py:126(call_command)
        1    0.000    0.000   18.017   18.017 D:\ka-lite\python-packages\django\core\management\base.py:240(execute)
        1    0.000    0.000   18.006   18.006 D:\ka-lite\kalite\topic_tools\management\commands\annotate_content_items.py:33(handle)
        1    0.000    0.000   18.002   18.002 D:\ka-lite\kalite\topic_tools\content_models.py:103(wrapper)

Conclusion

The built-in open and read method of file objects are the culprits, and I'm assuming the majority of uses for each are occurring in extract_assessment_items -- how can we cut down the # of calls to those functions?

MCGallaspy · 2016-02-18T01:52:33Z

@rtibbles I'll defer to you to merge this -- there are no obvious errors to me, though I am concerned about the above items. However, after discussing with @aronasorman I think it's important to merge this and tackle any related issues separately, so that testing KA Lite can proceed.

rtibbles · 2016-02-18T02:18:42Z

Some open questions here - I think we may want to revisit the current client side behaviour if nothing is returned from the assessment items, for example. But let's merge this so KA Lite testing can continue.

…sment-items Extract subtitles and assessment items into the right directories

aronasorman · 2016-02-18T03:43:05Z

Thanks for benchmarking! I can make this faster by assuming certain things in the path. That would make it brittle (it would mean the path we extract to and the path of assessment items in the zip file wouldn't change), but it would mean we can just call zf.extractall rather than looping over items and extracting them one by one.

Made an issue: #4827 (comment)

MCGallaspy · 2016-02-18T04:22:03Z

If the performance increase is very substantial, I would prefer that.

Extract subtitles from content packs into the right dir.

404a83b

aronasorman added the has PR label Feb 9, 2016

aronasorman reviewed Feb 9, 2016
View reviewed changes

kalite/distributed/management/commands/retrievecontentpack.py

assessment_zip_dir = "assessment_resources/"

try:

channel = zf.read("channel.name")

This comment was marked as spam.

Sign in to view

This comment was marked as spam.

Sign in to view

aronasorman force-pushed the extract-subtitles-and-assessment-items branch 2 times, most recently from d24118c to b55925d Compare February 10, 2016 23:29

aronasorman added 3 commits February 11, 2016 13:45

extract assessment resources from content packs into the right dir.

dce407e

Setup command now uses retrievecontentpack.

4e5637c

Instead of unpack_assessment_zip.

Migrate the assessment item endpoint to use a plain view.

fde7af9

Move away from tastypie endpoint, since our needs are simple and few.

aronasorman force-pushed the extract-subtitles-and-assessment-items branch from b55925d to fde7af9 Compare February 11, 2016 21:46

Always mark exercises as available.

a70b2a3

aronasorman reviewed Feb 12, 2016
View reviewed changes

aronasorman added 3 commits February 12, 2016 08:34

Query the pantry server for available lang packs.

4ade9fa

Use retrievecontentpack for downloading lang packs.

c3a14bc

Remove tests that depended on old language packs.

5885d71

aronasorman force-pushed the extract-subtitles-and-assessment-items branch from d414010 to f493ea7 Compare February 12, 2016 20:26

Purge the mentions of the languagepackdownload command.

7347aef

aronasorman force-pushed the extract-subtitles-and-assessment-items branch from f493ea7 to 7347aef Compare February 12, 2016 20:36

skip availability stamping for exercises.

fa3bd0c

benjaoming mentioned this pull request Feb 15, 2016

Extract subtitles and assessment items #4849

Merged

benjaoming closed this Feb 15, 2016

benjaoming removed the has PR label Feb 15, 2016

Change test to reflect new return value of update_content_availability.

2b48460

aronasorman reopened this Feb 16, 2016

aronasorman added the has PR label Feb 16, 2016

Merge remote-tracking branch 'u/0.16.x' into extract-subtitles-and-as…

a8489b5

…sessment-items

rtibbles reviewed Feb 16, 2016
View reviewed changes

aronasorman assigned rtibbles Feb 16, 2016

aronasorman added this to the 0.16.0 milestone Feb 16, 2016

aronasorman reviewed Feb 16, 2016
View reviewed changes

aronasorman mentioned this pull request Feb 17, 2016

Use of retrievecontentpack on post install script of OS X installer learningequality/ka-lite-installers#359

Merged

MCGallaspy reviewed Feb 18, 2016
View reviewed changes

kalite/main/api_views.py

assessment_item = get_assessment_item_data(channel=getattr(request, "channel", "khan"),

language=getattr(request, "language", "en"),

assessment_item_id=assessment_item)

return JsonResponse(assessment_item)

This comment was marked as spam.

Sign in to view

rtibbles added a commit that referenced this pull request Feb 18, 2016

Merge pull request #4827 from aronasorman/extract-subtitles-and-asses…

653cef9

…sment-items Extract subtitles and assessment items into the right directories

rtibbles merged commit 653cef9 into learningequality:0.16.x Feb 18, 2016

rtibbles removed the has PR label Feb 18, 2016

aronasorman mentioned this pull request Feb 18, 2016

Make retrievecontentpack faster #4863

Closed

benjaoming mentioned this pull request Feb 18, 2016

English content db in sdist #4864

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Extract subtitles and assessment items into the right directories #4827

Extract subtitles and assessment items into the right directories #4827

aronasorman commented Feb 9, 2016

This comment was marked as spam.

This comment was marked as spam.

This comment was marked as spam.

This comment was marked as spam.

This comment was marked as spam.

aronasorman commented Feb 12, 2016

aronasorman commented Feb 12, 2016

aronasorman commented Feb 12, 2016

benjaoming commented Feb 12, 2016

cpauya commented Feb 15, 2016

benjaoming commented Feb 15, 2016

aronasorman commented Feb 15, 2016

benjaoming commented Feb 15, 2016

aronasorman commented Feb 16, 2016

This comment was marked as spam.

This comment was marked as spam.

This comment was marked as spam.

This comment was marked as spam.

aronasorman commented Feb 17, 2016

MCGallaspy commented Feb 17, 2016

aronasorman commented Feb 17, 2016

This comment was marked as spam.

MCGallaspy commented Feb 18, 2016

MCGallaspy commented Feb 18, 2016

rtibbles commented Feb 18, 2016

aronasorman commented Feb 18, 2016

MCGallaspy commented Feb 18, 2016

Extract subtitles and assessment items into the right directories #4827

Extract subtitles and assessment items into the right directories #4827

Conversation

aronasorman commented Feb 9, 2016

This comment was marked as spam.

This comment was marked as spam.

This comment was marked as spam.

This comment was marked as spam.

This comment was marked as spam.

aronasorman commented Feb 12, 2016

aronasorman commented Feb 12, 2016

aronasorman commented Feb 12, 2016

benjaoming commented Feb 12, 2016

cpauya commented Feb 15, 2016

benjaoming commented Feb 15, 2016

aronasorman commented Feb 15, 2016

benjaoming commented Feb 15, 2016

aronasorman commented Feb 16, 2016

This comment was marked as spam.

This comment was marked as spam.

This comment was marked as spam.

This comment was marked as spam.

aronasorman commented Feb 17, 2016

MCGallaspy commented Feb 17, 2016

aronasorman commented Feb 17, 2016

This comment was marked as spam.

MCGallaspy commented Feb 18, 2016

Profile

Conclusion

MCGallaspy commented Feb 18, 2016

rtibbles commented Feb 18, 2016

aronasorman commented Feb 18, 2016

MCGallaspy commented Feb 18, 2016