Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extract all assessment resources at once. #4868

Merged

Conversation

aronasorman
Copy link
Collaborator

Rather than looping over them in python code then extracting them one by one.

Before: #4863 (comment)

After: #4863 (comment)

15 seconds vs. 26 seconds.

Might also solve the corrupt filenames getting served, as seen by @MCGallaspy.

@aronasorman
Copy link
Collaborator Author

@MCGallaspy can you test this branch out and see if it's fast enough for Windows?


zip_item_fileobj = zf.open(item)
shutil.move(temp_assessment_dir, assessment_dest_dir)

This comment was marked as spam.

This comment was marked as spam.

This comment was marked as spam.

This comment was marked as spam.

This comment was marked as spam.

This comment was marked as spam.

@MCGallaspy
Copy link
Contributor

Didn't profile, but it's down to about 10 mins. However I still think there's an easy improvement -- see my comment.

Rather than looping over them in python code then extracting them one by one.
@aronasorman aronasorman force-pushed the optimize-retrievecontentpack branch from d52afba to c6af3c0 Compare February 18, 2016 20:55
@aronasorman
Copy link
Collaborator Author

@MCGallaspy can you re-pull this branch and redownload en.zip? We now extract directly to the destination dir. However, said change made no noticeable improvement on my side:

*** PROFILER RESULTS ***
extract_assessment_items (/home/aron/src/github.com/aronasorman/ka-lite/kalite/distributed/management/commands/retrievecontentpack.py:171)
function called 1 times

         2990051 function calls in 15.809 seconds

   Ordered by: cumulative time, internal time, call count
   List reduced from 49 to 40 due to restriction <40>

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.000    0.000   15.809   15.809 retrievecontentpack.py:171(extract_assessment_items)
        1    0.000    0.000   15.809   15.809 retrievecontentpack.py:179(_extract_assessment_resources)
        1    0.099    0.099   15.790   15.790 zipfile.py:1030(extractall)
    29751    0.314    0.000   15.600    0.001 zipfile.py:1016(extract)
    29751    2.774    0.000   15.132    0.001 zipfile.py:1042(_extract_member)
    29751    0.272    0.000    5.861    0.000 shutil.py:46(copyfileobj)
    78358    0.460    0.000    4.663    0.000 zipfile.py:621(read)
   108081    0.991    0.000    4.114    0.000 zipfile.py:649(read1)
   108109    3.424    0.000    3.424    0.000 {method 'read' of 'file' objects}
    29751    0.562    0.000    3.128    0.000 zipfile.py:937(open)
    48607    0.926    0.000    0.926    0.000 {method 'write' of 'file' objects}
    48607    0.178    0.000    0.908    0.000 zipfile.py:639(_update_crc)
    29751    0.519    0.000    0.813    0.000 posixpath.py:336(normpath)
    29751    0.304    0.000    0.777    0.000 zipfile.py:701(close)
    78358    0.759    0.000    0.759    0.000 {zlib.crc32}
    33840    0.165    0.000    0.578    0.000 genericpath.py:15(exists)
    59502    0.261    0.000    0.455    0.000 {method 'join' of 'str' objects}
    33840    0.413    0.000    0.413    0.000 {posix.stat}
    29751    0.354    0.000    0.354    0.000 {open}
    29751    0.301    0.000    0.352    0.000 zipfile.py:515(__init__)
    29751    0.312    0.000    0.312    0.000 {method 'seek' of 'file' objects}
    29751    0.253    0.000    0.253    0.000 {method 'close' of 'file' objects}
     4089    0.023    0.000    0.240    0.000 os.py:136(makedirs)
    29751    0.162    0.000    0.235    0.000 posixpath.py:127(dirname)
   876782    0.230    0.000    0.230    0.000 {len}
    29751    0.220    0.000    0.220    0.000 {function close at 0x7f913c3135a0}
    29752    0.156    0.000    0.209    0.000 posixpath.py:68(join)
   119004    0.195    0.000    0.195    0.000 zipfile.py:1055(<genexpr>)
    59502    0.148    0.000    0.148    0.000 {method 'split' of 'str' objects}
     4089    0.139    0.000    0.139    0.000 {posix.mkdir}
    89253    0.109    0.000    0.109    0.000 {isinstance}
   357016    0.108    0.000    0.108    0.000 {method 'append' of 'list' objects}
    29751    0.070    0.000    0.095    0.000 zipfile.py:904(getinfo)
    29752    0.090    0.000    0.090    0.000 retrievecontentpack.py:180(<genexpr>)
    48607    0.079    0.000    0.079    0.000 {max}
    29751    0.069    0.000    0.069    0.000 {_struct.unpack}
    89253    0.065    0.000    0.065    0.000 {method 'startswith' of 'str' objects}
    29751    0.051    0.000    0.051    0.000 {method 'replace' of 'str' objects}
    48607    0.045    0.000    0.045    0.000 {min}
    33840    0.040    0.000    0.040    0.000 {method 'rfind' of 'str' objects}


It still takes 15 seconds for the whole process. Maybe it'll be different for Windows.

@MCGallaspy
Copy link
Contributor

Maybe it'll be different for Windows.

I don't expect it to be different because I'm on Windows, but because I have a conventional hard drive vs an SSD. Is that a bad expectation?

@MCGallaspy
Copy link
Contributor

Cumulative time before the latest change:

Thu Feb 18 12:31:18 2016    rcp_stats

         9571137 function calls (9537296 primitive calls) in 611.727 seconds

   Ordered by: cumulative time
   List reduced from 768 to 20 due to restriction <20>

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.000    0.000  611.475  611.475 D:\ka-lite\kalite\distributed\management\commands\retrievecontentpack.py:78(local)
        1    0.000    0.000  611.004  611.004 D:\ka-lite\kalite\distributed\management\commands\retrievecontentpack.py:88(process_content_pack)
        1    0.000    0.000  596.894  596.894 D:\ka-lite\kalite\distributed\management\commands\retrievecontentpack.py:178(extract_assessment_items)
        1    0.000    0.000  589.164  589.164 D:\ka-lite\kalite\distributed\management\commands\retrievecontentpack.py:195(_extract_assessment_resources)
        1    0.000    0.000  311.428  311.428 C:\Python27\Lib\shutil.py:265(move)
   4090/1    0.296    0.000  279.189  279.189 C:\Python27\Lib\shutil.py:145(copytree)
        1    0.083    0.083  277.726  277.726 C:\Python27\Lib\zipfile.py:1030(extractall)
    29751    0.289    0.000  277.583    0.009 C:\Python27\Lib\zipfile.py:1016(extract)
    29751  255.459    0.009  277.180    0.009 C:\Python27\Lib\zipfile.py:1042(_extract_member)
    29751    0.379    0.000  269.660    0.009 C:\Python27\Lib\shutil.py:122(copy2)
    29751   69.414    0.002  262.032    0.009 C:\Python27\Lib\shutil.py:66(copyfile)
    89265  174.467    0.002  174.467    0.002 {open}
   8182/3    0.488    0.000   39.969   13.323 C:\Python27\Lib\shutil.py:210(rmtree)
    59502   30.144    0.001   30.144    0.001 {nt.remove}
    59506    0.352    0.000   24.032    0.000 C:\Python27\Lib\shutil.py:46(copyfileobj)
   102207   19.995    0.000   20.027    0.000 {method 'write' of 'file' objects}
        1    0.000    0.000   12.896   12.896 D:\ka-lite\python-packages\django\core\management\__init__.py:126(call_command)
        1    0.000    0.000   12.790   12.790 D:\ka-lite\python-packages\django\core\management\base.py:240(execute)
        1    0.000    0.000   12.777   12.777 D:\ka-lite\kalite\topic_tools\management\commands\annotate_content_items.py:33(handle)
        1    0.000    0.000   12.772   12.772 D:\ka-lite\kalite\topic_tools\content_models.py:106(wrapper)

And total time:

Thu Feb 18 12:31:18 2016    rcp_stats

         9571137 function calls (9537296 primitive calls) in 611.727 seconds

   Ordered by: internal time
   List reduced from 768 to 20 due to restriction <20>

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
    29751  255.459    0.009  277.180    0.009 C:\Python27\Lib\zipfile.py:1042(_extract_member)
    89265  174.467    0.002  174.467    0.002 {open}
    29751   69.414    0.002  262.032    0.009 C:\Python27\Lib\shutil.py:66(copyfile)
    59502   30.144    0.001   30.144    0.001 {nt.remove}
   102207   19.995    0.000   20.027    0.000 {method 'write' of 'file' objects}
      328   10.322    0.031   10.322    0.031 {method 'execute' of 'sqlite3.Cursor' objects}
   152845    9.730    0.000    9.730    0.000 {nt.stat}
     8184    6.720    0.001    6.720    0.001 {nt.mkdir}
    67681    4.806    0.000    4.806    0.000 {nt.lstat}
    63631    3.452    0.000    3.452    0.000 {nt._isdir}
    33841    3.018    0.000    3.018    0.000 {nt.utime}
     8182    2.924    0.000    2.924    0.000 {nt.rmdir}
   191463    2.267    0.000    2.267    0.000 {method 'read' of 'file' objects}
    33841    1.831    0.000    1.831    0.000 {nt.chmod}
   530546    1.210    0.000    1.627    0.000 C:\Users\Mike\Envs\develop-0-16\lib\ntpath.py:96(splitdrive)
        7    1.149    0.164    1.149    0.164 {method 'commit' of 'sqlite3.Connection' objects}
    12276    1.069    0.000    1.069    0.000 {nt.listdir}
    89264    0.846    0.000    1.567    0.000 C:\Users\Mike\Envs\develop-0-16\lib\ntpath.py:415(normpath)
    83346    0.813    0.000    0.813    0.000 {zlib.crc32}
   186733    0.697    0.000    1.872    0.000 C:\Users\Mike\Envs\develop-0-16\lib\ntpath.py:63(join)

Benchmarking the latest changes now.

@aronasorman
Copy link
Collaborator Author

29751 69.414 0.002 262.032 0.009 C:\Python27\Lib\shutil.py:66(copyfile)

Do you have KA Lite in a separate drive vs. your Program Files? Might explain why this takes so long.

@MCGallaspy
Copy link
Contributor

Cumulative time for latest changes:

Thu Feb 18 15:07:25 2016    rcp_stats

         5876887 function calls (5855315 primitive calls) in 729.756 seconds

   Ordered by: cumulative time
   List reduced from 728 to 20 due to restriction <20>

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.000    0.000  729.164  729.164 D:\ka-lite\kalite\distributed\management\commands\retrievecontentpack.py:78(local)
        1    0.000    0.000  728.849  728.849 D:\ka-lite\kalite\distributed\management\commands\retrievecontentpack.py:88(process_content_pack)
        1    0.000    0.000  711.946  711.946 D:\ka-lite\kalite\distributed\management\commands\retrievecontentpack.py:178(extract_assessment_items)
        1    0.000    0.000  711.946  711.946 D:\ka-lite\kalite\distributed\management\commands\retrievecontentpack.py:185(_extract_assessment_resources)
        1    0.073    0.073  711.939  711.939 C:\Python27\Lib\zipfile.py:1030(extractall)
    29751    0.260    0.000  711.816    0.024 C:\Python27\Lib\zipfile.py:1016(extract)
    29751  698.962    0.023  711.448    0.024 C:\Python27\Lib\zipfile.py:1042(_extract_member)
        1    0.000    0.000   14.087   14.087 D:\ka-lite\python-packages\django\core\management\__init__.py:126(call_command)
        1    0.000    0.000   14.057   14.057 D:\ka-lite\python-packages\django\core\management\base.py:240(execute)
        1    0.000    0.000   14.047   14.047 D:\ka-lite\kalite\topic_tools\management\commands\annotate_content_items.py:33(handle)
        1    0.000    0.000   14.043   14.043 D:\ka-lite\kalite\topic_tools\content_models.py:106(wrapper)
        1    0.005    0.005   11.031   11.031 D:\ka-lite\kalite\topic_tools\content_models.py:605(annotate_content_models)
      325    0.001    0.000    9.180    0.028 C:\Users\Mike\Envs\develop-0-16\lib\site-packages\peewee.py:2468(_execute)
      328    0.003    0.000    9.067    0.028 C:\Users\Mike\Envs\develop-0-16\lib\site-packages\peewee.py:3186(execute_sql)
      328    9.059    0.028    9.059    0.028 {method 'execute' of 'sqlite3.Cursor' objects}
      253    0.002    0.000    8.881    0.035 C:\Users\Mike\Envs\develop-0-16\lib\site-packages\peewee.py:2782(execute)
    72/12    0.004    0.000    8.115    0.676 D:\ka-lite\kalite\topic_tools\content_models.py:621(recurse_availability_up_tree)
      180    0.001    0.000    7.925    0.044 C:\Users\Mike\Envs\develop-0-16\lib\site-packages\peewee.py:2712(aggregate)
      180    0.001    0.000    7.913    0.044 C:\Users\Mike\Envs\develop-0-16\lib\site-packages\peewee.py:2475(scalar)
      180    0.001    0.000    7.903    0.044 C:\Users\Mike\Envs\develop-0-16\lib\site-packages\peewee.py:2746(first)

Total time for latest changes:

Thu Feb 18 15:07:25 2016    rcp_stats

         5876887 function calls (5855315 primitive calls) in 729.756 seconds

   Ordered by: internal time
   List reduced from 728 to 20 due to restriction <20>

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
    29751  698.962    0.023  711.448    0.024 C:\Python27\Lib\zipfile.py:1042(_extract_member)
      328    9.059    0.028    9.059    0.028 {method 'execute' of 'sqlite3.Cursor' objects}
        7    6.041    0.863    6.041    0.863 {method 'commit' of 'sqlite3.Connection' objects}
    51322    3.194    0.000    3.194    0.000 {nt.stat}
    29763    1.761    0.000    1.761    0.000 {open}
    53598    1.215    0.000    1.215    0.000 {method 'write' of 'file' objects}
   113103    1.039    0.000    1.039    0.000 {method 'read' of 'file' objects}
    83344    0.765    0.000    0.765    0.000 {zlib.crc32}
    29756    0.528    0.000    0.528    0.000 {method 'close' of 'file' objects}
   113071    0.499    0.000    2.091    0.000 C:\Python27\Lib\zipfile.py:649(read1)
    29755    0.389    0.000    2.975    0.000 C:\Python27\Lib\zipfile.py:937(open)
   192132    0.386    0.000    0.488    0.000 C:\Users\Mike\Envs\develop-0-16\lib\ntpath.py:96(splitdrive)
    29801    0.344    0.000    0.411    0.000 C:\Users\Mike\Envs\develop-0-16\lib\ntpath.py:174(split)
    29759    0.300    0.000    0.524    0.000 C:\Users\Mike\Envs\develop-0-16\lib\ntpath.py:415(normpath)
    29755    0.292    0.000    1.017    0.000 C:\Python27\Lib\zipfile.py:701(close)
    94845    0.272    0.000    0.521    0.000 {method 'join' of 'str' objects}
    29751    0.260    0.000  711.816    0.024 C:\Python27\Lib\zipfile.py:1016(extract)
    16390    0.244    0.000    0.244    0.000 C:\Python27\Lib\json\decoder.py:372(raw_decode)
    16624    0.237    0.000    0.237    0.000 {method 'fetchone' of 'sqlite3.Cursor' objects}
    29755    0.223    0.000    0.257    0.000 C:\Python27\Lib\zipfile.py:515(__init__)

Conclusion

Still prefer this branch, even though it took longer. The difference is because of
ncalls tottime percall cumtime percall filename:lineno(function)
29751 698.962 0.023 711.448 0.024 C:\Python27\Lib\zipfile.py:1042(_extract_member)
which anomalously took 10x longer per call:
ncalls tottime percall cumtime percall filename:lineno(function)
29751 255.459 0.009 277.180 0.009 C:\Python27\Lib\zipfile.py:1042(_extract_member)

Will try one more benchmark to see if it really was an anomaly.

@MCGallaspy
Copy link
Contributor

Nope, not anomalous! The earlier fix was faster on my machine.

Thu Feb 18 15:30:05 2016    rcp_stats

         5876840 function calls (5855268 primitive calls) in 737.644 seconds

   Ordered by: cumulative time
   List reduced from 707 to 20 due to restriction <20>

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.000    0.000  737.369  737.369 D:\ka-lite\kalite\distributed\management\commands\retrievecontentpack.py:78(local)
        1    0.000    0.000  737.063  737.063 D:\ka-lite\kalite\distributed\management\commands\retrievecontentpack.py:88(process_content_pack)
        1    0.000    0.000  720.553  720.553 D:\ka-lite\kalite\distributed\management\commands\retrievecontentpack.py:178(extract_assessment_items)
        1    0.000    0.000  720.553  720.553 D:\ka-lite\kalite\distributed\management\commands\retrievecontentpack.py:185(_extract_assessment_resources)
        1    0.071    0.071  720.546  720.546 C:\Python27\Lib\zipfile.py:1030(extractall)
    29751    0.262    0.000  720.426    0.024 C:\Python27\Lib\zipfile.py:1016(extract)
    29751  707.615    0.024  720.060    0.024 C:\Python27\Lib\zipfile.py:1042(_extract_member)
        1    0.000    0.000   13.210   13.210 D:\ka-lite\python-packages\django\core\management\__init__.py:126(call_command)
        1    0.000    0.000   13.208   13.208 D:\ka-lite\python-packages\django\core\management\base.py:240(execute)
        1    0.000    0.000   13.198   13.198 D:\ka-lite\kalite\topic_tools\management\commands\annotate_content_items.py:33(handle)
        1    0.000    0.000   13.194   13.194 D:\ka-lite\kalite\topic_tools\content_models.py:106(wrapper)
        1    0.006    0.006   11.869   11.869 D:\ka-lite\kalite\topic_tools\content_models.py:605(annotate_content_models)
      325    0.001    0.000    9.809    0.030 C:\Users\Mike\Envs\develop-0-16\lib\site-packages\peewee.py:2468(_execute)
      328    0.004    0.000    9.683    0.030 C:\Users\Mike\Envs\develop-0-16\lib\site-packages\peewee.py:3186(execute_sql)
      328    9.674    0.029    9.674    0.029 {method 'execute' of 'sqlite3.Cursor' objects}
      253    0.002    0.000    9.583    0.038 C:\Users\Mike\Envs\develop-0-16\lib\site-packages\peewee.py:2782(execute)
    72/12    0.005    0.000    8.732    0.728 D:\ka-lite\kalite\topic_tools\content_models.py:621(recurse_availability_up_tree)
      180    0.001    0.000    8.597    0.048 C:\Users\Mike\Envs\develop-0-16\lib\site-packages\peewee.py:2712(aggregate)
      180    0.002    0.000    8.582    0.048 C:\Users\Mike\Envs\develop-0-16\lib\site-packages\peewee.py:2475(scalar)
      180    0.001    0.000    8.571    0.048 C:\Users\Mike\Envs\develop-0-16\lib\site-packages\peewee.py:2746(first)

Do you have KA Lite in a separate drive vs. your Program Files? Might explain why this takes so long.

My ka-lite directory and the content pack (en.zip) are both on the same drive. My python interpreter is not -- I wouldn't expect it to make a difference if I'm not copying files from one drive to another, which as far as I can tell is not the case here. But I've been wrong before -- do you expect different behavior?

I'll happily merge this, but I'll wait for your reply.

@aronasorman
Copy link
Collaborator Author

I don't expect it to be different because I'm on Windows, but because I have a conventional hard drive vs an SSD

I mentioned this because it might be a difference in the copying algorithm. If you look at the shutil.move operation, it mentions that if the files are in the same filesystem, it does an os.rename. Otherwise, it does a shutil.copy2, and then deletes the old copy.

Anyway, copy2 is no longer present in your new benchmarks, and it's now just dominated by extractall. Same as my Linux benchmark. I think there are no longer any obvious low hanging fruit, so feel free to merge.

@aronasorman
Copy link
Collaborator Author

Btw, I tried out using czipfile, no marked improvements on my side.

MCGallaspy added a commit that referenced this pull request Feb 19, 2016
@MCGallaspy MCGallaspy merged commit 53b8d9d into learningequality:0.16.x Feb 19, 2016
@MCGallaspy MCGallaspy removed the has PR label Feb 19, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants