Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reduce gensim distribution size #1783

Open
menshikh-iv opened this issue Dec 13, 2017 · 10 comments
Open

Reduce gensim distribution size #1783

menshikh-iv opened this issue Dec 13, 2017 · 10 comments
Labels
difficulty medium Medium issue: required good gensim understanding & python skills feature Issue described a new feature

Comments

@menshikh-iv
Copy link
Contributor

Right now, size of gensim wheel/tar.gz is ~16MB, this is less than 50MB+, but still huge.
Need to "cut" big files that used for tests and rewrite the affected tests

Previous issue #1698

@menshikh-iv menshikh-iv added feature Issue described a new feature difficulty medium Medium issue: required good gensim understanding & python skills labels Dec 13, 2017
@JensMadsen
Copy link

It would be most convenient if we could build a minimal version of the distribution (including minimal scipy and numpy modules). I just managed to squeeze gensim onto a aws lambda using python3, but it was not easy :-)

@menshikh-iv
Copy link
Contributor Author

@JensMadsen I think we can reduce the size of gensim distribution from ~15MB to ~3-5MB. Can you describe, what are you do for compressing?

@JensMadsen
Copy link

JensMadsen commented Jan 25, 2018

Yes of course. I plan to write a blog post somewhere soon :-)

A not in details procedure for squeezing gensim into AWS lambda:

  1. virtualenv --no-site-packages
  2. strip .so files, but not all since some scipy break (stripping some manylinux generated files produces broken shared objects pypa/manylinux#119)
  3. delete all tests in numpy, scipy, and gensim
  4. just to be sure delete pycache

In that way I get a sufficiently small zip file

Actually what matters the most is to reduce the size of scipy which to my understanding has grown significantly lately

@menshikh-iv
Copy link
Contributor Author

@JensMadsen thanks for the information!
We have an old issue related to scipy - #557 (we want to develop a small tool for manipulating with sparse matrices and drop scipy as a dependency).

@piskvorky
Copy link
Owner

Can't wait to finally ditch scipy!

@JustinMoser
Copy link

@JensMadsen Hi! Sorry to chime in, but did you ever write a blog post on getting gensim on AWS lambda? Trying to do that now, and gensim is quite...large, when creating a deployment package.

Thanks!

@menshikh-iv
Copy link
Contributor Author

@JustinMoser sorry, no updates.

As "ad-hoc" solution, you can extract & drop test data (gensim/test/test_data) from wheel manually (.whl just an archive) and use it on lambda.

@JustinMoser
Copy link

@menshikh-iv Thank you! Pardon me if I'm being dim, but when I install gensim to my deployment directory (using pip install gensim --target .), with the dependencies, it is near the 300mb mark.

@menshikh-iv
Copy link
Contributor Author

@JustinMoser wow, that sounds impossible, for example, I made a clean installation on python2

-rw-r--r-- 1 ivan ivan 26575351 янв 16 20:47 scipy-1.2.0-cp27-cp27mu-manylinux1_x86_64.whl
-rw-r--r-- 1 ivan ivan 23630838 янв 16 20:47 gensim-3.6.0-cp27-cp27mu-manylinux1_x86_64.whl
-rw-r--r-- 1 ivan ivan 16961961 янв 16 20:47 numpy-1.16.0-cp27-cp27mu-manylinux1_x86_64.whl
-rw-r--r-- 1 ivan ivan  5213503 янв 16 20:47 botocore-1.12.79-py2.py3-none-any.whl
-rw-r--r-- 1 ivan ivan  1359202 янв 16 20:47 boto-2.49.0-py2.py3-none-any.whl
-rw-r--r-- 1 ivan ivan   543728 янв 16 20:47 docutils-0.14-py2-none-any.whl
-rw-r--r-- 1 ivan ivan   225696 янв 16 20:47 python_dateutil-2.7.5-py2.py3-none-any.whl
-rw-r--r-- 1 ivan ivan   154154 янв 16 20:47 certifi-2018.11.29-py2.py3-none-any.whl
-rw-r--r-- 1 ivan ivan   133356 янв 16 20:47 chardet-3.0.4-py2.py3-none-any.whl
-rw-r--r-- 1 ivan ivan   128504 янв 16 20:47 boto3-1.9.79-py2.py3-none-any.whl
-rw-r--r-- 1 ivan ivan   118086 янв 16 20:47 urllib3-1.24.1-py2.py3-none-any.whl
-rw-r--r-- 1 ivan ivan    59642 янв 16 20:47 s3transfer-0.1.13-py2.py3-none-any.whl
-rw-r--r-- 1 ivan ivan    58594 янв 16 20:47 idna-2.8-py2.py3-none-any.whl
-rw-r--r-- 1 ivan ivan    57987 янв 16 20:47 requests-2.21.0-py2.py3-none-any.whl
-rw-r--r-- 1 ivan ivan    23497 янв 16 20:47 jmespath-0.9.3-py2.py3-none-any.whl
-rw-r--r-- 1 ivan ivan    15847 янв 16 20:47 futures-3.2.0-py2-none-any.whl
-rw-r--r-- 1 ivan ivan    10586 янв 16 20:47 six-1.12.0-py2.py3-none-any.whl

gensim with all deps takes around 72M, where 300MB comes from? Can you check please, what exactly downloaded?

if you talking about installed, so, in that case, numpy & scipy still top2 (more than 150MB)

285756  bbbbbb/lib/python2.7
285368  bbbbbb/lib/python2.7/site-packages
98708   bbbbbb/lib/python2.7/site-packages/scipy
70672   bbbbbb/lib/python2.7/site-packages/numpy
41116   bbbbbb/lib/python2.7/site-packages/gensim
39532   bbbbbb/lib/python2.7/site-packages/scipy/.libs
39516   bbbbbb/lib/python2.7/site-packages/botocore
34944   bbbbbb/lib/python2.7/site-packages/botocore/data
32152   bbbbbb/lib/python2.7/site-packages/gensim/test
30704   bbbbbb/lib/python2.7/site-packages/gensim/test/test_data
30072   bbbbbb/lib/python2.7/site-packages/numpy/.libs
25736   bbbbbb/lib/python2.7/site-packages/numpy/core
11056   bbbbbb/lib/python2.7/site-packages/boto
9136    bbbbbb/lib/python2.7/site-packages/scipy/special

unfortunatelly, I can't help with it

@JensMadsen
Copy link

@JustinMoser I dropped lambdas. too much hazzle. Doing a service in a kubernetes cluster instead :-) This is the content of my dockwer file from back then:

# Use an official Python runtime as a parent image
FROM amazonlinux:1

# install python 36
RUN yum -y install python36 python36-pip python36-setuptools python36-virtualenv

# install requirements for gensim
RUN yum -y install git
RUN yum -y install zip
RUN yum -y install gcc
RUN yum -y install gcc-gfortran 
RUN yum -y install gcc-c++ 
RUN yum -y install blas-devel 
RUN yum -y install lapack-devel 
RUN yum -y install atlas-devel

# create virtual env for lambda function
RUN python3 -m virtualenv d2v_env --no-site-packages --always-copy
RUN source d2v_env/bin/activate

# copy python files into docker
RUN mkdir d2v_infer
ADD *.py d2v_infer/
ADD requirements.txt .

# install gensim 
#RUN source d2v_env/bin/activate && pip install --use-wheel gensim
RUN source d2v_env/bin/activate && pip install -r requirements.txt

# strip to save space. This is neccessary due to bugs in numpy and scipy packages https://github.com/pypa/manylinux/issues/119
RUN cd d2v_env/lib64/python3.6/site-packages/ && find . -name "*.so" | grep -v ufuncs | grep -v fblas | grep -v flapack | grep -v cython_blas | grep -v cython_lapack | grep -v ellip_harm | grep -v odepack | grep -v quadpack | grep -v vode | grep -v lsoda | grep -v iterative | grep -v superlu | grep -v arpack | grep -v trlib | grep -v lbfgs | grep -v qhull | xargs strip

# get lib files
RUN mkdir d2v_infer/lib
RUN find /usr/lib64 -name "libblas.*" -exec cp -P {} d2v_infer/lib/ \;
RUN find /usr/lib64 -name "libgfortran.*" -exec cp -P {} d2v_infer/lib/ \;
RUN find /usr/lib64 -name "liblapack.*" -exec cp -P {} d2v_infer/lib/ \;
RUN find /usr/lib64 -name "libopenblas.*" -exec cp -P {} d2v_infer/lib/ \;
RUN find /usr/lib64 -name "libquadmath.*" -exec cp -P {} d2v_infer/lib/ \;
RUN find /usr/lib64 -name "libf77blas.*" -exec cp -P {} d2v_infer/lib/ \;
RUN find /usr/lib64 -name "libcblas.*" -exec cp -P {} d2v_infer/lib/ \;
RUN find /usr/lib64 -name "libatlas.*" -exec cp -P {} d2v_infer/lib/ \;

# Copy dependencies 
RUN cp -r d2v_env/lib/python3.6/site-packages/six* /d2v_infer/
RUN cp -r d2v_env/lib/python3.6/site-packages/bz2file* /d2v_infer/
RUN cp -r d2v_env/lib/python3.6/site-packages/boto* /d2v_infer/
RUN cp -r d2v_env/lib/python3.6/site-packages/idna* /d2v_infer/
RUN cp -r d2v_env/lib/python3.6/site-packages/chardet* /d2v_infer/
RUN cp -r d2v_env/lib/python3.6/site-packages/urllib3* /d2v_infer/
RUN cp -r d2v_env/lib/python3.6/site-packages/certifi* /d2v_infer/
RUN cp -r d2v_env/lib/python3.6/site-packages/requests* /d2v_infer/
RUN cp -r d2v_env/lib/python3.6/site-packages/python_dateutil-2.7.3.dist-info/ /d2v_infer/
RUN cp -r d2v_env/lib/python3.6/site-packages/docutils* /d2v_infer/.
RUN cp -r d2v_env/lib/python3.6/site-packages/jmespath /d2v_infer/.
RUN cp -r d2v_env/lib/python3.6/site-packages/boto* /d2v_infer/.
RUN cp -r d2v_env/lib/python3.6/site-packages/s3transfer* /d2v_infer/.
RUN cp -r d2v_env/lib/python3.6/site-packages/smart_open* /d2v_infer/.
RUN cp -r /d2v_env/lib/python3.6/site-packages/dateutil/ /d2v_infer/.
RUN cp -r d2v_env/lib64/python3.6/site-packages/* /d2v_infer/.

# delete __pycache__ if exists
RUN cd d2v_infer && find . -type d -name __pycache__ -exec rm -r {} \+

# Delete tests to reduce size
RUN cd d2v_infer && find . -type d -name tests -exec rm -r {} \+
RUN cd d2v_infer && find . -type d -name test -exec rm -r {} \+

# zip it up 
RUN cd /d2v_infer && zip -r -q /d2v.zip ./*

# aws s3 cp gensim_dist.zip s3://onlaw-d2v/deployment_packages.zip

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
difficulty medium Medium issue: required good gensim understanding & python skills feature Issue described a new feature
Projects
None yet
Development

No branches or pull requests

4 participants