Merge remote-tracking branch 'upstream/develop' into online_nmf

piskvorky · Aug 14, 2018 · f71ad89 · f71ad89
2 parents bbd3099 + 27c524d
commit f71ad89
Show file tree

Hide file tree

Showing 79 changed files with 19,155 additions and 13,495 deletions.
diff --git a/CHANGELOG.md b/CHANGELOG.md
diff --git a/docs/notebooks/doc2vec-IMDB.ipynb b/docs/notebooks/doc2vec-IMDB.ipynb
diff --git a/docs/notebooks/doc2vec-lee.ipynb b/docs/notebooks/doc2vec-lee.ipynb
diff --git a/docs/src/Makefile b/docs/src/Makefile
@@ -40,7 +40,7 @@ html:
 	@echo
 	@echo "Build finished. The HTML pages are in ../"
 
-upload: 
+upload:
 	scp -r _build/html/* rr:public_html/gensim/
 
 dirhtml:

diff --git a/docs/src/_index.rst.unused b/docs/src/_index.rst.unused
@@ -0,0 +1,100 @@
+
+:github_url: https://github.com/RaRe-Technologies/gensim
+
+Gensim documentation
+===================================
+
+============
+Introduction
+============
+
+Gensim is a free Python library designed to automatically extract semantic
+topics from documents, as efficiently (computer-wise) and painlessly (human-wise) as possible.
+
+Gensim is designed to process raw, unstructured digital texts ("plain text").
+
+The algorithms in Gensim, such as **Word2Vec**, **FastText**, **Latent Semantic Analysis**, **Latent Dirichlet Allocation** and **Random Projections**, discover semantic structure of documents by examining statistical co-occurrence patterns within a corpus of training documents. These algorithms are **unsupervised**, which means no human input is necessary -- you only need a corpus of plain text documents.
+
+Once these statistical patterns are found, any plain text documents can be succinctly
+expressed in the new, semantic representation and queried for topical similarity
+against other documents, words or phrases.
+
+.. note::
+   If the previous paragraphs left you confused, you can read more about the `Vector
+   Space Model <http://en.wikipedia.org/wiki/Vector_space_model>`_ and `unsupervised
+   document analysis <http://en.wikipedia.org/wiki/Latent_semantic_indexing>`_ on Wikipedia.
+
+
+.. _design:
+
+Features
+--------
+
+* **Memory independence** -- there is no need for the whole training corpus to
+  reside fully in RAM at any one time (can process large, web-scale corpora).
+* **Memory sharing** -- trained models can be persisted to disk and loaded back via mmap. Multiple processes can share the same data, cutting down RAM footprint.
+* Efficient implementations for several popular vector space algorithms,
+  including Word2Vec, Doc2Vec, FastText, TF-IDF, Latent Semantic Analysis (LSI, LSA),
+  Latent Dirichlet Allocation (LDA) or Random Projection.
+* I/O wrappers and readers from several popular data formats.
+* Fast similarity queries for documents in their semantic representation.
+
+The **principal design objectives** behind Gensim are:
+
+1. Straightforward interfaces and low API learning curve for developers. Good for prototyping.
+2. Memory independence with respect to the size of the input corpus; all intermediate
+   steps and algorithms operate in a streaming fashion, accessing one document
+   at a time.
+
+.. seealso::
+
+    We built a high performance server for NLP, document analysis, indexing, search and clustering: https://scaletext.ai.
+    ScaleText is a commercial product, available both on-prem or as SaaS.
+    Reach out at [email protected] if you need an industry-grade tool with professional support.
+
+.. _availability:
+
+Availability
+------------
+
+Gensim is licensed under the OSI-approved `GNU LGPLv2.1 license <http://www.gnu.org/licenses/old-licenses/lgpl-2.1.en.html>`_ and can be downloaded either from its `github repository <https://github.com/piskvorky/gensim/>`_ or from the `Python Package Index <http://pypi.python.org/pypi/gensim>`_.
+
+.. seealso::
+
+    See the :doc:`install <install>` page for more info on Gensim deployment.
+
+
+.. toctree::
+   :glob:
+   :maxdepth: 1
+   :caption: Getting started
+
+   install
+   intro
+   support
+   about
+   license
+   citing
+
+
+.. toctree::
+   :maxdepth: 1
+   :caption: Tutorials
+
+   tutorial
+   tut1
+   tut2
+   tut3
+
+
+.. toctree::
+   :maxdepth: 1
+   :caption: API Reference
+
+   apiref
+
+Indices and tables
+==================
+
+* :ref:`genindex`
+* :ref:`modindex`
diff --git a/docs/src/_license.rst.unused b/docs/src/_license.rst.unused
@@ -0,0 +1,26 @@
+:orphan:
+
+.. _license:
+
+Licensing
+---------
+
+Gensim is licensed under the OSI-approved `GNU LGPLv2.1 license <http://www.gnu.org/licenses/old-licenses/lgpl-2.1.en.html>`_.
+
+This means that it's free for both personal and commercial use, but if you make any
+modification to Gensim that you distribute to other people, you have to disclose
+the source code of these modifications.
+
+Apart from that, you are free to redistribute Gensim in any way you like, though you're
+not allowed to modify its license (doh!).
+
+My intent here is to **get more help and community involvement** with the development of Gensim.
+The legalese is therefore less important to me than your input and contributions.
+
+`Contact me <mailto:[email protected]>`_ if LGPL doesn't fit your bill but you'd like the LGPL restrictions liften.
+
+.. seealso::
+
+    We built a high performance server for NLP, document analysis, indexing, search and clustering: https://scaletext.ai.
+    ScaleText is a commercial product, available both on-prem or as SaaS.
+    Reach out at [email protected] if you need an industry-grade tool with professional support.
diff --git a/docs/src/about.rst b/docs/src/about.rst
@@ -2,72 +2,72 @@
 
 .. _about:
 
-============
+=====
 About
-============
+=====
 
 History
---------
+-------
 
 Gensim started off as a collection of various Python scripts for the Czech Digital Mathematics Library `dml.cz <http://dml.cz/>`_ in 2008,
 where it served to generate a short list of the most similar articles to a given article (**gensim = "generate similar"**).
 I also wanted to try these fancy "Latent Semantic Methods", but the libraries that
 realized the necessary computation were `not much fun to work with <http://soi.stanford.edu/~rmunk/PROPACK/>`_.
 
 Naturally, I set out to reinvent the wheel. Our `2010 LREC publication <http://radimrehurek.com/gensim/lrec2010_final.pdf>`_
-describes the initial design decisions behind gensim (clarity, efficiency and scalability)
-and is fairly representative of how gensim works even today.
+describes the initial design decisions behind Gensim: clarity, efficiency and scalability. It is fairly representative of how Gensim works even today.
 
 Later versions of gensim improved this efficiency and scalability tremendously. In fact,
 I made algorithmic scalability of distributional semantics the topic of my `PhD thesis <http://radimrehurek.com/phd_rehurek.pdf>`_.
 
-By now, gensim is---to my knowledge---the most robust, efficient and hassle-free piece
+By now, Gensim is---to my knowledge---the most robust, efficient and hassle-free piece
 of software to realize unsupervised semantic modelling from plain text. It stands
 in contrast to brittle homework-assignment-implementations that do not scale on one hand,
 and robust java-esque projects that take forever just to run "hello world".
 
 In 2011, I started using `Github <https://github.com/piskvorky/gensim>`_ for source code hosting
-and the gensim website moved to its present domain. In 2013, gensim got its current logo and website design.
+and the Gensim website moved to its present domain. In 2013, Gensim got its current logo and website design.
 
 
 Licensing
 ----------
 
 Gensim is licensed under the OSI-approved `GNU LGPLv2.1 license <http://www.gnu.org/licenses/old-licenses/lgpl-2.1.en.html>`_.
 This means that it's free for both personal and commercial use, but if you make any
-modification to gensim that you distribute to other people, you have to disclose
+modification to Gensim that you distribute to other people, you have to disclose
 the source code of these modifications.
 
-Apart from that, you are free to redistribute gensim in any way you like, though you're
+Apart from that, you are free to redistribute Gensim in any way you like, though you're
 not allowed to modify its license (doh!).
 
-My intent here is, of course, to **get more help and community involvement** with the development of gensim.
+My intent here is to **get more help and community involvement** with the development of Gensim.
 The legalese is therefore less important to me than your input and contributions.
-Contact me if LGPL doesn't fit your bill but you'd still like to use gensim -- we'll work something out.
+
+`Contact me <mailto:[email protected]>`_ if LGPL doesn't fit your bill and you'd like the open source restrictions lifted.
 
 .. seealso::
 
-    I also host a document similarity package `gensim.simserver`. This is a high-level
-    interface to `gensim` functionality, and offers transactional remote (web-based)
-    document similarity queries and indexing. It uses gensim to do the heavy lifting:
-    you don't need the `simserver` to use gensim, but you do need gensim to use the `simserver`.
-    Note that unlike gensim, `gensim.simserver` is licensed under `Affero GPL <http://www.gnu.org/licenses/agpl-3.0.html>`_,
-    which is much more restrictive for inclusion in commercial projects.
+    We also built a high performance commercial server for NLP, document analysis, indexing, search and clustering: https://scaletext.ai. ScaleText is available both on-prem and as SaaS.
+
+    Reach out at [email protected] if you need an industry-grade NLP tool with professional support.
+
 
 Contributors
---------------
+------------
 
-Credit goes to all the people who contributed to gensim, be it in `discussions <http://groups.google.com/group/gensim>`_,
+Credit goes to the many people who contributed to Gensim, be it in `discussions <http://groups.google.com/group/gensim>`_,
 ideas, `code contributions <https://github.com/piskvorky/gensim/pulls>`_ or `bug reports <https://github.com/piskvorky/gensim/issues>`_.
+
 It's really useful and motivating to get feedback, in any shape or form, so big thanks to you all!
 
 Some honorable mentions are included in the `CHANGELOG.txt <https://github.com/piskvorky/gensim/blob/develop/CHANGELOG.md>`_.
 
 Academic citing
-----------------
+---------------
 
-Gensim has been used in `many students' final theses as well as research papers <https://scholar.google.com/citations?view_op=view_citation&hl=en&user=9vG_kV0AAAAJ&citation_for_view=9vG_kV0AAAAJ:NaGl4SEjCO4C>`_. When citing gensim,
-please use `this BibTeX entry <bibtex_gensim.bib>`_::
+Gensim has been used in `over a thousand research paper and student theses <https://scholar.google.com/citations?view_op=view_citation&hl=en&user=9vG_kV0AAAAJ&citation_for_view=9vG_kV0AAAAJ:NaGl4SEjCO4C>`_.
+
+When citing Gensim, please use `this BibTeX entry <bibtex_gensim.bib>`_::
 
   @inproceedings{rehurek_lrec,
         title = {{Software Framework for Topic Modelling with Large Corpora}},
@@ -83,5 +83,3 @@ please use `this BibTeX entry <bibtex_gensim.bib>`_::
         note={\url{http://is.muni.cz/publication/884893/en}},
         language={English}
   }
-
-
diff --git a/docs/src/apiref.rst b/docs/src/apiref.rst
@@ -53,6 +53,9 @@ Modules:
     models/callbacks
     models/utils_any2vec
     models/_utils_any2vec
+    models/word2vec_inner
+    models/doc2vec_inner
+    models/fasttext_inner
     models/wrappers/ldamallet
     models/wrappers/dtmmodel
     models/wrappers/ldavowpalwabbit.rst
@@ -64,6 +67,7 @@ Modules:
     models/deprecated/word2vec
     models/deprecated/keyedvectors
     models/deprecated/fasttext_wrapper
+    models/base_any2vec
     similarities/docsim
     similarities/index
     sklearn_api/atmodel

diff --git a/docs/src/conf.py b/docs/src/conf.py
@@ -28,6 +28,8 @@
 extensions = ['sphinx.ext.autodoc', 'sphinxcontrib.napoleon', 'sphinx.ext.imgmath', 'sphinxcontrib.programoutput']
 autoclass_content = "both"
 
+napoleon_google_docstring = False  # Disable support for google-style docstring
+
 # Add any paths that contain templates here, relative to this directory.
 templates_path = ['_templates']
 
@@ -53,9 +55,9 @@
 # built documents.
 #
 # The short X.Y version.
-version = '3.4'
+version = '3.5'
 # The full version, including alpha/beta/rc tags.
-release = '3.4.0'
+release = '3.5.0'
 
 # The language for content autogenerated by Sphinx. Refer to documentation
 # for a list of supported languages.

diff --git a/docs/src/distributed.rst b/docs/src/distributed.rst
@@ -1,7 +1,7 @@
 .. _distributed:
 
 Distributed Computing
-===================================
+=====================
 
 Why distributed computing?
 ---------------------------
@@ -37,20 +37,20 @@ Prerequisites
 
 For communication between nodes, `gensim` uses `Pyro (PYthon Remote Objects)
 <http://pypi.python.org/pypi/Pyro4>`_, version >= 4.27. This is a library for low-level socket communication
-and remote procedure calls (RPC) in Python. `Pyro` is a pure-Python library, so its
+and remote procedure calls (RPC) in Python. `Pyro4` is a pure-Python library, so its
 installation is quite painless and only involves copying its `*.py` files somewhere onto your Python's import path::
 
-  sudo easy_install Pyro4
+  pip install Pyro4
 
-You don't have to install `Pyro` to run `gensim`, but if you don't, you won't be able
+You don't have to install Pyro to run Gensim, but if you don't, you won't be able
 to access the distributed features (i.e., everything will always run in serial mode,
 the examples on this page don't apply).
 
 
 Core concepts
------------------------------------
+-------------
 
-As always, `gensim` strives for a clear and straightforward API (see :ref:`design`).
+As always, Gensim strives for a clear and straightforward API (see :ref:`design`).
 To this end, *you do not need to make any changes in your code at all* in order to
 run it over a cluster of computers!
 

diff --git a/docs/src/gensim_theme/layout.html b/docs/src/gensim_theme/layout.html
@@ -174,7 +174,7 @@ <h3>Get Expert Help From The Gensim Authors</h3>
 
             <div class="tweetodsazeni">
               <div class="tweet">
-                <a href="https://twitter.com/radimrehurek" target="_blank" style="color: white">Tweet @RadimRehurek</a>
+                <a href="https://twitter.com/gensim_py" target="_blank" style="color: white">Tweet @Gensim_py</a>
               </div>
             </div>
-Original file line number
+Diff line change
@@ Expand Up / @@ -40,7 +40,7 @@ html: @@
     	@echo
     	@echo "Build finished. The HTML pages are in ../"
-    upload:
+    upload:
     	scp -r _build/html/* rr:public_html/gensim/
     dirhtml:
@@ Expand Down @@