Skip to content

Commit

Permalink
clean up docs structure
Browse files Browse the repository at this point in the history
  • Loading branch information
piskvorky committed Sep 28, 2020
1 parent 782f7ff commit 00dc57f
Show file tree
Hide file tree
Showing 13 changed files with 120 additions and 174 deletions.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
82 changes: 0 additions & 82 deletions docs/src/about.rst

This file was deleted.

2 changes: 1 addition & 1 deletion docs/src/conf.py
Original file line number Diff line number Diff line change
Expand Up @@ -54,7 +54,7 @@

# General information about the project.
project = u'gensim'
copyright = u'2009-now Radim Řehůřek, https://radimrehurek.com.'
copyright = u'2009-now'

# The version info for the project you're documenting, acts as replacement for
# |version| and |release|, also used in various other places throughout the
Expand Down
2 changes: 2 additions & 0 deletions docs/src/distributed.rst
Original file line number Diff line number Diff line change
@@ -1,3 +1,5 @@
:orphan:

.. _distributed:

Distributed Computing
Expand Down
4 changes: 1 addition & 3 deletions docs/src/indextoc.rst
Original file line number Diff line number Diff line change
Expand Up @@ -3,8 +3,6 @@
:maxdepth: 1

intro
distributed
auto_examples/index
support
wiki
apiref
support
119 changes: 85 additions & 34 deletions docs/src/intro.rst
Original file line number Diff line number Diff line change
@@ -1,17 +1,22 @@
.. _intro:

============
Introduction
============
===============
What is Gensim?
===============

Gensim is a :ref:`free <availability>` Python library designed to automatically extract semantic
topics from documents, as efficiently (computer-wise) and painlessly (human-wise) as possible.
Gensim is a free open-source Python library for representing
documents as semantic vectors, as efficiently (computer-wise) and painlessly (human-wise) as possible.

Gensim is designed to process raw, unstructured digital texts ("*plain text*").
.. image:: _static/images/gensim_logo_positive_complete_tb.png
:width: 600
:alt: Gensim logo

Gensim is designed to process raw, unstructured digital texts ("*plain text*") using unsupervised machine learning algorithms.

The algorithms in Gensim, such as :class:`~gensim.models.word2vec.Word2Vec`, :class:`~gensim.models.fasttext.FastText`,
Latent Semantic Analysis (LSI, LSA, see :class:`~gensim.models.lsimodel.LsiModel`), Latent Dirichlet
Allocation (LDA, see :class:`~gensim.models.ldamodel.LdaModel`) etc, automatically discover the semantic structure of documents by examining statistical
Latent Semantic Indexing (LSI, LSA, :class:`~gensim.models.lsimodel.LsiModel`), Latent Dirichlet
Allocation (LDA, :class:`~gensim.models.ldamodel.LdaModel`) etc, automatically discover the semantic
structure of documents by examining statistical
co-occurrence patterns within a corpus of training documents. These algorithms are **unsupervised**,
which means no human input is necessary -- you only need a corpus of plain text documents.

Expand All @@ -24,42 +29,88 @@ Once these statistical patterns are found, any plain text documents (sentence, p

.. _design:

Features
--------
Design principles
-----------------

We built Gensim from scratch for:

* **Practicality** -- as industry experts, we focus on proven, battle-hardened algorithms to solve real industry problems. More focus on engineering, less on academia.
* **Memory independence** -- there is no need for the whole training corpus to
reside fully in RAM at any one time (can process large, web-scale corpora).
* **Memory sharing** -- trained models can be persisted to disk and loaded back via `mmap <https://en.wikipedia.org/wiki/Mmap>`_. Multiple processes can share the same data, cutting down RAM footprint.
* Efficient implementations for several popular vector space algorithms,
including :class:`~gensim.models.word2vec.Word2Vec`, :class:`~gensim.models.doc2vec.Doc2Vec`, :class:`~gensim.models.fasttext.FastText`,
TF-IDF, Latent Semantic Analysis (LSI, LSA, see :class:`~gensim.models.lsimodel.LsiModel`),
Latent Dirichlet Allocation (LDA, see :class:`~gensim.models.ldamodel.LdaModel`) or Random Projection (see :class:`~gensim.models.rpmodel.RpModel`).
* I/O wrappers and readers from several popular data formats.
* Fast similarity queries for documents in their semantic representation.
reside fully in RAM at any one time. Can process large, web-scale corpora using data streaming.
* **Performance** – highly optimized implementations of popular vector space algorithms using C, BLAS and memory-mapping.

The **principal design objectives** behind Gensim are:

1. Straightforward interfaces and low API learning curve for developers. Good for prototyping.
2. Memory independence with respect to the size of the input corpus; all intermediate
steps and algorithms operate in a streaming fashion, accessing one document
at a time.
Installation
------------

.. seealso::
Gensim is a Python library, so you need `Python <https://www.python.org/downloads/>`_. Gensim supports all Python versions that haven't reached their `end-of-life <https://devguide.python.org/#status-of-python-branches>`_.

We also built a high performance commercial server for NLP, document analysis, indexing, search and clustering: https://scaletext.ai. ScaleText is available both on-prem and as SaaS.
If you need with an older Python (such as Python 2.7), you must install an older version of Gensim (such as `Gensim 3.8.3 <https://github.com/RaRe-Technologies/gensim/releases/tag/3.8.3>`_).

Reach out at [email protected] if you need an industry-grade NLP tool with professional support.
To install gensim, simply run::

.. _availability:
pip install --upgrade gensim

Availability
------------
Alternatively, you can download the source code from `Github <https://github.com/RARE-Technologies/gensim/>`__
or the `Python Package Index <http://pypi.python.org/pypi/gensim>`_.

After installation, learn how to use Gensim from its :ref:`sphx_glr_auto_examples_core_run_core_concepts.py` tutorials.


.. _Licensing:

Licensing
----------

Gensim is licensed under the OSI-approved `GNU LGPLv2.1 license <http://www.gnu.org/licenses/old-licenses/lgpl-2.1.en.html>`_.
This means that it's free for both personal and commercial use, but if you make any
modification to Gensim that you distribute to other people, you have to disclose
the source code of these modifications.

Apart from that, you are free to redistribute Gensim in any way you like, though you're
not allowed to modify its license (doh!).

If LGPL doesn't fit your bill, you can ask for :ref:`Commercial support`.

.. _Academic citing:

Academic citing
---------------

Gensim has been used in `over two thousand research papers and student theses <https://scholar.google.com/citations?view_op=view_citation&hl=en&user=9vG_kV0AAAAJ&citation_for_view=9vG_kV0AAAAJ:NaGl4SEjCO4C>`_.

When citing Gensim, please use `this BibTeX entry <bibtex_gensim.bib>`_::

@inproceedings{rehurek_lrec,
title = {{Software Framework for Topic Modelling with Large Corpora}},
author = {Radim {\v R}eh{\r u}{\v r}ek and Petr Sojka},
booktitle = {{Proceedings of the LREC 2010 Workshop on New
Challenges for NLP Frameworks}},
pages = {45--50},
year = 2010,
month = May,
day = 22,
publisher = {ELRA},
address = {Valletta, Malta},
note={\url{http://is.muni.cz/publication/884893/en}},
language={English}
}

Gensim = "Generate Similar"
---------------------------

Historically, Gensim started off as a collection of Python scripts for the Czech Digital Mathematics Library `dml.cz <http://dml.cz/>`_ project, back in 2008. The scripts served to generate a short list of the most similar math articles to a given article.

I (Radim) also wanted to try these fancy "Latent Semantic Methods", but the libraries that realized the necessary computation were `not much fun to work with <http://soi.stanford.edu/~rmunk/PROPACK/>`_.

Gensim is licensed under the OSI-approved `GNU LGPLv2.1 license <http://www.gnu.org/licenses/old-licenses/lgpl-2.1.en.html>`_ and can be downloaded either from its `Github repository <https://github.com/RARE-Technologies/gensim/>`_
or from the `Python Package Index <http://pypi.python.org/pypi/gensim>`_.
Naturally, I set out to reinvent the wheel. Our `2010 LREC publication <http://radimrehurek.com/lrec2010_final.pdf>`_ describes the initial design decisions behind Gensim: **clarity, efficiency and scalability**. It is fairly representative of how Gensim works even today.

Later versions of Gensim improved this efficiency and scalability tremendously. In fact, I made algorithmic scalability of distributional semantics the topic of my `PhD thesis <http://radimrehurek.com/phd_rehurek.pdf>`_.

Core concepts
-------------
By now, Gensim is---to my knowledge---the most robust, efficient and hassle-free piece
of software to realize unsupervised semantic modelling from plain text. It stands
in contrast to brittle homework-assignment-implementations that do not scale on one hand,
and robust java-esque projects that take forever just to run "hello world".

See the :ref:`sphx_glr_auto_examples_core_run_core_concepts.py` tutorial.
In 2011, I moved Gensim's source code to `Github <https://github.com/piskvorky/gensim>`__
and created the Gensim website. In 2013 Gensim got its current logo, and in 2020 a website redesign.
1 change: 1 addition & 0 deletions docs/src/sphinx_rtd_theme/advertisement.html
Original file line number Diff line number Diff line change
Expand Up @@ -2,4 +2,5 @@
<h3>Get Expert Help From The Gensim Authors</h3>
<p><a href="https://rare-technologies.com/">Consulting</a> in Machine Learning &amp; NLP</p>
<p><a href="https://rare-technologies.com/corporate-training/">Corporate trainings</a> in Data Science, NLP and Deep Learning</p>
<p><a href="https://pii-tools.com">PII Tools</a> automated discovery of personal and sensitive data</p>
</div>
9 changes: 2 additions & 7 deletions docs/src/sphinx_rtd_theme/layout.html
Original file line number Diff line number Diff line change
Expand Up @@ -234,10 +234,10 @@
{%- if hasdoc('copyright') %}
{% set path = pathto('copyright') %}
{% set copyright = copyright|e %}
&copy; <a href="{{ path }}">{% trans %}Copyright{% endtrans %}</a> {{ copyright }}
&copy; <a href="{{ path }}">{% trans %}Copyright{% endtrans %}</a> {{ copyright }}, <a href="https://radimrehurek.com">Radim Řehůřek</a>.
{%- else %}
{% set copyright = copyright|e %}
&copy; {% trans %}Copyright{% endtrans %} {{ copyright }}
&copy; {% trans %}Copyright{% endtrans %} {{ copyright }}, <a href="https://radimrehurek.com">Radim Řehůřek</a>.
{%- endif %}
{%- endif %}

Expand All @@ -257,11 +257,6 @@
{% trans last_updated=last_updated|e %}Last updated on {{ last_updated }}.{% endtrans %}
</span>
{%- endif %}
<br>
<a target="_blank" rel="nofollow" href="https://radimrehurek.com/"> Radim Řehůřek – Machine learing and data
mining expert</a>
<br>
<a target="_blank" rel="nofollow" href="https://edgy.digital/"> Created by edgy.digital</a>
</div>
<nav id="social-menu" class="menu-footer-container">
</nav>
Expand Down
6 changes: 3 additions & 3 deletions docs/src/sphinx_rtd_theme/layouthome.html
Original file line number Diff line number Diff line change
Expand Up @@ -100,7 +100,7 @@ <h5 class="themecolor">Do you have a question?</h5>
<h3>Gensim Support</h3>
<hr class="no_line" style="margin:0 auto 30px">
<!--<h4 style="display: inline-block; font-weight: 300; margin-right: 10px; color: #000;"><i class="icon-mobile themecolor" style></i> +61 (0) 383 766 284</h4>-->
<h4 style="display: inline-block; font-weight: 300; color: #000;">See the <a href="{{ pathto('support') }}">Gensim support page</a> for how to get open source and commercial support.</h4>
<h4 style="display: inline-block; font-weight: 300; color: #000;">See the <a href="{{ pathto('support') }}">Gensim support page</a> for how to ask for open source and commercial support.</h4>
</div>
</aside>
</div>
Expand All @@ -123,10 +123,10 @@ <h4 style="display: inline-block; font-weight: 300; color: #000;">See the <a hre
{%- if hasdoc('copyright') %}
{% set path = pathto('copyright') %}
{% set copyright = copyright|e %}
&copy; <a href="{{ path }}">{% trans %}Copyright{% endtrans %}</a> {{ copyright }}
&copy; <a href="{{ path }}">{% trans %}Copyright{% endtrans %}</a> {{ copyright }}, <a href="https://radimrehurek.com">Radim Řehůřek</a>.
{%- else %}
{% set copyright = copyright|e %}
&copy; {% trans %}Copyright{% endtrans %} {{ copyright }}
&copy; {% trans %}Copyright{% endtrans %} {{ copyright }}, <a href="https://radimrehurek.com">Radim Řehůřek</a>.
{%- endif %}
{%- endif %}

Expand Down
10 changes: 5 additions & 5 deletions docs/src/sphinx_rtd_theme/topbar.html
Original file line number Diff line number Diff line change
Expand Up @@ -29,16 +29,16 @@
<a href="{{ pathto("index") }}"><span>Home</span></a>
</li>
<li{% if pagename == "auto_examples/index" %} class="current-menu-item"{% endif %}>
<a href="{{ pathto("auto_examples/index") }}"><span>Documentation</span></a>
<a href="{{ pathto("auto_examples/index") }}#documentation"><span>Documentation</span></a>
</li>
<li{% if pagename == "support" %} class="current-menu-item"{% endif %}>
<a href="{{ pathto("support") }}"><span>Support</span></a>
<a href="{{ pathto("support") }}#support"><span>Support</span></a>
</li>
<li{% if pagename == "apiref" %} class="current-menu-item"{% endif %}>
<a href="{{ pathto("apiref") }}"><span>API</span></a>
<a href="{{ pathto("apiref") }}#api-reference"><span>API</span></a>
</li>
<li{% if pagename == "about" %} class="current-menu-item"{% endif %}>
<a href="{{ pathto("about") }}"><span>About</span></a>
<li{% if pagename == "intro" %} class="current-menu-item"{% endif %}>
<a href="{{ pathto("intro") }}#what-is-gensim"><span>About</span></a>
</li>
</ul>
</nav>
Expand Down
Loading

0 comments on commit 00dc57f

Please sign in to comment.