Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support pretrained word2vec model when train doc2vec #2703

Closed
wants to merge 27 commits into from
Closed
Show file tree
Hide file tree
Changes from 25 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
234 changes: 59 additions & 175 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,177 +1,61 @@
gensim – Topic Modelling in Python
doc2vec in gensim – support pretrained word2vec
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think instead of replacing the top-level README.md file, you should put this documentation somewhere else. Ideally, it should be in a tutorial or a howto.

See https://radimrehurek.com/gensim/auto_examples/howtos/run_doc.html#sphx-glr-auto-examples-howtos-run-doc-py for more info.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is my first time to do PR on github, I already followed your advice on word2vec.py and doc2vec.py, I also know that I should not change the top README.md file, but I really don't how and where to write the document, thank you for more advice!

==================================

[![Build Status](https://travis-ci.org/RaRe-Technologies/gensim.svg?branch=develop)](https://travis-ci.org/RaRe-Technologies/gensim)
[![GitHub release](https://img.shields.io/github/release/rare-technologies/gensim.svg?maxAge=3600)](https://github.com/RaRe-Technologies/gensim/releases)
[![Conda-forge Build](https://anaconda.org/conda-forge/gensim/badges/version.svg)](https://anaconda.org/conda-forge/gensim)
[![Wheel](https://img.shields.io/pypi/wheel/gensim.svg)](https://pypi.python.org/pypi/gensim)
[![DOI](https://zenodo.org/badge/DOI/10.13140/2.1.2393.1847.svg)](https://doi.org/10.13140/2.1.2393.1847)
[![Mailing List](https://img.shields.io/badge/-Mailing%20List-brightgreen.svg)](https://groups.google.com/forum/#!forum/gensim)
[![Gitter](https://img.shields.io/badge/gitter-join%20chat%20%E2%86%92-09a3d5.svg)](https://gitter.im/RaRe-Technologies/gensim)
[![Follow](https://img.shields.io/twitter/follow/gensim_py.svg?style=social&label=Follow)](https://twitter.com/gensim_py)

Gensim is a Python library for *topic modelling*, *document indexing*
and *similarity retrieval* with large corpora. Target audience is the
*natural language processing* (NLP) and *information retrieval* (IR)
community.

<!--
## :pizza: Hacktoberfest 2019 :beer:

We are accepting PRs for Hacktoberfest!
See [here](HACKTOBERFEST.md) for details.
-->

Features
--------

- All algorithms are **memory-independent** w.r.t. the corpus size
(can process input larger than RAM, streamed, out-of-core),
- **Intuitive interfaces**
- easy to plug in your own input corpus/datastream (trivial
streaming API)
- easy to extend with other Vector Space algorithms (trivial
transformation API)
- Efficient multicore implementations of popular algorithms, such as
online **Latent Semantic Analysis (LSA/LSI/SVD)**, **Latent
Dirichlet Allocation (LDA)**, **Random Projections (RP)**,
**Hierarchical Dirichlet Process (HDP)** or **word2vec deep
learning**.
- **Distributed computing**: can run *Latent Semantic Analysis* and
*Latent Dirichlet Allocation* on a cluster of computers.
- Extensive [documentation and Jupyter Notebook tutorials].

If this feature list left you scratching your head, you can first read
more about the [Vector Space Model] and [unsupervised document analysis]
on Wikipedia.

Support
------------

Ask open-ended or research questions on the [Gensim Mailing List](https://groups.google.com/forum/#!forum/gensim).

Raise bugs on [Github](https://github.com/RaRe-Technologies/gensim/blob/develop/CONTRIBUTING.md) but **make sure you follow the [issue template](https://github.com/RaRe-Technologies/gensim/blob/develop/ISSUE_TEMPLATE.md)**. Issues that are not bugs or fail to follow the issue template will be closed without inspection.

Installation
------------

This software depends on [NumPy and Scipy], two Python packages for
scientific computing. You must have them installed prior to installing
gensim.

It is also recommended you install a fast BLAS library before installing
NumPy. This is optional, but using an optimized BLAS such as [ATLAS] or
[OpenBLAS] is known to improve performance by as much as an order of
magnitude. On OS X, NumPy picks up the BLAS that comes with it
automatically, so you don’t need to do anything special.

The simple way to install gensim is:

pip install -U gensim

Or, if you have instead downloaded and unzipped the [source tar.gz]
package, you’d run:

python setup.py test
python setup.py install

For alternative modes of installation (without root privileges,
development installation, optional install features), see the
[documentation].

This version has been tested under Python 2.7, 3.5 and 3.6. Gensim’s github repo is hooked
against [Travis CI for automated testing] on every commit push and pull
request. Support for Python 2.6, 3.3 and 3.4 was dropped in gensim 1.0.0. Install gensim 0.13.4 if you *must* use Python 2.6, 3.3 or 3.4. Support for Python 2.5 was dropped in gensim 0.10.0; install gensim 0.9.1 if you *must* use Python 2.5).

How come gensim is so fast and memory efficient? Isn’t it pure Python, and isn’t Python slow and greedy?
--------------------------------------------------------------------------------------------------------

Many scientific algorithms can be expressed in terms of large matrix
operations (see the BLAS note above). Gensim taps into these low-level
BLAS libraries, by means of its dependency on NumPy. So while
gensim-the-top-level-code is pure Python, it actually executes highly
optimized Fortran/C under the hood, including multithreading (if your
BLAS is so configured).

Memory-wise, gensim makes heavy use of Python’s built-in generators and
iterators for streamed data processing. Memory efficiency was one of
gensim’s [design goals], and is a central feature of gensim, rather than
something bolted on as an afterthought.

Documentation
-------------

- [QuickStart]
- [Tutorials]
- [Official API Documentation]

[QuickStart]: https://radimrehurek.com/gensim/auto_examples/core/run_core_concepts.html
[Tutorials]: https://radimrehurek.com/gensim/auto_examples/
[Official Documentation and Walkthrough]: http://radimrehurek.com/gensim/
[Official API Documentation]: http://radimrehurek.com/gensim/apiref.html

---------

Adopters
--------

| Company | Logo | Industry | Use of Gensim |
|---------|------|----------|---------------|
| [RARE Technologies](http://rare-technologies.com) | ![rare](docs/src/readme_images/rare.png) | ML & NLP consulting | Creators of Gensim – this is us! |
| [Amazon](http://www.amazon.com/) | ![amazon](docs/src/readme_images/amazon.png) | Retail | Document similarity. |
| [National Institutes of Health](https://github.com/NIHOPA/pipeline_word2vec) | ![nih](docs/src/readme_images/nih.png) | Health | Processing grants and publications with word2vec. |
| [Cisco Security](http://www.cisco.com/c/en/us/products/security/index.html) | ![cisco](docs/src/readme_images/cisco.png) | Security | Large-scale fraud detection. |
| [Mindseye](http://www.mindseyesolutions.com/) | ![mindseye](docs/src/readme_images/mindseye.png) | Legal | Similarities in legal documents. |
| [Channel 4](http://www.channel4.com/) | ![channel4](docs/src/readme_images/channel4.png) | Media | Recommendation engine. |
| [Talentpair](http://talentpair.com) | ![talent-pair](docs/src/readme_images/talent-pair.png) | HR | Candidate matching in high-touch recruiting. |
| [Juju](http://www.juju.com/) | ![juju](docs/src/readme_images/juju.png) | HR | Provide non-obvious related job suggestions. |
| [Tailwind](https://www.tailwindapp.com/) | ![tailwind](docs/src/readme_images/tailwind.png) | Media | Post interesting and relevant content to Pinterest. |
| [Issuu](https://issuu.com/) | ![issuu](docs/src/readme_images/issuu.png) | Media | Gensim's LDA module lies at the very core of the analysis we perform on each uploaded publication to figure out what it's all about. |
| [Search Metrics](http://www.searchmetrics.com/) | ![search-metrics](docs/src/readme_images/search-metrics.png) | Content Marketing | Gensim word2vec used for entity disambiguation in Search Engine Optimisation. |
| [12K Research](https://12k.co/) | ![12k](docs/src/readme_images/12k.png)| Media | Document similarity analysis on media articles. |
| [Stillwater Supercomputing](http://www.stillwater-sc.com/) | ![stillwater](docs/src/readme_images/stillwater.png) | Hardware | Document comprehension and association with word2vec. |
| [SiteGround](https://www.siteground.com/) | ![siteground](docs/src/readme_images/siteground.png) | Web hosting | An ensemble search engine which uses different embeddings models and similarities, including word2vec, WMD, and LDA. |
| [Capital One](https://www.capitalone.com/) | ![capitalone](docs/src/readme_images/capitalone.png) | Finance | Topic modeling for customer complaints exploration. |

-------

Citing gensim
------------

When [citing gensim in academic papers and theses], please use this
BibTeX entry:

@inproceedings{rehurek_lrec,
title = {{Software Framework for Topic Modelling with Large Corpora}},
author = {Radim {\v R}eh{\r u}{\v r}ek and Petr Sojka},
booktitle = {{Proceedings of the LREC 2010 Workshop on New
Challenges for NLP Frameworks}},
pages = {45--50},
year = 2010,
month = May,
day = 22,
publisher = {ELRA},
address = {Valletta, Malta},
note={\url{http://is.muni.cz/publication/884893/en}},
language={English}
}

[citing gensim in academic papers and theses]: https://scholar.google.com/citations?view_op=view_citation&hl=en&user=9vG_kV0AAAAJ&citation_for_view=9vG_kV0AAAAJ:NaGl4SEjCO4C

[Travis CI for automated testing]: https://travis-ci.org/RaRe-Technologies/gensim
[design goals]: http://radimrehurek.com/gensim/about.html
[RaRe Technologies]: http://rare-technologies.com/wp-content/uploads/2016/02/rare_image_only.png%20=10x20
[rare\_tech]: //rare-technologies.com
[Talentpair]: https://avatars3.githubusercontent.com/u/8418395?v=3&s=100
[citing gensim in academic papers and theses]: https://scholar.google.cz/citations?view_op=view_citation&hl=en&user=9vG_kV0AAAAJ&citation_for_view=9vG_kV0AAAAJ:u-x6o8ySG0sC



[documentation and Jupyter Notebook tutorials]: https://github.com/RaRe-Technologies/gensim/#documentation
[Vector Space Model]: http://en.wikipedia.org/wiki/Vector_space_model
[unsupervised document analysis]: http://en.wikipedia.org/wiki/Latent_semantic_indexing
[NumPy and Scipy]: http://www.scipy.org/Download
[ATLAS]: http://math-atlas.sourceforge.net/
[OpenBLAS]: http://xianyi.github.io/OpenBLAS/
[source tar.gz]: http://pypi.python.org/pypi/gensim
[documentation]: http://radimrehurek.com/gensim/install.html
This is a forked gensim version, which edits the default doc2vec model to support pretrained word2vec during training doc2vec. It forked from gensim 3.8.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Write your documentation so that it's useful from the point of view of the reader.

"This is a forked gensim version" is not relevant to the user. Furthermore, it becomes misleading the moment we actually merge this PR.


The default doc2vec model in gensim does't support pretrained word2vec model. But according to Jey Han Lau and Timothy Baldwin's paper, [An Empirical Evaluation of doc2vec with Practical Insights into Document Embedding Generation(2016)](https://arxiv.org/abs/1607.05368), using pretrained word2vec model usually gets better results in NLP tasks. The author also released a [forked gensim verstion](https://github.com/jhlau/gensim) to perform pretrained embeddings, but it is from a very old gensim version, which can't be used in gensim 3.8(the latest gensim version when I release this fork).
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is also irrelevant. This kind of information is good inside the PR, as motivation and background (it may already be there).







Features and notice
=============
* 1.Support pretrained word2vec when train doc2vec.
* 2.Support Python 3.
* 3.Support gensim 3.8.
* 4.The pretrainned word2vec model should be C text format.
* 5.The dimension of the pretrained word2vec and the to be trained doc2vec should be the same.






Use the model
=============

1.Install the forked gensim
---------------------------

* Clone gensim to your machine
> git clone https://github.com/maohbao/gensim.git

* install gensim
> python setup.py install


2.Train your doc2vec model
---------------------------

> pretrained_emb = "word2vec_pretrained.txt" # This is a pretrained word2vec model of C text format
>
> model = gensim.models.doc2vec.Doc2Vec(
corpus_train, # This is the documents corpus to be trained which should meet gensim's format
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hanging indent please.

vector_size=300,
min_count=1, epochs=20,
dm=0,
pretrained_emb=pretrained_emb)






Publications
=============

* 1.Jey Han Lau and Timothy Baldwin. [An Empirical Evaluation of doc2vec with Practical Insights into Document Embedding Generation. In Proceedings of the 1st Workshop on Representation Learning for NLP, 2016.](https://arxiv.org/abs/1607.05368)

* 2.[The initial forked gensim version](https://github.com/jhlau/gensim)
19 changes: 15 additions & 4 deletions gensim/models/doc2vec.py
Original file line number Diff line number Diff line change
Expand Up @@ -209,10 +209,13 @@ class Doc2Vec(BaseWordEmbeddingsModel):

"""
def __init__(self, documents=None, corpus_file=None, dm_mean=None, dm=1, dbow_words=0, dm_concat=0,
dm_tag_count=1, docvecs=None, docvecs_mapfile=None, comment=None, trim_rule=None, callbacks=(),
**kwargs):
dm_tag_count=1, docvecs=None, docvecs_mapfile=None, comment=None, trim_rule=None,
pretrained_emb=None,
maohbao marked this conversation as resolved.
Show resolved Hide resolved
callbacks=(), **kwargs):
"""

`pretrained_emb` = takes in pre-trained embedding for word vectors; format = original C word2vec-tool non-binary format (i.e. one embedding per word)

Parameters
----------
documents : iterable of list of :class:`~gensim.models.doc2vec.TaggedDocument`, optional
Expand Down Expand Up @@ -319,8 +322,11 @@ def __init__(self, documents=None, corpus_file=None, dm_mean=None, dm=1, dbow_wo
sg=(1 + dm) % 2,
null_word=dm_concat,
callbacks=callbacks,
pretrained_emb=pretrained_emb,
**kwargs)

self.pretrained_emb=pretrained_emb

self.load = call_on_class_only

if dm_mean is not None:
Expand All @@ -337,7 +343,7 @@ def __init__(self, documents=None, corpus_file=None, dm_mean=None, dm=1, dbow_wo

trainables_keys = ['seed', 'hashfxn', 'window']
trainables_kwargs = dict((k, kwargs[k]) for k in trainables_keys if k in kwargs)
self.trainables = Doc2VecTrainables(
self.trainables = Doc2VecTrainables(self.pretrained_emb,
dm=dm, dm_concat=dm_concat, dm_tag_count=dm_tag_count,
vector_size=self.vector_size, **trainables_kwargs)

Expand Down Expand Up @@ -1171,13 +1177,18 @@ def _tag_seen(self, index, docvecs):

class Doc2VecTrainables(Word2VecTrainables):
"""Represents the inner shallow neural network used to train :class:`~gensim.models.doc2vec.Doc2Vec`."""
def __init__(self, dm=1, dm_concat=0, dm_tag_count=1, vector_size=100, seed=1, hashfxn=hash, window=5):
def __init__(self, pretrained_emb, dm=1, dm_concat=0, dm_tag_count=1, vector_size=100, seed=1, hashfxn=hash, window=5):
maohbao marked this conversation as resolved.
Show resolved Hide resolved

self.pretrained_emb=pretrained_emb
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PEP8, here and everywhere

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How to change this with PEP8? Thank you for advice!


super(Doc2VecTrainables, self).__init__(
self.pretrained_emb,
vector_size=vector_size, seed=seed, hashfxn=hashfxn)
if dm and dm_concat:
self.layer1_size = (dm_tag_count + (2 * window)) * vector_size
logger.info("using concatenative %d-dimensional layer1", self.layer1_size)


def prepare_weights(self, hs, negative, wv, docvecs, update=False):
"""Build tables and model weights based on final vocabulary settings."""
# set initial input/projection and hidden weights
Expand Down
Loading