-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support pretrained word2vec model when train doc2vec #2703
Changes from 25 commits
0e4c786
cdd440e
4ef0fa8
7e2e6ca
6b5d882
57def4c
49bf922
ec5b268
4f6c514
d982a81
791764b
0a99acf
58870b6
b9acd0c
597d893
0e1a937
c217879
603b3f0
b44ea69
515e04d
b20605b
69fbdf2
a22d02f
20dc004
924ff7a
08f0c34
b5420f7
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,177 +1,61 @@ | ||
gensim – Topic Modelling in Python | ||
doc2vec in gensim – support pretrained word2vec | ||
================================== | ||
|
||
[![Build Status](https://travis-ci.org/RaRe-Technologies/gensim.svg?branch=develop)](https://travis-ci.org/RaRe-Technologies/gensim) | ||
[![GitHub release](https://img.shields.io/github/release/rare-technologies/gensim.svg?maxAge=3600)](https://github.com/RaRe-Technologies/gensim/releases) | ||
[![Conda-forge Build](https://anaconda.org/conda-forge/gensim/badges/version.svg)](https://anaconda.org/conda-forge/gensim) | ||
[![Wheel](https://img.shields.io/pypi/wheel/gensim.svg)](https://pypi.python.org/pypi/gensim) | ||
[![DOI](https://zenodo.org/badge/DOI/10.13140/2.1.2393.1847.svg)](https://doi.org/10.13140/2.1.2393.1847) | ||
[![Mailing List](https://img.shields.io/badge/-Mailing%20List-brightgreen.svg)](https://groups.google.com/forum/#!forum/gensim) | ||
[![Gitter](https://img.shields.io/badge/gitter-join%20chat%20%E2%86%92-09a3d5.svg)](https://gitter.im/RaRe-Technologies/gensim) | ||
[![Follow](https://img.shields.io/twitter/follow/gensim_py.svg?style=social&label=Follow)](https://twitter.com/gensim_py) | ||
|
||
Gensim is a Python library for *topic modelling*, *document indexing* | ||
and *similarity retrieval* with large corpora. Target audience is the | ||
*natural language processing* (NLP) and *information retrieval* (IR) | ||
community. | ||
|
||
<!-- | ||
## :pizza: Hacktoberfest 2019 :beer: | ||
|
||
We are accepting PRs for Hacktoberfest! | ||
See [here](HACKTOBERFEST.md) for details. | ||
--> | ||
|
||
Features | ||
-------- | ||
|
||
- All algorithms are **memory-independent** w.r.t. the corpus size | ||
(can process input larger than RAM, streamed, out-of-core), | ||
- **Intuitive interfaces** | ||
- easy to plug in your own input corpus/datastream (trivial | ||
streaming API) | ||
- easy to extend with other Vector Space algorithms (trivial | ||
transformation API) | ||
- Efficient multicore implementations of popular algorithms, such as | ||
online **Latent Semantic Analysis (LSA/LSI/SVD)**, **Latent | ||
Dirichlet Allocation (LDA)**, **Random Projections (RP)**, | ||
**Hierarchical Dirichlet Process (HDP)** or **word2vec deep | ||
learning**. | ||
- **Distributed computing**: can run *Latent Semantic Analysis* and | ||
*Latent Dirichlet Allocation* on a cluster of computers. | ||
- Extensive [documentation and Jupyter Notebook tutorials]. | ||
|
||
If this feature list left you scratching your head, you can first read | ||
more about the [Vector Space Model] and [unsupervised document analysis] | ||
on Wikipedia. | ||
|
||
Support | ||
------------ | ||
|
||
Ask open-ended or research questions on the [Gensim Mailing List](https://groups.google.com/forum/#!forum/gensim). | ||
|
||
Raise bugs on [Github](https://github.com/RaRe-Technologies/gensim/blob/develop/CONTRIBUTING.md) but **make sure you follow the [issue template](https://github.com/RaRe-Technologies/gensim/blob/develop/ISSUE_TEMPLATE.md)**. Issues that are not bugs or fail to follow the issue template will be closed without inspection. | ||
|
||
Installation | ||
------------ | ||
|
||
This software depends on [NumPy and Scipy], two Python packages for | ||
scientific computing. You must have them installed prior to installing | ||
gensim. | ||
|
||
It is also recommended you install a fast BLAS library before installing | ||
NumPy. This is optional, but using an optimized BLAS such as [ATLAS] or | ||
[OpenBLAS] is known to improve performance by as much as an order of | ||
magnitude. On OS X, NumPy picks up the BLAS that comes with it | ||
automatically, so you don’t need to do anything special. | ||
|
||
The simple way to install gensim is: | ||
|
||
pip install -U gensim | ||
|
||
Or, if you have instead downloaded and unzipped the [source tar.gz] | ||
package, you’d run: | ||
|
||
python setup.py test | ||
python setup.py install | ||
|
||
For alternative modes of installation (without root privileges, | ||
development installation, optional install features), see the | ||
[documentation]. | ||
|
||
This version has been tested under Python 2.7, 3.5 and 3.6. Gensim’s github repo is hooked | ||
against [Travis CI for automated testing] on every commit push and pull | ||
request. Support for Python 2.6, 3.3 and 3.4 was dropped in gensim 1.0.0. Install gensim 0.13.4 if you *must* use Python 2.6, 3.3 or 3.4. Support for Python 2.5 was dropped in gensim 0.10.0; install gensim 0.9.1 if you *must* use Python 2.5). | ||
|
||
How come gensim is so fast and memory efficient? Isn’t it pure Python, and isn’t Python slow and greedy? | ||
-------------------------------------------------------------------------------------------------------- | ||
|
||
Many scientific algorithms can be expressed in terms of large matrix | ||
operations (see the BLAS note above). Gensim taps into these low-level | ||
BLAS libraries, by means of its dependency on NumPy. So while | ||
gensim-the-top-level-code is pure Python, it actually executes highly | ||
optimized Fortran/C under the hood, including multithreading (if your | ||
BLAS is so configured). | ||
|
||
Memory-wise, gensim makes heavy use of Python’s built-in generators and | ||
iterators for streamed data processing. Memory efficiency was one of | ||
gensim’s [design goals], and is a central feature of gensim, rather than | ||
something bolted on as an afterthought. | ||
|
||
Documentation | ||
------------- | ||
|
||
- [QuickStart] | ||
- [Tutorials] | ||
- [Official API Documentation] | ||
|
||
[QuickStart]: https://radimrehurek.com/gensim/auto_examples/core/run_core_concepts.html | ||
[Tutorials]: https://radimrehurek.com/gensim/auto_examples/ | ||
[Official Documentation and Walkthrough]: http://radimrehurek.com/gensim/ | ||
[Official API Documentation]: http://radimrehurek.com/gensim/apiref.html | ||
|
||
--------- | ||
|
||
Adopters | ||
-------- | ||
|
||
| Company | Logo | Industry | Use of Gensim | | ||
|---------|------|----------|---------------| | ||
| [RARE Technologies](http://rare-technologies.com) | ![rare](docs/src/readme_images/rare.png) | ML & NLP consulting | Creators of Gensim – this is us! | | ||
| [Amazon](http://www.amazon.com/) | ![amazon](docs/src/readme_images/amazon.png) | Retail | Document similarity. | | ||
| [National Institutes of Health](https://github.com/NIHOPA/pipeline_word2vec) | ![nih](docs/src/readme_images/nih.png) | Health | Processing grants and publications with word2vec. | | ||
| [Cisco Security](http://www.cisco.com/c/en/us/products/security/index.html) | ![cisco](docs/src/readme_images/cisco.png) | Security | Large-scale fraud detection. | | ||
| [Mindseye](http://www.mindseyesolutions.com/) | ![mindseye](docs/src/readme_images/mindseye.png) | Legal | Similarities in legal documents. | | ||
| [Channel 4](http://www.channel4.com/) | ![channel4](docs/src/readme_images/channel4.png) | Media | Recommendation engine. | | ||
| [Talentpair](http://talentpair.com) | ![talent-pair](docs/src/readme_images/talent-pair.png) | HR | Candidate matching in high-touch recruiting. | | ||
| [Juju](http://www.juju.com/) | ![juju](docs/src/readme_images/juju.png) | HR | Provide non-obvious related job suggestions. | | ||
| [Tailwind](https://www.tailwindapp.com/) | ![tailwind](docs/src/readme_images/tailwind.png) | Media | Post interesting and relevant content to Pinterest. | | ||
| [Issuu](https://issuu.com/) | ![issuu](docs/src/readme_images/issuu.png) | Media | Gensim's LDA module lies at the very core of the analysis we perform on each uploaded publication to figure out what it's all about. | | ||
| [Search Metrics](http://www.searchmetrics.com/) | ![search-metrics](docs/src/readme_images/search-metrics.png) | Content Marketing | Gensim word2vec used for entity disambiguation in Search Engine Optimisation. | | ||
| [12K Research](https://12k.co/) | ![12k](docs/src/readme_images/12k.png)| Media | Document similarity analysis on media articles. | | ||
| [Stillwater Supercomputing](http://www.stillwater-sc.com/) | ![stillwater](docs/src/readme_images/stillwater.png) | Hardware | Document comprehension and association with word2vec. | | ||
| [SiteGround](https://www.siteground.com/) | ![siteground](docs/src/readme_images/siteground.png) | Web hosting | An ensemble search engine which uses different embeddings models and similarities, including word2vec, WMD, and LDA. | | ||
| [Capital One](https://www.capitalone.com/) | ![capitalone](docs/src/readme_images/capitalone.png) | Finance | Topic modeling for customer complaints exploration. | | ||
|
||
------- | ||
|
||
Citing gensim | ||
------------ | ||
|
||
When [citing gensim in academic papers and theses], please use this | ||
BibTeX entry: | ||
|
||
@inproceedings{rehurek_lrec, | ||
title = {{Software Framework for Topic Modelling with Large Corpora}}, | ||
author = {Radim {\v R}eh{\r u}{\v r}ek and Petr Sojka}, | ||
booktitle = {{Proceedings of the LREC 2010 Workshop on New | ||
Challenges for NLP Frameworks}}, | ||
pages = {45--50}, | ||
year = 2010, | ||
month = May, | ||
day = 22, | ||
publisher = {ELRA}, | ||
address = {Valletta, Malta}, | ||
note={\url{http://is.muni.cz/publication/884893/en}}, | ||
language={English} | ||
} | ||
|
||
[citing gensim in academic papers and theses]: https://scholar.google.com/citations?view_op=view_citation&hl=en&user=9vG_kV0AAAAJ&citation_for_view=9vG_kV0AAAAJ:NaGl4SEjCO4C | ||
|
||
[Travis CI for automated testing]: https://travis-ci.org/RaRe-Technologies/gensim | ||
[design goals]: http://radimrehurek.com/gensim/about.html | ||
[RaRe Technologies]: http://rare-technologies.com/wp-content/uploads/2016/02/rare_image_only.png%20=10x20 | ||
[rare\_tech]: //rare-technologies.com | ||
[Talentpair]: https://avatars3.githubusercontent.com/u/8418395?v=3&s=100 | ||
[citing gensim in academic papers and theses]: https://scholar.google.cz/citations?view_op=view_citation&hl=en&user=9vG_kV0AAAAJ&citation_for_view=9vG_kV0AAAAJ:u-x6o8ySG0sC | ||
|
||
|
||
|
||
[documentation and Jupyter Notebook tutorials]: https://github.com/RaRe-Technologies/gensim/#documentation | ||
[Vector Space Model]: http://en.wikipedia.org/wiki/Vector_space_model | ||
[unsupervised document analysis]: http://en.wikipedia.org/wiki/Latent_semantic_indexing | ||
[NumPy and Scipy]: http://www.scipy.org/Download | ||
[ATLAS]: http://math-atlas.sourceforge.net/ | ||
[OpenBLAS]: http://xianyi.github.io/OpenBLAS/ | ||
[source tar.gz]: http://pypi.python.org/pypi/gensim | ||
[documentation]: http://radimrehurek.com/gensim/install.html | ||
This is a forked gensim version, which edits the default doc2vec model to support pretrained word2vec during training doc2vec. It forked from gensim 3.8. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Write your documentation so that it's useful from the point of view of the reader. "This is a forked gensim version" is not relevant to the user. Furthermore, it becomes misleading the moment we actually merge this PR. |
||
|
||
The default doc2vec model in gensim does't support pretrained word2vec model. But according to Jey Han Lau and Timothy Baldwin's paper, [An Empirical Evaluation of doc2vec with Practical Insights into Document Embedding Generation(2016)](https://arxiv.org/abs/1607.05368), using pretrained word2vec model usually gets better results in NLP tasks. The author also released a [forked gensim verstion](https://github.com/jhlau/gensim) to perform pretrained embeddings, but it is from a very old gensim version, which can't be used in gensim 3.8(the latest gensim version when I release this fork). | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This is also irrelevant. This kind of information is good inside the PR, as motivation and background (it may already be there). |
||
|
||
|
||
|
||
|
||
|
||
|
||
Features and notice | ||
============= | ||
* 1.Support pretrained word2vec when train doc2vec. | ||
* 2.Support Python 3. | ||
* 3.Support gensim 3.8. | ||
* 4.The pretrainned word2vec model should be C text format. | ||
* 5.The dimension of the pretrained word2vec and the to be trained doc2vec should be the same. | ||
|
||
|
||
|
||
|
||
|
||
|
||
Use the model | ||
============= | ||
|
||
1.Install the forked gensim | ||
--------------------------- | ||
|
||
* Clone gensim to your machine | ||
> git clone https://github.com/maohbao/gensim.git | ||
|
||
* install gensim | ||
> python setup.py install | ||
|
||
|
||
2.Train your doc2vec model | ||
--------------------------- | ||
|
||
> pretrained_emb = "word2vec_pretrained.txt" # This is a pretrained word2vec model of C text format | ||
> | ||
> model = gensim.models.doc2vec.Doc2Vec( | ||
corpus_train, # This is the documents corpus to be trained which should meet gensim's format | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Hanging indent please. |
||
vector_size=300, | ||
min_count=1, epochs=20, | ||
dm=0, | ||
pretrained_emb=pretrained_emb) | ||
|
||
|
||
|
||
|
||
|
||
|
||
Publications | ||
============= | ||
|
||
* 1.Jey Han Lau and Timothy Baldwin. [An Empirical Evaluation of doc2vec with Practical Insights into Document Embedding Generation. In Proceedings of the 1st Workshop on Representation Learning for NLP, 2016.](https://arxiv.org/abs/1607.05368) | ||
|
||
* 2.[The initial forked gensim version](https://github.com/jhlau/gensim) |
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -209,10 +209,13 @@ class Doc2Vec(BaseWordEmbeddingsModel): | |
|
||
""" | ||
def __init__(self, documents=None, corpus_file=None, dm_mean=None, dm=1, dbow_words=0, dm_concat=0, | ||
dm_tag_count=1, docvecs=None, docvecs_mapfile=None, comment=None, trim_rule=None, callbacks=(), | ||
**kwargs): | ||
dm_tag_count=1, docvecs=None, docvecs_mapfile=None, comment=None, trim_rule=None, | ||
pretrained_emb=None, | ||
maohbao marked this conversation as resolved.
Show resolved
Hide resolved
|
||
callbacks=(), **kwargs): | ||
""" | ||
|
||
`pretrained_emb` = takes in pre-trained embedding for word vectors; format = original C word2vec-tool non-binary format (i.e. one embedding per word) | ||
|
||
Parameters | ||
---------- | ||
documents : iterable of list of :class:`~gensim.models.doc2vec.TaggedDocument`, optional | ||
|
@@ -319,8 +322,11 @@ def __init__(self, documents=None, corpus_file=None, dm_mean=None, dm=1, dbow_wo | |
sg=(1 + dm) % 2, | ||
null_word=dm_concat, | ||
callbacks=callbacks, | ||
pretrained_emb=pretrained_emb, | ||
**kwargs) | ||
|
||
self.pretrained_emb=pretrained_emb | ||
|
||
self.load = call_on_class_only | ||
|
||
if dm_mean is not None: | ||
|
@@ -337,7 +343,7 @@ def __init__(self, documents=None, corpus_file=None, dm_mean=None, dm=1, dbow_wo | |
|
||
trainables_keys = ['seed', 'hashfxn', 'window'] | ||
trainables_kwargs = dict((k, kwargs[k]) for k in trainables_keys if k in kwargs) | ||
self.trainables = Doc2VecTrainables( | ||
self.trainables = Doc2VecTrainables(self.pretrained_emb, | ||
dm=dm, dm_concat=dm_concat, dm_tag_count=dm_tag_count, | ||
vector_size=self.vector_size, **trainables_kwargs) | ||
|
||
|
@@ -1171,13 +1177,18 @@ def _tag_seen(self, index, docvecs): | |
|
||
class Doc2VecTrainables(Word2VecTrainables): | ||
"""Represents the inner shallow neural network used to train :class:`~gensim.models.doc2vec.Doc2Vec`.""" | ||
def __init__(self, dm=1, dm_concat=0, dm_tag_count=1, vector_size=100, seed=1, hashfxn=hash, window=5): | ||
def __init__(self, pretrained_emb, dm=1, dm_concat=0, dm_tag_count=1, vector_size=100, seed=1, hashfxn=hash, window=5): | ||
maohbao marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
self.pretrained_emb=pretrained_emb | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. PEP8, here and everywhere There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. How to change this with PEP8? Thank you for advice! |
||
|
||
super(Doc2VecTrainables, self).__init__( | ||
self.pretrained_emb, | ||
vector_size=vector_size, seed=seed, hashfxn=hashfxn) | ||
if dm and dm_concat: | ||
self.layer1_size = (dm_tag_count + (2 * window)) * vector_size | ||
logger.info("using concatenative %d-dimensional layer1", self.layer1_size) | ||
|
||
|
||
def prepare_weights(self, hs, negative, wv, docvecs, update=False): | ||
"""Build tables and model weights based on final vocabulary settings.""" | ||
# set initial input/projection and hidden weights | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think instead of replacing the top-level README.md file, you should put this documentation somewhere else. Ideally, it should be in a tutorial or a howto.
See https://radimrehurek.com/gensim/auto_examples/howtos/run_doc.html#sphx-glr-auto-examples-howtos-run-doc-py for more info.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is my first time to do PR on github, I already followed your advice on word2vec.py and doc2vec.py, I also know that I should not change the top README.md file, but I really don't how and where to write the document, thank you for more advice!