Skip to content

Commit

Permalink
Merge branch 'develop' into develop
Browse files Browse the repository at this point in the history
  • Loading branch information
menshikh-iv authored May 31, 2017
2 parents 0818976 + 55997f8 commit df4bd06
Show file tree
Hide file tree
Showing 65 changed files with 4,972 additions and 1,042 deletions.
302 changes: 164 additions & 138 deletions CHANGELOG.md

Large diffs are not rendered by default.

1 change: 1 addition & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -135,6 +135,7 @@ Adopters
| Stillwater Supercomputing | <img src="http://www.stillwater-sc.com/img/stillwater-logo.png" width="100"> | [stillwater-sc.com](http://www.stillwater-sc.com/) | Document comprehension and association with word2vec |
| Channel 4 | <img src="http://www.channel4.com/static/info/images/lib/c4logo_2015_info_corporate.jpg" width="100"> | [channel4.com](http://www.channel4.com/) | Recommendation engine |
| Amazon | <img src="http://g-ec2.images-amazon.com/images/G/01/social/api-share/amazon_logo_500500._V323939215_.png" width="100"> | [amazon.com](http://www.amazon.com/) | Document similarity|
| SiteGround Hosting | <img src="https://www.siteground.com/img/knox/logos/siteground.png" width="100"> | [siteground.com](https://www.siteground.com/) | An ensemble search engine which uses different embeddings models and similarities, including word2vec, WMD, and LDA. |

-------

Expand Down
2 changes: 1 addition & 1 deletion continuous_integration/travis/flake8_diff.sh
Original file line number Diff line number Diff line change
Expand Up @@ -133,6 +133,6 @@ check_files() {
if [[ "$MODIFIED_FILES" == "no_match" ]]; then
echo "No file has been modified"
else
check_files "$(echo "$MODIFIED_FILES" )" "--ignore=E501,E731,E12,W503 --exclude=*.sh,*.md,*.yml"
check_files "$(echo "$MODIFIED_FILES" )" "--ignore=E501,E731,E12,W503 --exclude=*.sh,*.md,*.yml,*.rst,*.ipynb,*.vec,Dockerfile*"
fi
echo -e "No problem detected by flake8\n"
52 changes: 34 additions & 18 deletions docs/notebooks/Corpora_and_Vector_Spaces.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -17,7 +17,7 @@
},
{
"cell_type": "code",
"execution_count": 1,
"execution_count": null,
"metadata": {
"collapsed": true
},
Expand All @@ -27,6 +27,18 @@
"logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import os\n",
"import tempfile\n",
"TEMP_FOLDER = tempfile.gettempdir()\n",
"print('Folder \"{}\" will be used to save temporary dictionary and corpus.'.format(TEMP_FOLDER))"
]
},
{
"cell_type": "markdown",
"metadata": {},
Expand Down Expand Up @@ -149,7 +161,7 @@
],
"source": [
"dictionary = corpora.Dictionary(texts)\n",
"dictionary.save('/tmp/deerwester.dict') # store the dictionary, for future reference\n",
"dictionary.save(os.path.join(TEMP_FOLDER, 'deerwester.dict')) # store the dictionary, for future reference\n",
"print(dictionary)"
]
},
Expand Down Expand Up @@ -211,7 +223,11 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"The function `doc2bow()` simply counts the number of occurrences of each distinct word, converts the word to its integer word id and returns the result as a sparse vector. The sparse vector `[(word_id, 1), (word_id, 1)]` therefore reads: in the document *“Human computer interaction”*, the words *\"computer\"* and *\"human\"*, identified by an integer id given by the built dictionary, appear once; the other ten dictionary words appear (implicitly) zero times. Check their id at the dictionary displayed in the previous cell and see that they match."
"The function `doc2bow()` simply counts the number of occurrences of each distinct word, converts the word to its integer word id and returns the result as a bag-of-words--a sparse vector, in the form of `[(word_id, word_count), ...]`. \n",
"\n",
"As the token_id is 0 for *\"human\"* and 2 for *\"computer\"*, the new document *“Human computer interaction”* will be transformed to [(0, 1), (2, 1)]. The words *\"computer\"* and *\"human\"* exist in the dictionary and appear once. Thus, they become (0, 1), (2, 1) respectively in the sparse vector. The word *\"interaction\"* doesn't exist in the dictionary and, thus, will not show up in the sparse vector. The other ten dictionary words, that appear (implicitly) zero times, will not show up in the sparse vector and , ,there will never be a element in the sparse vector like (3, 0).\n",
"\n",
"For people familiar with scikit learn, `doc2bow()` has similar behaviors as calling `transform()` on [`CountVectorizer`](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html). `doc2bow()` can behave like `fit_transform()` as well. For more details, please look at [gensim API Doc](https://radimrehurek.com/gensim/corpora/dictionary.html#gensim.corpora.dictionary.Dictionary.doc2bow)."
]
},
{
Expand Down Expand Up @@ -239,7 +255,7 @@
],
"source": [
"corpus = [dictionary.doc2bow(text) for text in texts]\n",
"corpora.MmCorpus.serialize('/tmp/deerwester.mm', corpus) # store to disk, for later use\n",
"corpora.MmCorpus.serialize(os.path.join(TEMP_FOLDER, 'deerwester.mm'), corpus) # store to disk, for later use\n",
"for c in corpus:\n",
" print(c)"
]
Expand Down Expand Up @@ -338,7 +354,7 @@
"source": [
"Although the output is the same as for the plain Python list, the corpus is now much more memory friendly, because at most one vector resides in RAM at a time. Your corpus can now be as large as you want.\n",
"\n",
"Similarly, to construct the dictionary without loading all texts into memory:"
"We are going to create the dictionary from the mycorpus.txt file without loading the entire file into memory. Then, we will generate the list of token ids to remove from this dictionary by querying the dictionary for the token ids of the stop words, and by querying the document frequencies dictionary (dictionary.dfs) for token ids that only appear once. Finally, we will filter these token ids out of our dictionary and call dictionary.compactify() to remove the gaps in the token id series."
]
},
{
Expand Down Expand Up @@ -399,14 +415,14 @@
"# create a toy corpus of 2 documents, as a plain Python list\n",
"corpus = [[(1, 0.5)], []] # make one document empty, for the heck of it\n",
"\n",
"corpora.MmCorpus.serialize('/tmp/corpus.mm', corpus)"
"corpora.MmCorpus.serialize(os.path.join(TEMP_FOLDER, 'corpus.mm'), corpus)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Other formats include [Joachim’s SVMlight format](http://svmlight.joachims.org/), [Blei’s LDA-C format](http://www.cs.princeton.edu/~blei/lda-c/) and [GibbsLDA++ format](http://gibbslda.sourceforge.net/)."
"Other formats include [Joachim’s SVMlight format](http://svmlight.joachims.org/), [Blei’s LDA-C format](http://www.cs.columbia.edu/~blei/lda-c/) and [GibbsLDA++ format](http://gibbslda.sourceforge.net/)."
]
},
{
Expand All @@ -417,9 +433,9 @@
},
"outputs": [],
"source": [
"corpora.SvmLightCorpus.serialize('/tmp/corpus.svmlight', corpus)\n",
"corpora.BleiCorpus.serialize('/tmp/corpus.lda-c', corpus)\n",
"corpora.LowCorpus.serialize('/tmp/corpus.low', corpus)"
"corpora.SvmLightCorpus.serialize(os.path.join(TEMP_FOLDER, 'corpus.svmlight'), corpus)\n",
"corpora.BleiCorpus.serialize(os.path.join(TEMP_FOLDER, 'corpus.lda-c'), corpus)\n",
"corpora.LowCorpus.serialize(os.path.join(TEMP_FOLDER, 'corpus.low'), corpus)"
]
},
{
Expand All @@ -437,7 +453,7 @@
},
"outputs": [],
"source": [
"corpus = corpora.MmCorpus('/tmp/corpus.mm')"
"corpus = corpora.MmCorpus(os.path.join(TEMP_FOLDER, 'corpus.mm'))"
]
},
{
Expand Down Expand Up @@ -539,7 +555,7 @@
},
"outputs": [],
"source": [
"corpora.BleiCorpus.serialize('/tmp/corpus.lda-c', corpus)"
"corpora.BleiCorpus.serialize(os.path.join(TEMP_FOLDER, 'corpus.lda-c'), corpus)"
]
},
{
Expand Down Expand Up @@ -600,23 +616,23 @@
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"display_name": "Python 2",
"language": "python",
"name": "python3"
"name": "python2"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
"version": 2
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.0"
"pygments_lexer": "ipython2",
"version": "2.7.6"
}
},
"nbformat": 4,
"nbformat_minor": 0
"nbformat_minor": 1
}
123 changes: 101 additions & 22 deletions docs/notebooks/Similarity_Queries.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -16,7 +16,7 @@
},
{
"cell_type": "code",
"execution_count": 2,
"execution_count": 5,
"metadata": {
"collapsed": true
},
Expand All @@ -26,6 +26,28 @@
"logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Folder \"C:\\Users\\chaor\\AppData\\Local\\Temp\" will be used to save temporary dictionary and corpus.\n"
]
}
],
"source": [
"import os\n",
"import tempfile\n",
"TEMP_FOLDER = tempfile.gettempdir()\n",
"print('Folder \"{}\" will be used to save temporary dictionary and corpus.'.format(TEMP_FOLDER))"
]
},
{
"cell_type": "markdown",
"metadata": {},
Expand All @@ -44,11 +66,22 @@
},
{
"cell_type": "code",
"execution_count": 12,
"execution_count": 7,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"2017-05-22 14:27:18,911 : INFO : loading Dictionary object from C:\\Users\\chaor\\AppData\\Local\\Temp\\deerwester.dict\n",
"2017-05-22 14:27:18,911 : INFO : loaded C:\\Users\\chaor\\AppData\\Local\\Temp\\deerwester.dict\n",
"2017-05-22 14:27:18,921 : INFO : loaded corpus index from C:\\Users\\chaor\\AppData\\Local\\Temp\\deerwester.mm.index\n",
"2017-05-22 14:27:18,924 : INFO : initializing corpus reader from C:\\Users\\chaor\\AppData\\Local\\Temp\\deerwester.mm\n",
"2017-05-22 14:27:18,929 : INFO : accepted corpus with 9 documents, 12 features, 28 non-zero entries\n"
]
},
{
"name": "stdout",
"output_type": "stream",
Expand All @@ -60,8 +93,8 @@
"source": [
"from gensim import corpora, models, similarities\n",
"\n",
"dictionary = corpora.Dictionary.load('/tmp/deerwester.dict')\n",
"corpus = corpora.MmCorpus('/tmp/deerwester.mm') # comes from the first tutorial, \"From strings to vectors\"\n",
"dictionary = corpora.Dictionary.load(os.path.join(TEMP_FOLDER, 'deerwester.dict'))\n",
"corpus = corpora.MmCorpus(os.path.join(TEMP_FOLDER, 'deerwester.mm')) # comes from the first tutorial, \"From strings to vectors\"\n",
"print(corpus)"
]
},
Expand All @@ -74,11 +107,30 @@
},
{
"cell_type": "code",
"execution_count": 13,
"execution_count": 8,
"metadata": {
"collapsed": false
},
"outputs": [],
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"2017-05-22 14:27:19,048 : INFO : using serial LSI version on this node\n",
"2017-05-22 14:27:19,050 : INFO : updating model with new documents\n",
"2017-05-22 14:27:19,054 : INFO : preparing a new chunk of documents\n",
"2017-05-22 14:27:19,057 : INFO : using 100 extra samples and 2 power iterations\n",
"2017-05-22 14:27:19,060 : INFO : 1st phase: constructing (12, 102) action matrix\n",
"2017-05-22 14:27:19,064 : INFO : orthonormalizing (12, 102) action matrix\n",
"2017-05-22 14:27:19,068 : INFO : 2nd phase: running dense svd on (12, 9) matrix\n",
"2017-05-22 14:27:19,070 : INFO : computing the final decomposition\n",
"2017-05-22 14:27:19,073 : INFO : keeping 2 factors (discarding 43.156% of energy spectrum)\n",
"2017-05-22 14:27:19,076 : INFO : processed documents up to #9\n",
"2017-05-22 14:27:19,078 : INFO : topic #0(3.341): 0.644*\"system\" + 0.404*\"user\" + 0.301*\"eps\" + 0.265*\"response\" + 0.265*\"time\" + 0.240*\"computer\" + 0.221*\"human\" + 0.206*\"survey\" + 0.198*\"interface\" + 0.036*\"graph\"\n",
"2017-05-22 14:27:19,081 : INFO : topic #1(2.542): 0.623*\"graph\" + 0.490*\"trees\" + 0.451*\"minors\" + 0.274*\"survey\" + -0.167*\"system\" + -0.141*\"eps\" + -0.113*\"human\" + 0.107*\"response\" + 0.107*\"time\" + -0.072*\"interface\"\n"
]
}
],
"source": [
"lsi = models.LsiModel(corpus, id2word=dictionary, num_topics=2)"
]
Expand All @@ -92,7 +144,7 @@
},
{
"cell_type": "code",
"execution_count": 14,
"execution_count": 9,
"metadata": {
"collapsed": false
},
Expand All @@ -101,7 +153,7 @@
"name": "stdout",
"output_type": "stream",
"text": [
"[(0, 0.46182100453271591), (1, 0.070027665279000534)]\n"
"[(0, 0.46182100453271535), (1, -0.070027665279000437)]\n"
]
}
],
Expand Down Expand Up @@ -135,11 +187,20 @@
},
{
"cell_type": "code",
"execution_count": 15,
"execution_count": 10,
"metadata": {
"collapsed": false
},
"outputs": [],
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"2017-05-22 14:27:19,299 : WARNING : scanning corpus to determine the number of features (consider setting `num_features` explicitly)\n",
"2017-05-22 14:27:19,358 : INFO : creating matrix with 9 documents and 2 features\n"
]
}
],
"source": [
"index = similarities.MatrixSimilarity(lsi[corpus]) # transform corpus to LSI space and index it"
]
Expand All @@ -157,14 +218,23 @@
},
{
"cell_type": "code",
"execution_count": 18,
"execution_count": 12,
"metadata": {
"collapsed": true
"collapsed": false
},
"outputs": [],
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"2017-05-22 14:27:52,760 : INFO : saving MatrixSimilarity object under C:\\Users\\chaor\\AppData\\Local\\Temp\\deerwester.index, separately None\n",
"2017-05-22 14:27:52,772 : INFO : saved C:\\Users\\chaor\\AppData\\Local\\Temp\\deerwester.index\n"
]
}
],
"source": [
"index.save('/tmp/deerwester.index')\n",
"index = similarities.MatrixSimilarity.load('/tmp/deerwester.index')"
"index.save(os.path.join(TEMP_FOLDER, 'deerwester.index'))\n",
"#index = similarities.MatrixSimilarity.load(os.path.join(TEMP_FOLDER, 'index'))"
]
},
{
Expand All @@ -190,7 +260,7 @@
},
{
"cell_type": "code",
"execution_count": 21,
"execution_count": 13,
"metadata": {
"collapsed": false
},
Expand All @@ -199,7 +269,7 @@
"name": "stdout",
"output_type": "stream",
"text": [
"[(0, 0.99809301), (1, 0.93748635), (2, 0.99844527), (3, 0.9865886), (4, 0.90755945), (5, -0.12416792), (6, -0.10639259), (7, -0.098794632), (8, 0.050041769)]\n"
"[(0, 0.99809301), (1, 0.93748635), (2, 0.99844527), (3, 0.9865886), (4, 0.90755945), (5, -0.12416792), (6, -0.10639259), (7, -0.098794639), (8, 0.050041765)]\n"
]
}
],
Expand Down Expand Up @@ -246,25 +316,34 @@
"* your **feedback is most welcome** and appreciated (and it’s not just the code!): [idea contributions](https://github.com/piskvorky/gensim/wiki/Ideas-&-Features-proposals), [bug reports](https://github.com/piskvorky/gensim/issues) or just consider contributing [user stories and general questions](http://groups.google.com/group/gensim/topics).\n",
"Gensim has no ambition to become an all-encompassing framework, across all NLP (or even Machine Learning) subfields. Its mission is to help NLP practicioners try out popular topic modelling algorithms on large datasets easily, and to facilitate prototyping of new algorithms for researchers."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 2",
"display_name": "Python 3",
"language": "python",
"name": "python2"
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 2
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython2",
"version": "2.7.10"
"pygments_lexer": "ipython3",
"version": "3.6.0"
}
},
"nbformat": 4,
Expand Down
Loading

0 comments on commit df4bd06

Please sign in to comment.