Merge branch 'develop' into develop

piskvorky · May 31, 2017 · df4bd06 · df4bd06
2 parents 0818976 + 55997f8
commit df4bd06
Show file tree

Hide file tree

Showing 65 changed files with 4,972 additions and 1,042 deletions.
diff --git a/CHANGELOG.md b/CHANGELOG.md
diff --git a/README.md b/README.md
@@ -135,6 +135,7 @@ Adopters
 | Stillwater Supercomputing     | <img src="http://www.stillwater-sc.com/img/stillwater-logo.png" width="100"> | [stillwater-sc.com](http://www.stillwater-sc.com/)                                  | Document comprehension and association with word2vec |
 | Channel 4     | <img src="http://www.channel4.com/static/info/images/lib/c4logo_2015_info_corporate.jpg" width="100"> | [channel4.com](http://www.channel4.com/)                                  | Recommendation engine |
 | Amazon     |  <img src="http://g-ec2.images-amazon.com/images/G/01/social/api-share/amazon_logo_500500._V323939215_.png" width="100"> | [amazon.com](http://www.amazon.com/)                                  |  Document similarity|
+| SiteGround Hosting     |  <img src="https://www.siteground.com/img/knox/logos/siteground.png" width="100"> | [siteground.com](https://www.siteground.com/)                                  | An ensemble search engine which uses different embeddings models and similarities, including word2vec, WMD, and LDA. |
 
 -------
 

diff --git a/continuous_integration/travis/flake8_diff.sh b/continuous_integration/travis/flake8_diff.sh
@@ -133,6 +133,6 @@ check_files() {
 if [[ "$MODIFIED_FILES" == "no_match" ]]; then
     echo "No file has been modified"
 else
-    check_files "$(echo "$MODIFIED_FILES" )" "--ignore=E501,E731,E12,W503 --exclude=*.sh,*.md,*.yml"
+    check_files "$(echo "$MODIFIED_FILES" )" "--ignore=E501,E731,E12,W503 --exclude=*.sh,*.md,*.yml,*.rst,*.ipynb,*.vec,Dockerfile*"
 fi
 echo -e "No problem detected by flake8\n"
diff --git a/docs/notebooks/Corpora_and_Vector_Spaces.ipynb b/docs/notebooks/Corpora_and_Vector_Spaces.ipynb
@@ -17,7 +17,7 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 1,
+   "execution_count": null,
    "metadata": {
     "collapsed": true
    },
@@ -27,6 +27,18 @@
     "logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)"
    ]
   },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import os\n",
+    "import tempfile\n",
+    "TEMP_FOLDER = tempfile.gettempdir()\n",
+    "print('Folder \"{}\" will be used to save temporary dictionary and corpus.'.format(TEMP_FOLDER))"
+   ]
+  },
   {
    "cell_type": "markdown",
    "metadata": {},
@@ -149,7 +161,7 @@
    ],
    "source": [
     "dictionary = corpora.Dictionary(texts)\n",
-    "dictionary.save('/tmp/deerwester.dict')  # store the dictionary, for future reference\n",
+    "dictionary.save(os.path.join(TEMP_FOLDER, 'deerwester.dict'))  # store the dictionary, for future reference\n",
     "print(dictionary)"
    ]
   },
@@ -211,7 +223,11 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "The function `doc2bow()` simply counts the number of occurrences of each distinct word, converts the word to its integer word id and returns the result as a sparse vector. The sparse vector `[(word_id, 1), (word_id, 1)]` therefore reads: in the document *“Human computer interaction”*, the words *\"computer\"* and *\"human\"*, identified by an integer id given by the built dictionary, appear once; the other ten dictionary words appear (implicitly) zero times. Check their id at the dictionary displayed in the previous cell and see that they match."
+    "The function `doc2bow()` simply counts the number of occurrences of each distinct word, converts the word to its integer word id and returns the result as a bag-of-words--a sparse vector, in the form of `[(word_id, word_count), ...]`. \n",
+    "\n",
+    "As the token_id is 0 for *\"human\"* and 2 for *\"computer\"*, the new document *“Human computer interaction”* will be transformed to [(0, 1), (2, 1)]. The words *\"computer\"* and *\"human\"* exist in the dictionary and appear once. Thus, they become (0, 1), (2, 1) respectively in the sparse vector. The word *\"interaction\"* doesn't exist in the dictionary and, thus, will not show up in the sparse vector. The other ten dictionary words, that appear (implicitly) zero times, will not show up in the sparse vector and , ,there will never be a element in the sparse vector like (3, 0).\n",
+    "\n",
+    "For people familiar with scikit learn, `doc2bow()` has similar behaviors as calling `transform()` on [`CountVectorizer`](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html). `doc2bow()` can behave like `fit_transform()` as well. For more details, please look at [gensim API Doc](https://radimrehurek.com/gensim/corpora/dictionary.html#gensim.corpora.dictionary.Dictionary.doc2bow)."
    ]
   },
   {
@@ -239,7 +255,7 @@
    ],
    "source": [
     "corpus = [dictionary.doc2bow(text) for text in texts]\n",
-    "corpora.MmCorpus.serialize('/tmp/deerwester.mm', corpus)  # store to disk, for later use\n",
+    "corpora.MmCorpus.serialize(os.path.join(TEMP_FOLDER, 'deerwester.mm'), corpus)  # store to disk, for later use\n",
     "for c in corpus:\n",
     "    print(c)"
    ]
@@ -338,7 +354,7 @@
    "source": [
     "Although the output is the same as for the plain Python list, the corpus is now much more memory friendly, because at most one vector resides in RAM at a time. Your corpus can now be as large as you want.\n",
     "\n",
-    "Similarly, to construct the dictionary without loading all texts into memory:"
+    "We are going to create the dictionary from the mycorpus.txt file without loading the entire file into memory. Then, we will generate the list of token ids to remove from this dictionary by querying the dictionary for the token ids of the stop words, and by querying the document frequencies dictionary (dictionary.dfs) for token ids that only appear once. Finally, we will filter these token ids out of our dictionary and call dictionary.compactify() to remove the gaps in the token id series."
    ]
   },
   {
@@ -399,14 +415,14 @@
     "# create a toy corpus of 2 documents, as a plain Python list\n",
     "corpus = [[(1, 0.5)], []]  # make one document empty, for the heck of it\n",
     "\n",
-    "corpora.MmCorpus.serialize('/tmp/corpus.mm', corpus)"
+    "corpora.MmCorpus.serialize(os.path.join(TEMP_FOLDER, 'corpus.mm'), corpus)"
    ]
   },
   {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "Other formats include [Joachim’s SVMlight format](http://svmlight.joachims.org/), [Blei’s LDA-C format](http://www.cs.princeton.edu/~blei/lda-c/) and [GibbsLDA++ format](http://gibbslda.sourceforge.net/)."
+    "Other formats include [Joachim’s SVMlight format](http://svmlight.joachims.org/), [Blei’s LDA-C format](http://www.cs.columbia.edu/~blei/lda-c/) and [GibbsLDA++ format](http://gibbslda.sourceforge.net/)."
    ]
   },
   {
@@ -417,9 +433,9 @@
    },
    "outputs": [],
    "source": [
-    "corpora.SvmLightCorpus.serialize('/tmp/corpus.svmlight', corpus)\n",
-    "corpora.BleiCorpus.serialize('/tmp/corpus.lda-c', corpus)\n",
-    "corpora.LowCorpus.serialize('/tmp/corpus.low', corpus)"
+    "corpora.SvmLightCorpus.serialize(os.path.join(TEMP_FOLDER, 'corpus.svmlight'), corpus)\n",
+    "corpora.BleiCorpus.serialize(os.path.join(TEMP_FOLDER, 'corpus.lda-c'), corpus)\n",
+    "corpora.LowCorpus.serialize(os.path.join(TEMP_FOLDER, 'corpus.low'), corpus)"
    ]
   },
   {
@@ -437,7 +453,7 @@
    },
    "outputs": [],
    "source": [
-    "corpus = corpora.MmCorpus('/tmp/corpus.mm')"
+    "corpus = corpora.MmCorpus(os.path.join(TEMP_FOLDER, 'corpus.mm'))"
    ]
   },
   {
@@ -539,7 +555,7 @@
    },
    "outputs": [],
    "source": [
-    "corpora.BleiCorpus.serialize('/tmp/corpus.lda-c', corpus)"
+    "corpora.BleiCorpus.serialize(os.path.join(TEMP_FOLDER, 'corpus.lda-c'), corpus)"
    ]
   },
   {
@@ -600,23 +616,23 @@
  ],
  "metadata": {
   "kernelspec": {
-   "display_name": "Python 3",
+   "display_name": "Python 2",
    "language": "python",
-   "name": "python3"
+   "name": "python2"
   },
   "language_info": {
    "codemirror_mode": {
     "name": "ipython",
-    "version": 3
+    "version": 2
    },
    "file_extension": ".py",
    "mimetype": "text/x-python",
    "name": "python",
    "nbconvert_exporter": "python",
-   "pygments_lexer": "ipython3",
-   "version": "3.6.0"
+   "pygments_lexer": "ipython2",
+   "version": "2.7.6"
   }
  },
  "nbformat": 4,
- "nbformat_minor": 0
+ "nbformat_minor": 1
 }
diff --git a/docs/notebooks/Similarity_Queries.ipynb b/docs/notebooks/Similarity_Queries.ipynb
@@ -16,7 +16,7 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 2,
+   "execution_count": 5,
    "metadata": {
     "collapsed": true
    },
@@ -26,6 +26,28 @@
     "logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)"
    ]
   },
+  {
+   "cell_type": "code",
+   "execution_count": 6,
+   "metadata": {
+    "collapsed": false
+   },
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Folder \"C:\\Users\\chaor\\AppData\\Local\\Temp\" will be used to save temporary dictionary and corpus.\n"
+     ]
+    }
+   ],
+   "source": [
+    "import os\n",
+    "import tempfile\n",
+    "TEMP_FOLDER = tempfile.gettempdir()\n",
+    "print('Folder \"{}\" will be used to save temporary dictionary and corpus.'.format(TEMP_FOLDER))"
+   ]
+  },
   {
    "cell_type": "markdown",
    "metadata": {},
@@ -44,11 +66,22 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 12,
+   "execution_count": 7,
    "metadata": {
     "collapsed": false
    },
    "outputs": [
+    {
+     "name": "stderr",
+     "output_type": "stream",
+     "text": [
+      "2017-05-22 14:27:18,911 : INFO : loading Dictionary object from C:\\Users\\chaor\\AppData\\Local\\Temp\\deerwester.dict\n",
+      "2017-05-22 14:27:18,911 : INFO : loaded C:\\Users\\chaor\\AppData\\Local\\Temp\\deerwester.dict\n",
+      "2017-05-22 14:27:18,921 : INFO : loaded corpus index from C:\\Users\\chaor\\AppData\\Local\\Temp\\deerwester.mm.index\n",
+      "2017-05-22 14:27:18,924 : INFO : initializing corpus reader from C:\\Users\\chaor\\AppData\\Local\\Temp\\deerwester.mm\n",
+      "2017-05-22 14:27:18,929 : INFO : accepted corpus with 9 documents, 12 features, 28 non-zero entries\n"
+     ]
+    },
     {
      "name": "stdout",
      "output_type": "stream",
@@ -60,8 +93,8 @@
    "source": [
     "from gensim import corpora, models, similarities\n",
     "\n",
-    "dictionary = corpora.Dictionary.load('/tmp/deerwester.dict')\n",
-    "corpus = corpora.MmCorpus('/tmp/deerwester.mm') # comes from the first tutorial, \"From strings to vectors\"\n",
+    "dictionary = corpora.Dictionary.load(os.path.join(TEMP_FOLDER, 'deerwester.dict'))\n",
+    "corpus = corpora.MmCorpus(os.path.join(TEMP_FOLDER, 'deerwester.mm')) # comes from the first tutorial, \"From strings to vectors\"\n",
     "print(corpus)"
    ]
   },
@@ -74,11 +107,30 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 13,
+   "execution_count": 8,
    "metadata": {
     "collapsed": false
    },
-   "outputs": [],
+   "outputs": [
+    {
+     "name": "stderr",
+     "output_type": "stream",
+     "text": [
+      "2017-05-22 14:27:19,048 : INFO : using serial LSI version on this node\n",
+      "2017-05-22 14:27:19,050 : INFO : updating model with new documents\n",
+      "2017-05-22 14:27:19,054 : INFO : preparing a new chunk of documents\n",
+      "2017-05-22 14:27:19,057 : INFO : using 100 extra samples and 2 power iterations\n",
+      "2017-05-22 14:27:19,060 : INFO : 1st phase: constructing (12, 102) action matrix\n",
+      "2017-05-22 14:27:19,064 : INFO : orthonormalizing (12, 102) action matrix\n",
+      "2017-05-22 14:27:19,068 : INFO : 2nd phase: running dense svd on (12, 9) matrix\n",
+      "2017-05-22 14:27:19,070 : INFO : computing the final decomposition\n",
+      "2017-05-22 14:27:19,073 : INFO : keeping 2 factors (discarding 43.156% of energy spectrum)\n",
+      "2017-05-22 14:27:19,076 : INFO : processed documents up to #9\n",
+      "2017-05-22 14:27:19,078 : INFO : topic #0(3.341): 0.644*\"system\" + 0.404*\"user\" + 0.301*\"eps\" + 0.265*\"response\" + 0.265*\"time\" + 0.240*\"computer\" + 0.221*\"human\" + 0.206*\"survey\" + 0.198*\"interface\" + 0.036*\"graph\"\n",
+      "2017-05-22 14:27:19,081 : INFO : topic #1(2.542): 0.623*\"graph\" + 0.490*\"trees\" + 0.451*\"minors\" + 0.274*\"survey\" + -0.167*\"system\" + -0.141*\"eps\" + -0.113*\"human\" + 0.107*\"response\" + 0.107*\"time\" + -0.072*\"interface\"\n"
+     ]
+    }
+   ],
    "source": [
     "lsi = models.LsiModel(corpus, id2word=dictionary, num_topics=2)"
    ]
@@ -92,7 +144,7 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 14,
+   "execution_count": 9,
    "metadata": {
     "collapsed": false
    },
@@ -101,7 +153,7 @@
      "name": "stdout",
      "output_type": "stream",
      "text": [
-      "[(0, 0.46182100453271591), (1, 0.070027665279000534)]\n"
+      "[(0, 0.46182100453271535), (1, -0.070027665279000437)]\n"
      ]
     }
    ],
@@ -135,11 +187,20 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 15,
+   "execution_count": 10,
    "metadata": {
     "collapsed": false
    },
-   "outputs": [],
+   "outputs": [
+    {
+     "name": "stderr",
+     "output_type": "stream",
+     "text": [
+      "2017-05-22 14:27:19,299 : WARNING : scanning corpus to determine the number of features (consider setting `num_features` explicitly)\n",
+      "2017-05-22 14:27:19,358 : INFO : creating matrix with 9 documents and 2 features\n"
+     ]
+    }
+   ],
    "source": [
     "index = similarities.MatrixSimilarity(lsi[corpus]) # transform corpus to LSI space and index it"
    ]
@@ -157,14 +218,23 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 18,
+   "execution_count": 12,
    "metadata": {
-    "collapsed": true
+    "collapsed": false
    },
-   "outputs": [],
+   "outputs": [
+    {
+     "name": "stderr",
+     "output_type": "stream",
+     "text": [
+      "2017-05-22 14:27:52,760 : INFO : saving MatrixSimilarity object under C:\\Users\\chaor\\AppData\\Local\\Temp\\deerwester.index, separately None\n",
+      "2017-05-22 14:27:52,772 : INFO : saved C:\\Users\\chaor\\AppData\\Local\\Temp\\deerwester.index\n"
+     ]
+    }
+   ],
    "source": [
-    "index.save('/tmp/deerwester.index')\n",
-    "index = similarities.MatrixSimilarity.load('/tmp/deerwester.index')"
+    "index.save(os.path.join(TEMP_FOLDER, 'deerwester.index'))\n",
+    "#index = similarities.MatrixSimilarity.load(os.path.join(TEMP_FOLDER, 'index'))"
    ]
   },
   {
@@ -190,7 +260,7 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 21,
+   "execution_count": 13,
    "metadata": {
     "collapsed": false
    },
@@ -199,7 +269,7 @@
      "name": "stdout",
      "output_type": "stream",
      "text": [
-      "[(0, 0.99809301), (1, 0.93748635), (2, 0.99844527), (3, 0.9865886), (4, 0.90755945), (5, -0.12416792), (6, -0.10639259), (7, -0.098794632), (8, 0.050041769)]\n"
+      "[(0, 0.99809301), (1, 0.93748635), (2, 0.99844527), (3, 0.9865886), (4, 0.90755945), (5, -0.12416792), (6, -0.10639259), (7, -0.098794639), (8, 0.050041765)]\n"
      ]
     }
    ],
@@ -246,25 +316,34 @@
     "* your **feedback is most welcome** and appreciated (and it’s not just the code!): [idea contributions](https://github.com/piskvorky/gensim/wiki/Ideas-&-Features-proposals), [bug reports](https://github.com/piskvorky/gensim/issues) or just consider contributing [user stories and general questions](http://groups.google.com/group/gensim/topics).\n",
     "Gensim has no ambition to become an all-encompassing framework, across all NLP (or even Machine Learning) subfields. Its mission is to help NLP practicioners try out popular topic modelling algorithms on large datasets easily, and to facilitate prototyping of new algorithms for researchers."
    ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "collapsed": true
+   },
+   "outputs": [],
+   "source": []
   }
  ],
  "metadata": {
   "kernelspec": {
-   "display_name": "Python 2",
+   "display_name": "Python 3",
    "language": "python",
-   "name": "python2"
+   "name": "python3"
   },
   "language_info": {
    "codemirror_mode": {
     "name": "ipython",
-    "version": 2
+    "version": 3
    },
    "file_extension": ".py",
    "mimetype": "text/x-python",
    "name": "python",
    "nbconvert_exporter": "python",
-   "pygments_lexer": "ipython2",
-   "version": "2.7.10"
+   "pygments_lexer": "ipython3",
+   "version": "3.6.0"
   }
  },
  "nbformat": 4,