Skip to content

Commit

Permalink
Add paragraph describing dictionary.dfs and dictionary.compactify()
Browse files Browse the repository at this point in the history
In code snippet 13 there are two new concepts introduced that have not
been explained yet. In addition the workflow to create the dictionary
here is completely different from the workflow described in code
snippets 4 and 5. I've added a paragraph that tries to explain the new
workflow and concepts.
  • Loading branch information
oonska authored and tmcmurphy-cradlepoint committed May 22, 2017
1 parent 0635638 commit 1e835e7
Showing 1 changed file with 2 additions and 0 deletions.
2 changes: 2 additions & 0 deletions docs/notebooks/Corpora_and_Vector_Spaces.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -381,6 +381,8 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"The dictionary was first built from the complete mycorpus.txt file. Then the list of tokenids to remove was generated by querying the dictionary for the token ids of the stop words, and by querying the document frequencies dictionary (dictionary.dfs) for token ids that only appear once. Finally, dictionary.compactify() is called to remove the gaps in the token id series.\n",
"\n",
"And that is all there is to it! At least as far as bag-of-words representation is concerned. Of course, what we do with such corpus is another question; it is not at all clear how counting the frequency of distinct words could be useful. As it turns out, it isn’t, and we will need to apply a transformation on this simple representation first, before we can use it to compute any meaningful document vs. document similarities. Transformations are covered in the [next tutorial](https://radimrehurek.com/gensim/tut2.html), but before that, let’s briefly turn our attention to *corpus persistency*.\n",
"\n",
"## Corpus Formats\n",
Expand Down

0 comments on commit 1e835e7

Please sign in to comment.