diff --git a/docs/notebooks/atmodel_tutorial.ipynb b/docs/notebooks/atmodel_tutorial.ipynb new file mode 100644 index 0000000000..8d842b1a49 --- /dev/null +++ b/docs/notebooks/atmodel_tutorial.ipynb @@ -0,0 +1,1747 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# The author-topic model: LDA with metadata\n", + "\n", + "In this tutorial, you will learn how to use the author-topic model in Gensim. We will apply it to a corpus consisting of scientific papers, to get insight about the authors of the papers.\n", + "\n", + "The author-topic model is an extension of Latent Dirichlet Allocation (LDA), that allows us to learn topic representations of authors in a corpus. The model can be applied to any kinds of labels on documents, such as tags on posts on the web. The model can be used as a novel way of data exploration, as features in machine learning pipelines, for author (or tag) prediction, or to simply leverage your topic model with existing metadata.\n", + "\n", + "To learn about the theoretical side of the author-topic model, see [Rosen-Zvi and co-authors 2004](https://mimno.infosci.cornell.edu/info6150/readings/398.pdf), for example. A report on the algorithm used in the Gensim implementation will be available soon.\n", + "\n", + "Naturally, familiarity with topic modelling, LDA and Gensim is assumed in this tutorial. If you are not familiar with either LDA, or its Gensim implementation, I would recommend starting there. Consider some of these resources:\n", + "* Gentle introduction to the LDA model: http://blog.echen.me/2011/08/22/introduction-to-latent-dirichlet-allocation/\n", + "* Gensim's LDA API documentation: https://radimrehurek.com/gensim/models/ldamodel.html\n", + "* Topic modelling in Gensim: http://radimrehurek.com/topic_modeling_tutorial/2%20-%20Topic%20Modeling.html\n", + "* Pre-processing and training LDA: https://github.com/RaRe-Technologies/gensim/blob/develop/docs/notebooks/lda_training_tips.ipynb\n", + "\n", + "\n", + "> **NOTE:**\n", + ">\n", + "> To run this tutorial on your own, install Jupyter, Gensim, SpaCy, Scikit-Learn, Bokeh and Pandas, e.g. using pip:\n", + ">\n", + "> `pip install jupyter[all] gensim spacy sklearn bokeh pandas`\n", + ">\n", + "> Note that you need to download some data for SpaCy using `python -m spacy.en.download`.\n", + ">\n", + "> Download the notebook at https://github.com/RaRe-Technologies/gensim/tree/develop/docs/notebooks/atmodel_tutorial.ipynb.\n", + "\n", + "In this tutorial, we will learn how to prepare data for the model, how to train it, and how to explore the resulting representation in different ways. We will inspect the topic representation of some well known authors like Geoffrey Hinton and Yann LeCun, and compare authors by plotting them in reduced dimensionality and performing similarity queries.\n", + "\n", + "## Analyzing scientific papers\n", + "\n", + "The data we will be using consists of scientific papers about machine learning, from the Neural Information Processing Systems conference (NIPS). It is the same dataset used in the [Pre-processing and training LDA](https://github.com/RaRe-Technologies/gensim/blob/develop/docs/notebooks/lda_training_tips.ipynb) tutorial, mentioned earlier.\n", + "\n", + "We will be performing qualitative analysis of the model, and at times this will require an understanding of the subject matter of the data. If you try running this tutorial on your own, consider applying it on a dataset with subject matter that you are familiar with. For example, try one of the [StackExchange datadump datasets](https://archive.org/details/stackexchange).\n", + "\n", + "You can download the data from Sam Roweis' website (http://www.cs.nyu.edu/~roweis/data.html). Or just run the cell below, and it will be downloaded and extracted into your `tmp." + ] + }, + { + "cell_type": "code", + "execution_count": 1, + "metadata": { + "collapsed": false + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "--2017-01-16 12:29:12-- http://www.cs.nyu.edu/~roweis/data/nips12raw_str602.tgz\n", + "Resolving www.cs.nyu.edu (www.cs.nyu.edu)... 128.122.49.30\n", + "Connecting to www.cs.nyu.edu (www.cs.nyu.edu)|128.122.49.30|:80... connected.\n", + "HTTP request sent, awaiting response... 200 OK\n", + "Length: 12851423 (12M) [application/x-gzip]\n", + "Saving to: ‘STDOUT’\n", + "\n", + "- 100%[===================>] 12.26M 3.33MB/s in 4.9s \n", + "\n", + "2017-01-16 12:29:18 (2.49 MB/s) - written to stdout [12851423/12851423]\n", + "\n" + ] + } + ], + "source": [ + "!wget -O - 'http://www.cs.nyu.edu/~roweis/data/nips12raw_str602.tgz' > /tmp/nips12raw_str602.tgz" + ] + }, + { + "cell_type": "code", + "execution_count": 4, + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "import tarfile\n", + "\n", + "filename = '/tmp/nips12raw_str602.tgz'\n", + "tar = tarfile.open(filename, 'r:gz')\n", + "for item in tar:\n", + " tar.extract(item, path='/tmp')" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "In the following sections we will load the data, pre-process it, train the model, and explore the results using some of the implementation's functionality. Feel free to skip the loading and pre-processing for now, if you are familiar with the process.\n", + "\n", + "### Loading the data\n", + "\n", + "In the cell below, we crawl the folders and files in the dataset, and read the files into memory." + ] + }, + { + "cell_type": "code", + "execution_count": 5, + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "import os, re\n", + "\n", + "# Folder containing all NIPS papers.\n", + "data_dir = '/tmp/nipstxt/' # Set this path to the data on your machine.\n", + "\n", + "# Folders containin individual NIPS papers.\n", + "yrs = ['00', '01', '02', '03', '04', '05', '06', '07', '08', '09', '10', '11', '12']\n", + "dirs = ['nips' + yr for yr in yrs]\n", + "\n", + "# Get all document texts and their corresponding IDs.\n", + "docs = []\n", + "doc_ids = []\n", + "for yr_dir in dirs:\n", + " files = os.listdir(data_dir + yr_dir) # List of filenames.\n", + " for filen in files:\n", + " # Get document ID.\n", + " (idx1, idx2) = re.search('[0-9]+', filen).span() # Matches the indexes of the start end end of the ID.\n", + " doc_ids.append(yr_dir[4:] + '_' + str(int(filen[idx1:idx2])))\n", + " \n", + " # Read document text.\n", + " # Note: ignoring characters that cause encoding errors.\n", + " with open(data_dir + yr_dir + '/' + filen, errors='ignore', encoding='utf-8') as fid:\n", + " txt = fid.read()\n", + " \n", + " # Replace any whitespace (newline, tabs, etc.) by a single space.\n", + " txt = re.sub('\\s', ' ', txt)\n", + " \n", + " docs.append(txt)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Construct a mapping from author names to document IDs." + ] + }, + { + "cell_type": "code", + "execution_count": 2, + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "filenames = [data_dir + 'idx/a' + yr + '.txt' for yr in yrs] # Using the years defined in previous cell.\n", + "\n", + "# Get all author names and their corresponding document IDs.\n", + "author2doc = dict()\n", + "i = 0\n", + "for yr in yrs:\n", + " # The files \"a00.txt\" and so on contain the author-document mappings.\n", + " filename = data_dir + 'idx/a' + yr + '.txt'\n", + " for line in open(filename, errors='ignore', encoding='utf-8'):\n", + " # Each line corresponds to one author.\n", + " contents = re.split(',', line)\n", + " author_name = (contents[1] + contents[0]).strip()\n", + " # Remove any whitespace to reduce redundant author names.\n", + " author_name = re.sub('\\s', '', author_name)\n", + " # Get document IDs for author.\n", + " ids = [c.strip() for c in contents[2:]]\n", + " if not author2doc.get(author_name):\n", + " # This is a new author.\n", + " author2doc[author_name] = []\n", + " i += 1\n", + " \n", + " # Add document IDs to author.\n", + " author2doc[author_name].extend([yr + '_' + id for id in ids])\n", + "\n", + "# Use an integer ID in author2doc, instead of the IDs provided in the NIPS dataset.\n", + "# Mapping from ID of document in NIPS datast, to an integer ID.\n", + "doc_id_dict = dict(zip(doc_ids, range(len(doc_ids))))\n", + "# Replace NIPS IDs by integer IDs.\n", + "for a, a_doc_ids in author2doc.items():\n", + " for i, doc_id in enumerate(a_doc_ids):\n", + " author2doc[a][i] = doc_id_dict[doc_id]" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Pre-processing text\n", + "\n", + "The text will be pre-processed using the following steps:\n", + "* Tokenize text.\n", + "* Replace all whitespace by single spaces.\n", + "* Remove all punctuation and numbers.\n", + "* Remove stopwords.\n", + "* Lemmatize words.\n", + "* Add multi-word named entities.\n", + "* Add frequent bigrams.\n", + "* Remove frequent and rare words.\n", + "\n", + "A lot of the heavy lifting will be done by the great package, Spacy. Spacy markets itself as \"industrial-strength natural language processing\", is fast, enables multiprocessing, and is easy to use. First, let's import it and load the NLP pipline in english." + ] + }, + { + "cell_type": "code", + "execution_count": 3, + "metadata": { + "collapsed": true + }, + "outputs": [], + "source": [ + "import spacy\n", + "nlp = spacy.load('en')" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "In the code below, Spacy takes care of tokenization, removing non-alphabetic characters, removal of stopwords, lemmatization and named entity recognition.\n", + "\n", + "Note that we only keep named entities that consist of more than one word, as single word named entities are already there." + ] + }, + { + "cell_type": "code", + "execution_count": 4, + "metadata": { + "collapsed": false + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "CPU times: user 9min 6s, sys: 276 ms, total: 9min 7s\n", + "Wall time: 2min 52s\n" + ] + } + ], + "source": [ + "%%time\n", + "processed_docs = [] \n", + "for doc in nlp.pipe(docs, n_threads=4, batch_size=100):\n", + " # Process document using Spacy NLP pipeline.\n", + " \n", + " ents = doc.ents # Named entities.\n", + "\n", + " # Keep only words (no numbers, no punctuation).\n", + " # Lemmatize tokens, remove punctuation and remove stopwords.\n", + " doc = [token.lemma_ for token in doc if token.is_alpha and not token.is_stop]\n", + "\n", + " # Remove common words from a stopword list.\n", + " #doc = [token for token in doc if token not in STOPWORDS]\n", + "\n", + " # Add named entities, but only if they are a compound of more than word.\n", + " doc.extend([str(entity) for entity in ents if len(entity) > 1])\n", + " \n", + " processed_docs.append(doc)" + ] + }, + { + "cell_type": "code", + "execution_count": 5, + "metadata": { + "collapsed": true + }, + "outputs": [], + "source": [ + "docs = processed_docs\n", + "del processed_docs" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Below, we use a Gensim model to add bigrams. Note that this achieves the same goal as named entity recognition, that is, finding adjacent words that have some particular significance." + ] + }, + { + "cell_type": "code", + "execution_count": 6, + "metadata": { + "collapsed": false, + "scrolled": true + }, + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + "/home/olavur/Dropbox/my_folder/workstuff/DTU/thesis/code/gensim/gensim/models/phrases.py:248: UserWarning: For a faster implementation, use the gensim.models.phrases.Phraser class\n", + " warnings.warn(\"For a faster implementation, use the gensim.models.phrases.Phraser class\")\n" + ] + } + ], + "source": [ + "# Compute bigrams.\n", + "from gensim.models import Phrases\n", + "# Add bigrams and trigrams to docs (only ones that appear 20 times or more).\n", + "bigram = Phrases(docs, min_count=20)\n", + "for idx in range(len(docs)):\n", + " for token in bigram[docs[idx]]:\n", + " if '_' in token:\n", + " # Token is a bigram, add to document.\n", + " docs[idx].append(token)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Now we are ready to construct a dictionary, as our vocabulary is finalized. We then remove common words (occurring $> 50\\%$ of the time), and rare words (occur $< 20$ times in total)." + ] + }, + { + "cell_type": "code", + "execution_count": 7, + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "# Create a dictionary representation of the documents, and filter out frequent and rare words.\n", + "\n", + "from gensim.corpora import Dictionary\n", + "dictionary = Dictionary(docs)\n", + "\n", + "# Remove rare and common tokens.\n", + "# Filter out words that occur too frequently or too rarely.\n", + "max_freq = 0.5\n", + "min_wordcount = 20\n", + "dictionary.filter_extremes(no_below=min_wordcount, no_above=max_freq)\n", + "\n", + "_ = dictionary[0] # This sort of \"initializes\" dictionary.id2token." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "We produce the vectorized representation of the documents, to supply the author-topic model with, by computing the bag-of-words." + ] + }, + { + "cell_type": "code", + "execution_count": 8, + "metadata": { + "collapsed": true + }, + "outputs": [], + "source": [ + "# Vectorize data.\n", + "\n", + "# Bag-of-words representation of the documents.\n", + "corpus = [dictionary.doc2bow(doc) for doc in docs]" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Let's inspect the dimensionality of our data." + ] + }, + { + "cell_type": "code", + "execution_count": 9, + "metadata": { + "collapsed": false + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Number of authors: 2479\n", + "Number of unique tokens: 6996\n", + "Number of documents: 1740\n" + ] + } + ], + "source": [ + "print('Number of authors: %d' % len(author2doc))\n", + "print('Number of unique tokens: %d' % len(dictionary))\n", + "print('Number of documents: %d' % len(corpus))" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Train and use model\n", + "\n", + "We train the author-topic model on the data prepared in the previous sections. \n", + "\n", + "The interface to the author-topic model is very similar to that of LDA in Gensim. In addition to a corpus, ID to word mapping (`id2word`) and number of topics (`num_topics`), the author-topic model requires either an author to document ID mapping (`author2doc`), or the reverse (`doc2author`).\n", + "\n", + "Below, we have also (this can be skipped for now):\n", + "* Increased the number of `passes` over the dataset (to improve the convergence of the optimization problem).\n", + "* Decreased the number of `iterations` over each document (related to the above).\n", + "* Specified the mini-batch size (`chunksize`) (primarily to speed up training).\n", + "* Turned off bound evaluation (`eval_every`) (as it takes a long time to compute).\n", + "* Turned on automatic learning of the `alpha` and `eta` priors (to improve the convergence of the optimization problem).\n", + "* Set the random state (`random_state`) of the random number generator (to make these experiments reproducible).\n", + "\n", + "We load the model, and train it." + ] + }, + { + "cell_type": "code", + "execution_count": 10, + "metadata": { + "collapsed": false + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "CPU times: user 3.56 s, sys: 316 ms, total: 3.87 s\n", + "Wall time: 3.65 s\n" + ] + } + ], + "source": [ + "from gensim.models import AuthorTopicModel\n", + "%time model = AuthorTopicModel(corpus=corpus, num_topics=10, id2word=dictionary.id2token, \\\n", + " author2doc=author2doc, chunksize=2000, passes=1, eval_every=0, \\\n", + " iterations=1, random_state=1)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "If you believe your model hasn't converged, you can continue training using `model.update()`. If you have additional documents and/or authors call `model.update(corpus, author2doc)`.\n", + "\n", + "Before we explore the model, let's try to improve upon it. To do this, we will train several models with different random initializations, by giving different seeds for the random number generator (`random_state`). We evaluate the topic coherence of the model using the [top_topics](https://radimrehurek.com/gensim/models/ldamodel.html#gensim.models.ldamodel.LdaModel.top_topics) method, and pick the model with the highest topic coherence.\n", + "\n" + ] + }, + { + "cell_type": "code", + "execution_count": 11, + "metadata": { + "collapsed": false + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "CPU times: user 11min 59s, sys: 2min 14s, total: 14min 13s\n", + "Wall time: 11min 41s\n" + ] + } + ], + "source": [ + "%%time\n", + "model_list = []\n", + "for i in range(5):\n", + " model = AuthorTopicModel(corpus=corpus, num_topics=10, id2word=dictionary.id2token, \\\n", + " author2doc=author2doc, chunksize=2000, passes=100, gamma_threshold=1e-10, \\\n", + " eval_every=0, iterations=1, random_state=i)\n", + " top_topics = model.top_topics(corpus)\n", + " tc = sum([t[1] for t in top_topics])\n", + " model_list.append((model, tc))" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Choose the model with the highest topic coherence." + ] + }, + { + "cell_type": "code", + "execution_count": 12, + "metadata": { + "collapsed": false + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Topic coherence: -1.847e+03\n" + ] + } + ], + "source": [ + "model, tc = max(model_list, key=lambda x: x[1])\n", + "print('Topic coherence: %.3e' %tc)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "We save the model, to avoid having to train it again, and also show how to load it again." + ] + }, + { + "cell_type": "code", + "execution_count": 13, + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "# Save model.\n", + "model.save('/tmp/model.atmodel')" + ] + }, + { + "cell_type": "code", + "execution_count": 14, + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "# Load model.\n", + "model = AuthorTopicModel.load('/tmp/model.atmodel')" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Explore author-topic representation\n", + "\n", + "Now that we have trained a model, we can start exploring the authors and the topics.\n", + "\n", + "First, let's simply print the most important words in the topics. Below we have printed topic 0. As we can see, each topic is associated with a set of words, and each word has a probability of being expressed under that topic." + ] + }, + { + "cell_type": "code", + "execution_count": 15, + "metadata": { + "collapsed": false + }, + "outputs": [ + { + "data": { + "text/plain": [ + "[('chip', 0.014645100754555081),\n", + " ('circuit', 0.011967493386263996),\n", + " ('analog', 0.011466032752399413),\n", + " ('control', 0.010067258628938444),\n", + " ('implementation', 0.0078096719430403956),\n", + " ('design', 0.0072620826472022419),\n", + " ('implement', 0.0063648695668359189),\n", + " ('signal', 0.0063389759280913392),\n", + " ('vlsi', 0.0059415519461153785),\n", + " ('processor', 0.0056545823226162124)]" + ] + }, + "execution_count": 15, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "model.show_topic(0)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Below, we have given each topic a label based on what each topic seems to be about intuitively. " + ] + }, + { + "cell_type": "code", + "execution_count": 42, + "metadata": { + "collapsed": true + }, + "outputs": [], + "source": [ + "topic_labels = ['Circuits', 'Neuroscience', 'Numerical optimization', 'Object recognition', \\\n", + " 'Math/general', 'Robotics', 'Character recognition', \\\n", + " 'Reinforcement learning', 'Speech recognition', 'Bayesian modelling']" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Rather than just calling `model.show_topics(num_topics=10)`, we format the output a bit so it is easier to get an overview." + ] + }, + { + "cell_type": "code", + "execution_count": 43, + "metadata": { + "collapsed": false + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Label: Circuits\n", + "Words: chip circuit analog control implementation design implement signal vlsi processor \n", + "\n", + "Label: Neuroscience\n", + "Words: neuron cell spike response synaptic activity frequency stimulus synapse signal \n", + "\n", + "Label: Numerical optimization\n", + "Words: gradient noise prediction w optimal nonlinear matrix approximation series variance \n", + "\n", + "Label: Object recognition\n", + "Words: image visual object motion field direction representation map position orientation \n", + "\n", + "Label: Math/general\n", + "Words: bound f generalization class let w p theorem y threshold \n", + "\n", + "Label: Robotics\n", + "Words: dynamic control field trajectory neuron motor net forward l movement \n", + "\n", + "Label: Character recognition\n", + "Words: node distance character layer recognition matrix image sequence p code \n", + "\n", + "Label: Reinforcement learning\n", + "Words: action policy q reinforcement rule control optimal representation environment sequence \n", + "\n", + "Label: Speech recognition\n", + "Words: recognition speech word layer classifier net classification hidden class context \n", + "\n", + "Label: Bayesian modelling\n", + "Words: mixture gaussian likelihood prior data bayesian density sample cluster posterior \n", + "\n" + ] + } + ], + "source": [ + "for topic in model.show_topics(num_topics=10):\n", + " print('Label: ' + topic_labels[topic[0]])\n", + " words = ''\n", + " for word, prob in model.show_topic(topic[0]):\n", + " words += word + ' '\n", + " print('Words: ' + words)\n", + " print()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "These topics are by no means perfect. They have problems such as *chained topics*, *intruded words*, *random topics*, and *unbalanced topics* (see [Mimno and co-authors 2011](https://people.cs.umass.edu/~wallach/publications/mimno11optimizing.pdf)). They will do for the purposes of this tutorial, however.\n", + "\n", + "Below, we use the `model[name]` syntax to retrieve the topic distribution for an author. Each topic has a probability of being expressed given the particalar author, but only the ones above a certain threshold are shown." + ] + }, + { + "cell_type": "code", + "execution_count": 18, + "metadata": { + "collapsed": false + }, + "outputs": [ + { + "data": { + "text/plain": [ + "[(6, 0.99976720177983869)]" + ] + }, + "execution_count": 18, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "model['YannLeCun']" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Let's print the top topics of some authors. First, we make a function to help us do this more easily." + ] + }, + { + "cell_type": "code", + "execution_count": 44, + "metadata": { + "collapsed": true + }, + "outputs": [], + "source": [ + "from pprint import pprint\n", + "\n", + "def show_author(name):\n", + " print('\\n%s' % name)\n", + " print('Docs:', model.author2doc[name])\n", + " print('Topics:')\n", + " pprint([(topic_labels[topic[0]], topic[1]) for topic in model[name]])" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Below, we print some high profile researchers and inspect them. Three of these, Yann LeCun, Geoffrey E. Hinton and Christof Koch, are spot on. \n", + "\n", + "Terrence J. Sejnowski's results are surprising, however. He is a neuroscientist, so we would expect him to get the \"neuroscience\" label. This may indicate that Sejnowski works with the neuroscience aspects of visual perception, or perhaps that we have labeled the topic incorrectly, or perhaps that this topic simply is not very informative." + ] + }, + { + "cell_type": "code", + "execution_count": 55, + "metadata": { + "collapsed": false + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "\n", + "YannLeCun\n", + "Docs: [143, 406, 370, 495, 456, 449, 595, 616, 760, 752, 1532]\n", + "Topics:\n", + "[('Character recognition', 0.99976720177983869)]\n" + ] + } + ], + "source": [ + "show_author('YannLeCun')" + ] + }, + { + "cell_type": "code", + "execution_count": 46, + "metadata": { + "collapsed": false + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "\n", + "GeoffreyE.Hinton\n", + "Docs: [56, 143, 284, 230, 197, 462, 463, 430, 688, 784, 826, 848, 869, 1387, 1684, 1728]\n", + "Topics:\n", + "[('Object recognition', 0.42128917017624745),\n", + " ('Math/general', 0.043249835412857811),\n", + " ('Robotics', 0.11149925993091593),\n", + " ('Bayesian modelling', 0.42388500261455564)]\n" + ] + } + ], + "source": [ + "show_author('GeoffreyE.Hinton')" + ] + }, + { + "cell_type": "code", + "execution_count": 47, + "metadata": { + "collapsed": false + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "\n", + "TerrenceJ.Sejnowski\n", + "Docs: [513, 530, 539, 468, 611, 581, 600, 594, 703, 711, 849, 981, 944, 865, 850, 883, 881, 1221, 1137, 1224, 1146, 1282, 1248, 1179, 1424, 1359, 1528, 1484, 1571, 1727, 1732]\n", + "Topics:\n", + "[('Object recognition', 0.99992379088787087)]\n" + ] + } + ], + "source": [ + "show_author('TerrenceJ.Sejnowski')" + ] + }, + { + "cell_type": "code", + "execution_count": 53, + "metadata": { + "collapsed": false + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "\n", + "ChristofKoch\n", + "Docs: [9, 221, 266, 272, 349, 411, 337, 371, 450, 483, 653, 663, 754, 712, 778, 921, 1212, 1285, 1254, 1533, 1489, 1580, 1441, 1657]\n", + "Topics:\n", + "[('Neuroscience', 0.99989393011046035)]\n" + ] + } + ], + "source": [ + "show_author('ChristofKoch')" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "#### Simple model evaluation methods\n", + "\n", + "We can compute the per-word bound, which is a measure of the model's predictive performance (you could also say that it is the reconstruction error).\n", + "\n", + "To do that, we need the `doc2author` dictionary, which we can build automatically." + ] + }, + { + "cell_type": "code", + "execution_count": 24, + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "from gensim.models import atmodel\n", + "doc2author = atmodel.construct_doc2author(model.corpus, model.author2doc)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Now let's evaluate the per-word bound." + ] + }, + { + "cell_type": "code", + "execution_count": 25, + "metadata": { + "collapsed": false + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "-6.9955968712\n" + ] + } + ], + "source": [ + "# Compute the per-word bound.\n", + "# Number of words in corpus.\n", + "corpus_words = sum(cnt for document in model.corpus for _, cnt in document)\n", + "\n", + "# Compute bound and divide by number of words.\n", + "perwordbound = model.bound(model.corpus, author2doc=model.author2doc, \\\n", + " doc2author=model.doc2author) / corpus_words\n", + "print(perwordbound)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "We can evaluate the quality of the topics by computing the topic coherence, as in the LDA class. Use this to e.g. find out which of the topics are poor quality, or as a metric for model selection." + ] + }, + { + "cell_type": "code", + "execution_count": 26, + "metadata": { + "collapsed": false + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "CPU times: user 15.6 s, sys: 4 ms, total: 15.6 s\n", + "Wall time: 15.6 s\n" + ] + } + ], + "source": [ + "%time top_topics = model.top_topics(model.corpus)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "#### Plotting the authors\n", + "\n", + "Now we're going to produce the kind of pacific archipelago looking plot below. The goal of this plot is to give you a way to explore the author-topic representation in an intuitive manner.\n", + "\n", + "We take all the author-topic distributions (stored in `model.state.gamma`) and embed them in a 2D space. To do this, we reduce the dimensionality of this data using t-SNE. \n", + "\n", + "t-SNE is a method that attempts to reduce the dimensionality of a dataset, while maintaining the distances between the points. That means that if two authors are close together in the plot below, then their topic distributions are similar.\n", + "\n", + "In the cell below, we transform the author-topic representation into the t-SNE space. You can increase the `smallest_author` value if you do not want to view all the authors with few documents." + ] + }, + { + "cell_type": "code", + "execution_count": 27, + "metadata": { + "collapsed": false + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "CPU times: user 35.4 s, sys: 1.16 s, total: 36.5 s\n", + "Wall time: 36.4 s\n" + ] + } + ], + "source": [ + "%%time\n", + "from sklearn.manifold import TSNE\n", + "tsne = TSNE(n_components=2, random_state=0)\n", + "smallest_author = 0 # Ignore authors with documents less than this.\n", + "authors = [model.author2id[a] for a in model.author2id.keys() if len(model.author2doc[a]) >= smallest_author]\n", + "_ = tsne.fit_transform(model.state.gamma[authors, :]) # Result stored in tsne.embedding_" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "We are now ready to make the plot.\n", + "\n", + "Note that if you run this notebook yourself, you will see a different graph. The random initialization of the model will be different, and the result will thus be different to some degree. You may find an entirely different representation of the data, or it may show the same interpretation slightly differently.\n", + "\n", + "If you can't see the plot, you are probably viewing this tutorial in a Jupyter Notebook. View it in an nbviewer instead at http://nbviewer.jupyter.org/github/rare-technologies/gensim/blob/develop/docs/notebooks/atmodel_tutorial.ipynb." + ] + }, + { + "cell_type": "code", + "execution_count": 28, + "metadata": { + "collapsed": false, + "scrolled": true + }, + "outputs": [ + { + "data": { + "text/html": [ + "\n", + "
\n", + " \n", + " Loading BokehJS ...\n", + "
" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "data": { + "application/javascript": [ + "\n", + "(function(global) {\n", + " function now() {\n", + " return new Date();\n", + " }\n", + "\n", + " var force = \"1\";\n", + "\n", + " if (typeof (window._bokeh_onload_callbacks) === \"undefined\" || force !== \"\") {\n", + " window._bokeh_onload_callbacks = [];\n", + " window._bokeh_is_loading = undefined;\n", + " }\n", + "\n", + "\n", + " \n", + " if (typeof (window._bokeh_timeout) === \"undefined\" || force !== \"\") {\n", + " window._bokeh_timeout = Date.now() + 5000;\n", + " window._bokeh_failed_load = false;\n", + " }\n", + "\n", + " var NB_LOAD_WARNING = {'data': {'text/html':\n", + " \"
\\n\"+\n", + " \"

\\n\"+\n", + " \"BokehJS does not appear to have successfully loaded. If loading BokehJS from CDN, this \\n\"+\n", + " \"may be due to a slow or bad network connection. Possible fixes:\\n\"+\n", + " \"

\\n\"+\n", + " \"\\n\"+\n", + " \"\\n\"+\n", + " \"from bokeh.resources import INLINE\\n\"+\n", + " \"output_notebook(resources=INLINE)\\n\"+\n", + " \"\\n\"+\n", + " \"
\"}};\n", + "\n", + " function display_loaded() {\n", + " if (window.Bokeh !== undefined) {\n", + " Bokeh.$(\"#c8922b96-b8ff-4ac3-b6c6-882014f91988\").text(\"BokehJS successfully loaded.\");\n", + " } else if (Date.now() < window._bokeh_timeout) {\n", + " setTimeout(display_loaded, 100)\n", + " }\n", + " }\n", + "\n", + " function run_callbacks() {\n", + " window._bokeh_onload_callbacks.forEach(function(callback) { callback() });\n", + " delete window._bokeh_onload_callbacks\n", + " console.info(\"Bokeh: all callbacks have finished\");\n", + " }\n", + "\n", + " function load_libs(js_urls, callback) {\n", + " window._bokeh_onload_callbacks.push(callback);\n", + " if (window._bokeh_is_loading > 0) {\n", + " console.log(\"Bokeh: BokehJS is being loaded, scheduling callback at\", now());\n", + " return null;\n", + " }\n", + " if (js_urls == null || js_urls.length === 0) {\n", + " run_callbacks();\n", + " return null;\n", + " }\n", + " console.log(\"Bokeh: BokehJS not loaded, scheduling load and callback at\", now());\n", + " window._bokeh_is_loading = js_urls.length;\n", + " for (var i = 0; i < js_urls.length; i++) {\n", + " var url = js_urls[i];\n", + " var s = document.createElement('script');\n", + " s.src = url;\n", + " s.async = false;\n", + " s.onreadystatechange = s.onload = function() {\n", + " window._bokeh_is_loading--;\n", + " if (window._bokeh_is_loading === 0) {\n", + " console.log(\"Bokeh: all BokehJS libraries loaded\");\n", + " run_callbacks()\n", + " }\n", + " };\n", + " s.onerror = function() {\n", + " console.warn(\"failed to load library \" + url);\n", + " };\n", + " console.log(\"Bokeh: injecting script tag for BokehJS library: \", url);\n", + " document.getElementsByTagName(\"head\")[0].appendChild(s);\n", + " }\n", + " };var element = document.getElementById(\"c8922b96-b8ff-4ac3-b6c6-882014f91988\");\n", + " if (element == null) {\n", + " console.log(\"Bokeh: ERROR: autoload.js configured with elementid 'c8922b96-b8ff-4ac3-b6c6-882014f91988' but no matching script tag was found. \")\n", + " return false;\n", + " }\n", + "\n", + " var js_urls = ['https://cdn.pydata.org/bokeh/release/bokeh-0.12.3.min.js', 'https://cdn.pydata.org/bokeh/release/bokeh-widgets-0.12.3.min.js'];\n", + "\n", + " var inline_js = [\n", + " function(Bokeh) {\n", + " Bokeh.set_log_level(\"info\");\n", + " },\n", + " \n", + " function(Bokeh) {\n", + " \n", + " Bokeh.$(\"#c8922b96-b8ff-4ac3-b6c6-882014f91988\").text(\"BokehJS is loading...\");\n", + " },\n", + " function(Bokeh) {\n", + " console.log(\"Bokeh: injecting CSS: https://cdn.pydata.org/bokeh/release/bokeh-0.12.3.min.css\");\n", + " Bokeh.embed.inject_css(\"https://cdn.pydata.org/bokeh/release/bokeh-0.12.3.min.css\");\n", + " console.log(\"Bokeh: injecting CSS: https://cdn.pydata.org/bokeh/release/bokeh-widgets-0.12.3.min.css\");\n", + " Bokeh.embed.inject_css(\"https://cdn.pydata.org/bokeh/release/bokeh-widgets-0.12.3.min.css\");\n", + " }\n", + " ];\n", + "\n", + " function run_inline_js() {\n", + " \n", + " if ((window.Bokeh !== undefined) || (force === \"1\")) {\n", + " for (var i = 0; i < inline_js.length; i++) {\n", + " inline_js[i](window.Bokeh);\n", + " }if (force === \"1\") {\n", + " display_loaded();\n", + " }} else if (Date.now() < window._bokeh_timeout) {\n", + " setTimeout(run_inline_js, 100);\n", + " } else if (!window._bokeh_failed_load) {\n", + " console.log(\"Bokeh: BokehJS failed to load within specified timeout.\");\n", + " window._bokeh_failed_load = true;\n", + " } else if (!force) {\n", + " var cell = $(\"#c8922b96-b8ff-4ac3-b6c6-882014f91988\").parents('.cell').data().cell;\n", + " cell.output_area.append_execute_result(NB_LOAD_WARNING)\n", + " }\n", + "\n", + " }\n", + "\n", + " if (window._bokeh_is_loading === 0) {\n", + " console.log(\"Bokeh: BokehJS loaded, going straight to plotting\");\n", + " run_inline_js();\n", + " } else {\n", + " load_libs(js_urls, function() {\n", + " console.log(\"Bokeh: BokehJS plotting callback run at\", now());\n", + " run_inline_js();\n", + " });\n", + " }\n", + "}(this));" + ] + }, + "metadata": {}, + "output_type": "display_data" + } + ], + "source": [ + "# Tell Bokeh to display plots inside the notebook.\n", + "from bokeh.io import output_notebook\n", + "output_notebook()" + ] + }, + { + "cell_type": "code", + "execution_count": 29, + "metadata": { + "collapsed": false + }, + "outputs": [ + { + "data": { + "text/html": [ + "\n", + "\n", + "
\n", + "
\n", + "
\n", + "" + ] + }, + "metadata": {}, + "output_type": "display_data" + } + ], + "source": [ + "from bokeh.models import HoverTool\n", + "from bokeh.plotting import figure, show, ColumnDataSource\n", + "\n", + "x = tsne.embedding_[:, 0]\n", + "y = tsne.embedding_[:, 1]\n", + "author_names = [model.id2author[a] for a in authors]\n", + "\n", + "# Radius of each point corresponds to the number of documents attributed to that author.\n", + "scale = 0.1\n", + "author_sizes = [len(model.author2doc[a]) for a in author_names]\n", + "radii = [size * scale for size in author_sizes]\n", + "\n", + "source = ColumnDataSource(\n", + " data=dict(\n", + " x=x,\n", + " y=y,\n", + " author_names=author_names,\n", + " author_sizes=author_sizes,\n", + " radii=radii,\n", + " )\n", + " )\n", + "\n", + "# Add author names and sizes to mouse-over info.\n", + "hover = HoverTool(\n", + " tooltips=[\n", + " (\"author\", \"@author_names\"),\n", + " (\"size\", \"@author_sizes\"),\n", + " ]\n", + " )\n", + "\n", + "p = figure(tools=[hover, 'crosshair,pan,wheel_zoom,box_zoom,reset,save,lasso_select'])\n", + "p.scatter('x', 'y', radius='radii', source=source, fill_alpha=0.6, line_color=None)\n", + "show(p)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "The circles in the plot above are individual authors, and their sizes represent the number of documents attributed to the corresponding author. Hovering your mouse over the circles will tell you the name of the authors and their sizes. Large clusters of authors tend to reflect some overlap in interest. \n", + "\n", + "We see that the model tends to put duplicate authors close together. For example, Terrence J. Sejnowki and T. J. Sejnowski are the same person, and their vectors end up in the same place (see about $(-10, -10)$ in the plot).\n", + "\n", + "At about $(-15, -10)$ we have a cluster of neuroscientists like Christof Koch and James M. Bower. \n", + "\n", + "As discussed earlier, the \"object recognition\" topic was assigned to Sejnowski. If we get the topics of the other authors in Sejnoski's neighborhood, like Peter Dayan, we also get this same topic. Furthermore, we see that this cluster is close to the \"neuroscience\" cluster discussed above, which is further indication that this topic is about visual perception in the brain.\n", + "\n", + "Other clusters include a reinforcement learning cluster at about $(-5, 8)$, and a Bayesian modelling cluster at about $(8, -12)$.\n", + "\n", + "#### Similarity queries\n", + "\n", + "In this section, we are going to set up a system that takes the name of an author and yields the authors that are most similar. This functionality can be used as a component in an information retrieval (i.e. a search engine of some kind), or in an author prediction system, i.e. a system that takes an unlabelled document and predicts the author(s) that wrote it.\n", + "\n", + "We simply need to search for the closest vector in the author-topic space. In this sense, the approach is similar to the t-SNE plot above.\n", + "\n", + "Below we illustrate a similarity query using a built-in similarity framework in Gensim." + ] + }, + { + "cell_type": "code", + "execution_count": 30, + "metadata": { + "collapsed": true + }, + "outputs": [], + "source": [ + "from gensim.similarities import MatrixSimilarity\n", + "\n", + "# Generate a similarity object for the transformed corpus.\n", + "index = MatrixSimilarity(model[list(model.id2author.values())])\n", + "\n", + "# Get similarities to some author.\n", + "author_name = 'YannLeCun'\n", + "sims = index[model[author_name]]" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "However, this framework uses the cosine distance, but we want to use the Hellinger distance. The Hellinger distance is a natural way of measuring the distance (i.e. dis-similarity) between two probability distributions. Its discrete version is defined as\n", + "$$\n", + "H(p, q) = \\frac{1}{\\sqrt{2}} \\sqrt{\\sum_{i=1}^K (\\sqrt{p_i} - \\sqrt{q_i})^2},\n", + "$$\n", + "\n", + "where $p$ and $q$ are both topic distributions for two different authors. We define the similarity as\n", + "$$\n", + "S(p, q) = \\frac{1}{1 + H(p, q)}.\n", + "$$\n", + "\n", + "In the cell below, we prepare everything we need to perform similarity queries based on the Hellinger distance." + ] + }, + { + "cell_type": "code", + "execution_count": 63, + "metadata": { + "collapsed": true + }, + "outputs": [], + "source": [ + "# Make a function that returns similarities based on the Hellinger distance.\n", + "\n", + "from gensim import matutils\n", + "import pandas as pd\n", + "\n", + "# Make a list of all the author-topic distributions.\n", + "author_vecs = [model.get_author_topics(author) for author in model.id2author.values()]\n", + "\n", + "def similarity(vec1, vec2):\n", + " '''Get similarity between two vectors'''\n", + " dist = matutils.hellinger(matutils.sparse2full(vec1, model.num_topics), \\\n", + " matutils.sparse2full(vec2, model.num_topics))\n", + " sim = 1.0 / (1.0 + dist)\n", + " return sim\n", + "\n", + "def get_sims(vec):\n", + " '''Get similarity of vector to all authors.'''\n", + " sims = [similarity(vec, vec2) for vec2 in author_vecs]\n", + " return sims\n", + "\n", + "def get_table(name, top_n=10, smallest_author=1):\n", + " '''\n", + " Get table with similarities, author names, and author sizes.\n", + " Return `top_n` authors as a dataframe.\n", + " \n", + " '''\n", + " \n", + " # Get similarities.\n", + " sims = get_sims(model.get_author_topics(name))\n", + "\n", + " # Arrange author names, similarities, and author sizes in a list of tuples.\n", + " table = []\n", + " for elem in enumerate(sims):\n", + " author_name = model.id2author[elem[0]]\n", + " sim = elem[1]\n", + " author_size = len(model.author2doc[author_name])\n", + " if author_size >= smallest_author:\n", + " table.append((author_name, sim, author_size))\n", + " \n", + " # Make dataframe and retrieve top authors.\n", + " df = pd.DataFrame(table, columns=['Author', 'Score', 'Size'])\n", + " df = df.sort_values('Score', ascending=False)[:top_n]\n", + " \n", + " return df" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Now we can find the most similar authors to some particular author. We use the Pandas library to print the results in a nice looking tables." + ] + }, + { + "cell_type": "code", + "execution_count": 64, + "metadata": { + "collapsed": false + }, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
AuthorScoreSize
2422YannLeCun1.00000011
1717PatriceSimard0.9999778
986J.S.Denker0.9995813
2425YaserAbu-Mostafa0.9980405
1160JohnS.Denker0.9035606
187AntoninaStarita0.9016991
1718PatriceY.Simard0.8990054
560DiegoSona0.8762371
612EduardSackinger0.8704003
2413Y.LeCun0.8688432
\n", + "
" + ], + "text/plain": [ + " Author Score Size\n", + "2422 YannLeCun 1.000000 11\n", + "1717 PatriceSimard 0.999977 8\n", + "986 J.S.Denker 0.999581 3\n", + "2425 YaserAbu-Mostafa 0.998040 5\n", + "1160 JohnS.Denker 0.903560 6\n", + "187 AntoninaStarita 0.901699 1\n", + "1718 PatriceY.Simard 0.899005 4\n", + "560 DiegoSona 0.876237 1\n", + "612 EduardSackinger 0.870400 3\n", + "2413 Y.LeCun 0.868843 2" + ] + }, + "execution_count": 64, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "get_table('YannLeCun')" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "As before, we can specify the minimum author size." + ] + }, + { + "cell_type": "code", + "execution_count": 72, + "metadata": { + "collapsed": false + }, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
AuthorScoreSize
118JamesM.Bower1.00000010
44ChristofKoch0.99996724
182MatthewA.Wilson0.9998793
157L.F.Abbott0.9998724
256StephenP.DeWeerth0.9998695
82EveMarder0.9998283
96GirishN.Patel0.8568743
43ChdstofKoch0.7881953
291WilliamBialek0.7869874
247Shih-ChiiLiu0.7816433
\n", + "
" + ], + "text/plain": [ + " Author Score Size\n", + "118 JamesM.Bower 1.000000 10\n", + "44 ChristofKoch 0.999967 24\n", + "182 MatthewA.Wilson 0.999879 3\n", + "157 L.F.Abbott 0.999872 4\n", + "256 StephenP.DeWeerth 0.999869 5\n", + "82 EveMarder 0.999828 3\n", + "96 GirishN.Patel 0.856874 3\n", + "43 ChdstofKoch 0.788195 3\n", + "291 WilliamBialek 0.786987 4\n", + "247 Shih-ChiiLiu 0.781643 3" + ] + }, + "execution_count": 72, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "get_table('JamesM.Bower', smallest_author=3)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "collapsed": true + }, + "source": [ + "### Serialized corpora\n", + "\n", + "The `AuthorTopicModel` class accepts serialized corpora, that is, corpora that are stored on the hard-drive rather than in memory. This is usually done when the corpus is too big to fit in memory. There are, however, some caveats to this functionality, which we will discuss here. As these caveats make this functionality less than ideal, it may be improved in the future.\n", + "\n", + "It is not necessary to read this section if you don't intend to use serialized corpora.\n", + "\n", + "In the following, an explanation, followed by an example and a summarization will be given.\n", + "\n", + "If the corpus is serialized, the user must specify `serialized=True`. Any input corpus can then be any type of iterable or generator.\n", + "\n", + "The model will then take the input corpus and serialize it in the `MmCorpus` format, which is [supported in Gensim](https://radimrehurek.com/gensim/corpora/mmcorpus.html).\n", + "\n", + "The user must specify the path where the model should serialize all input documents, for example `serialization_path='/tmp/model_serializer.mm'`. To avoid accidentally overwriting some important data, the model will raise an error if there already exists a file at `serialization_path`; in this case, either choose another path, or delete the old file.\n", + "\n", + "When you want to train on new data, and call `model.update(corpus, author2doc)`, all the old data and the new data have to be re-serialized. This can of course be quite computationally demanding, so it is recommended that you do this *only* when necessary; that is, wait until you have as much new data as possible to update, rather than updating the model for every new document." + ] + }, + { + "cell_type": "code", + "execution_count": 34, + "metadata": { + "collapsed": false + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "CPU times: user 17.6 s, sys: 540 ms, total: 18.1 s\n", + "Wall time: 17.7 s\n" + ] + } + ], + "source": [ + "%time model_ser = AuthorTopicModel(corpus=corpus, num_topics=10, id2word=dictionary.id2token, \\\n", + " author2doc=author2doc, random_state=1, serialized=True, \\\n", + " serialization_path='/tmp/model_serialization.mm')" + ] + }, + { + "cell_type": "code", + "execution_count": 35, + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "# Delete the file, once you're done using it.\n", + "import os\n", + "os.remove('/tmp/model_serialization.mm')" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "In summary, when using serialized corpora:\n", + "* Set `serialized=True`.\n", + "* Set `serialization_path` to a path that doesn't already contain a file.\n", + "* Wait until you have lots of data before you call `model.update(corpus, author2doc)`.\n", + "* When done, delete the file at `serialization_path` if it's not needed anymore." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## What to try next\n", + "\n", + "Try the model on one of the datasets in the [StackExchange data dump](https://archive.org/details/stackexchange). You can treat the tags on the posts as authors and train a \"tag-topic\" model. There are many different categories, from statistics to cooking to philosophy, so you can pick on that you like. You can even try your hand at a [Kaggle competition](https://www.kaggle.com/c/transfer-learning-on-stack-exchange-tags) that uses tags in this dataset.\n" + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.5.2" + } + }, + "nbformat": 4, + "nbformat_minor": 1 +} diff --git a/docs/src/apiref.rst b/docs/src/apiref.rst index c1f2ee183f..7f212a3748 100644 --- a/docs/src/apiref.rst +++ b/docs/src/apiref.rst @@ -37,6 +37,7 @@ Modules: models/lsi_worker models/lda_dispatcher models/lda_worker + models/atmodel models/word2vec models/doc2vec models/phrases diff --git a/docs/src/models/atmodel.rst b/docs/src/models/atmodel.rst new file mode 100644 index 0000000000..5cf943d2f7 --- /dev/null +++ b/docs/src/models/atmodel.rst @@ -0,0 +1,9 @@ +:mod:`models.atmodel` -- Author-topic models +====================================================== + +.. automodule:: gensim.models.atmodel + :synopsis: Author-topic model + :members: + :inherited-members: + :undoc-members: + :show-inheritance: diff --git a/gensim/models/__init__.py b/gensim/models/__init__.py index 79ab1cca9b..082c65ca3c 100644 --- a/gensim/models/__init__.py +++ b/gensim/models/__init__.py @@ -16,6 +16,7 @@ from .ldamulticore import LdaMulticore from .phrases import Phrases from .normmodel import NormModel +from .atmodel import AuthorTopicModel from .ldaseqmodel import LdaSeqModel from . import wrappers diff --git a/gensim/models/atmodel.py b/gensim/models/atmodel.py new file mode 100755 index 0000000000..adad0191a3 --- /dev/null +++ b/gensim/models/atmodel.py @@ -0,0 +1,921 @@ +#!/usr/bin/env python +# -*- coding: utf-8 -*- +# +# Copyright (C) 2016 Radim Rehurek +# Copyright (C) 2016 Olavur Mortensen +# Licensed under the GNU LGPL v2.1 - http://www.gnu.org/licenses/lgpl.html + + +""" +Author-topic model in Python. + +This module trains the author-topic model on documents and corresponding author-document +dictionaries. The training is online and is constant in memory w.r.t. the number of +documents. The model is *not* constant in memory w.r.t. the number of authors. + +The model can be updated with additional documents after taining has been completed. It is +also possible to continue training on the existing data. + +The model is closely related to Latent Dirichlet Allocation. The AuthorTopicModel class +inherits the LdaModel class, and its usage is thus similar. + +Distributed compuation and multiprocessing is not implemented at the moment, but may be +coming in the future. + +The model was introduced by Rosen-Zvi and co-authors in 2004 (https://mimno.infosci.cornell.edu/info6150/readings/398.pdf). + +A tutorial can be found at https://github.com/RaRe-Technologies/gensim/tree/develop/docs/notebooks/atmodel_tutorial.ipynb. + +""" + +# TODO: this class inherits LdaModel and overwrites some methods. There is some code +# duplication still, and a refactor could be made to avoid this. Comments with "TODOs" +# are included in the code where this is the case, for example in the log_perplexity +# and do_estep methods. + +import pdb +from pdb import set_trace as st +from pprint import pprint + +import logging +import numpy as np # for arrays, array broadcasting etc. +import numbers +from copy import deepcopy +from shutil import copyfile +from os.path import isfile +from os import remove + +from gensim import utils +from gensim.models import LdaModel +from gensim.models.ldamodel import LdaState +from gensim.matutils import dirichlet_expectation +from gensim.corpora import MmCorpus +from itertools import chain +from scipy.special import gammaln # gamma function utils +from six.moves import xrange +import six + +logger = logging.getLogger('gensim.models.atmodel') + + +class AuthorTopicState(LdaState): + """ + NOTE: distributed mode not available yet in the author-topic model. This AuthorTopicState + object is kept so that when the time comes to imlement it, it will be easier. + + Encapsulate information for distributed computation of AuthorTopicModel objects. + + Objects of this class are sent over the network, so try to keep them lean to + reduce traffic. + + """ + def __init__(self, eta, lambda_shape, gamma_shape): + self.eta = eta + self.sstats = np.zeros(lambda_shape) + self.gamma = np.zeros(gamma_shape) + self.numdocs = 0 + + +def construct_doc2author(corpus, author2doc): + """Make a mapping from document IDs to author IDs.""" + doc2author = {} + for d, _ in enumerate(corpus): + author_ids = [] + for a, a_doc_ids in author2doc.items(): + if d in a_doc_ids: + author_ids.append(a) + doc2author[d] = author_ids + return doc2author + + +def construct_author2doc(corpus, doc2author): + """Make a mapping from author IDs to document IDs.""" + + # First get a set of all authors. + authors_ids = set() + for d, a_doc_ids in doc2author.items(): + for a in a_doc_ids: + authors_ids.add(a) + + # Now construct the dictionary. + author2doc = {} + for a in authors_ids: + author2doc[a] = [] + for d, a_ids in doc2author.items(): + if a in a_ids: + author2doc[a].append(d) + return author2doc + + +class AuthorTopicModel(LdaModel): + """ + The constructor estimates the author-topic model parameters based + on a training corpus: + + >>> model = AuthorTopicModel(corpus, num_topics=10, author2doc=author2doc, id2word=id2word) + + The model can be updated (trained) with new documents via + + >>> model.update(other_corpus, other_author2doc) + + Model persistency is achieved through its `load`/`save` methods. + """ + + def __init__(self, corpus=None, num_topics=100, id2word=None, author2doc=None, doc2author=None, + chunksize=2000, passes=1, iterations=50, decay=0.5, offset=1.0, + alpha='symmetric', eta='symmetric', update_every=1, eval_every=10, + gamma_threshold=0.001, serialized=False, serialization_path=None, + minimum_probability=0.01, random_state=None): + """ + If the iterable corpus and one of author2doc/doc2author dictionaries are given, + start training straight away. If not given, the model is left untrained + (presumably because you want to call the `update` method manually). + + `num_topics` is the number of requested latent topics to be extracted from + the training corpus. + + `id2word` is a mapping from word ids (integers) to words (strings). It is + used to determine the vocabulary size, as well as for debugging and topic + printing. + + `author2doc` is a dictionary where the keys are the names of authors, and the + values are lists of documents that the author contributes to. + + `doc2author` is a dictionary where the keys are document IDs (indexes to corpus) + and the values are lists of author names. I.e. this is the reverse mapping of + `author2doc`. Only one of the two, `author2doc` and `doc2author` have to be + supplied. + + `passes` is the number of times the model makes a pass over the entire trianing + data. + + `iterations` is the maximum number of times the model loops over each document + (M-step). The iterations stop when convergence is reached. + + `chunksize` controls the size of the mini-batches. + + `alpha` and `eta` are hyperparameters that affect sparsity of the author-topic + (theta) and topic-word (lambda) distributions. Both default to a symmetric + 1.0/num_topics prior. + + `alpha` can be set to an explicit array = prior of your choice. It also + support special values of 'asymmetric' and 'auto': the former uses a fixed + normalized asymmetric 1.0/topicno prior, the latter learns an asymmetric + prior directly from your data. + + `eta` can be a scalar for a symmetric prior over topic/word + distributions, or a vector of shape num_words, which can be used to + impose (user defined) asymmetric priors over the word distribution. + It also supports the special value 'auto', which learns an asymmetric + prior over words directly from your data. `eta` can also be a matrix + of shape num_topics x num_words, which can be used to impose + asymmetric priors over the word distribution on a per-topic basis + (can not be learned from data). + + Calculate and log perplexity estimate from the latest mini-batch every + `eval_every` model updates. Set to None to disable perplexity estimation. + + `decay` and `offset` parameters are the same as Kappa and Tau_0 in + Hoffman et al, respectively. `decay` controls how quickly old documents are + forgotten, while `offset` down-weights early iterations. + + `minimum_probability` controls filtering the topics returned for a document (bow). + + `random_state` can be an integer or a numpy.random.RandomState object. Set the + state of the random number generator inside the author-topic model, to ensure + reproducibility of your experiments, for example. + + `serialized` indicates whether the input corpora to the model are simple + in-memory lists (`serialized = False`) or saved to the hard-drive + (`serialized = True`). Note that this behaviour is quite different from + other Gensim models. If your data is too large to fit in to memory, use + this functionality. Note that calling `AuthorTopicModel.update` with new + data may be cumbersome as it requires all the existing data to be + re-serialized. + + `serialization_path` must be set to a filepath, if `serialized = True` is + used. Use, for example, `serialization_path = /tmp/serialized_model.mm` or use your + working directory by setting `serialization_path = serialized_model.mm`. An existing + file *cannot* be overwritten; either delete the old file or choose a different + name. + + Example: + + >>> model = AuthorTopicModel(corpus, num_topics=100, author2doc=author2doc, id2word=id2word) # train model + >>> model.update(corpus2) # update the author-topic model with additional documents + + >>> model = AuthorTopicModel(corpus, num_topics=50, author2doc=author2doc, id2word=id2word, alpha='auto', eval_every=5) # train asymmetric alpha from data + + """ + + # NOTE: as distributed version of this model is not implemented, "distributed" is set to false. Some of the + # infrastructure to implement a distributed author-topic model is already in place, such as the AuthorTopicState. + distributed = False + self.dispatcher = None + self.numworkers = 1 + + self.id2word = id2word + if corpus is None and self.id2word is None: + raise ValueError('at least one of corpus/id2word must be specified, to establish input space dimensionality') + + if self.id2word is None: + logger.warning("no word id mapping provided; initializing from corpus, assuming identity") + self.id2word = utils.dict_from_corpus(corpus) + self.num_terms = len(self.id2word) + elif len(self.id2word) > 0: + self.num_terms = 1 + max(self.id2word.keys()) + else: + self.num_terms = 0 + + if self.num_terms == 0: + raise ValueError("cannot compute the author-topic model over an empty collection (no terms)") + + logger.info('Vocabulary consists of %d words.', self.num_terms) + + self.author2doc = {} + self.doc2author = {} + + self.distributed = distributed + self.num_topics = num_topics + self.num_authors = 0 + self.chunksize = chunksize + self.decay = decay + self.offset = offset + self.minimum_probability = minimum_probability + self.num_updates = 0 + self.total_docs = 0 + + self.passes = passes + self.update_every = update_every + self.eval_every = eval_every + + self.author2id = {} + self.id2author = {} + + self.serialized = serialized + if serialized and not serialization_path: + raise ValueError("If serialized corpora are used, a the path to a folder where the corpus should be saved must be provided (serialized_path).") + if serialized and serialization_path: + assert not isfile(serialization_path), "A file already exists at the serialization_path path; choose a different serialization_path, or delete the file." + self.serialization_path = serialization_path + + # Initialize an empty self.corpus. + self.init_empty_corpus() + + self.alpha, self.optimize_alpha = self.init_dir_prior(alpha, 'alpha') + + assert self.alpha.shape == (self.num_topics,), "Invalid alpha shape. Got shape %s, but expected (%d, )" % (str(self.alpha.shape), self.num_topics) + + if isinstance(eta, six.string_types): + if eta == 'asymmetric': + raise ValueError("The 'asymmetric' option cannot be used for eta") + + self.eta, self.optimize_eta = self.init_dir_prior(eta, 'eta') + + self.random_state = utils.get_random_state(random_state) + + assert (self.eta.shape == (self.num_terms,) or self.eta.shape == (self.num_topics, self.num_terms)), ( + "Invalid eta shape. Got shape %s, but expected (%d, 1) or (%d, %d)" % + (str(self.eta.shape), self.num_terms, self.num_topics, self.num_terms)) + + # VB constants + self.iterations = iterations + self.gamma_threshold = gamma_threshold + + # Initialize the variational distributions q(beta|lambda) and q(theta|gamma) + self.state = AuthorTopicState(self.eta, (self.num_topics, self.num_terms), (self.num_authors, self.num_topics)) + self.state.sstats = self.random_state.gamma(100., 1. / 100., (self.num_topics, self.num_terms)) + self.expElogbeta = np.exp(dirichlet_expectation(self.state.sstats)) + + # if a training corpus was provided, start estimating the model right away + if corpus is not None and (author2doc is not None or doc2author is not None): + use_numpy = self.dispatcher is not None + self.update(corpus, author2doc, doc2author, chunks_as_numpy=use_numpy) + + def __str__(self): + return "AuthorTopicModel(num_terms=%s, num_topics=%s, num_authors=%s, decay=%s, chunksize=%s)" % \ + (self.num_terms, self.num_topics, self.num_authors, self.decay, self.chunksize) + + def init_empty_corpus(self): + """ + Initialize an empty corpus. If the corpora are to be treated as lists, simply + initialize an empty list. If serialization is used, initialize an empty corpus + of the class `gensim.corpora.MmCorpus`. + + """ + if self.serialized: + # Tnitialize the corpus as a serialized empty list. + # This corpus will be extended in self.update. + MmCorpus.serialize(self.serialization_path, []) # Serialize empty corpus. + self.corpus = MmCorpus(self.serialization_path) # Store serialized corpus object in self.corpus. + else: + # All input corpora are assumed to just be lists. + self.corpus = [] + + def extend_corpus(self, corpus): + """ + Add new documents in `corpus` to `self.corpus`. If serialization is used, + then the entire corpus (`self.corpus`) is re-serialized and the new documents + are added in the process. If serialization is not used, the corpus, as a list + of documents, is simply extended. + + """ + if self.serialized: + # Re-serialize the entire corpus while appending the new documents. + if isinstance(corpus, MmCorpus): + # Check that we are not attempting to overwrite the serialized corpus. + assert self.corpus.input != corpus.input, 'Input corpus cannot have the same file path as the model corpus (serialization_path).' + corpus_chain = chain(self.corpus, corpus) # A generator with the old and new documents. + copyfile(self.serialization_path, self.serialization_path + '.tmp') # Make a temporary copy of the file where the corpus is serialized. + self.corpus.input = self.serialization_path + '.tmp' # Point the old corpus at this temporary file. + MmCorpus.serialize(self.serialization_path, corpus_chain) # Re-serialize the old corpus, and extend it with the new corpus. + self.corpus = MmCorpus(self.serialization_path) # Store the new serialized corpus object in self.corpus. + remove(self.serialization_path + '.tmp') # Remove the temporary file again. + else: + # self.corpus and corpus are just lists, just extend the list. + # First check that corpus is actually a list. + assert isinstance(corpus, list), "If serialized == False, all input corpora must be lists." + self.corpus.extend(corpus) + + def compute_phinorm(self, ids, authors_d, expElogthetad, expElogbetad): + """Efficiently computes the normalizing factor in phi.""" + phinorm = np.zeros(len(ids)) + expElogtheta_sum = expElogthetad.sum(axis=0) + phinorm = expElogtheta_sum.dot(expElogbetad) + 1e-100 + + return phinorm + + def inference(self, chunk, author2doc, doc2author, rhot, collect_sstats=False, chunk_doc_idx=None): + """ + Given a chunk of sparse document vectors, update gamma (parameters + controlling the topic weights) for each author corresponding to the + documents in the chunk. + + The whole input chunk of document is assumed to fit in RAM; chunking of + a large corpus must be done earlier in the pipeline. + + If `collect_sstats` is True, also collect sufficient statistics needed + to update the model's topic-word distributions, and return a 2-tuple + `(gamma_chunk, sstats)`. Otherwise, return `(gamma_chunk, None)`. + `gamma_cunk` is of shape `len(chunk_authors) x self.num_topics`, where + `chunk_authors` is the number of authors in the documents in the + current chunk. + + Avoids computing the `phi` variational parameter directly using the + optimization presented in **Lee, Seung: Algorithms for non-negative matrix factorization, NIPS 2001**. + + """ + try: + _ = len(chunk) + except: + # convert iterators/generators to plain list, so we have len() etc. + chunk = list(chunk) + if len(chunk) > 1: + logger.debug("performing inference on a chunk of %i documents", len(chunk)) + + # Initialize the variational distribution q(theta|gamma) for the chunk + if collect_sstats: + sstats = np.zeros_like(self.expElogbeta) + else: + sstats = None + converged = 0 + + # Stack all the computed gammas into this output array. + gamma_chunk = np.zeros((0, self.num_topics)) + + # Now, for each document d update gamma and phi w.r.t. all authors in those documents. + for d, doc in enumerate(chunk): + if chunk_doc_idx is not None: + doc_no = chunk_doc_idx[d] + else: + doc_no = d + # Get the IDs and counts of all the words in the current document. + # TODO: this is duplication of code in LdaModel. Refactor. + if doc and not isinstance(doc[0][0], six.integer_types): + # make sure the term IDs are ints, otherwise np will get upset + ids = [int(id) for id, _ in doc] + else: + ids = [id for id, _ in doc] + cts = np.array([cnt for _, cnt in doc]) + + # Get all authors in current document, and convert the author names to integer IDs. + authors_d = [self.author2id[a] for a in self.doc2author[doc_no]] + + gammad = self.state.gamma[authors_d, :] # gamma of document d before update. + tilde_gamma = gammad.copy() # gamma that will be updated. + + # Compute the expectation of the log of the Dirichlet parameters theta and beta. + Elogthetad = dirichlet_expectation(tilde_gamma) + expElogthetad = np.exp(Elogthetad) + expElogbetad = self.expElogbeta[:, ids] + + # Compute the normalizing constant of phi for the current document. + phinorm = self.compute_phinorm(ids, authors_d, expElogthetad, expElogbetad) + + # Iterate between gamma and phi until convergence + for iteration in xrange(self.iterations): + + lastgamma = tilde_gamma.copy() + + # Update gamma. + # phi is computed implicitly below, + for ai, a in enumerate(authors_d): + tilde_gamma[ai, :] = self.alpha + len(self.author2doc[self.id2author[a]]) * expElogthetad[ai, :] * np.dot(cts / phinorm, expElogbetad.T) + + # Update gamma. + # Interpolation between document d's "local" gamma (tilde_gamma), + # and "global" gamma (gammad). + tilde_gamma = (1 - rhot) * gammad + rhot * tilde_gamma + + # Update Elogtheta and Elogbeta, since gamma and lambda have been updated. + Elogthetad = dirichlet_expectation(tilde_gamma) + expElogthetad = np.exp(Elogthetad) + + # Update the normalizing constant in phi. + phinorm = self.compute_phinorm(ids, authors_d, expElogthetad, expElogbetad) + + # Check for convergence. + # Criterion is mean change in "local" gamma. + meanchange_gamma = np.mean(abs(tilde_gamma - lastgamma)) + gamma_condition = meanchange_gamma < self.gamma_threshold + if gamma_condition: + converged += 1 + break + # End of iterations loop. + + # Store the updated gammas in the model state. + self.state.gamma[authors_d, :] = tilde_gamma + + # Stack the new gammas into the output array. + gamma_chunk = np.vstack([gamma_chunk, tilde_gamma]) + + if collect_sstats: + # Contribution of document d to the expected sufficient + # statistics for the M step. + expElogtheta_sum_a = expElogthetad.sum(axis=0) + sstats[:, ids] += np.outer(expElogtheta_sum_a.T, cts / phinorm) + + if len(chunk) > 1: + logger.debug("%i/%i documents converged within %i iterations", + converged, len(chunk), self.iterations) + + if collect_sstats: + # This step finishes computing the sufficient statistics for the + # M step, so that + # sstats[k, w] = \sum_d n_{dw} * \sum_a phi_{dwak} + # = \sum_d n_{dw} * exp{Elogtheta_{ak} + Elogbeta_{kw}} / phinorm_{dw}. + sstats *= self.expElogbeta + return gamma_chunk, sstats + + def do_estep(self, chunk, author2doc, doc2author, rhot, state=None, chunk_doc_idx=None): + """ + Perform inference on a chunk of documents, and accumulate the collected + sufficient statistics in `state` (or `self.state` if None). + + """ + + # TODO: this method is somewhat similar to the one in LdaModel. Refactor if possible. + if state is None: + state = self.state + gamma, sstats = self.inference(chunk, author2doc, doc2author, rhot, collect_sstats=True, chunk_doc_idx=chunk_doc_idx) + state.sstats += sstats + state.numdocs += len(chunk) + return gamma + + def log_perplexity(self, chunk, chunk_doc_idx=None, total_docs=None): + """ + Calculate and return per-word likelihood bound, using the `chunk` of + documents as evaluation corpus. Also output the calculated statistics. incl. + perplexity=2^(-bound), to log at INFO level. + + """ + + # TODO: This method is very similar to the one in LdaModel. Refactor. + if total_docs is None: + total_docs = len(chunk) + corpus_words = sum(cnt for document in chunk for _, cnt in document) + subsample_ratio = 1.0 * total_docs / len(chunk) + perwordbound = self.bound(chunk, chunk_doc_idx, subsample_ratio=subsample_ratio) / (subsample_ratio * corpus_words) + logger.info("%.3f per-word bound, %.1f perplexity estimate based on a corpus of %i documents with %i words" % + (perwordbound, np.exp2(-perwordbound), len(chunk), corpus_words)) + return perwordbound + + def update(self, corpus=None, author2doc=None, doc2author=None, chunksize=None, decay=None, offset=None, + passes=None, update_every=None, eval_every=None, iterations=None, + gamma_threshold=None, chunks_as_numpy=False): + """ + Train the model with new documents, by EM-iterating over `corpus` until + the topics converge (or until the maximum number of allowed iterations + is reached). `corpus` must be an iterable (repeatable stream of documents), + + This update also supports updating an already trained model (`self`) + with new documents from `corpus`; the two models are then merged in + proportion to the number of old vs. new documents. This feature is still + experimental for non-stationary input streams. + + For stationary input (no topic drift in new documents), on the other hand, + this equals the online update of Hoffman et al. and is guaranteed to + converge for any `decay` in (0.5, 1.0>. Additionally, for smaller + `corpus` sizes, an increasing `offset` may be beneficial (see + Table 1 in Hoffman et al.) + + If update is called with authors that already exist in the model, it will + resume training on not only new documents for that author, but also the + previously seen documents. This is necessary for those authors' topic + distributions to converge. + + Every time `update(corpus, author2doc)` is called, the new documents are + to appended to all the previously seen documents, and author2doc is + combined with the previously seen authors. + + To resume training on all the data seen by the model, simply call + `update()`. + + It is not possible to add new authors to existing documents, as all + documents in `corpus` are assumed to be new documents. + + Args: + corpus (gensim corpus): The corpus with which the author-topic model should be updated. + + author2doc (dictionary): author to document mapping corresponding to indexes in input + corpus. + + doc2author (dictionary): document to author mapping corresponding to indexes in input + corpus. + + chunks_as_numpy (bool): Whether each chunk passed to `.inference` should be a np + array of not. np can in some settings turn the term IDs + into floats, these will be converted back into integers in + inference, which incurs a performance hit. For distributed + computing it may be desirable to keep the chunks as np + arrays. + + For other parameter settings, see :class:`AuthorTopicModel` constructor. + + """ + + # use parameters given in constructor, unless user explicitly overrode them + if decay is None: + decay = self.decay + if offset is None: + offset = self.offset + if passes is None: + passes = self.passes + if update_every is None: + update_every = self.update_every + if eval_every is None: + eval_every = self.eval_every + if iterations is None: + iterations = self.iterations + if gamma_threshold is None: + gamma_threshold = self.gamma_threshold + + # TODO: if deepcopy is not used here, something goes wrong. When unit tests are run (specifically "testPasses"), + # the process simply gets killed. + author2doc = deepcopy(author2doc) + doc2author = deepcopy(doc2author) + + # TODO: it is not possible to add new authors to an existing document (all input documents are treated + # as completely new documents). Perhaps this functionality could be implemented. + # If it's absolutely necessary, the user can delete the documents that have new authors, and call update + # on them with the new and old authors. + + if corpus is None: + # Just keep training on the already available data. + # Assumes self.update() has been called before with input documents and corresponding authors. + assert self.total_docs > 0, 'update() was called with no documents to train on.' + train_corpus_idx = [d for d in xrange(self.total_docs)] + num_input_authors = len(self.author2doc) + else: + if doc2author is None and author2doc is None: + raise ValueError('at least one of author2doc/doc2author must be specified, to establish input space dimensionality') + + # If either doc2author or author2doc is missing, construct them from the other. + if doc2author is None: + doc2author = construct_doc2author(corpus, author2doc) + elif author2doc is None: + author2doc = construct_author2doc(corpus, doc2author) + + # Number of authors that need to be updated. + num_input_authors = len(author2doc) + + try: + len_input_corpus = len(corpus) + except: + logger.warning("input corpus stream has no len(); counting documents") + len_input_corpus = sum(1 for _ in corpus) + if len_input_corpus == 0: + logger.warning("AuthorTopicModel.update() called with an empty corpus") + return + + self.total_docs += len_input_corpus + + # Add new documents in corpus to self.corpus. + self.extend_corpus(corpus) + + # Obtain a list of new authors. + new_authors = [] + # Sorting the author names makes the model more reproducible. + for a in sorted(author2doc.keys()): + if not self.author2doc.get(a): + new_authors.append(a) + + num_new_authors = len(new_authors) + + # Add new authors do author2id/id2author dictionaries. + for a_id, a_name in enumerate(new_authors): + self.author2id[a_name] = a_id + self.num_authors + self.id2author[a_id + self.num_authors] = a_name + + # Increment the number of total authors seen. + self.num_authors += num_new_authors + + # Initialize the variational distributions q(theta|gamma) + gamma_new = self.random_state.gamma(100., 1. / 100., (num_new_authors, self.num_topics)) + self.state.gamma = np.vstack([self.state.gamma, gamma_new]) + + # Combine author2doc with self.author2doc. + # First, increment the document IDs by the number of previously seen documents. + for a, doc_ids in author2doc.items(): + doc_ids = [d + self.total_docs - len_input_corpus for d in doc_ids] + + # For all authors in the input corpus, add the new documents. + for a, doc_ids in author2doc.items(): + if self.author2doc.get(a): + # This is not a new author, append new documents. + self.author2doc[a].extend(doc_ids) + else: + # This is a new author, create index. + self.author2doc[a] = doc_ids + + # Add all new documents to self.doc2author. + for d, a_list in doc2author.items(): + self.doc2author[d] = a_list + + # Train on all documents of authors in input_corpus. + train_corpus_idx = [] + for a in author2doc.keys(): # For all authors in input corpus. + for doc_ids in self.author2doc.values(): # For all documents in total corpus. + train_corpus_idx.extend(doc_ids) + + # Make the list of training documents unique. + train_corpus_idx = list(set(train_corpus_idx)) + + # train_corpus_idx is only a list of indexes, so "len" is valid. + lencorpus = len(train_corpus_idx) + + if chunksize is None: + chunksize = min(lencorpus, self.chunksize) + + self.state.numdocs += lencorpus + + if update_every: + updatetype = "online" + updateafter = min(lencorpus, update_every * self.numworkers * chunksize) + else: + updatetype = "batch" + updateafter = lencorpus + evalafter = min(lencorpus, (eval_every or 0) * self.numworkers * chunksize) + + updates_per_pass = max(1, lencorpus / updateafter) + logger.info("running %s author-topic training, %s topics, %s authors, %i passes over " + "the supplied corpus of %i documents, updating model once " + "every %i documents, evaluating perplexity every %i documents, " + "iterating %ix with a convergence threshold of %f", + updatetype, self.num_topics, num_input_authors, passes, lencorpus, + updateafter, evalafter, iterations, + gamma_threshold) + + if updates_per_pass * passes < 10: + logger.warning("too few updates, training might not converge; consider " + "increasing the number of passes or iterations to improve accuracy") + + # rho is the "speed" of updating; TODO try other fncs + # pass_ + num_updates handles increasing the starting t for each pass, + # while allowing it to "reset" on the first pass of each update + def rho(): + return pow(offset + pass_ + (self.num_updates / chunksize), -decay) + + for pass_ in xrange(passes): + if self.dispatcher: + logger.info('initializing %s workers' % self.numworkers) + self.dispatcher.reset(self.state) + else: + # gamma is not needed in "other", thus its shape is (0, 0). + other = AuthorTopicState(self.eta, self.state.sstats.shape, (0, 0)) + dirty = False + + reallen = 0 + for chunk_no, chunk_doc_idx in enumerate(utils.grouper(train_corpus_idx, chunksize, as_numpy=chunks_as_numpy)): + chunk = [self.corpus[d] for d in chunk_doc_idx] + reallen += len(chunk) # keep track of how many documents we've processed so far + + if eval_every and ((reallen == lencorpus) or ((chunk_no + 1) % (eval_every * self.numworkers) == 0)): + # log_perplexity requires the indexes of the documents being evaluated, to know what authors + # correspond to the documents. + self.log_perplexity(chunk, chunk_doc_idx, total_docs=lencorpus) + + if self.dispatcher: + # add the chunk to dispatcher's job queue, so workers can munch on it + logger.info('PROGRESS: pass %i, dispatching documents up to #%i/%i', + pass_, chunk_no * chunksize + len(chunk), lencorpus) + # this will eventually block until some jobs finish, because the queue has a small finite length + self.dispatcher.putjob(chunk) + else: + logger.info('PROGRESS: pass %i, at document #%i/%i', + pass_, chunk_no * chunksize + len(chunk), lencorpus) + # do_estep requires the indexes of the documents being trained on, to know what authors + # correspond to the documents. + gammat = self.do_estep(chunk, self.author2doc, self.doc2author, rho(), other, chunk_doc_idx) + + if self.optimize_alpha: + self.update_alpha(gammat, rho()) + + dirty = True + del chunk + + # perform an M step. determine when based on update_every, don't do this after every chunk + if update_every and (chunk_no + 1) % (update_every * self.numworkers) == 0: + if self.dispatcher: + # distributed mode: wait for all workers to finish + logger.info("reached the end of input; now waiting for all remaining jobs to finish") + other = self.dispatcher.getstate() + self.do_mstep(rho(), other, pass_ > 0) + del other # frees up memory + + if self.dispatcher: + logger.info('initializing workers') + self.dispatcher.reset(self.state) + else: + other = AuthorTopicState(self.eta, self.state.sstats.shape, (0, 0)) + dirty = False + # endfor single corpus iteration + if reallen != lencorpus: + raise RuntimeError("input corpus size changed during training (don't use generators as input)") + + if dirty: + # finish any remaining updates + if self.dispatcher: + # distributed mode: wait for all workers to finish + logger.info("reached the end of input; now waiting for all remaining jobs to finish") + other = self.dispatcher.getstate() + self.do_mstep(rho(), other, pass_ > 0) + del other + dirty = False + # endfor entire corpus update + + def bound(self, chunk, chunk_doc_idx=None, subsample_ratio=1.0, author2doc=None, doc2author=None): + """ + Estimate the variational bound of documents from `corpus`: + E_q[log p(corpus)] - E_q[log q(corpus)] + + There are basically two use cases of this method: + 1. `chunk` is a subset of the training corpus, and `chunk_doc_idx` is provided, + indicating the indexes of the documents in the training corpus. + 2. `chunk` is a test set (held-out data), and author2doc and doc2author + corrsponding to this test set are provided. There must not be any new authors + passed to this method. `chunk_doc_idx` is not needed in this case. + + To obtain the per-word bound, compute: + + >>> corpus_words = sum(cnt for document in corpus for _, cnt in document) + >>> model.bound(corpus, author2doc=author2doc, doc2author=doc2author) / corpus_words + + """ + + # TODO: enable evaluation of documents with new authors. One could, for example, make it + # possible to pass a list of documents to self.inference with no author dictionaries, + # assuming all the documents correspond to one (unseen) author, learn the author's + # gamma, and return gamma (without adding it to self.state.gamma). Of course, + # collect_sstats should be set to false, so that the model is not updated w.r.t. these + # new documents. + + _lambda = self.state.get_lambda() + Elogbeta = dirichlet_expectation(_lambda) + expElogbeta = np.exp(Elogbeta) + + gamma = self.state.gamma + + if author2doc is None and doc2author is None: + # Evaluating on training documents (chunk of self.corpus). + author2doc = self.author2doc + doc2author = self.doc2author + + if not chunk_doc_idx: + # If author2doc and doc2author are not provided, chunk is assumed to be a subset of + # self.corpus, and chunk_doc_idx is thus required. + raise ValueError('Either author dictionaries or chunk_doc_idx must be provided. Consult documentation of bound method.') + elif author2doc is not None and doc2author is not None: + # Training on held-out documents (documents not seen during training). + # All authors in dictionaries must still be seen during training. + for a in author2doc.keys(): + if not self.author2doc.get(a): + raise ValueError('bound cannot be called with authors not seen during training.') + + if chunk_doc_idx: + raise ValueError('Either author dictionaries or chunk_doc_idx must be provided, not both. Consult documentation of bound method.') + else: + raise ValueError('Either both author2doc and doc2author should be provided, or neither. Consult documentation of bound method.') + + Elogtheta = dirichlet_expectation(gamma) + expElogtheta = np.exp(Elogtheta) + + word_score = 0.0 + theta_score = 0.0 + for d, doc in enumerate(chunk): + if chunk_doc_idx: + doc_no = chunk_doc_idx[d] + else: + doc_no = d + # Get all authors in current document, and convert the author names to integer IDs. + authors_d = [self.author2id[a] for a in self.doc2author[doc_no]] + ids = np.array([id for id, _ in doc]) # Word IDs in doc. + cts = np.array([cnt for _, cnt in doc]) # Word counts. + + if d % self.chunksize == 0: + logger.debug("bound: at document #%i in chunk", d) + + # Computing the bound requires summing over expElogtheta[a, k] * expElogbeta[k, v], which + # is the same computation as in normalizing phi. + phinorm = self.compute_phinorm(ids, authors_d, expElogtheta[authors_d, :], expElogbeta[:, ids]) + word_score += np.log(1.0 / len(authors_d)) + cts.dot(np.log(phinorm)) + + # Compensate likelihood for when `chunk` above is only a sample of the whole corpus. This ensures + # that the likelihood is always rougly on the same scale. + word_score *= subsample_ratio + + # E[log p(theta | alpha) - log q(theta | gamma)] + for a in author2doc.keys(): + a = self.author2id[a] + theta_score += np.sum((self.alpha - gamma[a, :]) * Elogtheta[a, :]) + theta_score += np.sum(gammaln(gamma[a, :]) - gammaln(self.alpha)) + theta_score += gammaln(np.sum(self.alpha)) - gammaln(np.sum(gamma[a, :])) + + # theta_score is rescaled in a similar fashion. + # TODO: treat this in a more general way, similar to how it is done with word_score. + theta_score *= self.num_authors / len(author2doc) + + # E[log p(beta | eta) - log q (beta | lambda)] + beta_score = 0.0 + beta_score += np.sum((self.eta - _lambda) * Elogbeta) + beta_score += np.sum(gammaln(_lambda) - gammaln(self.eta)) + sum_eta = np.sum(self.eta) + beta_score += np.sum(gammaln(sum_eta) - gammaln(np.sum(_lambda, 1))) + + total_score = word_score + theta_score + beta_score + + return total_score + + def get_document_topics(self, word_id, minimum_probability=None): + ''' + This method overwrites `LdaModel.get_document_topics` and simply raises an + exception. `get_document_topics` is not valid for the author-topic model, + use `get_author_topics` instead. + + ''' + + raise NotImplementedError('Method "get_document_topics" is not valid for the author-topic model. Use the "get_author_topics" method.') + + def get_author_topics(self, author_name, minimum_probability=None): + """ + Return topic distribution the given author, as a list of + (topic_id, topic_probability) 2-tuples. + Ignore topics with very low probability (below `minimum_probability`). + + Obtaining topic probabilities of each word, as in LDA (via `per_word_topics`), + is not supported. + + """ + + author_id = self.author2id[author_name] + + if minimum_probability is None: + minimum_probability = self.minimum_probability + minimum_probability = max(minimum_probability, 1e-8) # never allow zero values in sparse output + + topic_dist = self.state.gamma[author_id, :] / sum(self.state.gamma[author_id, :]) + + author_topics = [(topicid, topicvalue) for topicid, topicvalue in enumerate(topic_dist) + if topicvalue >= minimum_probability] + + return author_topics + + def __getitem__(self, author_names, eps=None): + ''' + Return topic distribution for input author as a list of + (topic_id, topic_probabiity) 2-tuples. + + Ingores topics with probaility less than `eps`. + + Do not call this method directly, instead use `model[author_names]`. + + ''' + if isinstance(author_names, list): + items = [] + for a in author_names: + items.append(self.get_author_topics(a, minimum_probability=eps)) + else: + items = self.get_author_topics(author_names, minimum_probability=eps) + + return items +# endclass AuthorTopicModel diff --git a/gensim/models/ldamodel.py b/gensim/models/ldamodel.py index 58451282ef..1bbe4a4bcd 100755 --- a/gensim/models/ldamodel.py +++ b/gensim/models/ldamodel.py @@ -737,7 +737,8 @@ def bound(self, corpus, gamma=None, subsample_ratio=1.0): score += np.sum(gammaln(gammad) - gammaln(self.alpha)) score += gammaln(np.sum(self.alpha)) - gammaln(np.sum(gammad)) - # compensate likelihood for when `corpus` above is only a sample of the whole corpus + # Compensate likelihood for when `corpus` above is only a sample of the whole corpus. This ensures + # that the likelihood is always rougly on the same scale. score *= subsample_ratio # E[log p(beta | eta) - log q (beta | lambda)]; assumes eta is a scalar diff --git a/gensim/test/test_atmodel.py b/gensim/test/test_atmodel.py new file mode 100644 index 0000000000..d2625f6ede --- /dev/null +++ b/gensim/test/test_atmodel.py @@ -0,0 +1,514 @@ +#!/usr/bin/env python +# -*- coding: utf-8 -*- +# +# Copyright (C) 2016 Radim Rehurek +# Copyright (C) 2016 Olavur Mortensen +# Licensed under the GNU LGPL v2.1 - http://www.gnu.org/licenses/lgpl.html + +""" +Automated tests for the author-topic model (AuthorTopicModel class). These tests +are based on the unit tests of LDA; the classes are quite similar, and the tests +needed are thus quite similar. +""" + + +import logging +import unittest +import os +import os.path +import tempfile +import numbers +from os import remove + +import six +import numpy as np +import scipy.linalg + +from gensim.corpora import mmcorpus, Dictionary +from gensim.models import atmodel +from gensim import matutils +from gensim.test import basetests + +# TODO: +# Test that computing the bound on new unseen documents works as expected (this is somewhat different +# in the author-topic model than in LDA). +# Perhaps test that the bound increases, in general (i.e. in several of the tests below where it makes +# sense. This is not tested in LDA either. Tests can also be made to check that automatic prior learning +# increases the bound. +# Test that models are compatiple across versions, as done in LdaModel. + +module_path = os.path.dirname(__file__) # needed because sample data files are located in the same folder +datapath = lambda fname: os.path.join(module_path, 'test_data', fname) + +# set up vars used in testing ("Deerwester" from the web tutorial) +texts = [['human', 'interface', 'computer'], + ['survey', 'user', 'computer', 'system', 'response', 'time'], + ['eps', 'user', 'interface', 'system'], + ['system', 'human', 'system', 'eps'], + ['user', 'response', 'time'], + ['trees'], + ['graph', 'trees'], + ['graph', 'minors', 'trees'], + ['graph', 'minors', 'survey']] +dictionary = Dictionary(texts) +corpus = [dictionary.doc2bow(text) for text in texts] + +# Assign some authors randomly to the documents above. +author2doc = {'john': [0, 1, 2, 3, 4, 5, 6], 'jane': [2, 3, 4, 5, 6, 7, 8], 'jack': [0, 2, 4, 6, 8], 'jill': [1, 3, 5, 7]} +doc2author = {0: ['john', 'jack'], 1: ['john', 'jill'], 2: ['john', 'jane', 'jack'], 3: ['john', 'jane', 'jill'], + 4: ['john', 'jane', 'jack'], 5: ['john', 'jane', 'jill'], 6: ['john', 'jane', 'jack'], 7: ['jane', 'jill'], + 8: ['jane', 'jack']} + +# More data with new and old authors (to test update method). +# Although the text is just a subset of the previous, the model +# just sees it as completely new data. +texts_new = texts[0:3] +author2doc_new = {'jill': [0], 'bob': [0, 1], 'sally': [1, 2]} +dictionary_new = Dictionary(texts_new) +corpus_new = [dictionary_new.doc2bow(text) for text in texts_new] + + +def testfile(test_fname=''): + # temporary data will be stored to this file + fname = 'gensim_models_' + test_fname + '.tst' + return os.path.join(tempfile.gettempdir(), fname) + + +class TestAuthorTopicModel(unittest.TestCase, basetests.TestBaseTopicModel): + def setUp(self): + self.corpus = mmcorpus.MmCorpus(datapath('testcorpus.mm')) + self.class_ = atmodel.AuthorTopicModel + self.model = self.class_(corpus, id2word=dictionary, author2doc=author2doc, num_topics=2, passes=100) + + def testTransform(self): + passed = False + # sometimes, training gets stuck at a local minimum + # in that case try re-training the model from scratch, hoping for a + # better random initialization + for i in range(25): # restart at most 5 times + # create the transformation model + model = self.class_(id2word=dictionary, num_topics=2, passes=100, random_state=0) + model.update(corpus, author2doc) + + jill_topics = model.get_author_topics('jill') + + # NOTE: this test may easily fail if the author-topic model is altered in any way. The model's + # output is sensitive to a lot of things, like the scheduling of the updates, or like the + # author2id (because the random initialization changes when author2id changes). If it does + # fail, simply be aware of whether we broke something, or if it just naturally changed the + # output of the model slightly. + vec = matutils.sparse2full(jill_topics, 2) # convert to dense vector, for easier equality tests + expected = [0.91, 0.08] + passed = np.allclose(sorted(vec), sorted(expected), atol=1e-1) # must contain the same values, up to re-ordering + if passed: + break + logging.warning("Author-topic model failed to converge on attempt %i (got %s, expected %s)" % + (i, sorted(vec), sorted(expected))) + self.assertTrue(passed) + + def testBasic(self): + # Check that training the model produces a positive topic vector for some author + # Otherwise, many of the other tests are invalid. + + model = self.class_(corpus, author2doc=author2doc, id2word=dictionary, num_topics=2) + + jill_topics = model.get_author_topics('jill') + jill_topics = matutils.sparse2full(jill_topics, model.num_topics) + self.assertTrue(all(jill_topics > 0)) + + def testAuthor2docMissing(self): + # Check that the results are the same if author2doc is constructed automatically from doc2author. + model = self.class_(corpus, author2doc=author2doc, doc2author=doc2author, id2word=dictionary, num_topics=2, random_state=0) + model2 = self.class_(corpus, doc2author=doc2author, id2word=dictionary, num_topics=2, random_state=0) + + # Compare Jill's topics before in both models. + jill_topics = model.get_author_topics('jill') + jill_topics2 = model2.get_author_topics('jill') + jill_topics = matutils.sparse2full(jill_topics, model.num_topics) + jill_topics2 = matutils.sparse2full(jill_topics2, model.num_topics) + self.assertTrue(np.allclose(jill_topics, jill_topics2)) + + def testDoc2authorMissing(self): + # Check that the results are the same if doc2author is constructed automatically from author2doc. + model = self.class_(corpus, author2doc=author2doc, doc2author=doc2author, id2word=dictionary, num_topics=2, random_state=0) + model2 = self.class_(corpus, author2doc=author2doc, id2word=dictionary, num_topics=2, random_state=0) + + # Compare Jill's topics before in both models. + jill_topics = model.get_author_topics('jill') + jill_topics2 = model2.get_author_topics('jill') + jill_topics = matutils.sparse2full(jill_topics, model.num_topics) + jill_topics2 = matutils.sparse2full(jill_topics2, model.num_topics) + self.assertTrue(np.allclose(jill_topics, jill_topics2)) + + def testUpdate(self): + # Check that calling update after the model already has been trained works. + model = self.class_(corpus, author2doc=author2doc, id2word=dictionary, num_topics=2) + + jill_topics = model.get_author_topics('jill') + jill_topics = matutils.sparse2full(jill_topics, model.num_topics) + + model.update() + jill_topics2 = model.get_author_topics('jill') + jill_topics2 = matutils.sparse2full(jill_topics2, model.num_topics) + + # Did we learn something? + self.assertFalse(all(np.equal(jill_topics, jill_topics2))) + + def testUpdateNewDataOldAuthor(self): + # Check that calling update with new documents and/or authors after the model already has + # been trained works. + # Test an author that already existed in the old dataset. + model = self.class_(corpus, author2doc=author2doc, id2word=dictionary, num_topics=2) + + jill_topics = model.get_author_topics('jill') + jill_topics = matutils.sparse2full(jill_topics, model.num_topics) + + model.update(corpus_new, author2doc_new) + jill_topics2 = model.get_author_topics('jill') + jill_topics2 = matutils.sparse2full(jill_topics2, model.num_topics) + + # Did we learn more about Jill? + self.assertFalse(all(np.equal(jill_topics, jill_topics2))) + + def testUpdateNewDataNewAuthor(self): + # Check that calling update with new documents and/or authors after the model already has + # been trained works. + # Test a new author, that didn't exist in the old dataset. + model = self.class_(corpus, author2doc=author2doc, id2word=dictionary, num_topics=2) + + model.update(corpus_new, author2doc_new) + + # Did we learn something about Sally? + sally_topics = model.get_author_topics('sally') + sally_topics = matutils.sparse2full(sally_topics, model.num_topics) + self.assertTrue(all(sally_topics > 0)) + + def testSerialized(self): + # Test the model using serialized corpora. Basic tests, plus test of update functionality. + + model = self.class_(self.corpus, author2doc=author2doc, id2word=dictionary, num_topics=2, serialized=True, serialization_path=datapath('testcorpus_serialization.mm')) + + jill_topics = model.get_author_topics('jill') + jill_topics = matutils.sparse2full(jill_topics, model.num_topics) + self.assertTrue(all(jill_topics > 0)) + + model.update() + jill_topics2 = model.get_author_topics('jill') + jill_topics2 = matutils.sparse2full(jill_topics2, model.num_topics) + + # Did we learn more about Jill? + self.assertFalse(all(np.equal(jill_topics, jill_topics2))) + + model.update(corpus_new, author2doc_new) + + # Did we learn something about Sally? + sally_topics = model.get_author_topics('sally') + sally_topics = matutils.sparse2full(sally_topics, model.num_topics) + self.assertTrue(all(sally_topics > 0)) + + # Delete the MmCorpus used for serialization inside the author-topic model. + remove(datapath('testcorpus_serialization.mm')) + + def testTransformSerialized(self): + # Same as testTransform, using serialized corpora. + passed = False + # sometimes, training gets stuck at a local minimum + # in that case try re-training the model from scratch, hoping for a + # better random initialization + for i in range(25): # restart at most 5 times + # create the transformation model + model = self.class_(id2word=dictionary, num_topics=2, passes=100, random_state=0, serialized=True, serialization_path=datapath('testcorpus_serialization.mm')) + model.update(self.corpus, author2doc) + + jill_topics = model.get_author_topics('jill') + + # NOTE: this test may easily fail if the author-topic model is altered in any way. The model's + # output is sensitive to a lot of things, like the scheduling of the updates, or like the + # author2id (because the random initialization changes when author2id changes). If it does + # fail, simply be aware of whether we broke something, or if it just naturally changed the + # output of the model slightly. + vec = matutils.sparse2full(jill_topics, 2) # convert to dense vector, for easier equality tests + expected = [0.91, 0.08] + passed = np.allclose(sorted(vec), sorted(expected), atol=1e-1) # must contain the same values, up to re-ordering + + # Delete the MmCorpus used for serialization inside the author-topic model. + remove(datapath('testcorpus_serialization.mm')) + if passed: + break + logging.warning("Author-topic model failed to converge on attempt %i (got %s, expected %s)" % + (i, sorted(vec), sorted(expected))) + self.assertTrue(passed) + + def testAlphaAuto(self): + model1 = self.class_(corpus, author2doc=author2doc, id2word=dictionary, alpha='symmetric', passes=10, num_topics=2) + modelauto = self.class_(corpus, author2doc=author2doc, id2word=dictionary, alpha='auto', passes=10, num_topics=2) + + # did we learn something? + self.assertFalse(all(np.equal(model1.alpha, modelauto.alpha))) + + def testAlpha(self): + kwargs = dict( + author2doc=author2doc, + id2word=dictionary, + num_topics=2, + alpha=None + ) + expected_shape = (2,) + + # should not raise anything + self.class_(**kwargs) + + kwargs['alpha'] = 'symmetric' + model = self.class_(**kwargs) + self.assertEqual(model.alpha.shape, expected_shape) + self.assertTrue(all(model.alpha == np.array([0.5, 0.5]))) + + kwargs['alpha'] = 'asymmetric' + model = self.class_(**kwargs) + self.assertEqual(model.alpha.shape, expected_shape) + self.assertTrue(np.allclose(model.alpha, [0.630602, 0.369398])) + + kwargs['alpha'] = 0.3 + model = self.class_(**kwargs) + self.assertEqual(model.alpha.shape, expected_shape) + self.assertTrue(all(model.alpha == np.array([0.3, 0.3]))) + + kwargs['alpha'] = 3 + model = self.class_(**kwargs) + self.assertEqual(model.alpha.shape, expected_shape) + self.assertTrue(all(model.alpha == np.array([3, 3]))) + + kwargs['alpha'] = [0.3, 0.3] + model = self.class_(**kwargs) + self.assertEqual(model.alpha.shape, expected_shape) + self.assertTrue(all(model.alpha == np.array([0.3, 0.3]))) + + kwargs['alpha'] = np.array([0.3, 0.3]) + model = self.class_(**kwargs) + self.assertEqual(model.alpha.shape, expected_shape) + self.assertTrue(all(model.alpha == np.array([0.3, 0.3]))) + + # all should raise an exception for being wrong shape + kwargs['alpha'] = [0.3, 0.3, 0.3] + self.assertRaises(AssertionError, self.class_, **kwargs) + + kwargs['alpha'] = [[0.3], [0.3]] + self.assertRaises(AssertionError, self.class_, **kwargs) + + kwargs['alpha'] = [0.3] + self.assertRaises(AssertionError, self.class_, **kwargs) + + kwargs['alpha'] = "gensim is cool" + self.assertRaises(ValueError, self.class_, **kwargs) + + def testEtaAuto(self): + model1 = self.class_(corpus, author2doc=author2doc, id2word=dictionary, eta='symmetric', passes=10, num_topics=2) + modelauto = self.class_(corpus, author2doc=author2doc, id2word=dictionary, eta='auto', passes=10, num_topics=2) + + # did we learn something? + self.assertFalse(all(np.equal(model1.eta, modelauto.eta))) + + def testEta(self): + kwargs = dict( + author2doc=author2doc, + id2word=dictionary, + num_topics=2, + eta=None + ) + num_terms = len(dictionary) + expected_shape = (num_terms,) + + # should not raise anything + model = self.class_(**kwargs) + self.assertEqual(model.eta.shape, expected_shape) + self.assertTrue(all(model.eta == np.array([0.5] * num_terms))) + + kwargs['eta'] = 'symmetric' + model = self.class_(**kwargs) + self.assertEqual(model.eta.shape, expected_shape) + self.assertTrue(all(model.eta == np.array([0.5] * num_terms))) + + kwargs['eta'] = 0.3 + model = self.class_(**kwargs) + self.assertEqual(model.eta.shape, expected_shape) + self.assertTrue(all(model.eta == np.array([0.3] * num_terms))) + + kwargs['eta'] = 3 + model = self.class_(**kwargs) + self.assertEqual(model.eta.shape, expected_shape) + self.assertTrue(all(model.eta == np.array([3] * num_terms))) + + kwargs['eta'] = [0.3] * num_terms + model = self.class_(**kwargs) + self.assertEqual(model.eta.shape, expected_shape) + self.assertTrue(all(model.eta == np.array([0.3] * num_terms))) + + kwargs['eta'] = np.array([0.3] * num_terms) + model = self.class_(**kwargs) + self.assertEqual(model.eta.shape, expected_shape) + self.assertTrue(all(model.eta == np.array([0.3] * num_terms))) + + # should be ok with num_topics x num_terms + testeta = np.array([[0.5] * len(dictionary)] * 2) + kwargs['eta'] = testeta + self.class_(**kwargs) + + # all should raise an exception for being wrong shape + kwargs['eta'] = testeta.reshape(tuple(reversed(testeta.shape))) + self.assertRaises(AssertionError, self.class_, **kwargs) + + kwargs['eta'] = [0.3] + self.assertRaises(AssertionError, self.class_, **kwargs) + + kwargs['eta'] = [0.3] * (num_terms + 1) + self.assertRaises(AssertionError, self.class_, **kwargs) + + kwargs['eta'] = "gensim is cool" + self.assertRaises(ValueError, self.class_, **kwargs) + + kwargs['eta'] = "asymmetric" + self.assertRaises(ValueError, self.class_, **kwargs) + + def testTopTopics(self): + top_topics = self.model.top_topics(corpus) + + for topic, score in top_topics: + self.assertTrue(isinstance(topic, list)) + self.assertTrue(isinstance(score, float)) + + for v, k in topic: + self.assertTrue(isinstance(k, six.string_types)) + self.assertTrue(isinstance(v, float)) + + def testGetTopicTerms(self): + topic_terms = self.model.get_topic_terms(1) + + for k, v in topic_terms: + self.assertTrue(isinstance(k, numbers.Integral)) + self.assertTrue(isinstance(v, float)) + + def testGetAuthorTopics(self): + + model = self.class_(corpus, author2doc=author2doc, id2word=dictionary, num_topics=2, passes=100, random_state=np.random.seed(0)) + + author_topics = [] + for a in model.id2author.values(): + author_topics.append(model.get_author_topics(a)) + + for topic in author_topics: + self.assertTrue(isinstance(topic, list)) + for k, v in topic: + self.assertTrue(isinstance(k, int)) + self.assertTrue(isinstance(v, float)) + + def testTermTopics(self): + + model = self.class_(corpus, author2doc=author2doc, id2word=dictionary, num_topics=2, passes=100, random_state=np.random.seed(0)) + + # check with word_type + result = model.get_term_topics(2) + for topic_no, probability in result: + self.assertTrue(isinstance(topic_no, int)) + self.assertTrue(isinstance(probability, float)) + + # if user has entered word instead, check with word + result = model.get_term_topics(str(model.id2word[2])) + for topic_no, probability in result: + self.assertTrue(isinstance(topic_no, int)) + self.assertTrue(isinstance(probability, float)) + + def testPasses(self): + # long message includes the original error message with a custom one + self.longMessage = True + # construct what we expect when passes aren't involved + test_rhots = list() + model = self.class_(id2word=dictionary, chunksize=1, num_topics=2) + final_rhot = lambda: pow(model.offset + (1 * model.num_updates) / model.chunksize, -model.decay) + + # generate 5 updates to test rhot on + for x in range(5): + model.update(corpus, author2doc) + test_rhots.append(final_rhot()) + + for passes in [1, 5, 10, 50, 100]: + model = self.class_(id2word=dictionary, chunksize=1, num_topics=2, passes=passes) + self.assertEqual(final_rhot(), 1.0) + # make sure the rhot matches the test after each update + for test_rhot in test_rhots: + model.update(corpus, author2doc) + + msg = ", ".join(map(str, [passes, model.num_updates, model.state.numdocs])) + self.assertAlmostEqual(final_rhot(), test_rhot, msg=msg) + + self.assertEqual(model.state.numdocs, len(corpus) * len(test_rhots)) + self.assertEqual(model.num_updates, len(corpus) * len(test_rhots)) + + def testPersistence(self): + fname = testfile() + model = self.model + model.save(fname) + model2 = self.class_.load(fname) + self.assertEqual(model.num_topics, model2.num_topics) + self.assertTrue(np.allclose(model.expElogbeta, model2.expElogbeta)) + self.assertTrue(np.allclose(model.state.gamma, model2.state.gamma)) + + def testPersistenceIgnore(self): + fname = testfile('testPersistenceIgnore') + model = atmodel.AuthorTopicModel(corpus, author2doc=author2doc, num_topics=2) + model.save(fname, ignore='id2word') + model2 = atmodel.AuthorTopicModel.load(fname) + self.assertTrue(model2.id2word is None) + + model.save(fname, ignore=['id2word']) + model2 = atmodel.AuthorTopicModel.load(fname) + self.assertTrue(model2.id2word is None) + + def testPersistenceCompressed(self): + fname = testfile() + '.gz' + model = self.model + model.save(fname) + model2 = self.class_.load(fname, mmap=None) + self.assertEqual(model.num_topics, model2.num_topics) + self.assertTrue(np.allclose(model.expElogbeta, model2.expElogbeta)) + + # Compare Jill's topics before and after save/load. + jill_topics = model.get_author_topics('jill') + jill_topics2 = model2.get_author_topics('jill') + jill_topics = matutils.sparse2full(jill_topics, model.num_topics) + jill_topics2 = matutils.sparse2full(jill_topics2, model.num_topics) + self.assertTrue(np.allclose(jill_topics, jill_topics2)) + + def testLargeMmap(self): + fname = testfile() + model = self.model + + # simulate storing large arrays separately + model.save(testfile(), sep_limit=0) + + # test loading the large model arrays with mmap + model2 = self.class_.load(testfile(), mmap='r') + self.assertEqual(model.num_topics, model2.num_topics) + self.assertTrue(isinstance(model2.expElogbeta, np.memmap)) + self.assertTrue(np.allclose(model.expElogbeta, model2.expElogbeta)) + + # Compare Jill's topics before and after save/load. + jill_topics = model.get_author_topics('jill') + jill_topics2 = model2.get_author_topics('jill') + jill_topics = matutils.sparse2full(jill_topics, model.num_topics) + jill_topics2 = matutils.sparse2full(jill_topics2, model.num_topics) + self.assertTrue(np.allclose(jill_topics, jill_topics2)) + + def testLargeMmapCompressed(self): + fname = testfile() + '.gz' + model = self.model + + # simulate storing large arrays separately + model.save(fname, sep_limit=0) + + # test loading the large model arrays with mmap + self.assertRaises(IOError, self.class_.load, fname, mmap='r') + + +if __name__ == '__main__': + logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.DEBUG) + unittest.main()