Latent Dirichlet Allocation #172

dyule · 2017-01-24T20:27:28Z

I've created a basic implementation of linear dirichlet allocation. Unlike the other learning algorithms, it's unsupervised, so the training method is just empty. It doesn't support non-symmetric parameters, but that could fairly easily be added in. I've tried to document everything clearly.

However I built this with v4 of rulinalg, so this pull request includes #167.

I'm happy to take any feedback on the approach here, and update as necessary.

dyule · 2017-01-24T20:33:08Z

Just realized I called it linear dirichlet allocation in the title. It's Latent Dirichlet Allocation. I constantly get that backwards, but it's right in the code.

AtheMathmo · 2017-01-24T22:51:22Z

Thank you for this! I've been hoping to make LDA a part of rusty-machine for a while now :).

Sadly I'm no expert on LDA and I am particularly busy at the moment so I'll need a little time to digest this PR. Can you poke me if I still haven't reviewed this by the end of the week?

dyule · 2017-01-25T14:44:38Z

If it helps, the only two files that are from LDA and not the rulinalg bump are examples/lda_gen.rs and src/learning/lda.rs. I don't believe I changed anything else of importance.

dyule · 2017-01-30T13:54:22Z

I've updated the code to be a little more efficient and documented better.

AtheMathmo · 2017-02-17T23:07:29Z

Thanks for your patience. I finally have some time to look through this and leave some comments. I'm still not too familiar with LDA but hopefully can provide some meaningful comments anyway.

AtheMathmo

In general this PR has been quite difficult to review because of the rulinalg 0.4 PR that has been incorporated. It's hard to disentangle the changes, especially as I haven't reviewed the other one yet! I think it might be a better idea to implement this PR for the current rusty-machine version and update it if/when rulinalg 0.4 lands. [Note that this will hopefully be soon, I finally have some free time back and want to try to get things moving here again!]

Before I give actual feedback, please remember that I don't know the LDA model very well and so my feedback should be taken with a large pinch of salt :). Please don't hesitate to inform me when I am being ill-informed.

With regards to the example - I think it's really great! But we should add a description of it in the examples/README.md file as exists for the others. This is especially important because this example is somewhat involved.

I'm happy to merge the current approach but I think we could make some improvements by using the online VB algorithm - see Algorithm 2 in this paper, it doesn't seem too intense. Let me know what you think, as I say, this is just a mild suggestion.

There is also some mismatch here in how this model is used compared to others in rusty-machine. This isn't really your fault, the current model traits need some alterations but I can't settle on what those should be. If you were to follow the rest of the library more closely then the train function would be used to compute the LDA results and these would be stored within the model. The predict function does not really have an obvious application here (in that setting). However, looking at the scikit learn implementation they model LDA as a transformer. I am not sure exactly what they do here but it might be worth looking into a little and determining if we can do the same with our Transformer trait.

AtheMathmo · 2017-02-17T23:08:07Z

examples/lda_gen.rs

@@ -0,0 +1,191 @@
+/// An example of how Latent Diriclhet Allocation (LDA) can be used.  This example begins by
+/// generating a distribution of words to categories.  This distribution is creatred so that


Typo here: distribution is CREATED.

AtheMathmo · 2017-02-17T23:09:57Z

examples/lda_gen.rs

+/// Given `topic_count` topics, this function will create a distrbution of words for each
+/// topic.  For simplicity, this function assumes that the total number of words in the corpus
+/// will be `(topic_count / 2)^2`.
+fn generate_word_distribution(topic_count: usize) -> Matrix<f64> {


Minor point: Potential issue here with assuming topic_count is divisible by 2. From what I can tell this shouldn't break anything but I wanted to draw you attention to it regardless.

AtheMathmo · 2017-02-17T23:13:48Z

src/analysis/score.rs

-    accuracy(outputs.iter_rows(), targets.iter_rows())
+pub fn row_accuracy<T: PartialEq>(outputs: &Matrix<T>, targets: &Matrix<T>) -> f64 {
+
+    assert!(outputs.rows() == targets.rows());


I'm a little confused by this change. I'm guessing it comes from the rulinalg-0.4 PR?

Indeed, I am pretty certain I took this one directly from the linalg bump PR

I've now merged a new linalg bump PR into master. I guess you'll want to rebase from that (or just C+P the relevant modules onto a new branch :) )

AtheMathmo · 2017-02-17T23:15:47Z

src/learning/lda.rs

@@ -0,0 +1,250 @@
+//! Latent Diriclhet Allocation Module


Minor: Type here, DIRICHLET.

AtheMathmo · 2017-02-17T23:15:55Z

src/learning/lda.rs

@@ -0,0 +1,250 @@
+//! Latent Diriclhet Allocation Module
+//!
+//! Contains an implementation of Latent Diriclhet Allocation (LDA) using


And again here

AtheMathmo · 2017-02-17T23:17:10Z

src/learning/lda.rs

+//! Gibbs sampling is a Morkov Chain Monto Carlo algorithm that iteratively approximates
+//! the above distributions.
+//!
+//! This module doesn't use any training.  It uses unsupervised learning to estimate


I find this statement a little confusing. I would argue that unsupervised learning algorithms still have a training phase. However, during this training stage they are not given any target data.

LDA (in its original form) is in a sort of strange place with respect to machine learning, in that it's designed to classify only the input data, and nothing more. In this sense, the training and prediction stage are actually identical. One could certainly use the distribution that was discovered to classify additional documents, but for whatever reason, that's not how it was used. I'll have more to say about this in a later comment.

AtheMathmo · 2017-02-17T23:19:31Z

src/learning/lda.rs

+    beta: f64,
+}
+
+impl Default for LDA {


I'm unfamiliar with the algorithm and related literature, so please let me know if I'm being silly.

This seems a little arbitrary to me (the topic count in particular). We should at least document these values in the implementation, e.g.

/// Creates a new LDA with topic count = 10, ... etc. impl Default for LDA { }

As far as I can tell, there aren't any sensible constant defaults for LDA. I only added this default implementation because the other models seemed to have it. I could drop it if you like (or document it better, of course).

I think it is good to have a default implementation so people can easily get off the ground. But yes it would be good to add documentation for this and explicitly state the parameter choices made.

AtheMathmo · 2017-02-17T23:23:42Z

src/learning/lda.rs

+    /// Find the distribution of words over topics.  This gives a matrix where the rows are
+    /// topics and the columns are words.  Each entry (topic, word) gives the probability of
+    /// word given topic.
+    pub fn phi(&self) -> Matrix<f64> {


This name seems somewhat dependent on related literature. I worry that unfamiliar users (like myself) would not know where to look for this function. Maybe it should be renamed distr_of_words?

I agree that this is a confusing name. My reason for keeping away from a simpler name was a) its use in the literature and b) I was a bit concerned it wouldn't be clear it was a conditional distribution (or the direction of the condition). But I think with proper documentation, this is fine.

AtheMathmo · 2017-02-17T23:28:07Z

src/learning/lda.rs

+
+impl UnSupModel<(Matrix<usize>, usize), LDAResult> for LDA {
+    /// Predict categories from the input matrix.
+        fn predict(&self, inputs: &(Matrix<usize>, usize)) -> LearningResult<LDAResult> {


I think that we need to be clearer about what these inputs are. I am assuming that the first tuple argument is the word counts per document, and the second is the number of iterations?

It would follow the rest of the library more closely if the iterations were specified at the model level, and this function simply took the &Matrix<usize>.

And one other small note, in order for docs to show up with rustdoc they must be placed on the outside of the impl block. e.g.

/// Doc here will be rendered impl UnSupModel<..> ... { /// Docs here will not fn predict(...) -> ... { } }

I've been very unhappy with this function signature since I started. It will certainly be more ergonomic to have them in the model, but my reasoning for keeping them out is that I don't think number of iterations is a property of the model. However, I believe that my conception of the way your library was organized was a bit off, since in my brain, LDA was a sort of factory object that could generate multiple classifications for different input. It seems pretty obvious now that this wasn't the case.

This isn't your fault, I'm quite unhappy with the structure of the current traits. They worked well for the first few algorithms I implemented but are a bit of a stretch now for unsupervised models.

Here you have sort of emulated what I want the traits to look like. One trait should be used to train a model using input data and this should output a model that can be used predictively. Like how your LDA model produces an LDAResult. The new Transformer trait works in this way.

AtheMathmo · 2017-02-17T23:30:30Z

src/learning/lda.rs

+    use super::{LDAResult, LDA};
+    use linalg::{Matrix, Vector};
+    #[test]
+    fn test_conditional_distribution() {


This is probably a result of me not knowing the algorithm very well, but this test does not look entirely convincing. How do you know that the conditional distribution returned here is indeed correct?

I don't have any better suggestions right now for improving the tests but I would feel more comfortable with stronger coverage for this algorithm.

I can look into testing a bit more, and I agree about the coverage. This test is based on values I calculated basically by hand, and verified by a python implementation. The present implementation of the algorithm really only has three functions of note: predict and conditional_distribution in LDA and new in LDAResult. predict is basically the entire algorithm, and so is tested basically by the example. It can't really by tested automatically, because it relies on random choice, and unless we want to make it possible to pass a source of randomness into a model, that can't really be controlled. new has the same problem as predict, and conditional_distribution is already tested via these magic numbers.

I can certainly add in some tests of the same kind as some of the other learning algorithms, where I provide some basic input, and ensure that the algorithm completes without panicking. I will include this in my updates.

This test is based on values I calculated basically by hand, and verified by a python implementation.

I think this is fine but you should note this somewhere near the test with a reference to any implementation you used if possible. It makes them a little more convincing to people reading over the code.

unless we want to make it possible to pass a source of randomness into a model

This is actually a good idea. See #138 . The support isn't there in the library yet but if you can get something working in the mean time it would be helpful for later. If we want to make changes later we'd like to be able to verify that the algorithm is still doing (close-to) the same thing.

* Updating to rulinalg v0.4.2 * Slight tidy up for dbscan * Improving GMM cov decomposition error msg * Updating benchmarks with rulinalg0.4

This work was started by sinhrks. After breaking a rebase it was easier to move to a new branch...

dyule · 2017-02-21T15:06:21Z

Thanks for taking the time to review this. I very much appreciate all your comments. I've responded to each of them with my thought process, but I will incorporate them into my changes.

For the example, I suppose I missed the listing of examples in the README, I will make sure to include it. I've documented the file pretty extensively, since as you say, it's fairly involved.

The algorithm you've shown does seem to be an improvement, and one that I'd like to incorporate down the road. My reasoning for using Gibbs sampling was a) it was the original approach and b) I actually implemented it for my own research, where I'm obligated to use that particular approach. As I say, I'm happy to include it, but doing so will take some time that I possibly do not have at the moment.

As I mentioned in my inline comments, I misunderstood a few things about the organization of the crate. I plan to move the main code from predict to train and make predict be a no-op. However, looking at the paper you linked, and the scikit implementation, it seems like predict is very similar to scikit's transform. That is, passing input data into transform results in a distribution of categories over documents. I believe I could do this with the current algorithm, but it will be better once we're using online VB.

So, I will take some time to update the changes suggested, but leaving the central algorithm intact, then let you know. When I have time, I'll look into changing to Online VB, for a faster and more flexible approach.

AtheMathmo · 2017-02-21T15:25:39Z

I haven't read all your comments but from your summary I think your approach sounds great. Keep the core algorithm as it is and we can write up an issue for future work with VB once this has been merged.

it seems like predict is very similar to scikit's transform

Scikit uses both a predict and a transform function depending on the model. See their GMM docs. We also have both in rusty-machine (though note that there have been some changes to this trait on the current master branch). If you think this will provide a better fit you could switch to this trait instead. Right now these traits are a little disorganized but I'm hoping to address that in a future release.

dyule

I could not make git tell me where the changes were, and so now here is a useless commit :(

dyule · 2017-02-21T15:48:49Z

The Transform trait does in fact seem like an excellent fit. I have only one thought. Currently, the parameterization of TransformFitter uses T to indicate a Transformer whereas the parameterization of Transformer uses T to indicate some kind of input type. It's a small thing, but it did cause me some confusion when first reading the code.

Also, transformer current requires the input and output of the transform to be of the same type, but that may not necessarily be the case. For example, my input is an integer (word counts), and it'll be transformed into f64 (probabilities)

AtheMathmo · 2017-02-21T17:09:55Z

Also, transformer current requires the input and output of the transform to be of the same type

Ah yes, that is true. I'm not sure whether we want to lift this restriction as future changes to the UnSupModel should give the behaviour you want. I'll have to think about things a little more...

dyule · 2017-02-21T17:49:47Z

So, I've made the changes promised. I've implemented LDA as a Transformer, which involved allowing them to have different input and output types. I'm happy to switch it back to being a learning model if that's where you think it fits best.

What it comes down to is the semantics of a Transformer, and where LDA fits into that idea. Currently, LDA has much more in common with the other learning models than with the other transformers, which all have to do with pre-processing the input data, but I'm not sure what your long term intention is there.

patrickmesana · 2017-09-22T01:24:27Z

Don't you mean Latent Dirichlet Allocation?

rohitjoshi · 2018-03-29T00:39:54Z

👍 waiting for merge

zackmdavis · 2018-03-29T05:20:29Z

waiting for merge

(Deputy acting co-maintainer here, but time-crunched at the moment; have made a note to take a look early next week)

zackmdavis · 2018-04-09T15:59:40Z

(or maybe this week)

sinhrks and others added 4 commits December 29, 2016 15:11

Bump to rulinalg 0.4

4a06bb8

Resolved merge conflicts from linalg version bump

dd18418

Implemented LDA

8e21fba

Updated lib.rs with LDA

0f7db31

danielyule added 2 commits January 26, 2017 15:30

Added documentation and some optimizations

2bea906

Updated the example with better output and more comments

057c2b8

AtheMathmo suggested changes Feb 17, 2017

View reviewed changes

AtheMathmo added 3 commits February 19, 2017 11:42

Updating to rulinalg v0.4.2 (AtheMathmo#174)

e34f191

* Updating to rulinalg v0.4.2 * Slight tidy up for dbscan * Improving GMM cov decomposition error msg * Updating benchmarks with rulinalg0.4

Adding normalizer transformation (AtheMathmo#175)

96b2896

This work was started by sinhrks. After breaking a rebase it was easier to move to a new branch...

Splitting out the Transformer trait (AtheMathmo#176)

3d507c8

danielyule added 4 commits February 21, 2017 11:14

Implemented LDA

e52ee17

Updated lib.rs with LDA

39bf12b

Added documentation and some optimizations

89b1ca9

Updated the example with better output and more comments

8be4d20

Merge changes from rulingalg version bump

8689743

dyule commented Feb 21, 2017

View reviewed changes

Converted LDA to a transformer and tidied up the interface

fa1b0ec

Added optimizations to LDA that minimizes copying

b96e49b

Updated the LDA example, and added an entry to the list of examples

174a522

dyule changed the title ~~Linear Dirichlet Allocation~~ Latent Dirichlet Allocation Sep 24, 2017

		@@ -0,0 +1,191 @@
		/// An example of how Latent Diriclhet Allocation (LDA) can be used. This example begins by
		/// generating a distribution of words to categories. This distribution is creatred so that

Latent Dirichlet Allocation #172

Are you sure you want to change the base?

Latent Dirichlet Allocation #172

Conversation

dyule commented Jan 24, 2017

dyule commented Jan 24, 2017

AtheMathmo commented Jan 24, 2017

dyule commented Jan 25, 2017

dyule commented Jan 30, 2017 • edited Loading

AtheMathmo commented Feb 17, 2017

AtheMathmo left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dyule commented Feb 21, 2017

AtheMathmo commented Feb 21, 2017

dyule left a comment

Choose a reason for hiding this comment

dyule commented Feb 21, 2017 • edited Loading

AtheMathmo commented Feb 21, 2017

dyule commented Feb 21, 2017

patrickmesana commented Sep 22, 2017

rohitjoshi commented Mar 29, 2018

zackmdavis commented Mar 29, 2018

zackmdavis commented Apr 9, 2018

dyule commented Jan 30, 2017 •

edited

Loading

dyule commented Feb 21, 2017 •

edited

Loading