Skip to content
This repository has been archived by the owner on Jul 16, 2021. It is now read-only.

Latent Dirichlet Allocation #172

Open
wants to merge 17 commits into
base: master
Choose a base branch
from
Open

Latent Dirichlet Allocation #172

wants to merge 17 commits into from

Conversation

dyule
Copy link

@dyule dyule commented Jan 24, 2017

I've created a basic implementation of linear dirichlet allocation. Unlike the other learning algorithms, it's unsupervised, so the training method is just empty. It doesn't support non-symmetric parameters, but that could fairly easily be added in. I've tried to document everything clearly.

However I built this with v4 of rulinalg, so this pull request includes #167.

I'm happy to take any feedback on the approach here, and update as necessary.

@dyule
Copy link
Author

dyule commented Jan 24, 2017

Just realized I called it linear dirichlet allocation in the title. It's Latent Dirichlet Allocation. I constantly get that backwards, but it's right in the code.

@AtheMathmo
Copy link
Owner

Thank you for this! I've been hoping to make LDA a part of rusty-machine for a while now :).

Sadly I'm no expert on LDA and I am particularly busy at the moment so I'll need a little time to digest this PR. Can you poke me if I still haven't reviewed this by the end of the week?

@dyule
Copy link
Author

dyule commented Jan 25, 2017

If it helps, the only two files that are from LDA and not the rulinalg bump are examples/lda_gen.rs and src/learning/lda.rs. I don't believe I changed anything else of importance.

@dyule
Copy link
Author

dyule commented Jan 30, 2017

I've updated the code to be a little more efficient and documented better.

@AtheMathmo
Copy link
Owner

Thanks for your patience. I finally have some time to look through this and leave some comments. I'm still not too familiar with LDA but hopefully can provide some meaningful comments anyway.

Copy link
Owner

@AtheMathmo AtheMathmo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general this PR has been quite difficult to review because of the rulinalg 0.4 PR that has been incorporated. It's hard to disentangle the changes, especially as I haven't reviewed the other one yet! I think it might be a better idea to implement this PR for the current rusty-machine version and update it if/when rulinalg 0.4 lands. [Note that this will hopefully be soon, I finally have some free time back and want to try to get things moving here again!]

Before I give actual feedback, please remember that I don't know the LDA model very well and so my feedback should be taken with a large pinch of salt :). Please don't hesitate to inform me when I am being ill-informed.

With regards to the example - I think it's really great! But we should add a description of it in the examples/README.md file as exists for the others. This is especially important because this example is somewhat involved.

I'm happy to merge the current approach but I think we could make some improvements by using the online VB algorithm - see Algorithm 2 in this paper, it doesn't seem too intense. Let me know what you think, as I say, this is just a mild suggestion.

There is also some mismatch here in how this model is used compared to others in rusty-machine. This isn't really your fault, the current model traits need some alterations but I can't settle on what those should be. If you were to follow the rest of the library more closely then the train function would be used to compute the LDA results and these would be stored within the model. The predict function does not really have an obvious application here (in that setting). However, looking at the scikit learn implementation they model LDA as a transformer. I am not sure exactly what they do here but it might be worth looking into a little and determining if we can do the same with our Transformer trait.

@@ -0,0 +1,191 @@
/// An example of how Latent Diriclhet Allocation (LDA) can be used. This example begins by
/// generating a distribution of words to categories. This distribution is creatred so that
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo here: distribution is CREATED.

/// Given `topic_count` topics, this function will create a distrbution of words for each
/// topic. For simplicity, this function assumes that the total number of words in the corpus
/// will be `(topic_count / 2)^2`.
fn generate_word_distribution(topic_count: usize) -> Matrix<f64> {
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor point: Potential issue here with assuming topic_count is divisible by 2. From what I can tell this shouldn't break anything but I wanted to draw you attention to it regardless.

accuracy(outputs.iter_rows(), targets.iter_rows())
pub fn row_accuracy<T: PartialEq>(outputs: &Matrix<T>, targets: &Matrix<T>) -> f64 {

assert!(outputs.rows() == targets.rows());
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a little confused by this change. I'm guessing it comes from the rulinalg-0.4 PR?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed, I am pretty certain I took this one directly from the linalg bump PR

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've now merged a new linalg bump PR into master. I guess you'll want to rebase from that (or just C+P the relevant modules onto a new branch :) )

@@ -0,0 +1,250 @@
//! Latent Diriclhet Allocation Module
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor: Type here, DIRICHLET.

@@ -0,0 +1,250 @@
//! Latent Diriclhet Allocation Module
//!
//! Contains an implementation of Latent Diriclhet Allocation (LDA) using
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And again here

//! Gibbs sampling is a Morkov Chain Monto Carlo algorithm that iteratively approximates
//! the above distributions.
//!
//! This module doesn't use any training. It uses unsupervised learning to estimate
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I find this statement a little confusing. I would argue that unsupervised learning algorithms still have a training phase. However, during this training stage they are not given any target data.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LDA (in its original form) is in a sort of strange place with respect to machine learning, in that it's designed to classify only the input data, and nothing more. In this sense, the training and prediction stage are actually identical. One could certainly use the distribution that was discovered to classify additional documents, but for whatever reason, that's not how it was used. I'll have more to say about this in a later comment.

beta: f64,
}

impl Default for LDA {
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm unfamiliar with the algorithm and related literature, so please let me know if I'm being silly.

This seems a little arbitrary to me (the topic count in particular). We should at least document these values in the implementation, e.g.

/// Creates a new LDA with topic count = 10, ... etc.
impl Default for LDA { }

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As far as I can tell, there aren't any sensible constant defaults for LDA. I only added this default implementation because the other models seemed to have it. I could drop it if you like (or document it better, of course).

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is good to have a default implementation so people can easily get off the ground. But yes it would be good to add documentation for this and explicitly state the parameter choices made.

/// Find the distribution of words over topics. This gives a matrix where the rows are
/// topics and the columns are words. Each entry (topic, word) gives the probability of
/// word given topic.
pub fn phi(&self) -> Matrix<f64> {
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This name seems somewhat dependent on related literature. I worry that unfamiliar users (like myself) would not know where to look for this function. Maybe it should be renamed distr_of_words?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that this is a confusing name. My reason for keeping away from a simpler name was a) its use in the literature and b) I was a bit concerned it wouldn't be clear it was a conditional distribution (or the direction of the condition). But I think with proper documentation, this is fine.


impl UnSupModel<(Matrix<usize>, usize), LDAResult> for LDA {
/// Predict categories from the input matrix.
fn predict(&self, inputs: &(Matrix<usize>, usize)) -> LearningResult<LDAResult> {
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that we need to be clearer about what these inputs are. I am assuming that the first tuple argument is the word counts per document, and the second is the number of iterations?

It would follow the rest of the library more closely if the iterations were specified at the model level, and this function simply took the &Matrix<usize>.

And one other small note, in order for docs to show up with rustdoc they must be placed on the outside of the impl block. e.g.

/// Doc here will be rendered
impl UnSupModel<..> ... {
    /// Docs here will not
    fn predict(...) -> ... { }
}

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've been very unhappy with this function signature since I started. It will certainly be more ergonomic to have them in the model, but my reasoning for keeping them out is that I don't think number of iterations is a property of the model. However, I believe that my conception of the way your library was organized was a bit off, since in my brain, LDA was a sort of factory object that could generate multiple classifications for different input. It seems pretty obvious now that this wasn't the case.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn't your fault, I'm quite unhappy with the structure of the current traits. They worked well for the first few algorithms I implemented but are a bit of a stretch now for unsupervised models.

Here you have sort of emulated what I want the traits to look like. One trait should be used to train a model using input data and this should output a model that can be used predictively. Like how your LDA model produces an LDAResult. The new Transformer trait works in this way.

use super::{LDAResult, LDA};
use linalg::{Matrix, Vector};
#[test]
fn test_conditional_distribution() {
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is probably a result of me not knowing the algorithm very well, but this test does not look entirely convincing. How do you know that the conditional distribution returned here is indeed correct?

I don't have any better suggestions right now for improving the tests but I would feel more comfortable with stronger coverage for this algorithm.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can look into testing a bit more, and I agree about the coverage. This test is based on values I calculated basically by hand, and verified by a python implementation. The present implementation of the algorithm really only has three functions of note: predict and conditional_distribution in LDA and new in LDAResult. predict is basically the entire algorithm, and so is tested basically by the example. It can't really by tested automatically, because it relies on random choice, and unless we want to make it possible to pass a source of randomness into a model, that can't really be controlled. new has the same problem as predict, and conditional_distribution is already tested via these magic numbers.

I can certainly add in some tests of the same kind as some of the other learning algorithms, where I provide some basic input, and ensure that the algorithm completes without panicking. I will include this in my updates.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test is based on values I calculated basically by hand, and verified by a python implementation.

I think this is fine but you should note this somewhere near the test with a reference to any implementation you used if possible. It makes them a little more convincing to people reading over the code.

unless we want to make it possible to pass a source of randomness into a model

This is actually a good idea. See #138 . The support isn't there in the library yet but if you can get something working in the mean time it would be helpful for later. If we want to make changes later we'd like to be able to verify that the algorithm is still doing (close-to) the same thing.

* Updating to rulinalg v0.4.2

* Slight tidy up for dbscan

* Improving GMM cov decomposition error msg

* Updating benchmarks with rulinalg0.4
This work was started by sinhrks. After breaking a rebase it was easier to move to a new branch...
@dyule
Copy link
Author

dyule commented Feb 21, 2017

Thanks for taking the time to review this. I very much appreciate all your comments. I've responded to each of them with my thought process, but I will incorporate them into my changes.

For the example, I suppose I missed the listing of examples in the README, I will make sure to include it. I've documented the file pretty extensively, since as you say, it's fairly involved.

The algorithm you've shown does seem to be an improvement, and one that I'd like to incorporate down the road. My reasoning for using Gibbs sampling was a) it was the original approach and b) I actually implemented it for my own research, where I'm obligated to use that particular approach. As I say, I'm happy to include it, but doing so will take some time that I possibly do not have at the moment.

As I mentioned in my inline comments, I misunderstood a few things about the organization of the crate. I plan to move the main code from predict to train and make predict be a no-op. However, looking at the paper you linked, and the scikit implementation, it seems like predict is very similar to scikit's transform. That is, passing input data into transform results in a distribution of categories over documents. I believe I could do this with the current algorithm, but it will be better once we're using online VB.

So, I will take some time to update the changes suggested, but leaving the central algorithm intact, then let you know. When I have time, I'll look into changing to Online VB, for a faster and more flexible approach.

@AtheMathmo
Copy link
Owner

I haven't read all your comments but from your summary I think your approach sounds great. Keep the core algorithm as it is and we can write up an issue for future work with VB once this has been merged.

it seems like predict is very similar to scikit's transform

Scikit uses both a predict and a transform function depending on the model. See their GMM docs. We also have both in rusty-machine (though note that there have been some changes to this trait on the current master branch). If you think this will provide a better fit you could switch to this trait instead. Right now these traits are a little disorganized but I'm hoping to address that in a future release.

Copy link
Author

@dyule dyule left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I could not make git tell me where the changes were, and so now here is a useless commit :(

@dyule
Copy link
Author

dyule commented Feb 21, 2017

The Transform trait does in fact seem like an excellent fit. I have only one thought. Currently, the parameterization of TransformFitter uses T to indicate a Transformer whereas the parameterization of Transformer uses T to indicate some kind of input type. It's a small thing, but it did cause me some confusion when first reading the code.

Also, transformer current requires the input and output of the transform to be of the same type, but that may not necessarily be the case. For example, my input is an integer (word counts), and it'll be transformed into f64 (probabilities)

@AtheMathmo
Copy link
Owner

Also, transformer current requires the input and output of the transform to be of the same type

Ah yes, that is true. I'm not sure whether we want to lift this restriction as future changes to the UnSupModel should give the behaviour you want. I'll have to think about things a little more...

@dyule
Copy link
Author

dyule commented Feb 21, 2017

So, I've made the changes promised. I've implemented LDA as a Transformer, which involved allowing them to have different input and output types. I'm happy to switch it back to being a learning model if that's where you think it fits best.

What it comes down to is the semantics of a Transformer, and where LDA fits into that idea. Currently, LDA has much more in common with the other learning models than with the other transformers, which all have to do with pre-processing the input data, but I'm not sure what your long term intention is there.

@patrickmesana
Copy link

Don't you mean Latent Dirichlet Allocation?

@dyule dyule changed the title Linear Dirichlet Allocation Latent Dirichlet Allocation Sep 24, 2017
@rohitjoshi
Copy link

👍 waiting for merge

@zackmdavis
Copy link
Collaborator

waiting for merge

(Deputy acting co-maintainer here, but time-crunched at the moment; have made a note to take a look early next week)

@zackmdavis
Copy link
Collaborator

(or maybe this week)

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants