-
Notifications
You must be signed in to change notification settings - Fork 152
Latent Dirichlet Allocation #172
base: master
Are you sure you want to change the base?
Conversation
Just realized I called it linear dirichlet allocation in the title. It's Latent Dirichlet Allocation. I constantly get that backwards, but it's right in the code. |
Thank you for this! I've been hoping to make LDA a part of rusty-machine for a while now :). Sadly I'm no expert on LDA and I am particularly busy at the moment so I'll need a little time to digest this PR. Can you poke me if I still haven't reviewed this by the end of the week? |
If it helps, the only two files that are from LDA and not the rulinalg bump are |
I've updated the code to be a little more efficient and documented better. |
Thanks for your patience. I finally have some time to look through this and leave some comments. I'm still not too familiar with LDA but hopefully can provide some meaningful comments anyway. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In general this PR has been quite difficult to review because of the rulinalg 0.4 PR that has been incorporated. It's hard to disentangle the changes, especially as I haven't reviewed the other one yet! I think it might be a better idea to implement this PR for the current rusty-machine version and update it if/when rulinalg 0.4 lands. [Note that this will hopefully be soon, I finally have some free time back and want to try to get things moving here again!]
Before I give actual feedback, please remember that I don't know the LDA model very well and so my feedback should be taken with a large pinch of salt :). Please don't hesitate to inform me when I am being ill-informed.
With regards to the example - I think it's really great! But we should add a description of it in the examples/README.md
file as exists for the others. This is especially important because this example is somewhat involved.
I'm happy to merge the current approach but I think we could make some improvements by using the online VB algorithm - see Algorithm 2 in this paper, it doesn't seem too intense. Let me know what you think, as I say, this is just a mild suggestion.
There is also some mismatch here in how this model is used compared to others in rusty-machine. This isn't really your fault, the current model traits need some alterations but I can't settle on what those should be. If you were to follow the rest of the library more closely then the train
function would be used to compute the LDA results and these would be stored within the model. The predict
function does not really have an obvious application here (in that setting). However, looking at the scikit learn implementation they model LDA as a transformer. I am not sure exactly what they do here but it might be worth looking into a little and determining if we can do the same with our Transformer
trait.
examples/lda_gen.rs
Outdated
@@ -0,0 +1,191 @@ | |||
/// An example of how Latent Diriclhet Allocation (LDA) can be used. This example begins by | |||
/// generating a distribution of words to categories. This distribution is creatred so that |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Typo here: distribution is CREATED
.
/// Given `topic_count` topics, this function will create a distrbution of words for each | ||
/// topic. For simplicity, this function assumes that the total number of words in the corpus | ||
/// will be `(topic_count / 2)^2`. | ||
fn generate_word_distribution(topic_count: usize) -> Matrix<f64> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor point: Potential issue here with assuming topic_count
is divisible by 2. From what I can tell this shouldn't break anything but I wanted to draw you attention to it regardless.
src/analysis/score.rs
Outdated
accuracy(outputs.iter_rows(), targets.iter_rows()) | ||
pub fn row_accuracy<T: PartialEq>(outputs: &Matrix<T>, targets: &Matrix<T>) -> f64 { | ||
|
||
assert!(outputs.rows() == targets.rows()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm a little confused by this change. I'm guessing it comes from the rulinalg-0.4
PR?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed, I am pretty certain I took this one directly from the linalg bump PR
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've now merged a new linalg bump PR into master. I guess you'll want to rebase from that (or just C+P the relevant modules onto a new branch :) )
src/learning/lda.rs
Outdated
@@ -0,0 +1,250 @@ | |||
//! Latent Diriclhet Allocation Module |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor: Type here, DIRICHLET
.
src/learning/lda.rs
Outdated
@@ -0,0 +1,250 @@ | |||
//! Latent Diriclhet Allocation Module | |||
//! | |||
//! Contains an implementation of Latent Diriclhet Allocation (LDA) using |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And again here
src/learning/lda.rs
Outdated
//! Gibbs sampling is a Morkov Chain Monto Carlo algorithm that iteratively approximates | ||
//! the above distributions. | ||
//! | ||
//! This module doesn't use any training. It uses unsupervised learning to estimate |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I find this statement a little confusing. I would argue that unsupervised learning algorithms still have a training phase. However, during this training stage they are not given any target data.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LDA (in its original form) is in a sort of strange place with respect to machine learning, in that it's designed to classify only the input data, and nothing more. In this sense, the training and prediction stage are actually identical. One could certainly use the distribution that was discovered to classify additional documents, but for whatever reason, that's not how it was used. I'll have more to say about this in a later comment.
src/learning/lda.rs
Outdated
beta: f64, | ||
} | ||
|
||
impl Default for LDA { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm unfamiliar with the algorithm and related literature, so please let me know if I'm being silly.
This seems a little arbitrary to me (the topic count in particular). We should at least document these values in the implementation, e.g.
/// Creates a new LDA with topic count = 10, ... etc.
impl Default for LDA { }
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As far as I can tell, there aren't any sensible constant defaults for LDA. I only added this default implementation because the other models seemed to have it. I could drop it if you like (or document it better, of course).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it is good to have a default implementation so people can easily get off the ground. But yes it would be good to add documentation for this and explicitly state the parameter choices made.
src/learning/lda.rs
Outdated
/// Find the distribution of words over topics. This gives a matrix where the rows are | ||
/// topics and the columns are words. Each entry (topic, word) gives the probability of | ||
/// word given topic. | ||
pub fn phi(&self) -> Matrix<f64> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This name seems somewhat dependent on related literature. I worry that unfamiliar users (like myself) would not know where to look for this function. Maybe it should be renamed distr_of_words
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that this is a confusing name. My reason for keeping away from a simpler name was a) its use in the literature and b) I was a bit concerned it wouldn't be clear it was a conditional distribution (or the direction of the condition). But I think with proper documentation, this is fine.
src/learning/lda.rs
Outdated
|
||
impl UnSupModel<(Matrix<usize>, usize), LDAResult> for LDA { | ||
/// Predict categories from the input matrix. | ||
fn predict(&self, inputs: &(Matrix<usize>, usize)) -> LearningResult<LDAResult> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that we need to be clearer about what these inputs are. I am assuming that the first tuple argument is the word counts per document, and the second is the number of iterations?
It would follow the rest of the library more closely if the iterations were specified at the model level, and this function simply took the &Matrix<usize>
.
And one other small note, in order for docs to show up with rustdoc they must be placed on the outside of the impl block. e.g.
/// Doc here will be rendered
impl UnSupModel<..> ... {
/// Docs here will not
fn predict(...) -> ... { }
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've been very unhappy with this function signature since I started. It will certainly be more ergonomic to have them in the model, but my reasoning for keeping them out is that I don't think number of iterations is a property of the model. However, I believe that my conception of the way your library was organized was a bit off, since in my brain, LDA was a sort of factory object that could generate multiple classifications for different input. It seems pretty obvious now that this wasn't the case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This isn't your fault, I'm quite unhappy with the structure of the current traits. They worked well for the first few algorithms I implemented but are a bit of a stretch now for unsupervised models.
Here you have sort of emulated what I want the traits to look like. One trait should be used to train a model using input data and this should output a model that can be used predictively. Like how your LDA
model produces an LDAResult
. The new Transformer
trait works in this way.
src/learning/lda.rs
Outdated
use super::{LDAResult, LDA}; | ||
use linalg::{Matrix, Vector}; | ||
#[test] | ||
fn test_conditional_distribution() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is probably a result of me not knowing the algorithm very well, but this test does not look entirely convincing. How do you know that the conditional distribution returned here is indeed correct?
I don't have any better suggestions right now for improving the tests but I would feel more comfortable with stronger coverage for this algorithm.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can look into testing a bit more, and I agree about the coverage. This test is based on values I calculated basically by hand, and verified by a python implementation. The present implementation of the algorithm really only has three functions of note: predict
and conditional_distribution
in LDA
and new
in LDAResult
. predict
is basically the entire algorithm, and so is tested basically by the example. It can't really by tested automatically, because it relies on random choice, and unless we want to make it possible to pass a source of randomness into a model, that can't really be controlled. new
has the same problem as predict
, and conditional_distribution
is already tested via these magic numbers.
I can certainly add in some tests of the same kind as some of the other learning algorithms, where I provide some basic input, and ensure that the algorithm completes without panicking. I will include this in my updates.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This test is based on values I calculated basically by hand, and verified by a python implementation.
I think this is fine but you should note this somewhere near the test with a reference to any implementation you used if possible. It makes them a little more convincing to people reading over the code.
unless we want to make it possible to pass a source of randomness into a model
This is actually a good idea. See #138 . The support isn't there in the library yet but if you can get something working in the mean time it would be helpful for later. If we want to make changes later we'd like to be able to verify that the algorithm is still doing (close-to) the same thing.
* Updating to rulinalg v0.4.2 * Slight tidy up for dbscan * Improving GMM cov decomposition error msg * Updating benchmarks with rulinalg0.4
This work was started by sinhrks. After breaking a rebase it was easier to move to a new branch...
Thanks for taking the time to review this. I very much appreciate all your comments. I've responded to each of them with my thought process, but I will incorporate them into my changes. For the example, I suppose I missed the listing of examples in the README, I will make sure to include it. I've documented the file pretty extensively, since as you say, it's fairly involved. The algorithm you've shown does seem to be an improvement, and one that I'd like to incorporate down the road. My reasoning for using Gibbs sampling was a) it was the original approach and b) I actually implemented it for my own research, where I'm obligated to use that particular approach. As I say, I'm happy to include it, but doing so will take some time that I possibly do not have at the moment. As I mentioned in my inline comments, I misunderstood a few things about the organization of the crate. I plan to move the main code from So, I will take some time to update the changes suggested, but leaving the central algorithm intact, then let you know. When I have time, I'll look into changing to Online VB, for a faster and more flexible approach. |
I haven't read all your comments but from your summary I think your approach sounds great. Keep the core algorithm as it is and we can write up an issue for future work with VB once this has been merged.
Scikit uses both a |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I could not make git tell me where the changes were, and so now here is a useless commit :(
The Also, transformer current requires the input and output of the transform to be of the same type, but that may not necessarily be the case. For example, my input is an integer (word counts), and it'll be transformed into f64 (probabilities) |
Ah yes, that is true. I'm not sure whether we want to lift this restriction as future changes to the |
So, I've made the changes promised. I've implemented LDA as a Transformer, which involved allowing them to have different input and output types. I'm happy to switch it back to being a learning model if that's where you think it fits best. What it comes down to is the semantics of a |
Don't you mean Latent Dirichlet Allocation? |
👍 waiting for merge |
(Deputy acting co-maintainer here, but time-crunched at the moment; have made a note to take a look early next week) |
(or maybe this week) |
I've created a basic implementation of linear dirichlet allocation. Unlike the other learning algorithms, it's unsupervised, so the training method is just empty. It doesn't support non-symmetric parameters, but that could fairly easily be added in. I've tried to document everything clearly.
However I built this with v4 of rulinalg, so this pull request includes #167.
I'm happy to take any feedback on the approach here, and update as necessary.