-
Notifications
You must be signed in to change notification settings - Fork 9
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Binomial vs neg-binomial GLM regression #5
Comments
Hi Angelo, Thank you for the interest, and testing different observation models like this is always good practice. The origin of the three classical count models (binomial, Poisson, negative binomial) were to solve some particular problems. The binomial model (which I chose here) models the sum of a discrete number N of binary trials. Out of N total trials we might see K positive cases. We learn about p = K / N the probability of a positive case given the sampling population. This is very similar to contingency table statistics and closely related to Fisher's exact test. The Poisson model was constructed to learn the number K of positive cases within a continuous interval. I think Poisson used it compare number of people kicked by horses per year. But the main point is, it is [discrete number] / [continuous interval]. If you have data that is measured in different intervals, you need to put that into the model by using an offset. The negative binomial model is also known as the Poisson-gamma model. Imagine you send your five students out to count red cars. Then when they return they give you a single number with no other information. You ask "for how long did you count cars?" and they reply "oh.. we didn't keep track". You have the counts, but you don't know how they relate to a "unit interval". To make use of this data you make a model like It should be mentioned that even if you have the values for "exposure" recorded, it is often still a good idea to use negative binomial, because there might be additional unknown differences in exposure on top of the recorded exposure. However, in our case, we know how many cells total we saw as a discrete number. So a binomial model is very reasonable. As a side note, if we have the K and N as described in the binomial case, and K / N is very small (e.g. < 0.01), then all the properties of the binomial distribution completely match the Poisson distribution with an offset. And the Poisson is easier to work with numerically. So it is often a decent idea to switch the binomial for a Poisson. I mentioned that even when we know the offset, it is a good idea to use the NB distribution. However, the NB distribution is much more challenging computationally. It is more time consuming to fit, and when the observed counts are small it is very difficult to know whether the rate (mean) is small or the dispersion is large. This is known as an identifiability problem, both cases are equally likely given the data. So to be more certain about the rate / probability, we would want to use the information we have: namely the total number of cells we counted. And this is why the binomial model have shorter confidence intervals than the NB model you have shared here. In your proposed model, I would bet that if you an offset with There is another part of all this. Instead of using a binomial model per cell type, you could make a multinomial model. The difference is that a multinomial model will also consider correlation between cell type abundances in the different observed samples. I chose the binomial model making the simplifying assumption that I don't expect correlation between cell types. This is probably wrong. If one sub-class of T cells like Gamma-delta is high I think that's probably correlated with increase in naive CD4 T cells. And you can make a model that estimates that correlation from the data, and this will give you more statistical power. But I don't think the GLM package I used supports multinomial models. Will Townes wrote a nice review about the connections between different count models: https://arxiv.org/abs/2001.04343v1 Sorry for the long response, but it was an excuse for me to write these things down! Best, |
Hej @vals !
Thanks for the really nice post about modeling single cell compositional data (201127-cell-count-glm)!
I have a small question regarding the choice of model for this type of count data.
I used your standard binomial GLM with a logit link to compare cell type compositions between two conditions and got some good results that match ones observed when simply plotting the relative cell type abundance in these two conditions:
I then tried to repeat it using a negative binomial glm from the MASS package using the glm.nb() function with a log-link:
The exact function call is
library(MASS) formula = count ~ cluster * time model2 <- glm.nb(formula = formula, link = "log",data = df)
where the df data frame has the same structure as the one posted in your example.
The calculated means for the odds.ratios between both models are the same but the negative binomial model has a much larger spread. I was wondering if you have any intuitive explanation for why the binomial GLM is a better choice, or vice-versa?
Thanks a lot!
//Angelo
The text was updated successfully, but these errors were encountered: