Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Binomial vs neg-binomial GLM regression #5

Open
angelolimeta opened this issue Aug 30, 2021 · 1 comment
Open

Binomial vs neg-binomial GLM regression #5

angelolimeta opened this issue Aug 30, 2021 · 1 comment

Comments

@angelolimeta
Copy link

angelolimeta commented Aug 30, 2021

Hej @vals !
Thanks for the really nice post about modeling single cell compositional data (201127-cell-count-glm)!

I have a small question regarding the choice of model for this type of count data.

I used your standard binomial GLM with a logit link to compare cell type compositions between two conditions and got some good results that match ones observed when simply plotting the relative cell type abundance in these two conditions:
image

I then tried to repeat it using a negative binomial glm from the MASS package using the glm.nb() function with a log-link:
image

The exact function call is
library(MASS) formula = count ~ cluster * time model2 <- glm.nb(formula = formula, link = "log",data = df)
where the df data frame has the same structure as the one posted in your example.

The calculated means for the odds.ratios between both models are the same but the negative binomial model has a much larger spread. I was wondering if you have any intuitive explanation for why the binomial GLM is a better choice, or vice-versa?

Thanks a lot!

//Angelo

@vals
Copy link
Owner

vals commented Aug 31, 2021

Hi Angelo,

Thank you for the interest, and testing different observation models like this is always good practice.

The origin of the three classical count models (binomial, Poisson, negative binomial) were to solve some particular problems.

The binomial model (which I chose here) models the sum of a discrete number N of binary trials. Out of N total trials we might see K positive cases. We learn about p = K / N the probability of a positive case given the sampling population. This is very similar to contingency table statistics and closely related to Fisher's exact test.

The Poisson model was constructed to learn the number K of positive cases within a continuous interval. I think Poisson used it compare number of people kicked by horses per year. But the main point is, it is [discrete number] / [continuous interval]. If you have data that is measured in different intervals, you need to put that into the model by using an offset.
For example, say you want to estimate the popularity of red cars. You send out five students to different street crossings to count cars. Two students count cars for 15 minutes, and three students count cars for 30 minutes, and let's say all street crossing are equally busy. Then in your model you would need a formula like num_red_cars ~ 1 + offset(log(observation_time)). Roughly, the Poisson aims to model "positive cases per unit interval".

The negative binomial model is also known as the Poisson-gamma model. Imagine you send your five students out to count red cars. Then when they return they give you a single number with no other information. You ask "for how long did you count cars?" and they reply "oh.. we didn't keep track". You have the counts, but you don't know how they relate to a "unit interval". To make use of this data you make a model like num_red_cars ~ 1 + offset(log(unknown_observation_time)) and put a prior on unknown_observation_time. A very flexible prior for this is the gamma distribution, since you know the value must be positive. Then you can find the maximum likelihood of both the rate of red cars and the distribution of observation times. This was the problem the NB distribution solved: Poisson with unknown variation in "exposure".

It should be mentioned that even if you have the values for "exposure" recorded, it is often still a good idea to use negative binomial, because there might be additional unknown differences in exposure on top of the recorded exposure.

However, in our case, we know how many cells total we saw as a discrete number. So a binomial model is very reasonable.

As a side note, if we have the K and N as described in the binomial case, and K / N is very small (e.g. < 0.01), then all the properties of the binomial distribution completely match the Poisson distribution with an offset. And the Poisson is easier to work with numerically. So it is often a decent idea to switch the binomial for a Poisson.

I mentioned that even when we know the offset, it is a good idea to use the NB distribution. However, the NB distribution is much more challenging computationally. It is more time consuming to fit, and when the observed counts are small it is very difficult to know whether the rate (mean) is small or the dispersion is large. This is known as an identifiability problem, both cases are equally likely given the data. So to be more certain about the rate / probability, we would want to use the information we have: namely the total number of cells we counted. And this is why the binomial model have shorter confidence intervals than the NB model you have shared here.

In your proposed model, I would bet that if you an offset with + offset(log(total_cells)) you will get shorter confidence intervals, and the dispersion parameter theta will be estimated close to 0.

There is another part of all this. Instead of using a binomial model per cell type, you could make a multinomial model. The difference is that a multinomial model will also consider correlation between cell type abundances in the different observed samples. I chose the binomial model making the simplifying assumption that I don't expect correlation between cell types. This is probably wrong. If one sub-class of T cells like Gamma-delta is high I think that's probably correlated with increase in naive CD4 T cells. And you can make a model that estimates that correlation from the data, and this will give you more statistical power. But I don't think the GLM package I used supports multinomial models.

Will Townes wrote a nice review about the connections between different count models: https://arxiv.org/abs/2001.04343v1

Sorry for the long response, but it was an excuse for me to write these things down!

Best,
/Valentine

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants