-
Notifications
You must be signed in to change notification settings - Fork 42
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Added Method Regression Analysis to exercises.qmd #591
base: devel
Are you sure you want to change the base?
Conversation
One of the methods to visualize data in the OMA book chapters is regression charts, however there was no example or mention about it in the exercises chapter. I have added: 1. Regression Analysis under visualization heading after heatmaps (Line 1743) 2. Added description and steps for it. 3. Added an R code to do Regression analysis using package ggplot2 and lm() function.
Thanks! Regression is indeed a common statistical technique. The primary focus of OMA is to teach Bioconductor methods that support the modern multi-assay data containers, in particular the (Tree)SummarizedExperiment and MultiAssayExperiment but possibly others. OMA is not a book about general statistics (a topic which has more comprehensive treatments elsewhere). A key shortcoming in this example is that it does not show how to do regression on such data objects. Another gap is in the statistical assumptions; read counts or relative abundances in microbiome context usually violate assumptions of standard linear regresssion in multiple ways and that is pedagogically not ideal. Examples on GLMs would be better justified but for those we do have DA tests already available for individual taxa. If we keep linear regression example then I would implement following changes:
|
I am working on your suggested changes but I see the PR has been approved to merge, is it a technical error or should I continue with adding the changes? |
This PR has not been merged. If you check those "merge" announcements above you can see that they are instead synchronizing this PR with the other PRs that have been approved meanwhile. So the changes from other PRs in the devel branch are merged into your branch to keep it up-to-date, but your branch has not been merged yet to devel branch.. |
Oh I see, i get it now, thank you Professor! |
1. Used the GlobalPatterns dataset from mia & compared against Shannon Index instead of individual taxa method. 2. Checked assumptions such as linearity, normality, homoscedasticity, and independence of residuals. 3. Include sex as a secondary covariate in the regression model. 4. Used scater for PCA and visualization. 5. Added violin plot as well, for extra reference.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See the comments.
THings to improve:
-
still often using base R rather than the TreeSE specific tools (I suggested some changes on this)
-
lm models can be used with both continuous data (like continuous age or bmi) and groups but the interpretation is different. It would be best to also include example with a continuous variable because that is a different case. The ggplot visualization with geom_smooth() is usually done with continuous variables, and linking it here with the discrete group variable example is potentially a bit misleading. Show boxplots (using miaViz) instead for that example, and scatterplots (geom_point & geom_smooth) with the continuous variable.
There is one major comment related to the use of lm:
|
1. Introduced two examples : one with discrete x and one with continuous x against shannon index. 2. Made changes to the steps and interpretations accordingly while adding gtsummary and jtools for summary and coefficient visualization.
Removing the unused libraries from line 1796
Added description to the variable, coverage. Edited the shannon method command.
@jkc9886 is this ready from your side? |
Yes Professor. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Set chunks to eval: true and check the other comments.
"Coverage" is a measure of how comprehensively the microbial diversity | ||
was sampled in each sample. This could refer to the sequencing depth (number of reads) | ||
or an index like Good’s Coverage that quantifies the completeness of sampling. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the motivation for comparing Shannon and coverage? This is a technical comparison where some association would be expected. Do we have any biologically relevant comparison available?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
including this comparison can highlight samples that need further sequencing, improving the quality of our data. We risk misinterpreting samples with high diversity but low coverage, where the sequencing depth might not have been sufficient to reveal the full diversity. It shows room to reduce bias but you are right this is an indirect relevance here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, I am extremely sorry for the delay in my response, my classes have started and keep me occupied but I shall resolve these soon.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks nice but there are couple of problems:
-
It think the instructions in the exercises should be in a format "a) Do this. b) Do that."; with clear instructions what reader should do. This current version at some points explains the reasoning but does not give instructions to user; it just shows how to do it in code chunk.
-
Most of the explanations should be in main material of OMA. The info might be hard to find here.
-
The exercises should reflect the material in the book. In theory, user should be able to solve most of the exercises by copy-pasting code from OMA.
Hi @jkc9886 - this PR is still unmerged. Would you help us to finalize this task you initiated? |
Hey @TuomasBorman, thank you for your comments. This is a bonus section so I added the info with it too but do you suggest moving the reasoning/detailed information to any other chapter as a section or is it fine to completely remove the explanation and just provide instructions and code snippets ? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suggest to:
We can keep the explanation here. I am not sure where these would fit in chapters.
You should rephrase the instructions. Remember that these are exercises, not tutorial. User should be able to follow without checking the solution.
First, fit linear model using a discrete variable, say "SampleType", where Shannon diversity is the response variable, and SampleType is the predictor.
You could say that user can use base R functions to fit linear model.
--> ".... (see stats::lm()
)"
Now, let's check some linear modeling assumptions :
How user should do that? It is not said anywhere?
--> "You can plot the fitted model (hint: plot())
To make this exercise simpler, you could focus only one case, e.g., shannon vs sample type
You could use some other dataset or use different grouping (e.g., human vs environment). Now the groups have 2 samples per group.
Check that these examples run with the latest mia
#| code-summary: "Show solution" | ||
#| eval: true | ||
|
||
model <- lm(shannon ~ SampleType, data = colData(tse)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This line does not work, since it does not have "_diversity" suffix
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can also modify the line 1853
tse <- addAlpha(tse, index = "shannon")
#| code-summary: "Show solution" | ||
#| eval: true | ||
|
||
plotColData(tse, x = "SampleType", y = "shannon", color_by = "SampleType") + |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be
plotColData(tse, x = "SampleType", y = "shannon", color_by = "SampleType", show_boxplot = TRUE)
#| label: extra-regressionchart-step1 | ||
#| code-fold: true | ||
#| code-summary: "Show solution" | ||
#| eval: true |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Exercise chunks are not run. So this should be eval: false
i. The boxplot generated shows the distribution of Shannon diversity across | ||
different sample types. If the boxes are well-separated and there are | ||
significant differences in medians, it indicates that SampleType has a strong | ||
effect on diversity. This would be supported by significant p-values in the | ||
regression summary. | ||
ii. The p-value associated with each coefficient tells us whether the effect | ||
of that sample type on the Shannon diversity index is statistically | ||
significant. A common threshold for significance is 0.05. | ||
iii. R-squared: This value indicates the proportion of the variance in the | ||
Shannon diversity index that is explained by the sample type. A higher R-squared | ||
value indicates a better fit of the model. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of i) ii) ..., use 1, 2, 3
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The code does not show p-values, you could run
summary(model)
#| code-summary: "Show solution" | ||
#| eval: true | ||
|
||
model_continuous <- lm(shannon ~ coverage, data = colData(tse)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is no "coverare" and "shannon" indices in colData
#| code-summary: "Show solution" | ||
#| eval: true | ||
|
||
gt_model <- tbl_regression(model) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not the model for shannon vs coverage
library(scater) | ||
library(lmtest) | ||
library(gtsummary) | ||
library(jtools) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Check which packages are really needed
#| eval: true | ||
|
||
library(mia) | ||
library(TreeSummarizedExperiment) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TreeSE comes with mia
Shannon diversity index that is explained by the sample type. A higher R-squared | ||
value indicates a better fit of the model. | ||
|
||
Now, let's check some linear modeling assumptions : |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove space: "modeling assumptions:"
distance) are potential outliers that may have a strong influence on the model's | ||
coefficients. | ||
|
||
ii. Durbin-Watson test checks for autocorrelation in the residuals. A value |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this Durbin-Watson test can be removed
Here, | ||
|
||
i. Residual diagnostics (normality and homoscedasticity): | ||
|
||
a. The Residuals vs Fitted plot provides insight into the relationship | ||
between residuals and fitted values. Ideally, the residuals should be randomly | ||
scattered around the horizontal line (residuals = 0) without any discernible | ||
pattern. This indicates that the model’s assumption of linearity and homoscedasticity | ||
(constant variance of residuals) is likely met. If you observe any systematic patterns | ||
(such as a funnel shape), it may suggest heteroscedasticity, meaning the variance of | ||
residuals is not constant. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You could add question: "Do the assumptions hold?". And then add answer with folded text: https://quarto.org/docs/authoring/callouts.html
One of the methods to visualize microbiome data in the OMA book chapters is regression charts, however there was no example or mention about it in the exercises chapter.
I have added:
Please suggest if this could be added and if yes, what better could be done?