Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added Method Regression Analysis to exercises.qmd #591

Open
wants to merge 20 commits into
base: devel
Choose a base branch
from

Conversation

jkc9886
Copy link
Contributor

@jkc9886 jkc9886 commented Aug 2, 2024

One of the methods to visualize microbiome data in the OMA book chapters is regression charts, however there was no example or mention about it in the exercises chapter.

I have added:

  1. Regression Analysis under visualization heading after heatmaps (Line 1743)
  2. Added description and steps for it.
  3. Created an R code exercise to do Regression analysis using package ggplot2 and lm() function.

Please suggest if this could be added and if yes, what better could be done?

One of the methods to visualize data in the OMA book chapters is regression charts, however there was no example or mention about it in the exercises chapter. 

I have added:

1. Regression Analysis under visualization heading after heatmaps (Line 1743)
2. Added description and steps for it.
3. Added an R code to do Regression analysis using package ggplot2 and lm() function.
@jkc9886 jkc9886 requested a review from antagomir August 2, 2024 12:53
@antagomir
Copy link
Member

Thanks!

Regression is indeed a common statistical technique.

The primary focus of OMA is to teach Bioconductor methods that support the modern multi-assay data containers, in particular the (Tree)SummarizedExperiment and MultiAssayExperiment but possibly others. OMA is not a book about general statistics (a topic which has more comprehensive treatments elsewhere). A key shortcoming in this example is that it does not show how to do regression on such data objects.

Another gap is in the statistical assumptions; read counts or relative abundances in microbiome context usually violate assumptions of standard linear regresssion in multiple ways and that is pedagogically not ideal. Examples on GLMs would be better justified but for those we do have DA tests already available for individual taxa.

If we keep linear regression example then I would implement following changes:

  • use one of the microbiome demo data sets
  • regress on alpha diversity (Shannon index?) instead of individual taxa (as we already have more dedicated DA methods available for taxa)
  • emphasize linear modeling assumptions and include some testing on how well the assumptions hold
  • show how to treat possible covariates
  • emphasize visualization and interpretation of the analysis outcomes
  • where possible, use visualization methods that are dedicated to (Tree)SE/SCE objects, e.g. from miaViz, scater or elsewhere

@jkc9886
Copy link
Contributor Author

jkc9886 commented Aug 7, 2024

I am working on your suggested changes but I see the PR has been approved to merge, is it a technical error or should I continue with adding the changes?

@antagomir
Copy link
Member

This PR has not been merged.

If you check those "merge" announcements above you can see that they are instead synchronizing this PR with the other PRs that have been approved meanwhile. So the changes from other PRs in the devel branch are merged into your branch to keep it up-to-date, but your branch has not been merged yet to devel branch..

@jkc9886
Copy link
Contributor Author

jkc9886 commented Aug 7, 2024

Oh I see, i get it now, thank you Professor!

jkc9886 and others added 4 commits August 7, 2024 14:12
1. Used the GlobalPatterns dataset from mia & compared against Shannon Index instead of individual taxa method. 
2. Checked assumptions such as linearity, normality, homoscedasticity, and independence of residuals.
3. Include sex as a secondary covariate in the regression model.
4. Used scater for PCA and visualization.
5. Added violin plot as well, for extra reference.
Copy link
Member

@antagomir antagomir left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See the comments.

THings to improve:

  • still often using base R rather than the TreeSE specific tools (I suggested some changes on this)

  • lm models can be used with both continuous data (like continuous age or bmi) and groups but the interpretation is different. It would be best to also include example with a continuous variable because that is a different case. The ggplot visualization with geom_smooth() is usually done with continuous variables, and linking it here with the discrete group variable example is potentially a bit misleading. Show boxplots (using miaViz) instead for that example, and scatterplots (geom_point & geom_smooth) with the continuous variable.

inst/pages/exercises.qmd Outdated Show resolved Hide resolved
inst/pages/exercises.qmd Outdated Show resolved Hide resolved
inst/pages/exercises.qmd Outdated Show resolved Hide resolved
inst/pages/exercises.qmd Outdated Show resolved Hide resolved
inst/pages/exercises.qmd Outdated Show resolved Hide resolved
inst/pages/exercises.qmd Outdated Show resolved Hide resolved
inst/pages/exercises.qmd Outdated Show resolved Hide resolved
inst/pages/exercises.qmd Outdated Show resolved Hide resolved
inst/pages/exercises.qmd Outdated Show resolved Hide resolved
inst/pages/exercises.qmd Outdated Show resolved Hide resolved
@antagomir
Copy link
Member

There is one major comment related to the use of lm:

  1. Your current use case with discrete x, continuous y can be done but is possibly a bit less standard than simple x,y scatterplot with continuous x. For this, scatterplot and geom_smooth are not recommended visualizations as they are designed for continouous x. Use instead boxplots or violin plots to visualize this kind of data (with discrete x).

  2. Use with continuous variables (both x, y) is common use case and can be visualized with scatterplots & geom_smooth; this is a common use case and for clarity it would be good to also include example on this. Then the coefficients have different interpretation.

1. Introduced two examples : one with discrete x and one with continuous x against shannon index.
2. Made changes to the steps and interpretations accordingly while adding gtsummary and jtools for summary and coefficient visualization.
Removing the unused libraries from line 1796
inst/pages/exercises.qmd Outdated Show resolved Hide resolved
inst/pages/exercises.qmd Outdated Show resolved Hide resolved
inst/pages/exercises.qmd Outdated Show resolved Hide resolved
jkc9886 and others added 3 commits August 20, 2024 14:26
Added description to the variable, coverage.
Edited the shannon method command.
@antagomir
Copy link
Member

@jkc9886 is this ready from your side?

@jkc9886
Copy link
Contributor Author

jkc9886 commented Aug 29, 2024

Yes Professor.

Copy link
Member

@antagomir antagomir left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Set chunks to eval: true and check the other comments.

inst/pages/exercises.qmd Outdated Show resolved Hide resolved
inst/pages/exercises.qmd Outdated Show resolved Hide resolved
inst/pages/exercises.qmd Show resolved Hide resolved
Comment on lines +1889 to +1891
"Coverage" is a measure of how comprehensively the microbial diversity
was sampled in each sample. This could refer to the sequencing depth (number of reads)
or an index like Good’s Coverage that quantifies the completeness of sampling.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the motivation for comparing Shannon and coverage? This is a technical comparison where some association would be expected. Do we have any biologically relevant comparison available?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

including this comparison can highlight samples that need further sequencing, improving the quality of our data. We risk misinterpreting samples with high diversity but low coverage, where the sequencing depth might not have been sufficient to reveal the full diversity. It shows room to reduce bias but you are right this is an indirect relevance here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, I am extremely sorry for the delay in my response, my classes have started and keep me occupied but I shall resolve these soon.

Copy link
Contributor

@TuomasBorman TuomasBorman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks nice but there are couple of problems:

  1. It think the instructions in the exercises should be in a format "a) Do this. b) Do that."; with clear instructions what reader should do. This current version at some points explains the reasoning but does not give instructions to user; it just shows how to do it in code chunk.

  2. Most of the explanations should be in main material of OMA. The info might be hard to find here.

  3. The exercises should reflect the material in the book. In theory, user should be able to solve most of the exercises by copy-pasting code from OMA.

@antagomir
Copy link
Member

Hi @jkc9886 - this PR is still unmerged. Would you help us to finalize this task you initiated?

@jkc9886
Copy link
Contributor Author

jkc9886 commented Nov 24, 2024

Hey @TuomasBorman, thank you for your comments. This is a bonus section so I added the info with it too but do you suggest moving the reasoning/detailed information to any other chapter as a section or is it fine to completely remove the explanation and just provide instructions and code snippets ?

Copy link
Contributor

@TuomasBorman TuomasBorman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest to:

We can keep the explanation here. I am not sure where these would fit in chapters.

You should rephrase the instructions. Remember that these are exercises, not tutorial. User should be able to follow without checking the solution.

First, fit linear model using a discrete variable, say "SampleType", where Shannon diversity is the response variable, and SampleType is the predictor.

You could say that user can use base R functions to fit linear model.

--> ".... (see stats::lm())"

Now, let's check some linear modeling assumptions :

How user should do that? It is not said anywhere?

--> "You can plot the fitted model (hint: plot())

To make this exercise simpler, you could focus only one case, e.g., shannon vs sample type

You could use some other dataset or use different grouping (e.g., human vs environment). Now the groups have 2 samples per group.

Check that these examples run with the latest mia

#| code-summary: "Show solution"
#| eval: true

model <- lm(shannon ~ SampleType, data = colData(tse))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This line does not work, since it does not have "_diversity" suffix

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can also modify the line 1853

tse <- addAlpha(tse, index = "shannon")

#| code-summary: "Show solution"
#| eval: true

plotColData(tse, x = "SampleType", y = "shannon", color_by = "SampleType") +
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be

plotColData(tse, x = "SampleType", y = "shannon", color_by = "SampleType", show_boxplot = TRUE)

#| label: extra-regressionchart-step1
#| code-fold: true
#| code-summary: "Show solution"
#| eval: true
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Exercise chunks are not run. So this should be eval: false

Comment on lines +1890 to +1900
i. The boxplot generated shows the distribution of Shannon diversity across
different sample types. If the boxes are well-separated and there are
significant differences in medians, it indicates that SampleType has a strong
effect on diversity. This would be supported by significant p-values in the
regression summary.
ii. The p-value associated with each coefficient tells us whether the effect
of that sample type on the Shannon diversity index is statistically
significant. A common threshold for significance is 0.05.
iii. R-squared: This value indicates the proportion of the variance in the
Shannon diversity index that is explained by the sample type. A higher R-squared
value indicates a better fit of the model.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of i) ii) ..., use 1, 2, 3

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code does not show p-values, you could run

summary(model)

#| code-summary: "Show solution"
#| eval: true

model_continuous <- lm(shannon ~ coverage, data = colData(tse))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is no "coverare" and "shannon" indices in colData

#| code-summary: "Show solution"
#| eval: true

gt_model <- tbl_regression(model)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not the model for shannon vs coverage

Comment on lines +1836 to +1839
library(scater)
library(lmtest)
library(gtsummary)
library(jtools)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Check which packages are really needed

#| eval: true

library(mia)
library(TreeSummarizedExperiment)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TreeSE comes with mia

Shannon diversity index that is explained by the sample type. A higher R-squared
value indicates a better fit of the model.

Now, let's check some linear modeling assumptions :
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove space: "modeling assumptions:"

distance) are potential outliers that may have a strong influence on the model's
coefficients.

ii. Durbin-Watson test checks for autocorrelation in the residuals. A value
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this Durbin-Watson test can be removed

Comment on lines +1919 to +1929
Here,

i. Residual diagnostics (normality and homoscedasticity):

a. The Residuals vs Fitted plot provides insight into the relationship
between residuals and fitted values. Ideally, the residuals should be randomly
scattered around the horizontal line (residuals = 0) without any discernible
pattern. This indicates that the model’s assumption of linearity and homoscedasticity
(constant variance of residuals) is likely met. If you observe any systematic patterns
(such as a funnel shape), it may suggest heteroscedasticity, meaning the variance of
residuals is not constant.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could add question: "Do the assumptions hold?". And then add answer with folded text: https://quarto.org/docs/authoring/callouts.html

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants