Added Method Regression Analysis to exercises.qmd #591

jkc9886 · 2024-08-02T12:53:31Z

One of the methods to visualize microbiome data in the OMA book chapters is regression charts, however there was no example or mention about it in the exercises chapter.

I have added:

Regression Analysis under visualization heading after heatmaps (Line 1743)
Added description and steps for it.
Created an R code exercise to do Regression analysis using package ggplot2 and lm() function.

Please suggest if this could be added and if yes, what better could be done?

One of the methods to visualize data in the OMA book chapters is regression charts, however there was no example or mention about it in the exercises chapter. I have added: 1. Regression Analysis under visualization heading after heatmaps (Line 1743) 2. Added description and steps for it. 3. Added an R code to do Regression analysis using package ggplot2 and lm() function.

antagomir · 2024-08-05T19:38:23Z

Thanks!

Regression is indeed a common statistical technique.

The primary focus of OMA is to teach Bioconductor methods that support the modern multi-assay data containers, in particular the (Tree)SummarizedExperiment and MultiAssayExperiment but possibly others. OMA is not a book about general statistics (a topic which has more comprehensive treatments elsewhere). A key shortcoming in this example is that it does not show how to do regression on such data objects.

Another gap is in the statistical assumptions; read counts or relative abundances in microbiome context usually violate assumptions of standard linear regresssion in multiple ways and that is pedagogically not ideal. Examples on GLMs would be better justified but for those we do have DA tests already available for individual taxa.

If we keep linear regression example then I would implement following changes:

use one of the microbiome demo data sets
regress on alpha diversity (Shannon index?) instead of individual taxa (as we already have more dedicated DA methods available for taxa)
emphasize linear modeling assumptions and include some testing on how well the assumptions hold
show how to treat possible covariates
emphasize visualization and interpretation of the analysis outcomes
where possible, use visualization methods that are dedicated to (Tree)SE/SCE objects, e.g. from miaViz, scater or elsewhere

jkc9886 · 2024-08-07T08:37:11Z

I am working on your suggested changes but I see the PR has been approved to merge, is it a technical error or should I continue with adding the changes?

antagomir · 2024-08-07T08:43:19Z

This PR has not been merged.

If you check those "merge" announcements above you can see that they are instead synchronizing this PR with the other PRs that have been approved meanwhile. So the changes from other PRs in the devel branch are merged into your branch to keep it up-to-date, but your branch has not been merged yet to devel branch..

jkc9886 · 2024-08-07T08:46:26Z

Oh I see, i get it now, thank you Professor!

1. Used the GlobalPatterns dataset from mia & compared against Shannon Index instead of individual taxa method. 2. Checked assumptions such as linearity, normality, homoscedasticity, and independence of residuals. 3. Include sex as a secondary covariate in the regression model. 4. Used scater for PCA and visualization. 5. Added violin plot as well, for extra reference.

antagomir

See the comments.

THings to improve:

still often using base R rather than the TreeSE specific tools (I suggested some changes on this)
lm models can be used with both continuous data (like continuous age or bmi) and groups but the interpretation is different. It would be best to also include example with a continuous variable because that is a different case. The ggplot visualization with geom_smooth() is usually done with continuous variables, and linking it here with the discrete group variable example is potentially a bit misleading. Show boxplots (using miaViz) instead for that example, and scatterplots (geom_point & geom_smooth) with the continuous variable.

inst/pages/exercises.qmd

antagomir · 2024-08-15T07:43:20Z

There is one major comment related to the use of lm:

Your current use case with discrete x, continuous y can be done but is possibly a bit less standard than simple x,y scatterplot with continuous x. For this, scatterplot and geom_smooth are not recommended visualizations as they are designed for continouous x. Use instead boxplots or violin plots to visualize this kind of data (with discrete x).
Use with continuous variables (both x, y) is common use case and can be visualized with scatterplots & geom_smooth; this is a common use case and for clarity it would be good to also include example on this. Then the coefficients have different interpretation.

1. Introduced two examples : one with discrete x and one with continuous x against shannon index. 2. Made changes to the steps and interpretations accordingly while adding gtsummary and jtools for summary and coefficient visualization.

Removing the unused libraries from line 1796

inst/pages/exercises.qmd

Added description to the variable, coverage. Edited the shannon method command.

antagomir · 2024-08-27T22:07:02Z

@jkc9886 is this ready from your side?

jkc9886 · 2024-08-29T05:55:58Z

Yes Professor.

antagomir

Set chunks to eval: true and check the other comments.

inst/pages/exercises.qmd

antagomir · 2024-08-29T06:07:42Z

inst/pages/exercises.qmd

+"Coverage" is a measure of how comprehensively the microbial diversity
+was sampled in each sample. This could refer to the sequencing depth (number of reads)
+or an index like Good’s Coverage that quantifies the completeness of sampling.


What is the motivation for comparing Shannon and coverage? This is a technical comparison where some association would be expected. Do we have any biologically relevant comparison available?

including this comparison can highlight samples that need further sequencing, improving the quality of our data. We risk misinterpreting samples with high diversity but low coverage, where the sequencing depth might not have been sufficient to reveal the full diversity. It shows room to reduce bias but you are right this is an indirect relevance here.

Also, I am extremely sorry for the delay in my response, my classes have started and keep me occupied but I shall resolve these soon.

eval : true, minor edits

TuomasBorman

This looks nice but there are couple of problems:

It think the instructions in the exercises should be in a format "a) Do this. b) Do that."; with clear instructions what reader should do. This current version at some points explains the reasoning but does not give instructions to user; it just shows how to do it in code chunk.
Most of the explanations should be in main material of OMA. The info might be hard to find here.
The exercises should reflect the material in the book. In theory, user should be able to solve most of the exercises by copy-pasting code from OMA.

antagomir · 2024-11-10T19:26:32Z

Hi @jkc9886 - this PR is still unmerged. Would you help us to finalize this task you initiated?

jkc9886 · 2024-11-12T07:54:16Z

Yes, sure Professor. I shall work on the comments suggested by Tuomas and send the edits.

…

On Sun, Nov 10, 2024 at 11:26 PM Leo Lahti ***@***.***> wrote: Hi @jkc9886 <https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_jkc9886&d=DwMCaQ&c=slrrB7dE8n7gBJbeO0g-IQ&r=fvCQdMIyOhlZ3JXGliOTRQ&m=YC1zQpLhXKKo-ffDr1EJrYp_-1btb5wJ814atDwHJgRy3z3rOQePW2ia38hFsbw6&s=lrb9xsxN0a0bXYe2REjW_ZJ4qzBZrsqG_W-qqUPsj0k&e=> - this PR is still unmerged. Would you help us to finalize this task you initiated? — Reply to this email directly, view it on GitHub <https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_microbiome_OMA_pull_591-23issuecomment-2D2466855485&d=DwMCaQ&c=slrrB7dE8n7gBJbeO0g-IQ&r=fvCQdMIyOhlZ3JXGliOTRQ&m=YC1zQpLhXKKo-ffDr1EJrYp_-1btb5wJ814atDwHJgRy3z3rOQePW2ia38hFsbw6&s=JyttnDscqTcAS9V2dC37-_WgD5H3-YDn8NoNFFI1h28&e=>, or unsubscribe <https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_notifications_unsubscribe-2Dauth_A7UNPARCAPNGFXFRFAQ4UULZ76XP5AVCNFSM6AAAAABL4PZEIGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDINRWHA2TKNBYGU&d=DwMCaQ&c=slrrB7dE8n7gBJbeO0g-IQ&r=fvCQdMIyOhlZ3JXGliOTRQ&m=YC1zQpLhXKKo-ffDr1EJrYp_-1btb5wJ814atDwHJgRy3z3rOQePW2ia38hFsbw6&s=zepbJDfwNBVtFhQA2mvomL0fD8taML2_WHUV31SFDcc&e=> . You are receiving this because you were mentioned.Message ID: ***@***.***>

jkc9886 · 2024-11-24T17:57:16Z

Hey @TuomasBorman, thank you for your comments. This is a bonus section so I added the info with it too but do you suggest moving the reasoning/detailed information to any other chapter as a section or is it fine to completely remove the explanation and just provide instructions and code snippets ?

TuomasBorman

I suggest to:

We can keep the explanation here. I am not sure where these would fit in chapters.

You should rephrase the instructions. Remember that these are exercises, not tutorial. User should be able to follow without checking the solution.

First, fit linear model using a discrete variable, say "SampleType", where Shannon diversity is the response variable, and SampleType is the predictor.

You could say that user can use base R functions to fit linear model.

--> ".... (see stats::lm())"

Now, let's check some linear modeling assumptions :

How user should do that? It is not said anywhere?

--> "You can plot the fitted model (hint: plot())

To make this exercise simpler, you could focus only one case, e.g., shannon vs sample type

You could use some other dataset or use different grouping (e.g., human vs environment). Now the groups have 2 samples per group.

Check that these examples run with the latest mia

TuomasBorman · 2024-11-25T08:38:09Z

inst/pages/exercises.qmd

+#| code-summary: "Show solution"
+#| eval: true
+
+model <- lm(shannon ~ SampleType, data = colData(tse))


This line does not work, since it does not have "_diversity" suffix

You can also modify the line 1853

tse <- addAlpha(tse, index = "shannon")

TuomasBorman · 2024-11-25T08:39:32Z

inst/pages/exercises.qmd

+#| code-summary: "Show solution"
+#| eval: true
+
+plotColData(tse, x = "SampleType", y = "shannon", color_by = "SampleType") +


This should be

plotColData(tse, x = "SampleType", y = "shannon", color_by = "SampleType", show_boxplot = TRUE)

TuomasBorman · 2024-11-25T08:40:17Z

inst/pages/exercises.qmd

+#| label: extra-regressionchart-step1
+#| code-fold: true
+#| code-summary: "Show solution"
+#| eval: true


Exercise chunks are not run. So this should be eval: false

TuomasBorman · 2024-11-25T08:42:41Z

inst/pages/exercises.qmd

+i. The boxplot generated shows the distribution of Shannon diversity across
+different sample types. If the boxes are well-separated and there are 
+significant differences in medians, it indicates that SampleType has a strong
+effect on diversity. This would be supported by significant p-values in the
+regression summary.
+ii. The p-value associated with each coefficient tells us whether the effect
+of that sample type on the Shannon diversity index is statistically
+significant. A common threshold for significance is 0.05.
+iii. R-squared: This value indicates the proportion of the variance in the
+Shannon diversity index that is explained by the sample type. A higher R-squared
+value indicates a better fit of the model.


Instead of i) ii) ..., use 1, 2, 3

The code does not show p-values, you could run

summary(model)

TuomasBorman · 2024-11-25T08:53:11Z

inst/pages/exercises.qmd

+#| code-summary: "Show solution"
+#| eval: true
+
+model_continuous <- lm(shannon ~ coverage, data = colData(tse))


There is no "coverare" and "shannon" indices in colData

TuomasBorman · 2024-11-25T08:55:57Z

inst/pages/exercises.qmd

+#| code-summary: "Show solution"
+#| eval: true
+
+gt_model <- tbl_regression(model)


This is not the model for shannon vs coverage

TuomasBorman · 2024-11-25T09:01:42Z

inst/pages/exercises.qmd

+library(scater) 
+library(lmtest)
+library(gtsummary)   
+library(jtools)   


Check which packages are really needed

TuomasBorman · 2024-11-25T09:01:50Z

inst/pages/exercises.qmd

+#| eval: true
+
+library(mia)             
+library(TreeSummarizedExperiment) 


TreeSE comes with mia

TuomasBorman · 2024-11-25T09:02:37Z

inst/pages/exercises.qmd

+Shannon diversity index that is explained by the sample type. A higher R-squared
+value indicates a better fit of the model.
+
+Now, let's check some linear modeling assumptions :


Remove space: "modeling assumptions:"

TuomasBorman · 2024-11-25T09:14:00Z

inst/pages/exercises.qmd

+distance) are potential outliers that may have a strong influence on the model's 
+coefficients.
+
+ii. Durbin-Watson test checks for autocorrelation in the residuals. A value


I think this Durbin-Watson test can be removed

TuomasBorman · 2024-11-25T09:22:53Z

inst/pages/exercises.qmd

+Here, 
+
+i. Residual diagnostics (normality and homoscedasticity):
+
+a. The Residuals vs Fitted plot provides insight into the relationship 
+between residuals and fitted values. Ideally, the residuals should be randomly
+scattered around the horizontal line (residuals = 0) without any discernible 
+pattern. This indicates that the model’s assumption of linearity and homoscedasticity
+(constant variance of residuals) is likely met. If you observe any systematic patterns
+(such as a funnel shape), it may suggest heteroscedasticity, meaning the variance of
+residuals is not constant.


You could add question: "Do the assumptions hold?". And then add answer with folded text: https://quarto.org/docs/authoring/callouts.html

jkc9886 requested a review from antagomir August 2, 2024 12:53

Merge branch 'devel' into jkc9886-patch-8

98b3444

Merge branch 'devel' into jkc9886-patch-8

60c662a

jkc9886 and others added 4 commits August 7, 2024 14:12

Merge branch 'devel' into jkc9886-patch-8

b3a48cf

Merge branch 'devel' into jkc9886-patch-8

0474b32

Merge branch 'devel' into jkc9886-patch-8

8e71c17

antagomir requested changes Aug 15, 2024

View reviewed changes

jkc9886 added 2 commits August 16, 2024 16:35

Update exercises.qmd

2f8b345

1. Introduced two examples : one with discrete x and one with continuous x against shannon index. 2. Made changes to the steps and interpretations accordingly while adding gtsummary and jtools for summary and coefficient visualization.

Update exercises.qmd

5c11215

Removing the unused libraries from line 1796

antagomir requested changes Aug 17, 2024

View reviewed changes

inst/pages/exercises.qmd Outdated Show resolved Hide resolved

inst/pages/exercises.qmd Outdated Show resolved Hide resolved

inst/pages/exercises.qmd Outdated Show resolved Hide resolved

jkc9886 and others added 3 commits August 20, 2024 14:26

Update exercises.qmd

8a26d6d

Added description to the variable, coverage. Edited the shannon method command.

Step by step description.qmd

8cc9e60

Merge branch 'devel' into jkc9886-patch-8

badfea4

Merge branch 'devel' into jkc9886-patch-8

28efee1

antagomir requested changes Aug 29, 2024

View reviewed changes

jkc9886 and others added 4 commits September 5, 2024 10:26

Update exercises.qmd

0705f8c

eval : true, minor edits

Merge branch 'devel' into jkc9886-patch-8

8781d67

Merge branch 'devel' into jkc9886-patch-8

61aa67b

Merge branch 'devel' into jkc9886-patch-8

bf27f90

TuomasBorman requested changes Oct 3, 2024

View reviewed changes

Merge branch 'devel' into jkc9886-patch-8

bc8c147

antagomir added 2 commits November 18, 2024 14:45

Merge branch 'devel' into jkc9886-patch-8

556ee18

Merge branch 'devel' into jkc9886-patch-8

83d80b0

TuomasBorman reviewed Nov 25, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Added Method Regression Analysis to exercises.qmd #591

Added Method Regression Analysis to exercises.qmd #591

jkc9886 commented Aug 2, 2024 •

edited

Loading

antagomir commented Aug 5, 2024

jkc9886 commented Aug 7, 2024

antagomir commented Aug 7, 2024

jkc9886 commented Aug 7, 2024

antagomir left a comment

antagomir commented Aug 15, 2024

antagomir commented Aug 27, 2024

jkc9886 commented Aug 29, 2024

antagomir left a comment

antagomir Aug 29, 2024

jkc9886 Sep 5, 2024

jkc9886 Sep 5, 2024

TuomasBorman left a comment

antagomir commented Nov 10, 2024

jkc9886 commented Nov 12, 2024 via email

jkc9886 commented Nov 24, 2024

TuomasBorman left a comment

TuomasBorman Nov 25, 2024

TuomasBorman Nov 25, 2024

TuomasBorman Nov 25, 2024

TuomasBorman Nov 25, 2024

TuomasBorman Nov 25, 2024

TuomasBorman Nov 25, 2024

TuomasBorman Nov 25, 2024

TuomasBorman Nov 25, 2024

TuomasBorman Nov 25, 2024

TuomasBorman Nov 25, 2024

TuomasBorman Nov 25, 2024

TuomasBorman Nov 25, 2024

TuomasBorman Nov 25, 2024

Added Method Regression Analysis to exercises.qmd #591

Are you sure you want to change the base?

Added Method Regression Analysis to exercises.qmd #591

Conversation

jkc9886 commented Aug 2, 2024 • edited Loading

antagomir commented Aug 5, 2024

jkc9886 commented Aug 7, 2024

antagomir commented Aug 7, 2024

jkc9886 commented Aug 7, 2024

antagomir left a comment

Choose a reason for hiding this comment

antagomir commented Aug 15, 2024

antagomir commented Aug 27, 2024

jkc9886 commented Aug 29, 2024

antagomir left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

TuomasBorman left a comment

Choose a reason for hiding this comment

antagomir commented Nov 10, 2024

jkc9886 commented Nov 12, 2024 via email

jkc9886 commented Nov 24, 2024

TuomasBorman left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jkc9886 commented Aug 2, 2024 •

edited

Loading