Skip to content

An example for biomedical scientists of how to do statistics and graphing of groups of data.

Notifications You must be signed in to change notification settings

tim-peterson/R-graphing-statistics

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

17 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

R-graphing-statistics

(demo'd on data from mouse bone micro-computed tomography (uCT))

Download R: Windows | Mac

Download R studio: Windows | Mac

Import data

There are two ways to import data. (1) You can pick the data file from the R Studio File menu or (2) you can use the R Studio console. I prefer to use the console because it will help new users get more comfortable working with the R language.

Console (preferred): It's a good idea to start by setting the working directory. This is both so you know the relative path R Studio will be grabbing the data file from and because this is where ggsave() saves the graphs you'll be generating, see below).

Set working directory to some folder.

setwd("/Users/timrpeterson/r/R-graphing-statistics/graphs")

Read the data into a dataframe b.

bvtv <- read_csv("../bvtv.csv")

File menu: Click File -> Import Dataset -> From CSV -> click on the data file. This will import your data and store it locally as a data.frame. By default, the name of the data.frame will be the same as the file name.

Statistics

To get simple median, mean, quartile data run this:

tapply(bvtv$`BV/TV%`, bvtv$X2, summary)

$`1`
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  6.250   6.858   7.855  10.010   9.465  25.270 

$`2`
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  4.380   4.938   5.810   5.949   6.965   8.080 

$`3`
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  10.76   20.60   32.70   27.76   35.43   38.04 

$`4`
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  4.780   5.020   8.140   7.601   9.500  10.250 

$`5`
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  3.930   5.055   5.690   6.846   7.810  13.160 

$`6`
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   6.04    9.94   14.36   17.66   25.07   35.61 

Run ANOVA - analysis of variance - statistically test. More details. You'll notice I intentionally gave the header columns bad names so that you can see a more real world scenario. BV/TV% needs to be escaped using backticks ` and the groups column didn't have a header name so R filled in X2 for us. See Notes at the bottom to learn more about factor() and the tilde ~.

result=aov(`BV/TV%`~factor(X2),data=bvtv)

pairwise.t.test(bvtv$`BV/TV%`,bvtv$X2,p.adjust="holm")

This should output something like the following with p-values for the pairwise comparisons of the X number of groups you have.

  1       2       3       4      5     
2 0.8364  -       -       -      -     
3 1.4e-06 2.9e-09 -       -      -     
4 1.0000  1.0000  1.3e-07 -      -     
5 1.0000  1.0000  1.5e-08 1.0000 -     
6 0.0767  0.0011  0.0119  0.0119 0.0036

This means that there is a highly statistically significant difference when comparing groups 2 and 3, 2.9e-09 (assuming a p <= 0.05 threshold). Whereas, groups 1 and 2 aren't statistically different, 0.8364.

One can also try Tukey's method.

tt <- TukeyHSD(result,conf.level=0.95)

print(tt)

           diff        lwr        upr     p adj
2-1  -4.0608333 -12.058594   3.936928 0.6656398
3-1  17.7488889   9.166597  26.331180 0.0000017
4-1  -2.4088889 -10.991180   6.173403 0.9607489
5-1  -3.1636364 -11.324968   4.997695 0.8600899
6-1   7.6544444  -0.927847  16.236736 0.1062980
3-2  21.8097222  13.573170  30.046275 0.0000000
4-2   1.6519444  -6.584608   9.888497 0.9911219
5-2   0.8971970  -6.899752   8.694145 0.9993653
6-2  11.7152778   3.478725  19.951830 0.0013370
4-3 -20.1577778 -28.963023 -11.352533 0.0000001
5-3 -20.9125253 -29.307997 -12.517053 0.0000000
6-3 -10.0944444 -18.899690  -1.289199 0.0158559
5-4  -0.7547475  -9.150219   7.640724 0.9998102
6-4  10.0633333   1.258088  18.868579 0.0163264
6-5  10.8180808   2.422609  19.213553 0.0046312

Once again you can see a highly statistically significant difference when comparing groups 2 and 3 (3-2), 0.0000000 (assuming a p <= 0.05 threshold). Whereas, groups 1 and 2 (2-1) aren't statistically different, 0.6656398. You can see that's similiar to using the Holm's method. The p-values are only shown to 8 decimal points so to get the exact value you can print to whatever decimal places you need.

print(tt,digits=15)

Graphing

Load ggplot graphing library. If it's not installed, run this: install.packages(ggplot2). For some examples on how to use ggplot, go here. Interestingly enough, R comes installed with an example data.frame of data mpg that you can play with.

library(ggplot2)

Store the graph as a variable, e.g., p.

p <- ggplot(bvtv, aes(factor(X2), `BV/TV%`, fill=factor(X2)))

This is where all the magic happens. I.e., R Studio graphs your data. See Notes below.

p + geom_boxplot() + geom_point(size = 5,  fill="white", stroke=2, shape=21) + geom_jitter(width = 0.2, size = 5,  fill="white", stroke=2, shape=21) + theme_bw() + theme(panel.border = element_blank(), panel.grid.major = element_blank(),panel.grid.minor = element_blank(), axis.line = element_line(colour = "#979798")) + scale_fill_manual(values=c("#ffffff", "#d1d3d4", "#d1d3d4", "#ffffff", "#14426f", "#14426f"), name = "Groups", labels = c("WT-SHAM", "WT+OVX", "WT+OVX+ALN", "KO-SHAM", "KO+OVX", "KO+OVX+ALN")) + stat_summary(geom = "crossbar", width=0.75, fatten=0, color="#da1a5e", fun.data = function(x){ return(c(y=median(x), ymin=median(x), ymax=median(x))) }) + scale_x_discrete(name ="Group")

Save the file to your current working directory.

ggsave("bvtv.pdf", useDingbats=FALSE)

The graph looks like this. It looks distorted when viewed as a PDF, but looks good when placed into an Illustrator file. In Illustrator, the first thing I do is set stroke width for the whole graph to 0.5. That might be the only adjustment you need.

bvtv.pdf

Notes

`BV/TV%`~X2 the tilde ~ is to set X2 as the dependent variable and BV/TV% as the independent variable. This is like telling R, what's the X and Y axis.

factor() is only required if your groups are numbers, "continuous variable", but you want them to be treated as a label. In our data, we've numbered our groups: 1,2,3,4,5,6 and group 1 is not smaller than group 6. It's just a different group.

shape=21 was required to have hollow data points.

scale_fill_manual() is needed to hard code the boxplot fill colors because I wanted a specific pattern.

stat_summary() was used mainly to give the median line a different color than the other border colors.

If your data.frame, let's say its called uCT, has multiple parameters and na values because the row lengths are unequal, you can omit na values in the uCT data.frame and generate a BVTV dataframe that is a subset of uCT that includes just the BV/TV% data.

BVTV <- na.omit(uCT[,c("BV/TV%","X2")])

About

An example for biomedical scientists of how to do statistics and graphing of groups of data.

Topics

Resources

Stars

Watchers

Forks

Packages

No packages published