-
Notifications
You must be signed in to change notification settings - Fork 16
/
Copy path07--diversity_stats.Rmd
527 lines (401 loc) · 24.5 KB
/
07--diversity_stats.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
---
output: html_document
bibliography: "bibtexlib.bib"
---
```{r setup, include=FALSE}
source("style.R")
knitr::opts_chunk$set(fig.height = 5, fig.width = 5)
```
# Diversity statistics
## Load example data
If you are starting the workshop at this section, or had problems running code in a previous section, use the following to load the data used in this section.
You can download the "clean_data.Rdata" file <a href="clean_data.Rdata" download="clean_data.Rdata">here</a>.
If `obj` and `sample_data` are already in your environment, you can ignore this and proceed.
```{r}
load("clean_data.Rdata")
```
## Measures of diversity
Diversity in the ecological sense is intuitively understood as the complexity of a community of organisms.
There are many ways to quantify this complexity so that we can compare communities objectively.
The two main categories of methods are known as **alpha diversity** and **beta diversity** [@whittaker1960vegetation].
Alpha diversity measures the diversity within a single sample and is generally based on the number and relative abundance of taxa at some rank (e.g. species or OTUs).
Beta diversity also uses the number of relative abundance of taxa at some rank, but measures variation between samples.
In other words, an alpha diversity statistic describes a single sample and a beta diversity statistic describes how two samples compare.
The `vegan` package is the main tool set used for calculating biological diversity statistics in R.
## Alpha (within sample) diversity
Common alpha diversity statistics include:
* **Shannon**: How difficult it is to predict the identity of a randomly chosen individual.
* **Simpson**: The probability that two randomly chosen individuals are the same species.
* **Inverse Simpson**: This is a bit confusing to think about. Assuming a theoretically community where all species were equally abundant, this would be the number of species needed to have the same Simpson index value for the community being analyzed.
There are also some diversity indexes that take into account the taxonomic similarity of the species called "taxonomic diversity" and "taxonomic distinctness", but we will not go into those.
The `diversity` function from the `vegan` package can be used to calculate the alpha diversity of a set of samples.
Like other vegan functions, it assumes that samples are in rows, but they are in columns in our data, so we need to use the `MARGIN = 2` option.
We also need to exclude the taxon ID column by subsetting the columns to only samples (i.e. all column besides the first one).
Since alpha diversity is a per-sample attribute, we can just add this as a column to the sample data table:
```{r }
library(vegan)
sample_data$alpha <- diversity(obj$data$otu_rarefied[, sample_data$SampleID],
MARGIN = 2,
index = "invsimpson")
hist(sample_data$alpha)
```
Adding this as a column to the sample data table makes it easy to graph using `ggplot2`.
```{r }
library(ggplot2)
ggplot(sample_data, aes(x = Site, y = alpha)) +
geom_boxplot()
```
We can use `r gloss$add('analysis of variance (ANOVA)')` to tell if at least one of the diversity means is different from the rest.
```{r}
anova_result <- aov(alpha ~ Site, sample_data)
summary(anova_result)
```
That tells us that there is a difference, but does not tell us which means are different.
A `r gloss$add("Tukey's Honest Significant Difference (HSD)")` test can do pairwise comparisons of the means to find this out.
We will use the `HSD.test` function from the `agricolae` package since it provides grouping codes that are useful for graphing.
```{r}
library(agricolae)
tukey_result <- HSD.test(anova_result, "Site", group = TRUE)
print(tukey_result)
```
Looking at the `tukey_result$groups` table it appears that the alpha diversity of sites "Sil" and "Mah" might not be different, but there is evidence that the diversity in site "Jam" is lower.
We can add this information to the graph using the `tukey_result$groups$groups` codes:
```{r }
group_data <- tukey_result$groups[order(rownames(tukey_result$groups)),]
ggplot(sample_data, aes(x = Site, y = alpha)) +
geom_text(data = data.frame(),
aes(x = rownames(group_data), y = max(sample_data$alpha) + 1, label = group_data$groups),
col = 'black',
size = 10) +
geom_boxplot() +
ggtitle("Alpha diversity") +
xlab("Site") +
ylab("Alpha diversity index")
```
So that takes care of comparing the alpha diversity of sites, but there are other interesting groupings we can compare, such as the genotype and the type of the sample (roots vs leaves).
We could do the above all over with minor modifications, but one of the benefits of using a programming language is that you can create your own functions to automate repeated tasks.
We can generalize what we did above and put it in a function like so:
```{r}
compare_alpha <- function(sample_data, grouping_var) {
# Calcuate alpha diversity
sample_data$alpha <- diversity(obj$data$otu_rarefied[, sample_data$SampleID],
MARGIN = 2,
index = "invsimpson")
# Do ANOVA
sample_data$grouping <- sample_data[[grouping_var]] # needed for how `aov` works
anova_result <- aov(alpha ~ grouping, sample_data)
# Do Tukey's HSD test
tukey_result <- HSD.test(anova_result, "grouping", group = TRUE)
# Plot result
group_data <- tukey_result$groups[order(rownames(tukey_result$groups)),]
my_plot <- ggplot(sample_data, aes(x = grouping, y = alpha)) +
geom_text(data = data.frame(),
aes(x = rownames(group_data),
y = max(sample_data$alpha) + 1,
label = group_data$groups),
col = 'black',
size = 10) +
geom_boxplot() +
ggtitle("Alpha diversity") +
xlab(grouping_var) +
ylab("Alpha diversity index")
# Return plot
return(my_plot)
}
```
Using this function, we can compare plot the alpha diversities by type of sample and genotype:
```{r }
compare_alpha(sample_data, "Type")
compare_alpha(sample_data, "Genotype")
```
Looks like there is no difference in the alpha diversity between genotypes, but a large difference between the diversity of roots and leaves.
The `phyloseq` package (@mcmurdie2013phyloseq) can be used to quickly plot a variety of alpha diversity indexes per sample using the `plot_richness` function.
First we need to convert the `taxmap` object to a `phyloseq` object, since all of the `phyloseq` functions expect `phyloseq` objects.
```{r fig.width=10, fig.height=5}
library(phyloseq)
ps_obj <- as_phyloseq(obj,
otu_table = "otu_rarefied",
otu_id_col = "OTU_ID",
sample_data = sample_data,
sample_id_col = "SampleID")
plot_richness(ps_obj, color = "Type", x = "Site")
```
Each dot is a sample and the error bars in some of the indexes are the standard error.
```{r include=FALSE}
save(ps_obj, file = "ps_obj.Rdata")
```
*NOTE: the `as_phyloseq` function is only available in the development version of metacoder. If you want to try it, you can install it by typing:*
```{r eval = FALSE}
devtools::install_github("ropensci/taxa")
devtools::install_github("grunwaldlab/metacoder")
```
*or you can use the current version of metacoder, download the file "ps_obj.Rdata" <a href="ps_obj.Rdata" download="ps_obj.Rdata">here</a>, and load the `ps_obj` object this way:*
```{r}
load("ps_obj.Rdata")
```
## Plotting taxon abundance with heat trees
Alpha diversity statistics capture the diversity of whole samples in a single number, but to see the abundance of each taxon in a group of samples (e.g., root samples), we need to use other techniques.
Stacked barcharts are typically used for this purpose, but we will be using heat trees.
First, we need to calculate the abundance of each taxon for a set of samples from our OTU abundance information.
We can use the `calc_taxon_abund` function to do this, grouping by a sample characteristic:
```{r}
obj$data$type_abund <- calc_taxon_abund(obj, "otu_rarefied",
cols = sample_data$SampleID,
groups = sample_data$Type)
print(obj$data$type_abund)
```
Now we can use these taxon counts to make heat trees of the primary taxa present in leafs and roots:
```{r fig.width = 10, fig.height = 10}
set.seed(2)
obj %>%
taxa::filter_taxa(leaf > 50) %>% # taxa:: needed because of phyloseq::filter_taxa
heat_tree(node_label = taxon_names,
node_size = leaf,
node_color = leaf,
layout = "da", initial_layout = "re",
title = "Taxa in leafs")
```
```{r fig.width = 10, fig.height = 10}
set.seed(3)
obj %>%
taxa::filter_taxa(root > 50) %>% # taxa:: needed because of phyloseq::filter_taxa
heat_tree(node_label = taxon_names,
node_size = root,
node_color = root,
layout = "da", initial_layout = "re",
title = "Taxa in roots")
```
Note that we needed to qualify `filter_taxa` with `taxa::`.
This is because `phyloseq` is loaded and it also has a function called `filter_taxa`.
## Beta (between sample) diversity
Beta diversity is a way to quantify the difference between two communities.
There are many metrics that are used for this, but we will only mention a few of the more popular ones.
A few also incorporate phylogenetic relatedness and require a phylogenetic tree of the organisms in either community to be calculated.
#### Examples of indexes used with presence/absence data:
* **Sørensen**: two times the number of species common to both communities divided by the sum of the number of species in each community.
* **Jaccard**: the number of species common to both communities divided by the number of species in either community.
* **Unifrac**: The fraction of the phylogenetic tree branch lengths shared by the two communities.
#### Examples of indexes used with count data:
* **Bray–Curtis**: The sum of lesser counts for species present in both communities divided by the sum of all counts in both communities. This can be thought of as a quantitative version of the Sørensen index.
* **Weighted Unifrac**: The fraction of the phylogenetic tree branch lengths shared by the two communities, weighted by the counts of organisms, so more abundant organisms have a greater influence.
The `vegan` function `vegdist` is used to calculate the pairwise beta diversity indexes for a set of samples.
Since this is a pairwise comparison, the output is a triangular matrix.
In R, a `matrix` is like a `data.frame`, but all of the same type (e.g. all numeric), and has some different behavior.
```{r fig.width=7, fig.height=5}
beta_dist <- vegdist(t(obj$data$otu_rarefied[, sample_data$SampleID]),
index = "bray")
```
Since `vegdist` does not have a `MARGIN` option like `diversity`, we need to `r gloss$add("transpose")` the matrix with the `t` function.
## Ordination
The typical way beta diversity is plotted is using ordination.
Ordination is a way to display "high dimensional" data in a viable number of dimensions (2 to 3).
Our data is "high dimensional" because we have many samples with many species and species can be considered a "dimension".
If we had only two species, we could make a scatter plot of their abundance in each sample and get an idea of how the samples differ.
With thousands of species, this is not possible.
Instead, ordination is used to try to capture the information in many dimensions by in a smaller number of new "artificial" dimensions.
```{r fig.width=7, fig.height=5}
mds <- metaMDS(beta_dist)
```
That transformed our beta diversity matrix into a set of coordinates in two dimensions, which attempt to capture the differences in the data.
However, it is in a format specific to `vegan`, so we will have to convert the data to a form that we can use for plotting.
```{r}
mds_data <- as.data.frame(mds$points)
```
To use the sample data in the plotting, we can combine the coordinate data with the sample data table:
```{r}
mds_data$SampleID <- rownames(mds_data)
mds_data <- dplyr::left_join(mds_data, sample_data)
```
Now that we have the data in a format ggplot2 likes, we can plot it.
Lets plot our two new dimensions and color them by sample type (i.e. leaves vs roots).
```{r fig.width=7, fig.height=5}
library(ggplot2)
ggplot(mds_data, aes(x = MDS1, y = MDS2, color = Type)) +
geom_point()
```
This shows that leaf and root samples are quite distinct, as we would expect.
We can also color them by Site:
```{r fig.width=7, fig.height=5}
ggplot(mds_data, aes(x = MDS1, y = MDS2, color = Site)) +
geom_point()
```
It appears that within the leaf and root clusters, we "sub-clusters" corresponding to site.
Finally, lets look at genotype:
```{r fig.width=7, fig.height=5}
ggplot(mds_data, aes(x = MDS1, y = MDS2, color = Genotype)) +
geom_point()
```
There is no discernible pattern there, suggesting plant genotype does not correspond to community structure.
We can also do the above quickly in `phyloseq` using the `ordinate` and `plot_ordination` functions.
Lets look at the differences between leaf and root samples again, but using a difference index this time.
```{r fig.width=7, fig.height=5}
ps_ord <- ordinate(ps_obj, method = "NMDS", distance = "jsd")
plot_ordination(ps_obj, ps_ord, type = "samples", color = "Type")
```
## Differential heat trees
Beta diversity summarizes the difference between two samples in a single number, but does not tell you why the samples are different.
We developed a plotting technique we call "differential heat trees" to display differences in abundance of each taxon in two samples or groups of samples.
These are like the heat trees shown in the plotting section, but colored with a diverging color scale the indicates which sample a each taxon is more abundant in.
In this way, it is similar to heat maps used for the same purpose and in gene expression studies.
### Comparing taxon abundance in two groups
Before we can make a differential heat tree, we need to calculate the difference in abundance for each taxon between groups of samples, such as root vs leaf samples.
There are many ways this can be done, ranging from simple differences in mean read counts, to outputs from specialized programs designed for microbiome data.
We will be using a function in `metacoder` called `compare_groups` to do the comparisons.
First however, we need to calculate the per-taxon abundance from our OTU counts.
We can use the `calc_taxon_abund` function to do this:
```{r}
obj$data$tax_abund <- calc_taxon_abund(obj, "otu_rarefied",
cols = sample_data$SampleID)
```
For each taxon, at every rank, the `compare_groups` function compares two groups of counts.
We have to define which sample belongs to which groups using `groups` option:
```{r}
obj$data$diff_table <- compare_groups(obj, dataset = "tax_abund",
cols = sample_data$SampleID,
groups = sample_data$Type)
print(obj$data$diff_table)
```
By default, `r gloss$add("Wilcoxon Rank Sum test")` is used to test for statistical significance of differences and various summary statistics are reported to measure the size of the difference.
Since we have done many independent tests (one for each taxon), we need to `r gloss$add('Multiple comparison corrections', shown = 'correct for multiple comparisions')`.
We will do that with a false discovery rate (FDR) correction, but other types can be used as well.
```{r}
obj <- mutate_obs(obj, "diff_table",
wilcox_p_value = p.adjust(wilcox_p_value, method = "fdr"))
```
The most useful statistic for plotting is the log of ratio of median abundances in the two groups, since it is centered on 0 and is symmetric (e.g., a value of -4 is the same magnitude as 4).
Lets set any differences that are not significant to 0 so all differences shown on the plot are significant.
```{r}
obj$data$diff_table$log2_median_ratio[obj$data$diff_table$wilcox_p_value > 0.05] <- 0
```
Now we have all the info needed to make a differential heat tree.
The standard `heat_tree` function can be used, but a few things need to be done to make an effective differential heat tree:
* A diverging color scale should be used, preferably with a neutral color in the middle, like gray.
* The interval of values displayed must be symmetric around zero so the neutral middle color is centered on zero.
* The node color should be set to the log of ratio of median abundances in the two groups
```{r fig.width = 10, fig.height = 10}
set.seed(1)
heat_tree(obj,
node_label = taxon_names,
node_size = n_obs, # number of OTUs
node_color = log2_median_ratio, # difference between groups
node_color_interval = c(-10, 10), # symmetric interval
node_color_range = c("cyan", "gray", "magenta"), # diverging colors
node_size_axis_label = "OTU count",
node_color_axis_label = "Log 2 ratio of median counts")
```
That's not too bad looking, but we can tweak it a bit to make it better by changing the layout, adding a title, and filtering out some of the taxa with odd names (depending on what the plot is for, you might not want to remove these).
Here `mutate_obs` is used to add a temporary variable to the taxmap object that will contain the taxon names with special characters like `[` removed.
```{r fig.width = 10, fig.height = 10}
set.seed(1)
obj %>%
mutate_obs("cleaned_names", gsub(taxon_names, pattern = "\\[|\\]", replacement = "")) %>%
taxa::filter_taxa(grepl(cleaned_names, pattern = "^[a-zA-Z]+$")) %>%
heat_tree(node_label = cleaned_names,
node_size = n_obs, # number of OTUs
node_color = log2_median_ratio, # difference between groups
node_color_interval = c(-10, 10), # symmetric interval
node_color_range = c("cyan", "gray", "magenta"), # diverging colors
node_size_axis_label = "OTU count",
node_color_axis_label = "Log 2 ratio of median counts",
layout = "da", initial_layout = "re", # good layout for large trees
title = "leaf vs root samples")
```
What color corresponds to each group depends on the order they were given in the `compare_groups` function.
Since "leaf" is "treatment_1" in the "diff_table", and "log2_median_ratio" is defined as "log2(treatment_1 / treatment_2)", when a taxon has more counts in leaf samples, the ratio is positive, therefore taxa more abundant in leafs are colored magenta in this case.
### Comparing taxon abundance with more than 2 groups
For pair-wise comparisons with more than two groups we have developed a graphing technique we call a heat tree matrix, in which trees are made for all pairs of sample groupings and arranged in a matrix with a larger key tree that has the taxon names and a legend.
The code to do this is similar to the code for making a single differential heat tree like we did above, but uses the `heat_tree_matrix` function.
First we need to use `compare_groups` to generate data for all pair-wise comparisons for a grouping with more than two treatments.
The code below compares all the sites used in this study to eachother:
```{r}
obj$data$diff_table <- compare_groups(obj, dataset = "tax_abund",
cols = sample_data$SampleID,
groups = sample_data$Site)
print(obj$data$diff_table)
```
We then need to correct for multiple comparisons and set non-significant differences to zero like we did before:
```{r}
obj <- mutate_obs(obj, "diff_table",
wilcox_p_value = p.adjust(wilcox_p_value, method = "fdr"))
obj$data$diff_table$log2_median_ratio[obj$data$diff_table$wilcox_p_value > 0.05] <- 0
```
Finally we call the `heat_tree_matrix` command with the same options that would be used for a single tree.
```{r fig.width = 10, fig.height = 10}
obj %>%
taxa::filter_taxa(taxon_ranks == "o", supertaxa = TRUE) %>%
mutate_obs("cleaned_names", gsub(taxon_names, pattern = "\\[|\\]", replacement = "")) %>%
taxa::filter_taxa(grepl(cleaned_names, pattern = "^[a-zA-Z]+$")) %>%
heat_tree_matrix(dataset = "diff_table",
node_label = cleaned_names,
node_size = n_obs, # number of OTUs
node_color = log2_median_ratio, # difference between groups
node_color_trans = "linear",
node_color_interval = c(-3, 3), # symmetric interval
edge_color_interval = c(-3, 3), # symmetric interval
node_color_range = diverging_palette(), # diverging colors
node_size_axis_label = "OTU count",
node_color_axis_label = "Log 2 ratio of median counts",
layout = "da", initial_layout = "re",
key_size = 0.67,
seed = 2)
```
We have only compared three groups here, due to the nature of this dataset, so this technique is not much better than 3 separate, full sized graphs in this case, but with more groups it can be a uniquely effective way to show lots of comparisons.
See the end of the [example analysis](02--quick_example.html) for a better example of this technique.
```{r include=FALSE}
save(ps_obj, obj, sample_data, file = "diversity_data.Rdata")
```
## Exercises
In these exercises, we will be using the `ps_obj` and `obj` from the analysis above.
If you did not run the code above or had problems, run the following code to get the objects used.
You can download the "diversity_data.Rdata" file <a href="diversity_data.Rdata" download="diversity_data.Rdata">here</a>.
```{r}
load("diversity_data.Rdata")
```
**1a)** Look at the documentation for the `plot_richness` function from `phyloseq`. Try to make a plot using only the Simpson and inverse Simpson indexes, colored by site and split up by sample type (leaf vs root).
```{r hide_button = TRUE}
plot_richness(ps_obj, color = "Site", x = "Genotype", measures = c("Simpson", "InvSimpson"))
```
**1b)** The Simpson and inverse Simpson indexes display the same information in different ways (if you know one, you can calculate the other). How do they differ?
```{asis hide_button = "Show Answer"}
The Simpson index is bound between 0 and 1, wheras the inverse Simpson index is greater than zero. Also, the inverse Simpson index tends to spread out higher diversity values more.
```
**2)** Rarefaction and converting counts to proportions are two ways of accounting for unequal sample depth. Although proportions are more intuitive and easier to understand, why might rarefaction be better when calculating diveristy indexes?
```{asis hide_button = "Show Answer"}
Some diversity indexes take into account the number of species. This especially important for those that use presence/absence instead of abundance, such as the Sørensen and Jaccard indexes. When read counts are rarefied, some low counts go to zero, effectively reducing the number of species in the sample, whereas converting to proportions will never make a count go to zero, so will not remove the effect of unequal sampling depth on these indexes.
```
**3a)** Using the techniques presented in the section on plotting the abundance of taxa, make a plot of taxon abundance for the site encoded "Jam".
```{r hide_button = TRUE}
obj$data$site_abund <- calc_taxon_abund(obj, "otu_rarefied",
cols = sample_data$SampleID,
groups = sample_data$Site)
obj %>%
taxa::filter_taxa(Jam > 50) %>% # taxa:: needed because of phyloseq::filter_taxa
heat_tree(node_label = taxon_names,
node_size = Jam,
node_color = Jam,
layout = "da", initial_layout = "re",
title = "Taxa in leafs")
```
**3b)** What are some of the most abundant families in the site "Jam"?
```{asis hide_button = "Show Answer"}
Sphingobacteriaceae, Chitinophagaceae, Sphingomonadaceae, Comamonadaceae, Microbacteriaceae, etc..
```
**3c)** What is the most abundant phylum in the site "Jam"?
```{asis hide_button = "Show Answer"}
Proteobacteria
```
**4a)** The ordination analysis colored by genotype showed no community differences corresponding to genotype. Try to use `compare_groups` to see if any taxa are significantly differentially abundant between genotypes to verify this result.
```{r hide_button = TRUE}
library(dplyr)
obj %>%
compare_groups(dataset = "tax_abund",
cols = sample_data$SampleID,
groups = sample_data$Genotype) %>%
mutate(wilcox_p_value = p.adjust(wilcox_p_value, method = "fdr")) %>%
filter(wilcox_p_value < 0.05)
```
**4b)** How many taxa have significant differences between genotypes?
```{asis hide_button = "Show Answer"}
None
```
## References