diff --git a/_posts/2023-12-11-class-9-clustering/class-9-clustering.Rmd b/_posts/2023-12-11-class-9-clustering/class-9-clustering.Rmd index 586565b..80a50c4 100644 --- a/_posts/2023-12-11-class-9-clustering/class-9-clustering.Rmd +++ b/_posts/2023-12-11-class-9-clustering/class-9-clustering.Rmd @@ -162,6 +162,17 @@ Repeat the process above, but compute for 4 clusters ```{r} # TODO find k = 4 clusters +set.seed(0) +km.res4 <- kmeans(x = test_data, centers = 4, nstart = 25) + +cluster_k4 <- cbind(test_data, "cluster" = km.res4$cluster) + +cluster_k4$cluster <- km.res4$cluster + +cluster_k4$cluster <- factor(cluster_k4$cluster) + +ggplot(cluster_k4, aes(x = x_val, y = y_val, color = cluster)) + + geom_point(size = 2) ``` @@ -282,6 +293,18 @@ Repeat finding clusters using different points on the tree. For example, what if ```{r} # TODO repeat finding clusters with different numbers of clusters +hc_clusters5 <- cutree(tree = hc, k = 5) + +hc_clusters5 <- cbind(test_data, cluster = hc_clusters5) + +hc_clusters5$cluster <- factor(hc_clusters5$cluster) + +hc_clusters5 %>% + dplyr::mutate(sample = rownames(.)) %>% + ggplot(aes(x = x_val, y = y_val, label = sample, color = cluster)) + + geom_point(size = 2) + + geom_text(hjust=2, vjust=0) + ``` @@ -376,7 +399,7 @@ Let's run PCA and use matrix multiplication to visualize the first PC in our x-y ```{r} svda <- svd(as.matrix(normalized_dat)) -pc <- as.matrix(normalized_dat) %*% svda$v[, 1] %*% t(svda$v[, 1]) # Matrix multiplication +pv <- as.matrix(normalized_dat) %*% svda$v[, 1] %*% t(svda$v[, 1]) # Matrix multiplication bp <- svda$v[2, 1] / svda$v[1, 1] ap <- mean(pc[, 2]) - bp * mean(pc[, 1]) dim_plot + geom_segment(xend = pc[, 1], yend = pc[, 2]) + @@ -470,7 +493,7 @@ ggplot(pca_cluster, aes(x = PC1, y = PC2, color = cluster)) + ylab(paste0("PC2: ",round(percentVar[2] * 100),"% variance")) ``` -Let's try again with 3 clusters +Let's try again with 2 clusters ```{r} set.seed(123) @@ -490,7 +513,16 @@ Try with 4 clusters, what names cluster together? ```{r} # TODO Repeat with 4 clusters and repeat what names cluster together +set.seed(0) +km.res4 <- kmeans(normalized_dat, centers = 4, nstart = 25) + +pca_cluster <- cbind(pca_vals, "cluster" = km.res4$cluster) +pca_cluster$cluster <- factor(pca_cluster$cluster) +ggplot(pca_cluster, aes(x = PC1, y = PC2, color = cluster)) + + geom_point(size = 2) + + xlab(paste0("PC1: ",round(percentVar[1] * 100),"% variance")) + + ylab(paste0("PC2: ",round(percentVar[2] * 100),"% variance")) ``` @@ -586,17 +618,17 @@ plot(hc) In the plot above we see 3 clear clusters ```{r} -hc_clusters3 <- cutree(tree = hc, k =4) +hc_clusters3 <- cutree(tree = hc, k =3) hc_clusters3 <- cbind(pca_vals, "cluster" = hc_clusters3) hc_clusters3$cluster <- factor(hc_clusters3$cluster) -ggplot(hc_clusters3, aes(x = PC1, y = PC3, color = cluster)) + +ggplot(hc_clusters3, aes(x = PC1, y = PC2, color = cluster)) + geom_point(size = 2) + xlab(paste0("PC1: ",round(percentVar[1] * 100),"% variance")) + - ylab(paste0("PC3: ",round(percentVar[3] * 100),"% variance")) + ylab(paste0("PC2: ",round(percentVar[2] * 100),"% variance")) ``` **Exercise** diff --git a/_posts/2023-12-12-class-10-heatmap/class-10-heatmap.Rmd b/_posts/2023-12-12-class-10-heatmap/class-10-heatmap.Rmd index b4b003c..ae14cd1 100644 --- a/_posts/2023-12-12-class-10-heatmap/class-10-heatmap.Rmd +++ b/_posts/2023-12-12-class-10-heatmap/class-10-heatmap.Rmd @@ -67,7 +67,7 @@ Notice here we can only see the names for California ### Cleaning up by normalizing values -It's pretty hard to see much strucutre in the data just using the raw values. Let's try with our normalized values +It's pretty hard to see much structure in the data just using the raw values. Let's try with our normalized values ```{r, fig.height=10} normalized_mat <- t(t(names_mat) / colSums(names_mat)) @@ -260,7 +260,7 @@ Add your own colors to the row annotations We can also add annotations to the columns using the same approach. Let's load in some of the data I've compiled for the states ```{r} -state_info <- read.csv(here("class_7-9_data", "state_info.csv")) +state_info <- read.csv(here("class_8-10_data", "state_info.csv")) head(state_info) ``` diff --git a/_posts/2023-12-12-class-10-heatmap/class-10-heatmap.html b/_posts/2023-12-12-class-10-heatmap/class-10-heatmap.html new file mode 100644 index 0000000..ed09e22 --- /dev/null +++ b/_posts/2023-12-12-class-10-heatmap/class-10-heatmap.html @@ -0,0 +1,2204 @@ + + + + +
+ + + + + + + + + + + + + + + +The Rmarkdown for this class is on github
+pheatmap
Before we get started, let’s download all of the files you will need for the next three classes.
+# conditionally download all of the files used in rmarkdown from github
+source("https://raw.githubusercontent.com/rnabioco/bmsc-7810-pbda/main/_posts/2023-12-08-class-8-matricies/download_files.R")
+Today we are going to continue working with the same data we used to talk about matrices and clustering, the top 100 boy and girl names by state for 2020.
+We’ve loaded in both the boy and the girl names as separate matrices, we can combine these using rbind
to bind the rows
names_mat <- rbind(male_names_mat, female_names_mat)
+To make a heatmap, we will use the pheatmap package
+pheatmap(names_mat)
+Notice here we can only see the names for California
+It’s pretty hard to see much structure in the data just using the raw values. Let’s try with our normalized values
+ +You can see that when we try to normalize the values by dividing each number by the total number for the state, the trends in the data are much more clear and we don’t only see the data for California.
+Another popular way to normalize the data for a heatmap is called a z-score. This is also know a mean-centered scaled data. The z-score is calculated as follows
+Where x is each value \(\mu\) is the mean for the values and \(\sigma\) is the standard deviation for the values. Here, \(x - \mu\) is called “centering” and dividing by \(\sigma\) is called “scaling”
+In R, we can scale the data using the scale
function. Scaling is done on the columns, so here we can use the original matrix without any transformations
scale
takes 3 arguments
x - a numeric matrix(like object).
+
+center - either a logical value or numeric-alike vector of length equal to the
+number of columns of x, where ‘numeric-alike’ means that as.numeric(.) will be
+applied successfully if is.numeric(.) is not true.
+
+scale - either a logical value or a numeric-alike vector of length equal to the
+number of columns of x.
+The center
refers to \(x - \mu\) from the equation above and scale
refers to dividing by \(\sigma\). We will set both to be true to be a z-score.
scaled_mat <- scale(names_mat, scale = TRUE, center = TRUE)
+pheatmap(scaled_mat)
+Notice how we again can see more structure in the data.
+The scale shows both negative and positive values. Negative values had counts less than the mean, positive values had counts above the mean, and 0 means the value was equal to the mean. An important note here, values that are negative do not mean they were zero. For example, the scaled value for “Grayson” in California is negative:
+scaled_mat["Grayson", "California"]
+[1] -1.043828
+but there are still several hundred individuals named “Grayson” in California.
+names_mat["Grayson", "California"]
+[1] 453
+Note if you want a z-score of gene expression data, you will want to center and scale by the genes not the samples. Most gene expression data is stored as gene x sample so you will likely need to transform the matrix before scaling:
+scaled_mat <- t(scale(t(unscaled_mat), scale = TRUE, center = TRUE))
+Exercise +Looking at the three heatmaps, what is the most popular name? What heatmap is easiest to interpret?
+The default color palette for pheatmap isn’t always the color palette you want. One color palette I like is the magma color from the viridis
package. To use this package you can just call any of the color functions. and provide the number of colors you want in the palette. We will use the magma
function below.
You can also add custom color palettes using colorRampPalette
. This creates a function to generate a color palette with your colors for any number of values:
color_function <- colorRampPalette(c("navy", "white", "red"))
+color_function
+function (n)
+{
+ x <- ramp(seq.int(0, 1, length.out = n))
+ if (ncol(x) == 4L)
+ rgb(x[, 1L], x[, 2L], x[, 3L], x[, 4L], maxColorValue = 255)
+ else rgb(x[, 1L], x[, 2L], x[, 3L], maxColorValue = 255)
+}
+<bytecode: 0x7fdaba45bb70>
+<environment: 0x7fdab8051430>
+color_function(10)
+ [1] "#000080" "#38389C" "#7171B8" "#AAAAD4" "#E2E2F0" "#FFE2E2"
+ [7] "#FFAAAA" "#FF7171" "#FF3838" "#FF0000"
+pheatmap(scaled_mat, color = color_function(100))
+You can also run this in one line of code and not save the function
+pheatmap(scaled_mat, color = colorRampPalette(c("navy", "white", "red"))(100))
+You can even use custom color palettes with any HEX color codes you want. A personal favorite of mine is a blue/yellow palette from the ArchR
package.
blueYellow <- c("#352A86", "#343DAE", "#0262E0", "#1389D2", "#2DB7A3",
+ "#A5BE6A", "#F8BA43", "#F6DA23", "#F8FA0D")
+
+pheatmap(scaled_mat, color = blueYellow)
+Exercise +Make a heatmap with your own color palette
+# TODO make the heatmap with your unique color palette
+We can help interpretation by adding row and column annotations to the plot. Let’s first add some row annotations showing if the name is a male or a female name. We start by making a data frame of the names and indicating if the name was from the male or female database.
+names_annotation <- data.frame("sex" = c(rep("female", nrow(female_names_mat)),
+ rep("male", nrow(male_names_mat))),
+ row.names = c(rownames(female_names_mat),
+ rownames(male_names_mat)))
+
+head(names_annotation)
+ sex
+Ava female
+Olivia female
+Emma female
+Charlotte female
+Harper female
+Amelia female
+Note here the row names are the same as the row names of our matrix and we have one column named “sex”
+We can now add this to the plot using the annotation_row
. The key here is that the row names of your matrix must be the same as the row names of your data frame.
blueYellow <- c("#352A86", "#343DAE", "#0262E0", "#1389D2", "#2DB7A3",
+ "#A5BE6A", "#F8BA43", "#F6DA23", "#F8FA0D")
+
+pheatmap(scaled_mat,
+ color = blueYellow,
+ annotation_row = names_annotation)
+Now you can see very clearly what names were in the female database and what names were in the male database. Again, we can change the default colors by using a list
object. Here the names of the list
must match the column names in your data frame and the names of the colors in the list must match the levels of that column.
We can either use the name of colors:
+$sex
+ male female
+ "red" "yellow"
+pheatmap(scaled_mat,
+ color = blueYellow,
+ annotation_row = names_annotation,
+ annotation_colors = all_colors)
+Or a HEX code:
+$sex
+ male female
+"#B067A3" "#9C954D"
+pheatmap(scaled_mat,
+ color = blueYellow,
+ annotation_row = names_annotation,
+ annotation_colors = all_colors)
+Exercise +Add your own colors to the row annotations
+# TODO Add your own row annotaiton colors
+We can also add annotations to the columns using the same approach. Let’s load in some of the data I’ve compiled for the states
+ State Location Size Density
+1 Alabama South Middle 4
+2 Alaska West Small 6
+3 Arizona West Middle 4
+4 Arkansas South Middle 5
+5 California West Large 2
+6 Colorado West Middle 5
+This table has information about the location, the size (based on population as small middle or large) and the density (on a scale of 1-6).
+We first will need to reformat so that the row names are the sates. Now the row names must be the same as the column names in the matrix.
+state_info <- tibble::column_to_rownames(state_info, "State")
+
+
+head(state_info)
+ Location Size Density
+Alabama South Middle 4
+Alaska West Small 6
+Arizona West Middle 4
+Arkansas South Middle 5
+California West Large 2
+Colorado West Middle 5
+We can now pass this data frame to annotation_col
in pheatmap
pheatmap(scaled_mat,
+ color = blueYellow,
+ annotation_col = state_info)
+You can see that we now have three levels of annotation on our heatmap. One thing to notice is that our population density is given as a scale of colors because it is seen as a numeric score. If we instead use this as a factor, this scale will go away:
+state_info$Density <- factor(state_info$Density)
+
+pheatmap(scaled_mat,
+ color = blueYellow,
+ annotation_col = state_info)
+As before, we can add in our own color palettes by using a named list where the list names correspond to the column names of our data frame. Here, I’m going to use the package MetBrewer
to generate palettes for me. You can see all possible palette options here
location_colors <- met.brewer("Pissaro",
+ n = length(unique(state_info$Location))) %>%
+ as.character()
+
+names(location_colors) <- unique(state_info$Location)
+
+location_colors
+ South West New_England Midwest
+ "#4c825d" "#8cae9e" "#8dc7dc" "#0e2a4d"
+size_colors <- met.brewer("Navajo",
+ n = length(unique(state_info$Size))) %>%
+ as.character()
+
+names(size_colors) <- unique(state_info$Size)
+
+size_colors
+ Middle Small Large
+"#660d20" "#e59a52" "#edce79"
+density_colors <- met.brewer("Juarez",
+ n = length(levels(state_info$Density))) %>%
+ as.character()
+
+names(density_colors) <- levels(state_info$Density)
+
+density_colors
+ 1 2 3 4 5 6
+"#a82203" "#208cc0" "#f1af3a" "#cf5e4e" "#637b31" "#003967"
+all_colors <- list("Location" = location_colors,
+ "Size" = size_colors,
+ "Density" = density_colors)
+
+all_colors
+$Location
+ South West New_England Midwest
+ "#4c825d" "#8cae9e" "#8dc7dc" "#0e2a4d"
+
+$Size
+ Middle Small Large
+"#660d20" "#e59a52" "#edce79"
+
+$Density
+ 1 2 3 4 5 6
+"#a82203" "#208cc0" "#f1af3a" "#cf5e4e" "#637b31" "#003967"
+pheatmap(scaled_mat,
+ color = blueYellow,
+ annotation_col = state_info,
+ annotation_colors = all_colors)
+Finally, we can add in the annotation from the rows as well. First we can add in the colors we had before
+all_colors <- c(all_colors,
+ list("sex" = c("male" = "#B067A3", "female" = "#9C954D")))
+
+all_colors
+$Location
+ South West New_England Midwest
+ "#4c825d" "#8cae9e" "#8dc7dc" "#0e2a4d"
+
+$Size
+ Middle Small Large
+"#660d20" "#e59a52" "#edce79"
+
+$Density
+ 1 2 3 4 5 6
+"#a82203" "#208cc0" "#f1af3a" "#cf5e4e" "#637b31" "#003967"
+
+$sex
+ male female
+"#B067A3" "#9C954D"
+And then add the names_annotation
back into the argument for annotation_row
pheatmap(scaled_mat,
+ color = blueYellow,
+ annotation_col = state_info,
+ annotation_row = names_annotation,
+ annotation_colors = all_colors)
+Exercise
+What happens if all_colors
only has some of the colors in the annotation data frames?
# TODO try making the heatmap with a list that only contains color arguments for
+# some of the columns in the annotation data frames
+Notice how the annotation_colors
argument accepts the color arguments for both the rows and the columns. The important things about this argument to color the annotation is
annotation_row
or annotation_col
data frameThere are also a few important aspects of the annotation data frames
+1. The rownames of annotation_row
must match the rownames of the matrix (although the order does not need to be the same)
+2. The rownames of annotation_col
must match the column names of the matrix (although the order does not need to be the same)
+3. You can add as many annotations to the rows and columns that you want. You just need to include these as columns in either the annotation_row
or annotation_col
data frame.
One aspect of the plots that you’ve probably noticed is dendrograms for both the rows and the columns. The clustering is done using hclust
like we did in the clustering lecture.
Just like with our clustering methods we can cut the dendrogram based on an expected number of clusters to group either the states or the names. pheatmap
has a function that uses cutree
to identify clusters and physically separates these clusters with a white space using cutree_rows
and cutree_cols
. First, we can visualize three clusters of the names.
pheatmap(scaled_mat,
+ color = blueYellow,
+ annotation_col = state_info,
+ annotation_row = names_annotation,
+ annotation_colors = all_colors,
+ cutree_rows = 3)
+Or we can visualize three clusters of the states
+pheatmap(scaled_mat,
+ color = blueYellow,
+ annotation_col = state_info,
+ annotation_row = names_annotation,
+ annotation_colors = all_colors,
+ cutree_cols = 3)
+We could even visuzalize both at once
+pheatmap(scaled_mat,
+ color = blueYellow,
+ annotation_col = state_info,
+ annotation_row = names_annotation,
+ annotation_colors = all_colors,
+ cutree_rows = 3,
+ cutree_cols = 3)
+Exercise +Try making the heatmap with different numbers of clusters. What number of clusters makes the most sense?
+# TODO Make the heatmap cutting to make different numbers of clusters
+As with using cutree
ourselves, you can pick any number of clusters to visualize in this way.
In addition to visualizing these clusters, we can also pull the clustering information out of the heatmap object. First we can save the plot to a variable. Right now we are setting silent
to TRUE
so the heatmap isn’t drawn
heatmap <- pheatmap(scaled_mat,
+ color = blueYellow,
+ annotation_col = state_info,
+ annotation_row = names_annotation,
+ annotation_colors = all_colors,
+ cutree_rows = 3,
+ cutree_cols = 3,
+ silent = TRUE)
+
+names(heatmap)
+[1] "tree_row" "tree_col" "kmeans" "gtable"
+The hclust
object for the row are in tree_row
and the hclust
object for the column are in tree_col
We can plot just the dendrogram as we did previously
+plot(heatmap$tree_col)
+plot(heatmap$tree_row)
+We can also now use cutree
on these hclust
objects in the exact same way we did in the clustering lecture to pull out clusters
Columns:
+cutree(tree = heatmap$tree_col, k = 3)
+ Alabama Alaska Arizona Arkansas California
+ 1 2 3 1 3
+ Colorado Connecticut Delaware Florida Georgia
+ 2 3 3 3 1
+ Hawaii Idaho Illinois Indiana Iowa
+ 3 2 3 2 2
+ Kansas Kentucky Louisiana Maine Maryland
+ 2 1 1 2 3
+Rows:
+cutree(tree = heatmap$tree_row, k = 3)
+ William James John Elijah Noah Liam Mason
+ 1 1 2 1 1 1 2
+ Oliver Henry Jackson Samuel Jaxon Asher Grayson
+ 1 2 2 3 3 2 2
+ Levi Michael Carter Benjamin Charles Wyatt Thomas
+ 2 2 2 2 3 2 3
+ Aiden Luke David Owen Daniel Logan Joseph
+ 3 3 3 2 2 2 3
+ Lucas Joshua Jack Alexander Maverick Gabriel Ethan
+ 2 3 2 2 3 3 2
+ Eli Isaac Hunter Ezra Theodore Ava Olivia
+ 3 3 3 3 2 1 1
+ Emma Charlotte Harper Amelia Elizabeth Evelyn Isabella
+ 1 1 2 1 3 2 2
+ Ella Avery Abigail Sophia Layla Mia Madison
+ 3 3 3 2 3 2 3
+ Lily Ellie Nova Eleanor Zoey Brooklyn Riley
+ 3 3 3 3 3 3 3
+ Nora Aria Mila Stella Natalie Luna Penelope
+ 3 3 3 3 3 3 3
+ Aurora Claire
+ 3 3
+Exercise +What names are the most similar? What states are the most similar?
+Another option is to remove the clustering all together using cluster_rows
and cluster_cols
pheatmap(scaled_mat,
+ color = blueYellow,
+ annotation_col = state_info,
+ annotation_row = names_annotation,
+ annotation_colors = all_colors,
+ cluster_rows = FALSE,
+ cluster_cols = FALSE)
+If we don’t cluster, the order of the rows and columns is taken directly from the order of the matrix. We can change to alphabetical order as follows
+scaled_mat_ord <- scaled_mat[order(rownames(scaled_mat)), ]
+
+pheatmap(scaled_mat_ord,
+ color = blueYellow,
+ annotation_col = state_info,
+ annotation_row = names_annotation,
+ annotation_colors = all_colors,
+ cluster_rows = FALSE,
+ cluster_cols = FALSE)
+We could even order the states by the density
+state_info <- state_info %>%
+ dplyr::arrange(Density)
+
+scaled_mat_ord <- scaled_mat[ , order(match(colnames(scaled_mat),
+ rownames(state_info)))]
+
+pheatmap(scaled_mat_ord,
+ color = blueYellow,
+ annotation_col = state_info,
+ annotation_row = names_annotation,
+ annotation_colors = all_colors,
+ cluster_rows = FALSE,
+ cluster_cols = FALSE)
+Exercise +Order the heatmap by density, but cluster the names
+# TODO order the heatmap by density but cluster the names
+Exercise +Order the heatmap by location, but cluster the names
+# TODO order the heatmap by location and cluster the names
+One other helpful way to adjust your heatmap is to remove the row and column names. You can control if the row and column names are seen using show_rownames
and show_colnames
pheatmap(scaled_mat,
+ color = blueYellow,
+ annotation_col = state_info,
+ annotation_row = names_annotation,
+ annotation_colors = all_colors,
+ show_rownames = FALSE,
+ show_colnames = FALSE)
+There are many possible adjustments you can make, we’ve just gone through the adjustments I use the most often. To see all possible arguments see ?pheatmap
The content of this class borrows heavily from previous tutorials:
+ +