Gene-Expression-Unsupervised-Clusteing

Using consensus clustering approach to find pattern in expression data sets

Unsupervised class discovery is a data mining method to identify unknown possible groups (clusters) of items solely based on intrinsic features and no external variables. Basically clustering includes four steps:

[1] Data preparation,

[2] Dissimilarity matrix calculation,

[3] applying clustering algorithms,

[4] Assessing cluster assignment

[1] Data preparation

In this step we need to filter out incomplete cases and low expressed genes, then transforming/normalizing gene expression values. Usually analysis would start from a raw count matrix comming from an RNA-seq experiment. Because there are a large number of features (gene) in such matrix , a feature selection step should be done to limit the analysis to those genes that possibly explain variation between samples in the cohort.
To do so ;

For filteration: I keep those genes that have expression in 10% of samples with a count of 10 or higher.
For transformation/normalization : I use variance stabilizing transformation (VST). I use vst function from DESeq2 packages which at the same time will normalize the raw count also. Using other type of transformation like Z score, log transformation are also quiet commmon.
For feature selection: I select 2k, 4k and 6k top genes based on median absolute deviation (MAD) . A number of other methods like "feature selection based on the most variance", "feature dimension reduction and extraction based on Principal Component Analysis (PCA)", and "feature selection based on Cox regression model" could be applied. See bioconductor package CancerSubtypes manual.

Other approaches also could be used for example:

Using log2(RSEM) gene expression value and removing genes with NA values more than 10% across samples. Then selected top 25% most-varying genes by standard deviation of gene expression across samples (A. Gordon Robertson et al., Cell,2017).
Using FPKM matrix and keeping genes with (log2(FPKM+1)>2 in at least 10% of samples and selecting a subsets (2K, 4K, 6K ) of MAD ranked genes to identify stable classes (Jakob Hedegaard et al, Cancer Cell, 2016).

#_________________________________reading data and processing_________________________________#
# reading count data
rna <- read.table("Uromol1_CountData.v1.csv", header = T, sep = ",")
head(rna[1:5, 1:5], 5)
#                   U0001 U0002 U0006 U0007 U0010
#ENSG00000000003.13  1458   228  1800  3945   293
#ENSG00000000005.5      0     0     9     0     0
#ENSG00000000419.11   594    23   792  1378   139
#ENSG00000000457.12   548    22  1029   976   148
#ENSG00000000460.15    53     2   190   136    47

dim(rna)
# [1] 60483   476  this is a typical output from hts-seq count matrix with more than 60,000 genes

# dissecting dataset based on gene-type
library(biomaRt)
mart <- useDataset("hsapiens_gene_ensembl", useMart("ensembl"))
genes <- getBM(attributes= c("ensembl_gene_id","hgnc_symbol", "gene_biotype"), 
               mart= mart)

# see gene types returned by biomaRt
data.frame(table(genes$gene_biotype))[order(- data.frame(table(genes$gene_biotype))[,2]), ]

# we will continue with protein coding here:
rna <- rna[substr(rownames(rna),1,15) %in% genes$ensembl_gene_id[genes$gene_biotype == "protein_coding"],]

dim(rna)
#[1] 19581   476 These are protein coding genes

# reading associated clinical data
clinical.exp <- read.table("uromol_clinic.csv", sep = ",", header = T)
head(clinical.exp[1:5,1:5], 5)
#      UniqueID   CLASS  BASE47    CIS X12.gene.signature
#U0603    U0603 luminal luminal no-CIS          high_risk
#U0497    U0497 luminal   basal no-CIS           low_risk
#U0839    U0839 luminal luminal    CIS           low_risk
#U1043    U1043 luminal luminal no-CIS          high_risk
#U0566    U0566 luminal   basal no-CIS           low_risk

# making sure about sample order in rna and clincal.exp dataset
all(rownames(clinical.exp) %in% colnames(rna))
#[1] TRUE
all(rownames(clinical.exp) == colnames(rna))
#[1] FALSE
# reordering rna dataset
rna <- rna[, rownames(clinical.exp)]
#______________________ Data tranformation & Normalization ______________________________#
library(DESeq2)
dds <- DESeqDataSetFromMatrix(countData = rna,
                              colData = clinical.exp,
                              design = ~ 1) # 1 passed to the function because of no model
# pre-filteration, however while using DESeq2 package it is not necessary, because the function automatically will filter out low count genes
# Keeping genes with expression in 10% of samples with a count of 10 or higher
keep <- rowSums(counts(dds) >= 10) >= round(ncol(rna)*0.1)
dds <- dds[keep,]

# vst tranformation
vsd <- vst(dds) # For a fully unsupervised transformation one can set blind = TRUE (which is the default).

# Inspecting deseq object
#head(assay(vsd), 3)
#colData(vsd)
#_________________________________# Feature Selection _________________________________#
# top 5K based on MAD 

mads=apply(assay(vsd),1,mad)
# check data distribution
hist(mads, breaks=nrow(assay(vsd))*0.1)
# selecting features
mad2k=assay(vsd)[rev(order(mads))[1:2000],]
#mad4k=assay(vsd)[rev(order(mads))[1:4000],]
#mad6k=assay(vsd)[rev(order(mads))[1:6000],]

[2&3] Dissimilarity matrix calculation & applying clustering algorithms:

Clustering is grouping similar samples into one cluster and keeping them far from disimilar samples based on distance measure. There are a diverse list of dissimilarity matrix calculation methods (distance measures). To see list of available distance measures please check ?stats::dist and ?vegan::vegdist(). The latter need to have the vegan package to be installed:

For log-transformed gene expression, Euclidean based measures can be applied.
For RNA-seq normalised counts, correlation based measures (Pearson, Spearman) or a Poisson-based distance can be used.

Clustering algorithms: To see a list of algorithms please check diceR package vignettes. Most widely used algorithms are partitional clustering and hierarchical clustering.

Partitional clustering are clustering methods used to classify samples into multiple clusters based on their similarity. The algorithms required to specify the number of clusters to be generated. The commonly used partitional clustering, including:

K-means clustering (KM): each cluster is represented by the center or means of the data points belonging to the cluster. This method is sensitive to outliers.
K-medoids clustering or PAM (Partitioning Around Medoids), in which, each cluster is represented by one of the objects in the cluster. PAM is less sensitive to outliers.

Hierarchical clustering (HC) in contrast to partitional clustering, this method does not require to pre-specify the number of clusters to be generated. HC can be grouped into two classes.

Agglomerative:"bottom-up" approach, each observation is initially considered as a cluster of its own (leaf), and pairs of clusters are merged as one moves up the hierarchy.
Divisive: "top-down" approach, This begins with the root so all observations start in one cluster, and splits are performed recursively as one moves down the hierarchy.

Some words on how to compute distances between clusters: Indeed there are diffrent way to do cluster agglomeration (i.e, linkage ). When one using stat package, possible methods include “ward.D”, “ward.D2”, “single”, “complete”, “average”, “mcquitty”, “median” or “centroid”. Generally complete linkage and Ward’s method are preferred. More reading materials on this can be find here. As a result hierarchical clustering provides a tree-based representation of the objects, which is also known as dendrogram.

In practice HC, KM and PAM are commonly used for gene exoression data.

To quantitatively dtermine the number and membership of possible clusters within the dataset, I will use Consensus Clustering (CC) approach. Applying this method has proved to be effective in new cancer subclasses discoveries. For more information on the methodology please refere to the seminal paper by Monti et al. (2003) and ConsensusClusterPlus package manual.

To determine the number of cluster based on CC, there are several graphics which would help to this extent. (1) A Color-coded heatmap corresponding to the consensus matrix that represent consensus values from 0–1; white corresponds to 0 (never clustered togather) and dark blue to 1 (allways clustered togather). (2) Consensus Cumulative Distribution Function (CDF) plot, This graphic lets one to determine at what number of clusters,k, the CDF reaches an approximate maximum. So consensus and cluster confidence is at a maximum at this k. (3) Based on CDF plot, a chart for relative change in area under CDF curve can be plotted. This plot allows a user to determine the relative increase in consensus and determine k at which there is no appreciable increase.

#_________________________________# Clustering & Cluster assignmnet validation _________________________________#
# finding optimal clusters by CC
library(CancerSubtypes)
cc.res=ExecuteCC(clusterNum=3, ## Refer below for more details on this
                 d=mad2k,
                 maxK=10,# maximum cluster number for Consensus Clustering Algorithm to evaluate
                 clusterAlg="pam",
                 distance="spearman",
                 title="UROMOL",
                 plot= pdf)

By clusterNum argument I have to provide the number of cluster that I am intrested to get. This is needed by the package CancerSubtypes. The main package for performin CC in R is ConsensusClusterPlus and does not need to specify cluster number. From where I got the number 4? From inspecting the CDF plots in the result folder (here "my/wd/UROMOL").Also I performed same analysis for datasets mad4k and mad6k. So the number for clusters seems to be 4. This is not in agreement with the original paper, where the authors reported three subtypes. However in new report from the same group using a new pipeline they conculded that NMIBC to have four diffrent subtypes! ref.

The above command with return two plots which is helful to make decision about cluster number: consensus CDF and relative change in area under CDF curve.

[4] Assessing cluster assignment

To be added.

Refrences

1- Biostar posts: https://www.biostars.org/p/321773/ https://www.biostars.org/p/225315/ https://www.biostars.org/p/74223/ https://www.biostars.org/p/281161/ https://www.biostars.org/p/273107/

2- https://www.datanovia.com/en/courses/partitional-clustering-in-r-the-essentials/

3- https://2-bitbio.com/2017/10/clustering-rnaseq-data-using-k-means.html

Name		Name	Last commit message	Last commit date
Latest commit History 31 Commits
README.md		README.md
consensus011.png		consensus011.png
consensus012.png		consensus012.png
uromol_clinic.csv		uromol_clinic.csv

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Gene-Expression-Unsupervised-Clusteing

[1] Data preparation,

[2] Dissimilarity matrix calculation,

[3] applying clustering algorithms,

[4] Assessing cluster assignment

[1] Data preparation

[2&3] Dissimilarity matrix calculation & applying clustering algorithms:

[4] Assessing cluster assignment

Refrences

About

Releases

Packages

mosaddekhussain/Gene-Expression-Unsupervised-Clusteing

Folders and files

Latest commit

History

Repository files navigation

Gene-Expression-Unsupervised-Clusteing

[1] Data preparation,

[2] Dissimilarity matrix calculation,

[3] applying clustering algorithms,

[4] Assessing cluster assignment

[1] Data preparation

[2&3] Dissimilarity matrix calculation & applying clustering algorithms:

[4] Assessing cluster assignment

Refrences

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Packages