ejh243 · ejh243 · Dec 19, 2024 · Dec 19, 2024 · Dec 19, 2024 · Dec 19, 2024
diff --git a/array/DNAm/preprocessing/QC.rmd b/array/DNAm/preprocessing/QC.rmd
@@ -813,51 +813,64 @@ dev.off()
 
 ## Sex Prediction
 
-Using the intensity values from probes located on the X and Y chromosomes, we calculate a fold change relative to intensity values from the autosomes. In females you would expect the fold change on the X chromosome to be greater than 1 and the Y chromosome less than 1, while males we would expect the fold change on the X chromosome to be less than 1 and the fold change on the Y chromosome to be greater than 1. Based on these assumptions we can predict male or female from each sex chromosome. 
 
-Let's look at the distribution of these fold change statistics. If you have > 1 sex you should see two peaks. You sometimes also see a third peak around 0, these are poor quality samples. The grey area on the graph highlights were the data are two ambigous to make a sex prediction. 
+```{r sexBimodalTest, echo = FALSE, eval = sexCheck}
+	bimodP.y<-dip.test(QCmetrics$y.cp)$p.value
+```
 
-```{r, plotSexChromosomeHist, echo = FALSE, fig.width = 12, fig.height=6}
-par(mfrow = c(1,2))
-hist(QCmetrics$x.cp, breaks = 25, main = "X chromosome FC")
-polygon(c(0.995,1.005,1.005,0.995), c(0,0,1000,1000), col = "grey")
-abline(v = 1)
-abline(v = 0.995, col = "red")
-abline(v = 1.005, col = "red")
-mtext("Female", side = 3, adj = 0.9, line = 0)
-mtext("Male", side = 3, adj = 0.1, line = 0)
-
-hist(QCmetrics$y.cp, breaks = 25, main = "Y chromosome FC")
-polygon(c(0.9,1.1,1.1,0.9), c(0,0,1000,1000), col = "grey")
-abline(v = 1)
-abline(v = 0.9, col = "red")
-abline(v = 1.1, col = "red")
-mtext("Male", side = 3, adj = 0.9, line = 0)
-mtext("Female", side = 3, adj = 0.1, line = 0)
 
-```
+Using the intensity values from probes located on the X and Y chromosomes, we calculate a fold change relative to intensity values from the autosomes. In females you would expect the fold change on the X chromosome to be greater than 1 and the Y chromosome less than 1, while males we would expect the fold change on the X chromosome to be less than 1 and the fold change on the Y chromosome to be greater than 1. 
 
-In this dataset `r sum(is.na(QCmetrics$predSex))` samples did not have sex predicted.
+Let's look at the distribution of these fold change statistics. If you have > 1 sex you should see two peaks. You sometimes also see a third peak around 0, these are poor quality samples. Evidence of multiple sexes in the dataset is needed to enable sex predictions to be performed. This is checked intially with a dip test for multimodality in the distribution of standardized y chromosome intensities. In the density plots below we are looking for evidence of multiple distributions or a non-unimodal distribution. Hartigans dip test for unimodality / multimodality, indicates we can `r if(bimodP.y < 0.05){I(print("accept"))} else {I(print("reject"))} ` the alternative hypothesis of multiple modes (P = `r signif(bimodP.y,3)`).
 
-```{r, plotSexChromosomeFC, echo = FALSE, fig.width = 6, fig.height=6}
-plot(QCmetrics$x.cp, QCmetrics$y.cp, type = "n", xlab = "X chromosome FC", ylab = "Y chromosome FC")
-polygon(c(0.995,1.005,1.005,0.995), c(0.9, 0.9, 1.1, 1.1), col = "grey")
-abline(v = 1)
-abline(h = 1, pch = 16)
-points(QCmetrics$x.cp, QCmetrics$y.cp, col = as.factor(QCmetrics$Sex), pch = 16)
 
+```{r plotSexCheck, echo = FALSE, eval = sexCheck}
 
-#plot(QCmetrics$x.cp, QCmetrics$M.median, pch = 16, xlab = "X chromosome FC", ylab = "median M intenisty")
+    if(bimodP.y < 0.05){
+        library(mixtools, warn.conflicts = FALSE, quietly = TRUE)
+        mixmdl.x = normalmixEM(QCmetrics$x.cp, mu = c(0.99, 1.01), sigma = c(0.05, 0.05))
+        mixmdl.y = normalmixEM(QCmetrics$y.cp, mu = c(0.3, 1.02), sigma = c(0.2, 0.2))
 
-```
+        # plot density of standardized x and y chromosome intensities
+        par(mfrow = c(1,2))
+        plot(mixmdl.x,which=2, main2 = "X chr", xlab2 = "Standardized Mean Intensity")
+        lines(density(QCmetrics$x.cp), lty=2, lwd=2)
+        plot(mixmdl.y,which=2, main2 = "Y chr", xlab2 = "Standardized Mean Intensity")
+        lines(density(QCmetrics$y.cp), lty=2, lwd=2)
 
+    } else {
+        par(mfrow = c(1,2))
+        plot(density(QCmetrics$x.cp), lty=2, lwd=2, main = "X chr", xlab = "Standardized Mean Intensity")
+        plot(density(QCmetrics$y.cp), lty=2, lwd=2, main = "X chr", xlab = "Standardized Mean Intensity")
 
-We will compare the sex predictions from the X and Y chromosomes.
+    }
+```
+
+If predictions were made, we will compare the sex predictions between the X and Y chromosomes.
 
 ```{r tabSexpredictions, echo = FALSE}
-pander(table(QCmetrics$predSex.x, QCmetrics$predSex.y))
+if(bimodP.y < 0.05){
+  pander(table(QCmetrics$predSex.x, QCmetrics$predSex.y))
+} else {
+  print("Data did not have enough evidence of multimodality to make sex predictions.")
+}
 ``` 
 
+
+Sex is predicted using these standardized sex chromosome intensities, by first fitting two distributions to the data, where one distribution is expected to be the females and the second distribution is the males. We can then calculate for each sample the posterior probability that they belong to each distribution. Whichever distribution they have a higher posterior probability for they get predicted as. Ideally these posterior probabilities will be near the extremes to enable a clean prediction. This process is done separately for X and Y chromosomes. We can visulaise the distribution of posterior probabilities to check that these are distinct enough for making sex predictions. 
+
+```{r plotSexPosteriorProbabilities, echo = FALSE, eval = sexCheck}
+
+    if(bimodP.y < 0.05){
+        # plot posterior probabilities of sex predictions
+        par(mfrow = c(1,2))
+        plot(density(mixmdl.x$posterior[,1]), main = "X chr", xlab = "Posterior Probability Male")
+        abline(v = 0.5)
+        plot(density(mixmdl.y$posterior[,2]), main = "Y chr", xlab = "Posterior Probability Male")
+        abline(v = 0.5)
+    }
+```
+
 Perform sex check against phenotype reported sexes: `r sexCheck`. 
 
 

diff --git a/array/DNAm/preprocessing/calcQCMetrics.r b/array/DNAm/preprocessing/calcQCMetrics.r
@@ -21,12 +21,12 @@
 # DEFINE PARAMETERS
 #----------------------------------------------------------------------#
 args<-commandArgs(trailingOnly = TRUE)
 dataDir <- args[1]
 refDir <- args[2]

 gdsFile <-paste0(dataDir, "/2_gds/raw.gds")
 qcData <-paste0(dataDir, "/2_gds/QCmetrics/QCmetrics.rdata")
 genoFile <- paste0(dataDir, "/0_metadata/epicSNPs.raw")
 configFile <- paste0(dataDir, "/config.r")

 source(configFile)
@@ -60,8 +60,6 @@
 # ensure sample sheet is in same order as data
 sampleSheet<-sampleSheet[match(colnames(gfile), sampleSheet$Basename),]
 
-
-
 ## see if any QC data already exists
 if(file.exists(qcData)){
 	load(qcData)
@@ -301,39 +299,46 @@
 	ints.X<-methylated(gfile)[x.probes,]+unmethylated(gfile)[x.probes,]
 	ints.Y<-methylated(gfile)[y.probes,]+unmethylated(gfile)[y.probes,]
 
-
 	x.cp<-colMeans(ints.X, na.rm = TRUE)/colMeans(ints.auto, na.rm = TRUE)
 	y.cp<-colMeans(ints.Y, na.rm = TRUE)/colMeans(ints.auto, na.rm = TRUE)
 
+	# test for evidence of two groups (i.e. catch for if dataset only has one sex)
+	# result is identical for both chromosomes
+	# needs at least ~15 samples for this to work
+	library(diptest, warn.conflicts = FALSE, quietly = TRUE)
+	bimodP.y<-dip.test(y.cp)$p.value
 
-	if(arrayType == "V2"){
-		# base prediction on y chromosome
-		predSex.y<-rep(NA, length(y.cp))
-		predSex.y[which(y.cp > 1.0 & intensPASS == TRUE)]<-"M"
-		predSex.y[which(y.cp < 0.5 & intensPASS == TRUE)]<-"F"
+	if(bimodP.y < 0.05){
+		print("Intensities on X & Y chromosomes are multimodal. Sufficient evidence of more than one sex in data, proceeding to fit two distributions to the data to make predictions")
 
-		# base prediction on x chromosome
-		predSex.x<-rep(NA, length(x.cp))
-		predSex.x[which(x.cp < 1 & intensPASS == TRUE)]<-"M"
-		predSex.x[which(x.cp > 1.01 & intensPASS == TRUE)]<-"F"
-	} else {
+		library(mixtools, warn.conflicts = FALSE, quietly = TRUE)
+		mixmdl.x = normalmixEM(x.cp, mu = c(0.99, 1.01), sigma = c(0.05, 0.05))
+		mixmdl.y = normalmixEM(y.cp, mu = c(0.3, 1.02), sigma = c(0.2, 0.2))
 
-		# base prediction on y chromosome
+		posteriorProbs<-cbind(mixmdl.x$posterior, mixmdl.y$posterior)
+		colnames(posteriorProbs)<-c("PP.M.X", "PP.F.X", "PP.F.Y", "PP.M.Y")
+		# base prediction on y chromosome posterior probability
 		predSex.y<-rep(NA, length(y.cp))
-		predSex.y[which(y.cp > 1.1 & intensPASS == TRUE)]<-"M"
-		predSex.y[which(y.cp < 0.9 & intensPASS == TRUE)]<-"F"
+		predSex.y[which(posteriorProbs[,"PP.M.Y"] > 0.5 & intensPASS == TRUE)]<-"M"
+		predSex.y[which(posteriorProbs[,"PP.F.Y"] > 0.5 & intensPASS == TRUE)]<-"F"
 
-		# base prediction on x chromosome
+		# base prediction on x chromosome posterior probability
 		predSex.x<-rep(NA, length(x.cp))
-		predSex.x[which(x.cp < 0.996 & intensPASS == TRUE)]<-"M"
-		predSex.x[which(x.cp > 1.005 & intensPASS == TRUE)]<-"F"
+		predSex.x[which(posteriorProbs[,"PP.M.X"] > 0.5 & intensPASS == TRUE)]<-"M"
+		predSex.x[which(posteriorProbs[,"PP.F.X"] > 0.5 & intensPASS == TRUE)]<-"F"
+
+		# check for consistent prediction
+		predSex<-rep(NA, length(x.cp))
+		predSex[which(predSex.x == predSex.y)]<-predSex.x[which(predSex.x == predSex.y)]
+		QCmetrics<-cbind(QCmetrics,x.cp,y.cp,posteriorProbs,predSex.x, predSex.y, predSex)
+	} else{
+		print("Intensities on X & Y chromosomes are unimodal. This would suggest just one sex in the data, a sample sample size or very few of one sex. Bypassing prediction step")
+		predSex<-rep(NA, length(x.cp))
+		predSex.x<-rep(NA, length(x.cp))
+		predSex.y<-rep(NA, length(x.cp))
+		QCmetrics<-cbind(QCmetrics,x.cp,y.cp,predSex.x, predSex.y, predSex)
 
 	}
-
-	# check for consistent prediction
-	predSex<-rep(NA, length(x.cp))
-	predSex[which(predSex.x == predSex.y)]<-predSex.x[which(predSex.x == predSex.y)]
-	QCmetrics<-cbind(QCmetrics,x.cp,y.cp,predSex.x, predSex.y, predSex)
 }
 
 #----------------------------------------------------------------------#

diff --git a/array/DNAm/preprocessing/rmarkdownChild/sexCheck.rmd b/array/DNAm/preprocessing/rmarkdownChild/sexCheck.rmd
@@ -1,6 +1,6 @@
 
-`r sum(!is.na(QCmetrics$Sex))` (`r sum(!is.na(QCmetrics$Sex))/nrow(QCmetrics)*100`%) samples have sex provided in the phenotype file for comparison with predicted sex. We will compare the sex predictions to that provided in the phenotype file. 
 
+`r sum(!is.na(QCmetrics$Sex))` (`r sum(!is.na(QCmetrics$Sex))/nrow(QCmetrics)*100`%) samples have sex provided in the phenotype file for comparison with predicted sex. We will compare the sex predictions to that provided in the phenotype file. 
 
 ```{r tabSexReported, echo = FALSE, eval = sexCheck}
 pander(table(QCmetrics$predSex, QCmetrics$Sex))