Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

152 bug dnam qc pipeline sex check thresholds #284

Open
wants to merge 7 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
75 changes: 44 additions & 31 deletions array/DNAm/preprocessing/QC.rmd
Original file line number Diff line number Diff line change
Expand Up @@ -813,51 +813,64 @@ dev.off()

## Sex Prediction

Using the intensity values from probes located on the X and Y chromosomes, we calculate a fold change relative to intensity values from the autosomes. In females you would expect the fold change on the X chromosome to be greater than 1 and the Y chromosome less than 1, while males we would expect the fold change on the X chromosome to be less than 1 and the fold change on the Y chromosome to be greater than 1. Based on these assumptions we can predict male or female from each sex chromosome.

Let's look at the distribution of these fold change statistics. If you have > 1 sex you should see two peaks. You sometimes also see a third peak around 0, these are poor quality samples. The grey area on the graph highlights were the data are two ambigous to make a sex prediction.
```{r sexBimodalTest, echo = FALSE, eval = sexCheck}
bimodP.y<-dip.test(QCmetrics$y.cp)$p.value
```

```{r, plotSexChromosomeHist, echo = FALSE, fig.width = 12, fig.height=6}
par(mfrow = c(1,2))
hist(QCmetrics$x.cp, breaks = 25, main = "X chromosome FC")
polygon(c(0.995,1.005,1.005,0.995), c(0,0,1000,1000), col = "grey")
abline(v = 1)
abline(v = 0.995, col = "red")
abline(v = 1.005, col = "red")
mtext("Female", side = 3, adj = 0.9, line = 0)
mtext("Male", side = 3, adj = 0.1, line = 0)

hist(QCmetrics$y.cp, breaks = 25, main = "Y chromosome FC")
polygon(c(0.9,1.1,1.1,0.9), c(0,0,1000,1000), col = "grey")
abline(v = 1)
abline(v = 0.9, col = "red")
abline(v = 1.1, col = "red")
mtext("Male", side = 3, adj = 0.9, line = 0)
mtext("Female", side = 3, adj = 0.1, line = 0)

```
Using the intensity values from probes located on the X and Y chromosomes, we calculate a fold change relative to intensity values from the autosomes. In females you would expect the fold change on the X chromosome to be greater than 1 and the Y chromosome less than 1, while males we would expect the fold change on the X chromosome to be less than 1 and the fold change on the Y chromosome to be greater than 1.

In this dataset `r sum(is.na(QCmetrics$predSex))` samples did not have sex predicted.
Let's look at the distribution of these fold change statistics. If you have > 1 sex you should see two peaks. You sometimes also see a third peak around 0, these are poor quality samples. Evidence of multiple sexes in the dataset is needed to enable sex predictions to be performed. This is checked intially with a dip test for multimodality in the distribution of standardized y chromosome intensities. In the density plots below we are looking for evidence of multiple distributions or a non-unimodal distribution. Hartigans dip test for unimodality / multimodality, indicates we can `r if(bimodP.y < 0.05){I(print("accept"))} else {I(print("reject"))} ` the alternative hypothesis of multiple modes (P = `r signif(bimodP.y,3)`).

```{r, plotSexChromosomeFC, echo = FALSE, fig.width = 6, fig.height=6}
plot(QCmetrics$x.cp, QCmetrics$y.cp, type = "n", xlab = "X chromosome FC", ylab = "Y chromosome FC")
polygon(c(0.995,1.005,1.005,0.995), c(0.9, 0.9, 1.1, 1.1), col = "grey")
abline(v = 1)
abline(h = 1, pch = 16)
points(QCmetrics$x.cp, QCmetrics$y.cp, col = as.factor(QCmetrics$Sex), pch = 16)

```{r plotSexCheck, echo = FALSE, eval = sexCheck}

#plot(QCmetrics$x.cp, QCmetrics$M.median, pch = 16, xlab = "X chromosome FC", ylab = "median M intenisty")
if(bimodP.y < 0.05){
library(mixtools, warn.conflicts = FALSE, quietly = TRUE)
mixmdl.x = normalmixEM(QCmetrics$x.cp, mu = c(0.99, 1.01), sigma = c(0.05, 0.05))
mixmdl.y = normalmixEM(QCmetrics$y.cp, mu = c(0.3, 1.02), sigma = c(0.2, 0.2))

```
# plot density of standardized x and y chromosome intensities
par(mfrow = c(1,2))
plot(mixmdl.x,which=2, main2 = "X chr", xlab2 = "Standardized Mean Intensity")
lines(density(QCmetrics$x.cp), lty=2, lwd=2)
plot(mixmdl.y,which=2, main2 = "Y chr", xlab2 = "Standardized Mean Intensity")
lines(density(QCmetrics$y.cp), lty=2, lwd=2)

} else {
par(mfrow = c(1,2))
plot(density(QCmetrics$x.cp), lty=2, lwd=2, main = "X chr", xlab = "Standardized Mean Intensity")
plot(density(QCmetrics$y.cp), lty=2, lwd=2, main = "X chr", xlab = "Standardized Mean Intensity")

We will compare the sex predictions from the X and Y chromosomes.
}
```

If predictions were made, we will compare the sex predictions between the X and Y chromosomes.

```{r tabSexpredictions, echo = FALSE}
pander(table(QCmetrics$predSex.x, QCmetrics$predSex.y))
if(bimodP.y < 0.05){
pander(table(QCmetrics$predSex.x, QCmetrics$predSex.y))
} else {
print("Data did not have enough evidence of multimodality to make sex predictions.")
}
```


Sex is predicted using these standardized sex chromosome intensities, by first fitting two distributions to the data, where one distribution is expected to be the females and the second distribution is the males. We can then calculate for each sample the posterior probability that they belong to each distribution. Whichever distribution they have a higher posterior probability for they get predicted as. Ideally these posterior probabilities will be near the extremes to enable a clean prediction. This process is done separately for X and Y chromosomes. We can visulaise the distribution of posterior probabilities to check that these are distinct enough for making sex predictions.

```{r plotSexPosteriorProbabilities, echo = FALSE, eval = sexCheck}

if(bimodP.y < 0.05){
# plot posterior probabilities of sex predictions
par(mfrow = c(1,2))
plot(density(mixmdl.x$posterior[,1]), main = "X chr", xlab = "Posterior Probability Male")
abline(v = 0.5)
plot(density(mixmdl.y$posterior[,2]), main = "Y chr", xlab = "Posterior Probability Male")
abline(v = 0.5)
}
```

Perform sex check against phenotype reported sexes: `r sexCheck`.


Expand Down
53 changes: 29 additions & 24 deletions array/DNAm/preprocessing/calcQCMetrics.r
Original file line number Diff line number Diff line change
Expand Up @@ -21,12 +21,12 @@
# DEFINE PARAMETERS
#----------------------------------------------------------------------#
args<-commandArgs(trailingOnly = TRUE)
dataDir <- args[1]

Check warning on line 24 in array/DNAm/preprocessing/calcQCMetrics.r

View workflow job for this annotation

GitHub Actions / Lint code base

file=/github/workspace/array/DNAm/preprocessing/calcQCMetrics.r,line=24,col=16,[extraction_operator_linter] Use `[[` instead of `[` to extract an element.
refDir <- args[2]

Check warning on line 25 in array/DNAm/preprocessing/calcQCMetrics.r

View workflow job for this annotation

GitHub Actions / Lint code base

file=/github/workspace/array/DNAm/preprocessing/calcQCMetrics.r,line=25,col=15,[extraction_operator_linter] Use `[[` instead of `[` to extract an element.

gdsFile <-paste0(dataDir, "/2_gds/raw.gds")

Check warning on line 27 in array/DNAm/preprocessing/calcQCMetrics.r

View workflow job for this annotation

GitHub Actions / Lint code base

file=/github/workspace/array/DNAm/preprocessing/calcQCMetrics.r,line=27,col=11,[paste_linter] Construct file paths with file.path(...) instead of paste0(x, "/", y, "/", z). Note that paste() converts empty inputs to "", whereas file.path() leaves it empty.

Check warning on line 27 in array/DNAm/preprocessing/calcQCMetrics.r

View workflow job for this annotation

GitHub Actions / Lint code base

file=/github/workspace/array/DNAm/preprocessing/calcQCMetrics.r,line=27,col=28,[absolute_path_linter] Do not use absolute paths.

Check warning on line 27 in array/DNAm/preprocessing/calcQCMetrics.r

View workflow job for this annotation

GitHub Actions / Lint code base

file=/github/workspace/array/DNAm/preprocessing/calcQCMetrics.r,line=27,col=28,[nonportable_path_linter] Use file.path() to construct portable file paths.
qcData <-paste0(dataDir, "/2_gds/QCmetrics/QCmetrics.rdata")

Check warning on line 28 in array/DNAm/preprocessing/calcQCMetrics.r

View workflow job for this annotation

GitHub Actions / Lint code base

file=/github/workspace/array/DNAm/preprocessing/calcQCMetrics.r,line=28,col=10,[paste_linter] Construct file paths with file.path(...) instead of paste0(x, "/", y, "/", z). Note that paste() converts empty inputs to "", whereas file.path() leaves it empty.

Check warning on line 28 in array/DNAm/preprocessing/calcQCMetrics.r

View workflow job for this annotation

GitHub Actions / Lint code base

file=/github/workspace/array/DNAm/preprocessing/calcQCMetrics.r,line=28,col=27,[absolute_path_linter] Do not use absolute paths.

Check warning on line 28 in array/DNAm/preprocessing/calcQCMetrics.r

View workflow job for this annotation

GitHub Actions / Lint code base

file=/github/workspace/array/DNAm/preprocessing/calcQCMetrics.r,line=28,col=27,[nonportable_path_linter] Use file.path() to construct portable file paths.
genoFile <- paste0(dataDir, "/0_metadata/epicSNPs.raw")

Check warning on line 29 in array/DNAm/preprocessing/calcQCMetrics.r

View workflow job for this annotation

GitHub Actions / Lint code base

file=/github/workspace/array/DNAm/preprocessing/calcQCMetrics.r,line=29,col=13,[paste_linter] Construct file paths with file.path(...) instead of paste0(x, "/", y, "/", z). Note that paste() converts empty inputs to "", whereas file.path() leaves it empty.

Check warning on line 29 in array/DNAm/preprocessing/calcQCMetrics.r

View workflow job for this annotation

GitHub Actions / Lint code base

file=/github/workspace/array/DNAm/preprocessing/calcQCMetrics.r,line=29,col=30,[absolute_path_linter] Do not use absolute paths.
configFile <- paste0(dataDir, "/config.r")

source(configFile)
Expand Down Expand Up @@ -60,8 +60,6 @@
# ensure sample sheet is in same order as data
sampleSheet<-sampleSheet[match(colnames(gfile), sampleSheet$Basename),]



## see if any QC data already exists
if(file.exists(qcData)){
load(qcData)
Expand Down Expand Up @@ -301,39 +299,46 @@
ints.X<-methylated(gfile)[x.probes,]+unmethylated(gfile)[x.probes,]
ints.Y<-methylated(gfile)[y.probes,]+unmethylated(gfile)[y.probes,]


x.cp<-colMeans(ints.X, na.rm = TRUE)/colMeans(ints.auto, na.rm = TRUE)
y.cp<-colMeans(ints.Y, na.rm = TRUE)/colMeans(ints.auto, na.rm = TRUE)

# test for evidence of two groups (i.e. catch for if dataset only has one sex)
# result is identical for both chromosomes
# needs at least ~15 samples for this to work
library(diptest, warn.conflicts = FALSE, quietly = TRUE)
bimodP.y<-dip.test(y.cp)$p.value

if(arrayType == "V2"){
# base prediction on y chromosome
predSex.y<-rep(NA, length(y.cp))
predSex.y[which(y.cp > 1.0 & intensPASS == TRUE)]<-"M"
predSex.y[which(y.cp < 0.5 & intensPASS == TRUE)]<-"F"
if(bimodP.y < 0.05){
print("Intensities on X & Y chromosomes are multimodal. Sufficient evidence of more than one sex in data, proceeding to fit two distributions to the data to make predictions")

# base prediction on x chromosome
predSex.x<-rep(NA, length(x.cp))
predSex.x[which(x.cp < 1 & intensPASS == TRUE)]<-"M"
predSex.x[which(x.cp > 1.01 & intensPASS == TRUE)]<-"F"
} else {
library(mixtools, warn.conflicts = FALSE, quietly = TRUE)
mixmdl.x = normalmixEM(x.cp, mu = c(0.99, 1.01), sigma = c(0.05, 0.05))
mixmdl.y = normalmixEM(y.cp, mu = c(0.3, 1.02), sigma = c(0.2, 0.2))

# base prediction on y chromosome
posteriorProbs<-cbind(mixmdl.x$posterior, mixmdl.y$posterior)
colnames(posteriorProbs)<-c("PP.M.X", "PP.F.X", "PP.F.Y", "PP.M.Y")
# base prediction on y chromosome posterior probability
predSex.y<-rep(NA, length(y.cp))
predSex.y[which(y.cp > 1.1 & intensPASS == TRUE)]<-"M"
predSex.y[which(y.cp < 0.9 & intensPASS == TRUE)]<-"F"
predSex.y[which(posteriorProbs[,"PP.M.Y"] > 0.5 & intensPASS == TRUE)]<-"M"
predSex.y[which(posteriorProbs[,"PP.F.Y"] > 0.5 & intensPASS == TRUE)]<-"F"

# base prediction on x chromosome
# base prediction on x chromosome posterior probability
predSex.x<-rep(NA, length(x.cp))
predSex.x[which(x.cp < 0.996 & intensPASS == TRUE)]<-"M"
predSex.x[which(x.cp > 1.005 & intensPASS == TRUE)]<-"F"
predSex.x[which(posteriorProbs[,"PP.M.X"] > 0.5 & intensPASS == TRUE)]<-"M"
predSex.x[which(posteriorProbs[,"PP.F.X"] > 0.5 & intensPASS == TRUE)]<-"F"

# check for consistent prediction
predSex<-rep(NA, length(x.cp))
predSex[which(predSex.x == predSex.y)]<-predSex.x[which(predSex.x == predSex.y)]
QCmetrics<-cbind(QCmetrics,x.cp,y.cp,posteriorProbs,predSex.x, predSex.y, predSex)
} else{
print("Intensities on X & Y chromosomes are unimodal. This would suggest just one sex in the data, a sample sample size or very few of one sex. Bypassing prediction step")
predSex<-rep(NA, length(x.cp))
predSex.x<-rep(NA, length(x.cp))
predSex.y<-rep(NA, length(x.cp))
QCmetrics<-cbind(QCmetrics,x.cp,y.cp,predSex.x, predSex.y, predSex)

}

# check for consistent prediction
predSex<-rep(NA, length(x.cp))
predSex[which(predSex.x == predSex.y)]<-predSex.x[which(predSex.x == predSex.y)]
QCmetrics<-cbind(QCmetrics,x.cp,y.cp,predSex.x, predSex.y, predSex)
}

#----------------------------------------------------------------------#
Expand Down
2 changes: 1 addition & 1 deletion array/DNAm/preprocessing/rmarkdownChild/sexCheck.rmd
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@

`r sum(!is.na(QCmetrics$Sex))` (`r sum(!is.na(QCmetrics$Sex))/nrow(QCmetrics)*100`%) samples have sex provided in the phenotype file for comparison with predicted sex. We will compare the sex predictions to that provided in the phenotype file.

`r sum(!is.na(QCmetrics$Sex))` (`r sum(!is.na(QCmetrics$Sex))/nrow(QCmetrics)*100`%) samples have sex provided in the phenotype file for comparison with predicted sex. We will compare the sex predictions to that provided in the phenotype file.

```{r tabSexReported, echo = FALSE, eval = sexCheck}
pander(table(QCmetrics$predSex, QCmetrics$Sex))
Expand Down
Loading