Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

refactor: use consistent config file within DNAm QC pipeline #293

Open
wants to merge 2 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 2 additions & 2 deletions array/DNAm/preprocessing/CETYGOdeconvolution.r
Original file line number Diff line number Diff line change
Expand Up @@ -21,10 +21,10 @@
#----------------------------------------------------------------------#

args<-commandArgs(trailingOnly = TRUE)
dataDir <- args[1]
dataDir <- args[[1]]
configFile <- args[[2]]

gdsFile <-paste0(dataDir, "/2_gds/raw.gds")

Check warning on line 27 in array/DNAm/preprocessing/CETYGOdeconvolution.r

View workflow job for this annotation

GitHub Actions / Lint code base

file=/github/workspace/array/DNAm/preprocessing/CETYGOdeconvolution.r,line=27,col=11,[paste_linter] Construct file paths with file.path(...) instead of paste0(x, "/", y, "/", z). Note that paste() converts empty inputs to "", whereas file.path() leaves it empty.

Check warning on line 27 in array/DNAm/preprocessing/CETYGOdeconvolution.r

View workflow job for this annotation

GitHub Actions / Lint code base

file=/github/workspace/array/DNAm/preprocessing/CETYGOdeconvolution.r,line=27,col=28,[absolute_path_linter] Do not use absolute paths.

Check warning on line 27 in array/DNAm/preprocessing/CETYGOdeconvolution.r

View workflow job for this annotation

GitHub Actions / Lint code base

file=/github/workspace/array/DNAm/preprocessing/CETYGOdeconvolution.r,line=27,col=28,[nonportable_path_linter] Use file.path() to construct portable file paths.
configFile <- paste0(dataDir, "/config.r")

source(configFile)

Expand All @@ -50,17 +50,17 @@
setwd(dataDir)

# load sample sheet
sampleSheet<-read.csv("0_metadata/sampleSheet.csv", na.strings = c("", "NA"), stringsAsFactors = FALSE)

Check warning on line 53 in array/DNAm/preprocessing/CETYGOdeconvolution.r

View workflow job for this annotation

GitHub Actions / Lint code base

file=/github/workspace/array/DNAm/preprocessing/CETYGOdeconvolution.r,line=53,col=24,[nonportable_path_linter] Use file.path() to construct portable file paths.
# if no column Basename, creates from columns Chip.ID and Chip.Location
if(!"Basename" %in% colnames(sampleSheet)){
sampleSheet$Basename<-paste(sampleSheet$Chip.ID, sampleSheet$Chip.Location, sep = "_")

Check warning on line 56 in array/DNAm/preprocessing/CETYGOdeconvolution.r

View workflow job for this annotation

GitHub Actions / Lint code base

file=/github/workspace/array/DNAm/preprocessing/CETYGOdeconvolution.r,line=56,col=13,[extraction_operator_linter] Use `[[` instead of `$` to extract an element.

Check warning on line 56 in array/DNAm/preprocessing/CETYGOdeconvolution.r

View workflow job for this annotation

GitHub Actions / Lint code base

file=/github/workspace/array/DNAm/preprocessing/CETYGOdeconvolution.r,line=56,col=41,[extraction_operator_linter] Use `[[` instead of `$` to extract an element.

Check warning on line 56 in array/DNAm/preprocessing/CETYGOdeconvolution.r

View workflow job for this annotation

GitHub Actions / Lint code base

file=/github/workspace/array/DNAm/preprocessing/CETYGOdeconvolution.r,line=56,col=62,[extraction_operator_linter] Use `[[` instead of `$` to extract an element.
}
sampleSheet$Cell_Type <- as.factor(sampleSheet$Cell_Type)

Check warning on line 58 in array/DNAm/preprocessing/CETYGOdeconvolution.r

View workflow job for this annotation

GitHub Actions / Lint code base

file=/github/workspace/array/DNAm/preprocessing/CETYGOdeconvolution.r,line=58,col=12,[extraction_operator_linter] Use `[[` instead of `$` to extract an element.

Check warning on line 58 in array/DNAm/preprocessing/CETYGOdeconvolution.r

View workflow job for this annotation

GitHub Actions / Lint code base

file=/github/workspace/array/DNAm/preprocessing/CETYGOdeconvolution.r,line=58,col=47,[extraction_operator_linter] Use `[[` instead of `$` to extract an element.


gfile<-openfn.gds(gdsFile, readonly = FALSE, allow.fork = TRUE)
# ensure sample sheet is in same order as data
sampleSheet<-sampleSheet[match(colnames(gfile), sampleSheet$Basename),]

Check warning on line 63 in array/DNAm/preprocessing/CETYGOdeconvolution.r

View workflow job for this annotation

GitHub Actions / Lint code base

file=/github/workspace/array/DNAm/preprocessing/CETYGOdeconvolution.r,line=63,col=60,[extraction_operator_linter] Use `[[` instead of `$` to extract an element.
# extract a few useful matrices
if(arrayType == "V2"){
rawbetas<-epicv2clean(betas(gfile)[])
Expand Down
15 changes: 7 additions & 8 deletions array/DNAm/preprocessing/QC.rmd
Original file line number Diff line number Diff line change
Expand Up @@ -23,16 +23,15 @@ library(RColorBrewer, warn.conflicts = FALSE, quietly = TRUE)
library(pheatmap, warn.conflicts = FALSE, quietly = TRUE)
library(data.table, warn.conflicts = FALSE, quietly = TRUE)


source(args[4]) ### change the content of this file to run QC with different thresholds
### prior to running this Rmarkdown which summarises the QC output, QC metrics must have been generated

dataDir <- args[2]
refDir <- args[3]
most_recent_git_tag <- args[6]
current_commit_hash <- args[7]
dataDir <- args[[2]]
refDir <- args[[3]]
configFile <- args[[4]]
most_recent_git_tag <- args[[6]]
current_commit_hash <- args[[7]]
setwd(dataDir)

source(configFile)

qcData <-paste0(dataDir, "/2_gds/QCmetrics/QCmetrics.rdata")

load(qcData)
Expand Down
6 changes: 3 additions & 3 deletions array/DNAm/preprocessing/calcQCMetrics.r
Original file line number Diff line number Diff line change
Expand Up @@ -21,13 +21,13 @@
# DEFINE PARAMETERS
#----------------------------------------------------------------------#
args<-commandArgs(trailingOnly = TRUE)
dataDir <- args[1]
refDir <- args[2]
dataDir <- args[[1]]
refDir <- args[[2]]
configFile <- args[[3]]

gdsFile <-paste0(dataDir, "/2_gds/raw.gds")
qcData <-paste0(dataDir, "/2_gds/QCmetrics/QCmetrics.rdata")
genoFile <- paste0(dataDir, "/0_metadata/epicSNPs.raw")
configFile <- paste0(dataDir, "/config.r")

source(configFile)

Expand Down
88 changes: 43 additions & 45 deletions array/DNAm/preprocessing/checkColnamesSampleSheet.r
Original file line number Diff line number Diff line change
@@ -1,80 +1,78 @@


## This script checks that sample sheet columns are formatted correctly prior to DNAm QC ##


args <- commandArgs(trailingOnly = TRUE)
dataDir <- args[1]
configFile <- paste0(dataDir, "/config.r")
configFile <- args[[1]]

# Load libraries
library(stringdist, warn.conflicts = FALSE, quietly = TRUE) # for amatch()
'%ni%' <- Negate('%in%') # define '%ni%' (not in)
"%ni%" <- Negate("%in%") # define '%ni%' (not in)

# Load sample sheet
sampleSheet <- read.csv(paste0(dataDir, "/0_metadata/sampleSheet.csv"), na.strings = c("", "NA"), stringsAsFactors = FALSE)

# Column names to test
req_cols <- c('Sample_ID','Individual_ID') # required column names
bsnm_cols <- c('Basename','Chip_ID','Chip_Location','Sentrix_ID','Sentrix_Position') # required when Basename not present
opt_cols <- c('Age') # optional column names
cond_cols <- c('Sex','Genotype_IID','Cell_Type') # conditionally column names
req_cols <- c("Sample_ID", "Individual_ID") # required column names
bsnm_cols <- c("Basename", "Chip_ID", "Chip_Location", "Sentrix_ID", "Sentrix_Position") # required when Basename not present
opt_cols <- c("Age") # optional column names
cond_cols <- c("Sex", "Genotype_IID", "Cell_Type") # conditionally column names

# source checkColnames()
source("checkColnamesFunction.r")


#1. Check required column names ------------------------------------------------------------------------------
# 1. Check required column names ------------------------------------------------------------------------------
cat("1. Checking required column names: ")
cat(c('Basename',req_cols,'\n'))
cat(c("Basename", req_cols, "\n"))

# check Basename first
b <- checkColnames(sampleSheet, bsnm_cols[1], type='Required', verbose=F)

if(b$allPresent){
# if Basename present, continue to check Basename alongside other required columns
checkColnames(sampleSheet, c('Basename',req_cols), type='Required')
}else{
# if Basename not present, check Chip and Sentrix as alternatives
chip <- checkColnames(sampleSheet, bsnm_cols[2:3], type='Required', verbose=F)
sntrx <- checkColnames(sampleSheet, bsnm_cols[4:5], type='Required', verbose=F)
# if either Chip or Sentrix present, continue to check other required columns
if(any(chip$allPresent | sntrx$allPresent)){
cat("Basename column not found, but at least 2 of the following alternative columns are present: ", bsnm_cols[2:5],'\n')
cat("Checking remaining required columns",'\n')
checkColnames(sampleSheet, req_cols, type='Required')
}else{
cat("Basename column not found, and neither set of alternative column names are present: ", bsnm_cols[2:5],'\n')
cat("Checking remaining required columns",'\n')
checkColnames(sampleSheet, req_cols, type='Required')
}
b <- checkColnames(sampleSheet, bsnm_cols[1], type = "Required", verbose = F)

if (b$allPresent) {
# if Basename present, continue to check Basename alongside other required columns
checkColnames(sampleSheet, c("Basename", req_cols), type = "Required")
} else {
# if Basename not present, check Chip and Sentrix as alternatives
chip <- checkColnames(sampleSheet, bsnm_cols[2:3], type = "Required", verbose = F)
sntrx <- checkColnames(sampleSheet, bsnm_cols[4:5], type = "Required", verbose = F)

# if either Chip or Sentrix present, continue to check other required columns
if (any(chip$allPresent | sntrx$allPresent)) {
cat("Basename column not found, but at least 2 of the following alternative columns are present: ", bsnm_cols[2:5], "\n")
cat("Checking remaining required columns", "\n")
checkColnames(sampleSheet, req_cols, type = "Required")
} else {
cat("Basename column not found, and neither set of alternative column names are present: ", bsnm_cols[2:5], "\n")
cat("Checking remaining required columns", "\n")
checkColnames(sampleSheet, req_cols, type = "Required")
}
}




#2. Check optional column names ------------------------------------------------------------------------------
# 2. Check optional column names ------------------------------------------------------------------------------
cat("2. Checking optional column names: ")
cat(opt_cols,'\n')
checkColnames(sampleSheet, opt_cols, type='Optional')
cat(opt_cols, "\n")
checkColnames(sampleSheet, opt_cols, type = "Optional")


#3. Check conditional column names ---------------------------------------------------------------------------
cat("Sourcing conditional variables from config.r",'\n')
# 3. Check conditional column names ---------------------------------------------------------------------------
cat("Sourcing conditional variables from config.r", "\n")
source(configFile)
cond_status <- c(sexCheck,snpCheck,ctCheck) # T/Fs from config file
cat(paste0(c("sexCheck","snpCheck","ctCheck"),"=", cond_status),'\n')
cond_status <- c(sexCheck, snpCheck, ctCheck) # T/Fs from config file
cat(paste0(c("sexCheck", "snpCheck", "ctCheck"), "=", cond_status), "\n")

# subset conditional colnames to those TRUE in config
cond_cols.filtered <- cond_cols[cond_status]

if(all(cond_status==F)){
cat("No conditional variables to check")
}else{
cat("3. Checking conditional column names: ",'\n')
cat(cond_cols.filtered,'\n')
checkColnames(sampleSheet, cond_cols.filtered, type='Conditional')
if (all(cond_status == F)) {
cat("No conditional variables to check")
} else {
cat("3. Checking conditional column names: ", "\n")
cat(cond_cols.filtered, "\n")
checkColnames(sampleSheet, cond_cols.filtered, type = "Conditional")
}

# ---------------------------------------------------------------------------------------------------------- #
# ---------------------------------------------------------------------------------------------------------- #

3 changes: 1 addition & 2 deletions array/DNAm/preprocessing/checkRconfigFile.r
Original file line number Diff line number Diff line change
Expand Up @@ -22,8 +22,7 @@
print("checking config.r file parameters are present and correctly formatted...")

args <- commandArgs(trailingOnly = TRUE)
dataDir <- args[1]
configFile <- file.path(dataDir, "config.r")
configFile <- args[[1]]

source(configFile)

Expand Down
6 changes: 3 additions & 3 deletions array/DNAm/preprocessing/clusterCellTypes.r
Original file line number Diff line number Diff line change
Expand Up @@ -18,14 +18,14 @@
#----------------------------------------------------------------------#

args<-commandArgs(trailingOnly = TRUE)
dataDir <- args[1]
refDir <- args[2]
dataDir <- args[[1]]
refDir <- args[[2]]
configFile <- args[[3]]

gdsFile <-paste0(dataDir, "/2_gds/raw.gds")
qcOutFolder<-paste0(dataDir, "/2_gds/QCmetrics")
qcData <-paste0(dataDir, "/2_gds/QCmetrics/QCmetrics.rdata")
genoFile <- paste0(dataDir, "/0_metadata/epicSNPs.raw")
configFile <- paste0(dataDir, "/config.r")

source(configFile)

Expand Down
14 changes: 7 additions & 7 deletions array/DNAm/preprocessing/jobSubmission/1_runDNAmQC.sh
Original file line number Diff line number Diff line change
Expand Up @@ -52,7 +52,7 @@ module load $RVERS # load specified R version

cd ${SCRIPTSDIR}/array/DNAm/preprocessing/

Rscript checkRconfigFile.r ${DATADIR}
Rscript checkRconfigFile.r ${RCONFIG}
config_malformed=$?
if [[ "${config_malformed}" -ne 0 ]]; then
print_error_message \
Expand All @@ -71,7 +71,7 @@ if [[ "${library_did_not_install}" -ne 0 ]]; then
"Exiting..."
fi

Rscript checkColnamesSampleSheet.r ${DATADIR}
Rscript checkColnamesSampleSheet.r ${RCONFIG}
sample_sheet_malformed=$?
if [[ "${sample_sheet_malformed}" -ne 0 ]]; then
print_error_message \
Expand All @@ -81,7 +81,7 @@ fi

mkdir -p ${GDSDIR}/QCmetrics

Rscript loadDataGDS.r ${DATADIR}
Rscript loadDataGDS.r ${DATADIR} ${RCONFIG}
gds_problem_identified=$?
if [[ "${gds_problem_identified}" -ne 0 ]]; then
print_error_message \
Expand All @@ -91,9 +91,9 @@ fi

chmod 755 ${DATADIR}/2_gds/raw.gds

Rscript calcQCMetrics.r ${DATADIR} ${REFDIR}
Rscript calcQCMetrics.r ${DATADIR} ${REFDIR} ${RCONFIG}

Rscript clusterCellTypes.r ${DATADIR} ${REFDIR}
Rscript clusterCellTypes.r ${DATADIR} ${REFDIR} ${RCONFIG}

most_recent_git_tag=$(git describe --tags --abbrev=0)
current_commit_hash=$(git rev-parse HEAD)
Expand All @@ -106,12 +106,12 @@ mv QC.html ${GDSDIR}/QCmetrics/

mkdir -p ${DATADIR}/3_normalised

Rscript normalisation.r ${DATADIR} ${REFDIR}
Rscript normalisation.r ${DATADIR} ${REFDIR} ${RCONFIG}
chmod 755 ${DATADIR}/2_gds/rawNorm.gds

mkdir -p ${GDSDIR}/QCmetrics/CETYGO

Rscript CETYGOdeconvolution.r ${DATADIR}
Rscript CETYGOdeconvolution.r ${DATADIR} ${RCONFIG}

## print finish date and time
echo Job finished on:
Expand Down
4 changes: 2 additions & 2 deletions array/DNAm/preprocessing/loadDataGDS.r
Original file line number Diff line number Diff line change
Expand Up @@ -17,11 +17,11 @@
# DEFINE PARAMETERS
#----------------------------------------------------------------------#
args <- commandArgs(trailingOnly = TRUE)
dataDir <- args[1]
dataDir <- args[[1]]
configFile <- args[[2]]

gdsFile <- file.path(dataDir, "2_gds/raw.gds")

configFile <- paste0(dataDir, "/config.r")
source(configFile)

arrayType <- toupper(arrayType)
Expand Down
7 changes: 4 additions & 3 deletions array/DNAm/preprocessing/normalisation.r
Original file line number Diff line number Diff line change
Expand Up @@ -19,13 +19,14 @@
# DEFINE PARAMETERS
#----------------------------------------------------------------------#
args<-commandArgs(trailingOnly = TRUE)
dataDir <- args[1]
refDir <- args[2]
dataDir <- args[[1]]
refDir <- args[[2]]
configFile <- args[[3]]

gdsFile <-file.path(dataDir, "/2_gds/raw.gds")
normgdsFile<-sub("\\.gds", "Norm.gds", gdsFile)
qcOutFolder<-file.path(dataDir, "/2_gds/QCmetrics")
normData<-file.path(dataDir, "/3_normalised/normalised.rdata")
configFile <- paste0(dataDir, "/config.r")

source(configFile)

Expand Down