PROJECT=~/sharedscratch/GSE110823
mkdir -p ${PROJECT}/fastqs
cd ${PROJECT}/fastqs
Get list of fastq files from ENA (https://www.ebi.ac.uk/ena/browser/view/PRJNA434658). Download All generates a script with wget calls. Remove human files from list, then remove wget -nc
from each line, leaving a file with 30 lines in format:
ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR675/000/SRR6750050/SRR6750050_1.fastq.gz
ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR675/000/SRR6750050/SRR6750050_2.fastq.gz
ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR675/001/SRR6750041/SRR6750041_1.fastq.gz
ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR675/001/SRR6750041/SRR6750041_2.fastq.gz
...
Read file list into an array and sbatch for download using ~/templates/wget.sh
.
FILES=($(cat GSE110823.files))
for FILE in ${FILES[@]}
do
sbatch ~/templates/wget.sh ${FILE}
done
The seqspec (v0.3) format for this publication is discussed here and the related yaml
and indices
are provided here.
We need to generate a seqspec (v0) file as the cellatlas
release currently lags behind seqspec
. The following generates a template based on the details availabile from the publication.
module load mamba
mamba activate cellatlas
mkdir -p ${PROJECT}/seqspec
cd ${PROJECT}/seqspec
#seqspec init -n SPLiTSeq -m 1 -o spec.v0.yaml "((P5:29,Spacer:8,Read_1_primer:33,cDNA:1098,RT_primer:15,Round_1_BC:8,linker_1:30,Round_2_BC:8,Linker_2:30,Round_3_BC:8,UMI:10,Read_2_primer:22,Round_4_BC:6,P7:24)rna)"
seqspec init -n SPLiTSeq -m 1 -o spec.v0.2.0.yaml "(((P5:29,Spacer:8,Read_1_primer:33,cDNA:1098)R1.fastq.gz,(RT_primer:15,Round_1_BC:8,linker_1:30,Round_2_BC:8,Linker_2:30,Round_3_BC:8,UMI:10,Read_2_primer:22,Round_4_BC:6,P7:24)R2.fastq.gz)rna)"
We can edit the template to amend the values of the various sections using the details from the seqspec v0.3.0 SPLiTSeq yaml.
To run cellatlas build
we need to provide the reference fasta
and GTF
. We use the 109 release of the mouse genome for compatability with AnnotationHub
downstream.
mkdir -p ${PROJECT}/ref
cd ${PROJECT}/ref
wget http://ftp.ensembl.org/pub/release-109/fasta/mus_musculus/dna/Mus_musculus.GRCm39.dna.primary_assembly.fa.gz
wget http://ftp.ensembl.org/pub/release-109/gtf/mus_musculus/Mus_musculus.GRCm39.109.gtf.gz
Next we run cellatlas build
to generate the commands to run via sbatch
on the fastq
files.
cd ${PROJECT}/seqspec
seqspec format -o fmt.yaml spec.v0.2.0.yaml
seqspec check fmt.yaml
FQ1=${PROJECT}/fastqs/SRR6750041_1.fastq.gz
FQ2=${PROJECT}/fastqs/SRR6750041_2.fastq.gz
cellatlas build -o ${PROJECT}/out -m rna -s fmt.yaml \
-fa ${PROJECT}/ref/Mus_musculus.GRCm39.dna.primary_assembly.fa.gz \
-g ${PROJECT}/ref/Mus_musculus.GRCm39.109.gtf.gz ${FQ1} ${FQ2}
cp ${FQ1} ${PROJECT}/out/R1.fastq.gz
cp ${FQ2} ${PROJECT}/out/R2.fastq.gz
We next need to append these commands to a slurm
template script for running via sbatch
. Note that the fastq
files were missing from the end of the kb count
call, so these are added using sed
. In addition, the -x <str>
passed to kb count
appears to default to -1 values. The -x 1,10,18,1,48,56,1,78,86:1,0,10:0,0,140
values from the SPLiTSeq example appears to broadly match that reported for SPLIT-SEQ by kb list
. These values need to be manually amended to avoid an error, after which the job can be submitted via sbatch
.
cp ~/templates/cellatlas.sh ${PROJECT}/out
~/software/jq-linux64 -r '.commands[] | values[] | join("\n")' ${PROJECT}/out/cellatlas_info.json >> ${PROJECT}/out/cellatlas.sh
sed -i.bak '$s|$|'"${PROJECT}/out/R1.fastq.gz ${PROJECT}/out/R2.fastq.gz"'|' ${PROJECT}/out/cellatlas.sh
chmod +x ${PROJECT}/out/cellatlas.sh
sbatch ${PROJECT}/out/cellatlas.sh