Skip to content

Latest commit

 

History

History
78 lines (62 loc) · 4.11 KB

GSE110823.md

File metadata and controls

78 lines (62 loc) · 4.11 KB

Single-cell profiling of the developing mouse brain and spinal cord with split-pool barcoding

Rosenberg et al.

PROJECT=~/sharedscratch/GSE110823
mkdir -p ${PROJECT}/fastqs
cd ${PROJECT}/fastqs

Get list of fastq files from ENA (https://www.ebi.ac.uk/ena/browser/view/PRJNA434658). Download All generates a script with wget calls. Remove human files from list, then remove wget -nc from each line, leaving a file with 30 lines in format:

ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR675/000/SRR6750050/SRR6750050_1.fastq.gz
ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR675/000/SRR6750050/SRR6750050_2.fastq.gz
ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR675/001/SRR6750041/SRR6750041_1.fastq.gz
ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR675/001/SRR6750041/SRR6750041_2.fastq.gz
...

Read file list into an array and sbatch for download using ~/templates/wget.sh.

FILES=($(cat GSE110823.files))
for FILE in ${FILES[@]}
do
    sbatch ~/templates/wget.sh ${FILE}
done

The seqspec (v0.3) format for this publication is discussed here and the related yaml and indices are provided here.

We need to generate a seqspec (v0) file as the cellatlas release currently lags behind seqspec. The following generates a template based on the details availabile from the publication.

module load mamba
mamba activate cellatlas
mkdir -p ${PROJECT}/seqspec
cd ${PROJECT}/seqspec
#seqspec init -n SPLiTSeq -m 1 -o spec.v0.yaml "((P5:29,Spacer:8,Read_1_primer:33,cDNA:1098,RT_primer:15,Round_1_BC:8,linker_1:30,Round_2_BC:8,Linker_2:30,Round_3_BC:8,UMI:10,Read_2_primer:22,Round_4_BC:6,P7:24)rna)"
seqspec init -n SPLiTSeq -m 1 -o spec.v0.2.0.yaml "(((P5:29,Spacer:8,Read_1_primer:33,cDNA:1098)R1.fastq.gz,(RT_primer:15,Round_1_BC:8,linker_1:30,Round_2_BC:8,Linker_2:30,Round_3_BC:8,UMI:10,Read_2_primer:22,Round_4_BC:6,P7:24)R2.fastq.gz)rna)"

We can edit the template to amend the values of the various sections using the details from the seqspec v0.3.0 SPLiTSeq yaml.

To run cellatlas build we need to provide the reference fasta and GTF. We use the 109 release of the mouse genome for compatability with AnnotationHub downstream.

mkdir -p ${PROJECT}/ref
cd ${PROJECT}/ref
wget http://ftp.ensembl.org/pub/release-109/fasta/mus_musculus/dna/Mus_musculus.GRCm39.dna.primary_assembly.fa.gz
wget http://ftp.ensembl.org/pub/release-109/gtf/mus_musculus/Mus_musculus.GRCm39.109.gtf.gz

Next we run cellatlas build to generate the commands to run via sbatch on the fastq files.

cd ${PROJECT}/seqspec
seqspec format -o fmt.yaml spec.v0.2.0.yaml
seqspec check fmt.yaml
FQ1=${PROJECT}/fastqs/SRR6750041_1.fastq.gz
FQ2=${PROJECT}/fastqs/SRR6750041_2.fastq.gz
cellatlas build -o ${PROJECT}/out -m rna -s fmt.yaml \
  -fa ${PROJECT}/ref/Mus_musculus.GRCm39.dna.primary_assembly.fa.gz \
  -g ${PROJECT}/ref/Mus_musculus.GRCm39.109.gtf.gz ${FQ1} ${FQ2}
cp ${FQ1} ${PROJECT}/out/R1.fastq.gz
cp ${FQ2} ${PROJECT}/out/R2.fastq.gz

We next need to append these commands to a slurm template script for running via sbatch. Note that the fastq files were missing from the end of the kb count call, so these are added using sed. In addition, the -x <str> passed to kb count appears to default to -1 values. The -x 1,10,18,1,48,56,1,78,86:1,0,10:0,0,140 values from the SPLiTSeq example appears to broadly match that reported for SPLIT-SEQ by kb list. These values need to be manually amended to avoid an error, after which the job can be submitted via sbatch.

cp ~/templates/cellatlas.sh ${PROJECT}/out
~/software/jq-linux64  -r '.commands[] | values[] | join("\n")' ${PROJECT}/out/cellatlas_info.json >> ${PROJECT}/out/cellatlas.sh
sed -i.bak '$s|$|'"${PROJECT}/out/R1.fastq.gz ${PROJECT}/out/R2.fastq.gz"'|' ${PROJECT}/out/cellatlas.sh
chmod +x ${PROJECT}/out/cellatlas.sh
sbatch ${PROJECT}/out/cellatlas.sh