- java 1.8 (https://java.com/en/)
- UCSC blat (http://genome.ucsc.edu/FAQ/FAQblat.html#blat3)
- STAR 2.5.4
- Picard 1.138
- GATK 3.6
- SAMtools 0.1.19
- BEDTools 2.23.0
RNA-MosaicHunter requires a fasta file (.fasta, .fa) for your reference genome. It is better to make sure that your reference file has the same name and order of contigs to your .bam file(s) and .bed file(s). Reads aligned to any contigs which do not appear in the reference file will be ignored. When running RNA-MosaicHunter, you need to set the top parameter reference_file.
RNA-MosaicHunter currently accepts aligned reads from RNA sequencing (RNA-seq). We recommend STAR for read mapping and GATK/Picard for read pre-processing. These pre-processing steps are important in order to reduce false positives in identifying mosaic sites. The following is an example of how the reads are pre-processed.
Note that default config file and resources files are designed for GRCh37 (hg19). To run on other versions (e.g. GRCh38), users should prepare hg38 reference files and change the parameters "valid_references", "chr_x_name", "chr_y_name" in the config file to the corresponding chromosome names (e.g. with "chr" for GRCh38).
STAR --runThreadN ${THREAD_NUM} --twopassMode Basic --outSAMattributes All --outSAMtype BAM Unsorted --readFilesCommand zcat --genomeDir ${REFERENCE_DIR}/human_v37_contig_hg19_hs37d5_STAR_index10_gencode19 --readFilesIn "rnaseq/fastq/${IND_NAME}/${SAMPLE_ID}_1.fastq.gz" "rnaseq/fastq/${IND_NAME}/${SAMPLE_ID}_2.fastq.gz" --outFileNamePrefix "$OUTPUT_DIR/${IND_NAME}/${SAMPLE_ID}/"
java -jar picard.jar AddOrReplaceReadGroups INPUT=rnaseq/STAR/${IND_NAME}/${SAMPLE_ID}/Aligned.out.bam OUTPUT=rnaseq/Picard/${IND_NAME}_${SAMPLE_ID}.sorted.bam SO=coordinate ID=${IND_NAME}_${SAMPLE_ID} LB=unknown PL=illumina SM=${IND_NAME}_${SAMPLE_ID} PU=unknown
java -jar picard.jar MarkDuplicates INPUT=rnaseq/Picard/${IND_NAME}_${SAMPLE_ID}.sorted.bam OUTPUT=rnaseq/Picard/${IND_NAME}_${SAMPLE_ID}.masked.bam M=rnaseq/Picard/${IND_NAME}_${SAMPLE_ID}.matrix REMOVE_DUPLICATES=true AS=true VALIDATION_STRINGENCY=SILENT
samtools index rnaseq/Picard/${IND_NAME}_${SAMPLE_ID}.masked.bam
java -jar GenomeAnalysisTK.jar -T SplitNCigarReads -R ${REFERENCE_DIR}/human_v37_contig_hg19_hs37d5/human_v37_contig_hg19_hs37d5.fasta -I rnaseq/Picard/${IND_NAME}_${SAMPLE_ID}.masked.f.bam -o rnaseq/GATK/${IND_NAME}_${SAMPLE_ID}.split.bam -rf ReassignOneMappingQuality -RMQF 255 -RMQT 60 -U ALLOW_N_CIGAR_READS
java -jar GenomeAnalysisTK.jar -T RealignerTargetCreator -nt ${THREAD_NUM} -R ${REFERENCE_DIR}/human_v37_contig_hg19_hs37d5/human_v37_contig_hg19_hs37d5.fasta -I rnaseq/GATK/${IND_NAME}_${SAMPLE_ID}.split.bam -o rnaseq/GATK/${IND_NAME}_${SAMPLE_ID}.intervals
java -jar GenomeAnalysisTK.jar -T IndelRealigner -R ${REFERENCE_DIR}/human_v37_contig_hg19_hs37d5/human_v37_contig_hg19_hs37d5.fasta -I rnaseq/GATK/${IND_NAME}_${SAMPLE_ID}.split.bam -targetIntervals rnaseq/GATK/${IND_NAME}_${SAMPLE_ID}.intervals -o rnaseq/GATK/${IND_NAME}_${SAMPLE_ID}.realigned.bam
echo "Job ended at $(date)"
java -jar GenomeAnalysisTK.jar -T BaseRecalibrator -R ${REFERENCE_DIR}/human_v37_contig_hg19_hs37d5/human_v37_contig_hg19_hs37d5.fasta -I rnaseq/GATK/${IND_NAME}_${SAMPLE_ID}.realigned.bam -knownSites ${REFERENCE_DIR}/dbsnp_137.b37.vcf -o rnaseq/GATK/${IND_NAME}_${SAMPLE_ID}.recal_data.grp
java -jar GenomeAnalysisTK.jar -T PrintReads -nct ${THREAD_NUM} -R ${REFERENCE_DIR}/human_v37_contig_hg19_hs37d5/human_v37_contig_hg19_hs37d5.fasta -I rnaseq/GATK/${IND_NAME}_${SAMPLE_ID}.realigned.bam -BQSR rnaseq/GATK/${IND_NAME}_${SAMPLE_ID}.recal_data.grp -o rnaseq/GATK/${IND_NAME}_${SAMPLE_ID}.BQSR.bam
samtools view -f 2 -F 768 -h rnaseq/GATK/${IND_NAME}_${SAMPLE_ID}.BQSR.bam | samtools view -Sb - > rnaseq/bam_ready/${IND_NAME}_${SAMPLE_ID}.final.bam
samtools index rnaseq/bam_ready/${IND_NAME}_${SAMPLE_ID}.final.bam
/BEDtools-2.23.0/bin/coverageBed -abam rnaseq/bam_ready/${IND_NAME}_${SAMPLE_ID}.final.bam -b ${REFERENCE_DIR}/human_gene_gencode19.b37.exon.bed -hist | awk '$1=="all"' > rnaseq/coverageBed/${IND_NAME}_${SAMPLE_ID}.tsv
{Path to example data}
java -Xmx64G -jar ${MOSAICHUNTER_DIR}/build/mosaichunter.jar -C ${MOSAICHUNTER_CONFIG} -P input_file=rnaseq/bam_ready/${IND_NAME}_${SAMPLE_ID}.final.bam -P mosaic_filter.sex=${SEX} -P output_dir=${OUTPUT_DIR} -P common_site_filter.bed_file=${ERROR_PRONE_BED} -P misaligned_reads_filter.max_NM=3
cat rnaseq/MosaicHunter/${IND_NAME}/${SAMPLE_ID}/final.passed.tsv | awk '$3==$7||$3==$9' | awk '$11~"N/A"&&$11~"1.0"' > rnaseq/MosaicHunter/${IND_NAME}/${SAMPLE_ID}/final.clean.tsv
We have two different mode of RNA editint filters:
- full-removal: removes all A>G / T>C mutations
Rscript RNA_Editing_Filter_MH.R final.clean.tsv full-removal final.clean.filter_RNA_edit.tsv
- filter-based-removal: removes RNA editing sites reported in DARNED [1] and REDIportal [2], and A>G mutations on transcribed strand and T>C mutations on untranscribed strand
Rscript RNA_Editing_Filter_MH.R final.clean.tsv filter-based-removal final.clean.filter_RNA_edit.tsv
All of the output files are in the specified directory output_dir, including the final candidate list final.passed.tsv, the filtered and remained remaining variant lists of each filters (.filtered.tsv, .passed.tsv), and other temporary files, such as blat input and output.
The final.clean.filter_RNA_edit.tsv files are in tab-separated format, whose columns' meanings for single mode are listed below:
-
Contig / chromosome name
-
Position / coordinate on the contig (1-based)
-
Base of reference allele
-
Total depth of this site
-
Pileuped sequencing bases at this site
-
Pileuped sequencing baseQs at this site
-
Base of major allele
-
Depth of major allele
-
Base of minor allele
-
Depth of minor allele
-
dbSNP allele frequency of major and minor alleles
-
log10 prior probability of major-homozygous genotype
-
log10 prior probability of heterozygous genotype
-
log10 prior probability of minor-homozygous genotype
-
log10 prior probability of mosaic genotype
-
log10 likelihood of major-homozygous genotype
-
log10 likelihood of heterozygous genotype
-
log10 likelihood of minor-homozygous genotype
-
log10 likelihood of mosaic genotype
-
log10 posterior probability of major-homozygous genotype
-
log10 posterior probability of heterozygous genotype
-
log10 posterior probability of minor-homozygous genotype
-
log10 posterior probability of mosaic genotype
-
Mosaic posterior probability
You can get a high-confidence candidate list of mosaic sites by filtering the 24th column (mosaic posterior probability) or sorting this column from high to low.
[1] Kiran,A. and Baranov,P.V. (2010) DARNED: a DAtabase of RNa EDiting in humans. Bioinformatics, 26, 1772–1776.
[2] Picardi,E., D’Erchia,A.M., Lo Giudice,C. and Pesole,G. (2017) REDIportal: a comprehensive database of A-to-I RNA editing events in humans. Nucleic Acids Res., 45, D750–D757.