RNA-seq Quality Assessment Assignment - Bi 623 (Summer 2021)

#See this README.md well as (QAA/QAA.pdf)

Be sure to upload all relevant materials by the deadline and double check to be sure that your off-line repository is up-to-date with your on-line repository. Answers to the questions can be submitted as Github markdown or pdf.

Objectives

The objectives of this assignment are to use existing tools for quality assessment and adaptor trimming, compare the quality assessments to those from your own software, and to demonstrate your ability to summarize other important information about this RNA-Seq data set.

Data:

Each of you will be working with 2 of the demultiplexed file pairs. For all steps below, process the two libraries separately. Library assignments are here: /projects/bgmp/shared/Bi623/QAA_data_assignments.txt

My particular files begin with: 29_4E_fox_S21_L008 and 8_2F_fox_S7_L008

The demultiplexed, gzipped .fastq files are here: /projects/bgmp/shared/2017_sequencing/demultiplexed/

Do not move, copy, or unzip these data!

Part 1 – Read quality score distributions

Using FastQC via the command line on Talapas, produce plots of quality score distributions for R1 and R2 reads. Also, produce plots of the per-base N content, and comment on whether or not they are consistent with the quality score plots.

The per-base N content plots appear to be be consistent with the quality score distributions.

29_4E_fox_S21_L008 R1

29_4E_fox_S21_L008 R2

8_2F_fox_S7_L008 R1

8_2F_fox_S7_L008 R2

Run your quality score plotting script from your Demultiplexing assignment from Bi622. Describe how the FastQC quality score distribution plots compare to your own. If different, propose an explanation. Also, does the runtime differ? If so, why?

The amount of time it took to run my plotting script was considerably slower than the fastqc plot generation. It takes a while to do the same thing in Python, while Java might just be comparably faster, since that's what fastqc is programmed in.

29_4E_fox_S21_L008 R1

29_4E_fox_S21_L008 R2

8_2F_fox_S7_L008 R1

8_2F_fox_S7_L008 R2

Comment on the overall data quality of your two libraries.

Both libraries are honestly very good in my opinion. Qscores are higher than I've seen as of now.

Part 2 – Adaptor trimming comparison

Create a new conda environment called QAA and install cutadapt and Trimmomatic. Google around if you need a refresher on how to create conda environments. Make sure you check your installations with:
- cutadapt --version (should be 3.4)
- trimmomatic -version (should be 0.39)
Using cutadapt, properly trim adapter sequences from your assigned files. Be sure to read how to use cutadapt. Use default settings. What proportion of reads (both forward and reverse) were trimmed?

Try to determine what the adapters are on your own. If you cannot (or if you do, and want to confirm), click here to see the actual adapter sequences used.

R1: AGATCGGAAGAGCACACGTCTGAACTCCAGTCA

R2: AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT
- Sanity check: Use your Unix skills to search for the adapter sequences in your datasets and confirm the expected sequence orientations. Report the commands you used, the reasoning behind them, and how you confirmed the adapter sequences.

One example of how I checked for adapter sequences was the following unix command: cat 8_2F_fox_S7_L008_R1_001.out.1.fastq | grep "AGATCGGAAGAGCACACGTCTGAACTCCAGTCA" | head I used this to search for that adapter sequence in any of the lines, but I wouldn't have known where to begin, to look for the sequence itself.

Use Trimmomatic to quality trim your reads. Specify the following, in this order:
- LEADING: quality of 3
- TRAILING: quality of 3
- SLIDING WINDOW: window size of 5 and required quality of 15
- MINLENGTH: 35 bases
Be sure to output compressed files and clear out any intermediate files.
Plot the trimmed read length distributions for both R1 and R2 reads (on the same plot). You can produce 2 different plots for your 2 different RNA-seq samples. There are a number of ways you could possibly do this. One useful thing your plot should show, for example, is whether R1s are trimmed more extensively than R2s, or vice versa. Comment on whether you expect R1s and R2s to be adapter-trimmed at different rates.

I expect the R2s to be adapter-trimmed at a higher rate than the R1s.

Part 3 – Alignment and strand-specificity

Install sofware. In your QAA environment, use conda to install:
- star
- numpy
- pysam
- matplotlib
Then pip install HTSeq
Find publicly available mouse genome fasta files (Ensemble release 104) and generate an alignment database from them. Align the reads to your mouse genomic database using a splice-aware aligner. Use the settings specified in PS8 from Bi621.

Hint - you will need to use gene models to perform splice-aware alignment, see PS8 from Bi621.
Using your script from PS8 in Bi621, report the number of mapped and unmapped reads from each of your 2 sam files. Make sure that your script is looking at the bitwise flag to determine if reads are primary or secondary mapping (update your script if necessary).

29_4E_fox_S21_L008

mapped read total 3966649
unmapped read total 116941

8_2F_fox_S7_L008

mapped read total 67070958
unmapped read total 2511364

Count reads that map to features using htseq-count. You should run htseq-count twice: once with --stranded=yes and again with --stranded=no. Use default parameters otherwise.


htseq-count --stranded=yes Mus_musculus.GRCm39.dna.primary_assembly.ens104.STAR_2.7.9a1Aligned.out.sam ./mouse_fasta/Mus_musculus.GRCm39.104.gtf > HTS_S_yes_8_2F.tsv 

__no_feature	30658192
__ambiguous	26184
__too_low_aQual	61684
__not_aligned	1222586
__alignment_not_unique	1588617


htseq-count --stranded=no Mus_musculus.GRCm39.dna.primary_assembly.ens104.STAR_2.7.9a1Aligned.out.sam ./mouse_fasta/Mus_musculus.GRCm39.104.gtf > HTS_S_no_8_2F.tsv

__no_feature	3148692
__ambiguous	1555558
__too_low_aQual	61684
__not_aligned	1222586
__alignment_not_unique	1588617


htseq-count --stranded=yes Mus_musculus.GRCm39.dna.primary_assembly.ens104.STAR_2.7.9aAligned.out.sam ./mouse_fasta/Mus_musculus.GRCm39.104.gtf > HTS_S_yes_29_4E.tsv

__no_feature	1811726
__ambiguous	1334   *****
__too_low_aQual	3642
__not_aligned	56506
__alignment_not_unique	87551

htseq-count --stranded=no Mus_musculus.GRCm39.dna.primary_assembly.ens104.STAR_2.7.9aAligned.out.sam ./mouse_fasta/Mus_musculus.GRCm39.104.gtf  > HTS_S_no_29_4E.tsv

__no_feature	119362
__ambiguous	98624   *****
__too_low_aQual	3642
__not_aligned	56506
__alignment_not_unique	87551

Demonstrate convincingly whether or not the data are from “strand-specific” RNA-Seq libraries. Include any comands/scripts used. Briefly describe your evidence, using quantitative statements (e.g. "I propose that these data are/are not strand-specific, because X% of the reads are y, as opposed to z.").

I propose that these data are not strand-specific, because 93.8% of the reads have no feature when stranded=yes is selected, as opposed to 6.1% when stranded=no is selected in the 29_4E sample.

*Hint* - recall ICA4 from Bi621.

To turn in your work for this assignment: Upload your Talapas batch script/code, FastQC plots, mapped/unmapped read counts, counts files generated from htseq-count, answers to questions, and any additional plots/code to github. You should create at most 2 files for submission (R markdown and the rendered pdf file) containing all these elements. The three parts of the assignment should be clearly labeled. Be sure to title and write a figure legend for each image/graph/table you present.

Name		Name	Last commit message	Last commit date
Latest commit History 17 Commits
metadata		metadata
.gitignore		.gitignore
QAA.pdf		QAA.pdf
README.md		README.md
cutadap_batch.srun		cutadap_batch.srun
fastq_gzipper.srun		fastq_gzipper.srun
htseq-er.srun		htseq-er.srun
trimmo_sbatch.srun		trimmo_sbatch.srun
wrapper.genomedbmake.srun		wrapper.genomedbmake.srun
wrapper.pyparse.py		wrapper.pyparse.py
wrapper.pyparse1.py		wrapper.pyparse1.py
wrapper.staralign.srun		wrapper.staralign.srun

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

RNA-seq Quality Assessment Assignment - Bi 623 (Summer 2021)

Objectives

Data:

Do not move, copy, or unzip these data!

Part 1 – Read quality score distributions

Part 2 – Adaptor trimming comparison

Part 3 – Alignment and strand-specificity

About

Releases

Packages

Languages

johncheney/QAA

Folders and files

Latest commit

History

Repository files navigation

RNA-seq Quality Assessment Assignment - Bi 623 (Summer 2021)

Objectives

Data:

Do not move, copy, or unzip these data!

Part 1 – Read quality score distributions

Part 2 – Adaptor trimming comparison

Part 3 – Alignment and strand-specificity

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages