raw_seq_reads_to_kallisto.Rmd

---
title: "Raw Sequencing Reads to kallisto pseudoalignment"
author: "Aaron Mitchell-Dick"
output: html_document
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
##knitr::opts_knit$set(root.dir = 'C:/Users/AMDprograms/Desktop/Untrimmed/')
knitr::opts_chunk$set(warning = FALSE)
knitr::opts_chunk$set(error = FALSE)
```

<br/>
<br/>


## RNA-Seq Analysis: Mapping Raw Sequencing Reads to Gene Transcripts Using kallisto


This is the first part of a tutorial designed to walk you through a full RNA-Seq workflow.  In this section, we will take raw sequencing reads, process them, and map them to gene transcripts.  In the second section, we will take the abundance counts of gene transcripts for each sample and for each conditions, and combine them to perform differential expression analysis.  To skip ahead to that section, [use this link to skip to part 2](https://aaronmitchd.github.io/RNA_Seq/kallisto_h5_to_tximport_to_DESeq2_eval.html).  This tutorial is designed and commented so that users with little to no experience in programming can perform robust, simple, RNA-Seq analysis. For users that desire to perform an analysis without having to write any code, see [Galaxy](https://galaxyproject.org/), and [UseGalaxy](https://usegalaxy.org/), for an open-source, web-based platform.

<br/>


#### Make sure you can run bash/linux

For Mac Users: you already run bash in the terminal so you are good to go.

For Windows Users: you need to download and install WSL (Windows Subsystem for Linux), then use the command `bash` on the command line to open the bash/Linux shell.  The easiest way to do this is the 1) enable WSL and 2) download Ubuntu for windows by following the instructions [here](https://ubuntu.com/wsl).  WARNING: This install can take up to an hour on a laptop.  Time for a coffee break.

<br/>


#### Make sure you have anaconda/bioconda and python installed

An effective way to download and run bioinformatics packages is through Anaconda using bioconductor.  You'll need to [install Anaconda](https://docs.anaconda.com/anaconda/install/).  This will install Anaconda on Mac or Windows, and will also install python.  You can then run Anaconda from your programs list in Mac/Windows.

<br/>


#### Install bioconda

```{bash, eval=FALSE}
conda config --add channels defaults
conda config --add channels bioconda
conda config --add channels conda-forge
```

<br/>


#### Install all the programs and packages you'll need for this workflow

To see the list of programs and packages you'll need, navigate to the [Install Resources](https://aaronmitchd.github.io/RNA_Seq/Install-Resources.html) page in this tutorial.  You'll want to install all the packages, programs, and libraries you need right at the beginning, for a good flow when you're programming.  This may take a while, including installing RStudio and all the R packages for the second half of this tutorial.

<br/>


#### Download your sequencing files and run FastQC

Download your sequencing files (likely they are fastq.gz files) and ask your sequencing core or sequencing service what processing they performed on the reads.  They likely de-multiplexed them, and may have trimmed them.  If they did not do anything, you'll need to run quality checks and trim sequencing adapters from all of your files.  Sometimes they perform this step for you, and you'll be able to tell from running FastQC what you need to do next.  Download and run FastQC on each of your samples [FastQC](https://raw.githubusercontent.com/s-andrews/FastQC/master/INSTALL.txt) and [FastQC website](https://www.bioinformatics.babraham.ac.uk/projects/fastqc/).

<br/>


#### Evaluate FastQC

In the output file, you'll find information on sequencing quality and presence or absence of adapter sequences.  If any samples are of poor quality, you'll need to make the determination whether to include them or throw them out.  Then you can move forward with processing

<br/>


#### Open bash terminal

On Mac open the Terminal, or on Windows open the command prompt and type `bash`.

<br/>


#### Update conda

```{bash, eval=FALSE}
conda update all
```

<br/>


#### Create a new conda environment

Notice that to the left of your cursor, the line begins with (base) or something like it.  This is telling you what environment you are in.  You want to make a new environment for each workflow you do on your computer, so that the packages you download are specific to that environment, and don't contaminate the global environment in case versions of packages are not compatible with the version of python you have, for example.  To create a new conda environment, run:

```{bash, eval=FALSE}
conda create --name RNA
```

<br/>


#### Initialize the conda environment

And every time you open bash to perform this RNA-Seq analysis workflow in a new session, you'll want to initialize this environment to run the packages you download here.

```{bash, eval=FALSE}
conda activate RNA
```

<br/>


#### Install TrimGalore

TrimGalore depends on cutadapt and FastQC (optional), so you'll need to install those.  Once you are in your conda RNA environment (you can check by running `conda env list` and the `*` will tell you which one you are currently in), install cutadapt then FastQC:

```{bash, eval=FALSE}
conda install -c bioconda cutadapt
sudo apt install fastqc
```

Check versions to make sure they installed:

```{bash, eval=FALSE}
cutadapt --version 
fastqc -v
```

Then install TrimGalore if you need to trim adapters and check read quality.

```{bash, eval=FALSE}
curl -fsSL https://github.com/FelixKrueger/TrimGalore/archive/0.6.5.tar.gz -o trim_galore.tar.gz
tar xvzf trim_galore.tar.gz
```

To run TrimGalore with default options and run FastQC after: `[your trimgalore directory]_trim_galore --fastqc <path and filename>`

```{bash, eval=FALSE}
/mnt/c/Users/Amdprograms/TrimGalore-0.6.5/trim_galore --fastqc <filename>
```

If you are advanced, you can run a short for loop to run trim galore on all files in a folder.  this could take several hours depending on filesize and number.

<br/>


#### Concatenate samples run on multiple lanes

If you mitigated batch effects by splitting samples and running them on multiple lanes, you'll need to concatenate those runs into a single file representing your full sample.  Concatenate any samples that were run on multiple lanes using:

```{bash, eval=FALSE}
Cat 5190-S17* > S17_DMSO_6h.fastq.gz
# NOTE: SAVE YOUR FILES AS fastq.gz
```

Or you can even cat files from one folder and then put the output file in a different folder:

```{bash, eval=FALSE}
cat ./Aaron_data_copy/Aaron_5190_181108B1/5190-S5_S105_L00* > ./Untrimmed/5190-S5.fastq.gz
```

The `*` means ignore any other characters other than the ones displayed.  This is an easy way to say "include all the files with the substring `*important_chars*` to the compiler.

Now that you have your files, and they are pre-processed, you can run kallisto.  kallisto is run in two steps:

1) generate an index file called a kallisto index file
2) pseudoalign each samples reads against the kallisto index file.

<br/>


#### Install kallisto

kallisto should already be installed, but if not, run

```{bash, eval=FALSE}
conda install kallisto
kallisto
```

which will install kallisto, then will check version and provide a list of possible commands.

<br/>


#### Create a kallisto index file

For common organisms, you can skip generating an index file, as the authors have created a repository for you.  [Download the tarball for your species here](https://github.com/pachterlab/kallisto-transcriptome-indices/releases).  It is highly recommended you follow this step if your organism is on this list, because it will make part two of this walkthrough easier.  Otherwise, build your index according to the manual. 

<br/>


#### Extract the tarball

run this code in bash and name the destination folder.

```{bash, eval=FALSE}
sudo tar -xvzf /mnt/c/PATH/TO/TAR-FILE/Desktop/FILE-NAME.tar.gz -C /mnt/c/PATH/TO/DESTINATION/FOLDER
```

Now, in the extracted folder you have your .idx file which you can just point to when running kallisto.  Or, you can copy the .idx file and paste it into the folder containing all your sequencing samples.  Side note, in the extracted folder you also have a .gtf file which you will use in part 2 of the walkthrough.

<br/>


#### Run kallisto

Now, we will run kallisto on each sample.  It is highly recommended you read the manual before you run your samples.  Here, we will provide some simple instruction.  Kallisto defaults to paired-end reads, so it will accept two files and compute the pairs.  If you have single-end reads, you'll provide an argument and it requires you to define your read length `-l` and standard deviation `-s` (you will know these values from your FASTQC run).

Navigate to the directory containing all your samples.

Here, we run kallisto on single-end reads with 100 bootstraps `-b`:

```{bash, eval=FALSE}
kallisto quant -i mouse.idx -o /mnt/c/Users/AMDprograms/Desktop/6h/sample1output -b 100 --single -l 51 -s 5 ./filename/of/working/directory.fastq.gz
```

Advanced users can run a for loop to iterate through the samples.  if you do this, MAKE SURE you indicate a new output folder each time.  something like this 
`-o ./kallisto/sample1output`
then
`-o ./kallist/sample2output`
otherwise you'll just keep overwriting your abundance.h5 files.

Once you have folders for each sample with your kallisto output files, you are ready to move onto part two of this walkthrough!