-
Notifications
You must be signed in to change notification settings - Fork 2
Tutorial
This is a tutorial which leads through the process of analyzing the repeat content of the sugar beet Beta vulgaris.
First, we need to get reper and the test data.
git clone https://github.com/nterhoeven/reper.git
docker pull nterhoeven/reper
alias reper="docker run --user=$(id -u):$(id -g) -it --rm -v $(pwd):/data nterhoeven/reper"
Move to the tutorial directory
cd reper/tutorial/
Here we find a prepared reper config file (reper.conf
). And bash scripts containing the needed commands.
We now have to download the sequencing data from sra.
The easiest way to do this is using fastq-dump
from the sratools
package (01_download-data.sh
):
fastq-dump --defline-seq '@$sn[_$rn]/$ri' --split-files SRR952972
The next step is the configuration of the reference databases used for the classification of the detected repeats. For our data set we can use REdat and refseq. reper comes with scripts to automatically download and configure these databases (02_prepare-databases.sh
):
reper configure-refseq
reper configure-REdat
Everything is set up now and we can start reper. With the settings specified in the config file, the analysis requires 24 threads and 100G memory and runs about 5-6 hours (03_run-reper.sh
).
reper kmerCount
Now it is time to grab a coffee and let reper do the work.
When reper finished successfully, you will find the following important files:
- repeat-landscape-by-* These files contain an overview of the repeats found, their assigned class and estimated part of the genome size.
- Trinity.fasta This file contains all repeat sequences found
- Trinity.fasta.exemplars.classified This file contains the representative exemplar sequences for each cluster along with the classification
With the R script (plot-landscape.R
) provided in the scripts
subdirectory of reper, you can generate nice visualizations of the repeat landscape. Note, that this uses the library ggplot (04_generate-plots.sh
).