HybPiper2.Rmd

---
title: "HybPiper 2"
output:
  github_document:
    number_sections: true
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE,
                      eval = FALSE)
```

**Author**: [Léo-Paul Dagallier](https://github.com/LPDagallier)    
**Last update**: `r format(Sys.Date())`

***


HybPiper uses clean reads to create per samples assemblies and to extract the targeted sequences. See the full [HybPiper documentation](https://github.com/mossmatters/HybPiper/wiki/Full-pipeline-parameters) for more details. And see their [test dataset](https://github.com/mossmatters/HybPiper/raw/master/test_dataset.tar.gz) if you want to practice with a tutorial dataset.

The workflow is divided in 2 major steps:\
- [`hybpiper assemble`](#assembly): for each sample, assembles the targeted loci\
- [`hybpiper extract`](#locus-extraction): for each locus, extract the assembled sequences for the samples

The output from each step can be either stored into a same directory, or into two separate directories.\
Here the output from each step will be stored into separate directories. The practical reason for storing them separately is that you will usually first assemble a bunch of samples (e.g. samples from a complete sequencing plate), and secondly extract the sequences from either only a subset of samples, or from different assembly runs (e.g only some samples from different sequencing plates).\
But if you prefer, you can totally store the outputs from the `assemble` and `extract` steps into a same directory.

The following assumes that:

-   We defined a unique identifier `<analysis_ID>` for the analysed data and created a corresponding subfolder in the `DATA` directory (`<base_directory>/DATA_ANALYSES/PHYLOGENY_RECONSTRUCTION/DATA`; see [here](Notes_for_advanced_wf.md) for details on folder organization and analysis naming)
-   The analysis will be run in a temporary directory whose path is `path_to_tmp`
-   The names of the output folders are composed of the unique identifier `<analysis_ID>` and of a step identifier (`hybpiper2_assemble`, `hybpiper2_extract`, etc.). The output folders will be stored in the `<base_directory>/DATA_ANALYSES/PHYLOGENY_RECONSTRUCTION/JOBS_OUTPUTS/` directory.

Note: if you are running HybPiper in local (i.e. not on a cluster), you can directly run HybPiper in the output directory (i.e temporary directory and output directory are the same).

# Assembly {#assembly}

:point_right: :computer: See the [script for local use](PHYLOGENY_RECONSTRUCTION/SCRIPTS_local/hybpiper2_assemble.sh)\
:point_right: :woman_technologist: See the [script for cluster (SLURM) use](PHYLOGENY_RECONSTRUCTION/SCRIPTS_cluster/hybpiper2_assemble_TEMPLATE.sh).

## Preparation

As input, HybPiper needs:

-   the list of the samples to analyse (`namelist_<analysis_ID>.txt`).There must not be any duplicate.\
-   the target file `targetfile.fasta`\
-   the clean reads for each sample (R1 and R2 .fastq). They can be either copied manually, or you can create the following text file to batch copy the .fastq files:
    -   [`input_fastq.txt`](example_files/input_fastq.txt): contains the copy commands (`scp`) to transfer the clean .fastq R1 and R2 files from their storage directory to the working directory. Each line contains a `scp`command specific to a single sample, e.g. `scp <path_to_clean_reads_storage>/Sample1_CLEAN* .`
-   the .fastq file names have to match the names in `namelist_<analysis_ID>.txt`
    -   you can use [`files_renaming.txt`](example_files/files_renaming.txt) to batch rename the input .fastq files. Each line contains a `rename` command specific to a single sample, e.g. `rename -v 'Sample1_CLEAN' 'Sample1' *`

Note that `input_fastq.txt` and `files_renaming.txt` can be easily created using a simple spreadsheet (see 3 last columns in the [example spreadsheet](example_files/spreadsheet_example.xlsx))

`namelist_<analysis_ID>.txt`, `input_fastq.txt`, and `files_renaming.txt` are stored in the `DATA/<analysis_ID>` directory.

### Define the paths and variables

-   for the inputs:

```{bash}
analysis_ID="my_analysis_ID"
path_to_dir_in="<base_directory>/DATA_ANALYSES/PHYLOGENY_RECONSTRUCTION/DATA";

path_to_ref="<base_directory>/DATASETS/PHYLOGENOMICS/target_references"
reference_fasta_file="target_reference.FAA"
```

-   for the outputs:

```{bash}
step_ID="hybpiper2_assemble"
path_to_dir_out="<base_directory>/DATA_ANALYSES/PHYLOGENY_RECONSTRUCTION/JOBS_OUTPUTS/$analysis_ID"_"$step_ID"/";
```

**Note.** Since the release of [HybPiper 2.1.3](https://github.com/mossmatters/HybPiper/blob/master/change_log.md), the parameter `--hybpiper_output` or `-o` allows you to specify a directory where all the `hybpiper assemble` outputs will be stored. A strategy can be to store all the assemblies in a same directory (easier if you want to use only a subset of the samples or if you want to combine different assembly runs).


-   temporary directory for local users:

```{bash}
path_to_tmp=$path_to_dir_out
```

-   temporary directory for cluster users: depends on the cluster manager and its setup (please see with your cluster documentation), e.g. for SLURM on HiperGator:

```{bash}
path_to_tmp=$SLURM_TMPDIR
```

### Copy all the files in the working directory

Go to working directory.

```{bash}
cd $path_to_tmp
```

Copy the reference file.

```{bash}
scp $path_to_ref/$reference_fasta_file $path_to_tmp
```

Copy the data related files.

```{bash}
scp $path_to_dir_in/$analysis_ID/namelist_$analysis_ID".txt" $path_to_tmp
scp $path_to_dir_in/$analysis_ID/input_fastq.txt $path_to_tmp
scp $path_to_dir_in/$analysis_ID/files_renaming.txt $path_to_tmp
```

Text files written on Windows systems have a 'carriage return' (CR, \\r) and a 'line feed' (LF, \\f) at the end of each line. This is incompatible (in some situations) with Unix-based systems (Linux and MacOS) files, that only have a LF.\
Thus, if you wrote your text files on Windows system, you need to run to following command to make them compatible with Linux.

```{bash}
dos2unix *.txt
```

Copy the .fastq files, either manually or using the `input_fastq.txt` file.

```{bash}
sh input_fastq.txt
```

Each line in `input_fastq.txt` will be run at a time. To copy several samples in parallel (here, 8 in parallel), use:

```{bash}
parallel -j 8 < input_fastq.txt 
```

If your .fastq files are in .fastq.gz (compressed .fastq), you need to decompress them. One by one:

```{bash}
gunzip *.gz
```

Or in parallel:

```{bash}
ls *.gz | parallel -j 8 gunzip
```

Rename you .fastq files:

```{bash}
sh files_renaming.txt
```

## `hybpiper assemble`

Go to working directory.

```{bash}
cd $path_to_tmp
```

Rename `namelist_<analysis_ID>.txt` to `namelist.txt`.

```{bash}
mv $path_to_tmp/namelist_$analysis_ID".txt" $path_to_tmp/namelist.txt
```

Run `hybpiper assemble`.

```{bash}
while read name;
do hybpiper assemble -t_aa $reference_fasta_file -r $name*.fastq --prefix $name --cpu 8 --diamond --run_intronerate;
done < namelist.txt
```

Here, the assembly is run with an amino-acid reference and the Diamond aligner. The reference has to be in amino-acids and specified with `-t_aa` (note: if target reference is in nucleotide and specified with `-t_dna`, it will automatically be translated)

For the Melastomataceae probe set, I recommend to add the `--no_stitched_contig` parameter.

See the [full list of parameters](https://github.com/mossmatters/HybPiper/wiki/Full-pipeline-parameters#10-hybpiper-assemble) for more details.

### Faster `hybpiper assemble`

After different trials, I found out that running `hybpiper assemble` in parallel was faster. For example, on 8 CPUs, running 4 `hybpiper assemble --cpu 2` in parallel is faster than one single `hybpiper assemble --cpu 8`.

Initialize an empty text file that will contain all the commands to run in parallel.

```{bash}
rm hybpiper_parallel.txt
touch hybpiper_parallel.txt
```

Write the commands to be run in parallel in the file, 1 command per line.

```{bash}
while read name;
do 
  echo "hybpiper assemble -t_aa $reference_fasta_file -r $path_to_fastq/$name"_"*.fastq --prefix $name --cpu 2 --diamond --unpaired $path_to_fastq/$name.fastq --no_stitched_contig;" >> hybpiper_parallel.txt
done < namelist.txt
```

Run the commands in the text file on 4 parallel instances.

```{bash}
parallel -j 4 < hybpiper_parallel.txt
```

As an example, running `hybpiper assemble` on 240 samples (64Gb RAM) took the following times:

+-----------------------------+-----------------+----------------+
| Command                     | CPU time        | Wall time      |
+=============================+=================+================+
| `hybpiper assemble --cpu 8` | 147h and 6mins  | 30h and 40mins |
+-----------------------------+-----------------+----------------+
| `hybpiper assemble --cpu 2` | 147h and 16mins | 21h and 34mins |
| `parallel -j4`              |                 |                |
+-----------------------------+-----------------+----------------+

## Summary statistics

Compute the summary statistics for the exons.

```{bash}
hybpiper stats -t_aa $reference_fasta_file --seq_lengths_filename genes_sequences_lengths --stats_filename hybpiper_genes_statistics gene namelist.txt
```

Compute the summary statistics for the supercontigs (exons+introns).

```{bash}
hybpiper stats -t_aa $reference_fasta_file --seq_lengths_filename supercontigs_sequences_lengths --stats_filename hybpiper_supercontigs_statistics supercontig namelist.txt
```

Visualize the summary statistics.

```{bash}
hybpiper recovery_heatmap --heatmap_dpi 300 --heatmap_filetype pdf --heatmap_filename recovery_heatmap_exons genes_sequences_lengths.tsv
hybpiper recovery_heatmap --heatmap_dpi 300 --heatmap_filetype pdf --heatmap_filename recovery_heatmap_supercontigs supercontigs_sequences_lengths.tsv
```
![heatmap example](figures/hybpiper_heatmap_lowres.PNG)

([click here](figures/recovery_heatmap_exons_example.pdf) to see this figure in full resolution)


## Transfer to output directory

(only in cases the temporary directory is different from the output directory)

.fastq files can take up a lot of space so you may want to remove them.

```{bash}
rm $path_to_tmp/*.fastq
```

Transfer the assemblies to the output directory.

```{bash}
mkdir $path_to_dir_out
scp -rp $path_to_tmp/* $path_to_dir_out/
```

If your running on a cluster, remove the temporary directory with the following command (on some clusters (e.g. HiperGator), it is automatically removed at the end of the job).\
Do not remove the temporary directory if your temporary directory and your output directory are the same (local users).

```{bash}
# rm -r $path_to_tmp
```

# Locus extraction {#locus-extraction}

:point_right: :computer: See the [script for local use](PHYLOGENY_RECONSTRUCTION/SCRIPTS_local/hybpiper2_extract.sh)\
:point_right: :woman_technologist: See the [script for cluster (SLURM) use](PHYLOGENY_RECONSTRUCTION/SCRIPTS_cluster/hybpiper2_extract_TEMPLATE.sh).

## Preparation

For extracting the loci sequences, you will usually be in 2 situations:

- the list of samples you want to extract sequences for is ***the same*** as the list of samples you just assembled sequences for (same as previous step)
- the list of samples you want to extract sequences for is ***different*** from the list of samples you just assembled sequences for (different from previous step): this can happen when you want to discard samples that have poor assembly statistics, or when you want to extract the sequences for samples that were assembled in 2 separate assembly runs.

### List of sample is *the same* as in the previous assembly step

To extract the sequences for the samples you assembled in the step before, you can simply define the temporary directory as the output directory of the previous step.

```{bash}
analysis_ID="my_analysis_ID"
path_to_assemble="<base_directory>/DATA_ANALYSES/PHYLOGENY_RECONSTRUCTION/JOBS_OUTPUTS/"$analysis_ID"_hybpiper2_assemble/";
path_to_tmp=$path_to_assemble
reference_fasta_file="target_reference.FAA"
```

### List of sample is *different* as in the previous assembly step

To extract the sequences for samples assembled in several different assembly runs, or for a subset of samples from the step before, you need:

-   a new `<analysis_ID>`,
-   a new `namelist_<analysis_ID>.txt` with the list of samples you want to extract sequences of
-   the different assemblies of the samples you want to extract sequence of. The assemblies folder can either be copied manually, or you can create a text file to copy the assemblies folders:
    -   [`input_assemblies.txt`](example_files/input_assemblies.txt): contains the `scp` commands to transfer the folders from their respective assembly output directories to the current working directory. Each line contains a command specific to a single sample, e.g. `scp -r path_to_assembly/Sample1 .`

Similarly to `input_fastq.txt` and `files_renaming.txt`, `input_assemblies.txt` can be easily created using a simple spreadsheet.

`namelist_<analysis_ID>.txt` and `input_assemblies.txt` are stored in the `DATA/<analysis_ID>` directory.

**Note.** Since the release of [HybPiper 2.1.3](https://github.com/mossmatters/HybPiper/blob/master/change_log.md), you can store all the assemblies outputs in a same directory thanks to the parameter `--hybpiper_output` or `-o`. You can then directly retrieve the assemblies from this directory for the extraction step (i.e. without copying the assemblies with `input_assemblies.txt`, which can be long).

#### Define the paths and variables

-   for the inputs

```{bash}
analysis_ID="<new_analysis_id>"
path_to_dir_in="<base_directory>/DATA_ANALYSES/PHYLOGENY_RECONSTRUCTION/DATA"

path_to_ref="<base_directory>/DATASETS/PHYLOGENOMICS/target_references"
reference_fasta_file="target_reference.FAA"
```

-   for the outputs

```{bash}
step_ID="hybpiper2_extract"
path_to_dir_out="<base_directory>/DATA_ANALYSES/PHYLOGENY_RECONSTRUCTION/JOBS_OUTPUTS/"$analysis_ID"_"$step_ID/;
```

-   temporary directory for local users

```{bash}
path_to_tmp=$path_to_dir_out
```

-   temporary directory for cluster users depends on the cluster manager and its setup (please see with your cluster documentation), e.g. for SLURM on HiperGator:

```{bash}
path_to_tmp=$SLURM_TMPDIR
```

#### Copy all the files in the working directory

Go to working directory, copy the reference file and data related files.

```{bash}
cd $path_to_tmp
scp $path_to_ref/$reference_fasta_file $path_to_tmp
scp $path_to_dir_in/$analysis_ID/namelist_$analysis_ID".txt" $path_to_tmp
mv $path_to_tmp/namelist_$analysis_ID".txt" $path_to_tmp/namelist.txt
scp $path_to_dir_in/$analysis_ID/input_assemblies.txt $path_to_tmp
dos2unix *.txt
```

Copy the assemblies, either manually, or using the `input_assemblies.txt` file.

```{bash}
parallel -j 8 < input_assemblies.txt 
```

#### Assembly statistics (optional)

You can re-compute the assembly statistics and plot heatmap for the samples you're focusing on. These will basically be the same than when run during the assembly step, except that they will only include the new list of samples.

Compute the summary statistics for the exons and supercontigs

```{bash}
hybpiper stats -t_aa $reference_fasta_file --seq_lengths_filename genes_sequences_lengths --stats_filename hybpiper_genes_statistics gene namelist.txt
hybpiper stats -t_aa $reference_fasta_file --seq_lengths_filename supercontigs_sequences_lengths --stats_filename hybpiper_supercontigs_statistics supercontig namelist.txt
```

Visualize the summary statistics.

```{bash}
hybpiper recovery_heatmap --heatmap_dpi 300 --heatmap_filetype pdf --heatmap_filename recovery_heatmap_exons genes_sequences_lengths.tsv
hybpiper recovery_heatmap --heatmap_dpi 300 --heatmap_filetype pdf --heatmap_filename recovery_heatmap_supercontigs supercontigs_sequences_lengths.tsv
```

## Sequences extraction

Go to working directory.

```{bash}
cd $path_to_tmp
```

To recover the exon sequences, first create a subdirectory, then run the extraction command

```{bash}
mkdir retrieved_exons
hybpiper retrieve_sequences dna -t_aa $reference_fasta_file --sample_names namelist.txt --fasta_dir retrieved_exons
```

To recover the supercontigs sequences, first create a subdirectory, then run the extraction command

```{bash}
mkdir retrieved_supercontigs
hybpiper retrieve_sequences supercontig -t_aa $reference_fasta_file --sample_names namelist.txt --fasta_dir retrieved_supercontigs
```

To recover the introns sequences, first create a subdirectory, then run the extraction command

```{bash}
mkdir retrieved_introns
hybpiper retrieve_sequences intron -t_aa $reference_fasta_file --sample_names namelist.txt --fasta_dir retrieved_introns
```

To recover the exons sequences in amino-acids, first create a subdirectory, then run the extraction command

```{bash}
mkdir retrieved_aa
hybpiper retrieve_sequences aa -t_aa $reference_fasta_file --sample_names namelist.txt --fasta_dir retrieved_aa
```

## Post-HybPiper files formatting

The .fasta files containing the retrieved sequences have to be slightly modified to be compatible with downstream analyses. In particular, HybPiper incorporates additional information in the sequences header that we don't need downstream.

-   for exons sequences

```{bash}
cd $path_to_tmp/retrieved_exons/
```

Create a subdirectory to contain the formatted fastas, and copy (not move) the original fastas to this subdirectory.

```{bash}
mkdir formatted_fastas
cp *.FNA formatted_fastas
```

Remove empty files

```{bash}
cd formatted_fastas
find . -type f -empty -delete
```

Write the sequence on a single line instead of wrapping the sequences text on several lines. New files `.FNA.oneline` are created.

```{bash}
ls -1 ./ | \
while read sample; \
do 
	cat $sample | awk '/^>/ {printf("\n%s\n",$0);next; } { printf("%s",$0);}  END {printf("\n");}' > $sample.oneline
done
```

Remove the `.FNA` files and rename the `.FNA.oneline` to `.FNA`.

```{bash}
rm *.FNA
rename '.oneline' '' *
```

Removes everything after a space in the headers (i.e. keep only the sample name).

```{bash}
sed -i '/^>/s/[[:space:]].*//g' * # removes everything after a space in lines begining with >
```

Remove empty lines in the files.

```{bash}
sed -i '/^$/d' * #removes empty lines
```

-   same for supercontigs

```{bash}
cd $path_to_tmp/retrieved_supercontigs/
mkdir formatted_fastas
cp *.fasta formatted_fastas
cd formatted_fastas
find . -type f -empty -delete
ls -1 ./ | \
while read sample; \
do 
	cat $sample | awk '/^>/ {printf("\n%s\n",$0);next; } { printf("%s",$0);}  END {printf("\n");}' > $sample.oneline
done
rm *.fasta
rename '.fasta.oneline' '.FNA' *
sed -i '/^>/s/[[:space:]].*//g' *
sed -i '/^$/d' *
```

-   same can be done for introns sequences and amino-acids sequences if needed.

## Transfer to output directory

(only in cases the temporary directory is different from the output directory)

Transfer the extracted sequences in their respective subdirectory to the output directory.

```{bash}
mkdir $path_to_dir_out
scp -rp $path_to_tmp/* $path_to_dir_out/
```

If your running on a cluster, remove the temporary directory with the following command (on some clusters, e.g. HiperGator, it is automatically removed at the end of the job).\
Do not remove the temporary directory if your temporary directory and your output directory are the same (local users).

```{bash}
# rm -r $path_to_tmp
```