-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathHybPiper2.Rmd
497 lines (340 loc) · 19 KB
/
HybPiper2.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
---
title: "HybPiper 2"
output:
github_document:
number_sections: true
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE,
eval = FALSE)
```
**Author**: [Léo-Paul Dagallier](https://github.com/LPDagallier)
**Last update**: `r format(Sys.Date())`
***
HybPiper uses clean reads to create per samples assemblies and to extract the targeted sequences. See the full [HybPiper documentation](https://github.com/mossmatters/HybPiper/wiki/Full-pipeline-parameters) for more details. And see their [test dataset](https://github.com/mossmatters/HybPiper/raw/master/test_dataset.tar.gz) if you want to practice with a tutorial dataset.
The workflow is divided in 2 major steps:\
- [`hybpiper assemble`](#assembly): for each sample, assembles the targeted loci\
- [`hybpiper extract`](#locus-extraction): for each locus, extract the assembled sequences for the samples
The output from each step can be either stored into a same directory, or into two separate directories.\
Here the output from each step will be stored into separate directories. The practical reason for storing them separately is that you will usually first assemble a bunch of samples (e.g. samples from a complete sequencing plate), and secondly extract the sequences from either only a subset of samples, or from different assembly runs (e.g only some samples from different sequencing plates).\
But if you prefer, you can totally store the outputs from the `assemble` and `extract` steps into a same directory.
The following assumes that:
- We defined a unique identifier `<analysis_ID>` for the analysed data and created a corresponding subfolder in the `DATA` directory (`<base_directory>/DATA_ANALYSES/PHYLOGENY_RECONSTRUCTION/DATA`; see [here](Notes_for_advanced_wf.md) for details on folder organization and analysis naming)
- The analysis will be run in a temporary directory whose path is `path_to_tmp`
- The names of the output folders are composed of the unique identifier `<analysis_ID>` and of a step identifier (`hybpiper2_assemble`, `hybpiper2_extract`, etc.). The output folders will be stored in the `<base_directory>/DATA_ANALYSES/PHYLOGENY_RECONSTRUCTION/JOBS_OUTPUTS/` directory.
Note: if you are running HybPiper in local (i.e. not on a cluster), you can directly run HybPiper in the output directory (i.e temporary directory and output directory are the same).
# Assembly {#assembly}
:point_right: :computer: See the [script for local use](PHYLOGENY_RECONSTRUCTION/SCRIPTS_local/hybpiper2_assemble.sh)\
:point_right: :woman_technologist: See the [script for cluster (SLURM) use](PHYLOGENY_RECONSTRUCTION/SCRIPTS_cluster/hybpiper2_assemble_TEMPLATE.sh).
## Preparation
As input, HybPiper needs:
- the list of the samples to analyse (`namelist_<analysis_ID>.txt`).There must not be any duplicate.\
- the target file `targetfile.fasta`\
- the clean reads for each sample (R1 and R2 .fastq). They can be either copied manually, or you can create the following text file to batch copy the .fastq files:
- [`input_fastq.txt`](example_files/input_fastq.txt): contains the copy commands (`scp`) to transfer the clean .fastq R1 and R2 files from their storage directory to the working directory. Each line contains a `scp`command specific to a single sample, e.g. `scp <path_to_clean_reads_storage>/Sample1_CLEAN* .`
- the .fastq file names have to match the names in `namelist_<analysis_ID>.txt`
- you can use [`files_renaming.txt`](example_files/files_renaming.txt) to batch rename the input .fastq files. Each line contains a `rename` command specific to a single sample, e.g. `rename -v 'Sample1_CLEAN' 'Sample1' *`
Note that `input_fastq.txt` and `files_renaming.txt` can be easily created using a simple spreadsheet (see 3 last columns in the [example spreadsheet](example_files/spreadsheet_example.xlsx))
`namelist_<analysis_ID>.txt`, `input_fastq.txt`, and `files_renaming.txt` are stored in the `DATA/<analysis_ID>` directory.
### Define the paths and variables
- for the inputs:
```{bash}
analysis_ID="my_analysis_ID"
path_to_dir_in="<base_directory>/DATA_ANALYSES/PHYLOGENY_RECONSTRUCTION/DATA";
path_to_ref="<base_directory>/DATASETS/PHYLOGENOMICS/target_references"
reference_fasta_file="target_reference.FAA"
```
- for the outputs:
```{bash}
step_ID="hybpiper2_assemble"
path_to_dir_out="<base_directory>/DATA_ANALYSES/PHYLOGENY_RECONSTRUCTION/JOBS_OUTPUTS/$analysis_ID"_"$step_ID"/";
```
**Note.** Since the release of [HybPiper 2.1.3](https://github.com/mossmatters/HybPiper/blob/master/change_log.md), the parameter `--hybpiper_output` or `-o` allows you to specify a directory where all the `hybpiper assemble` outputs will be stored. A strategy can be to store all the assemblies in a same directory (easier if you want to use only a subset of the samples or if you want to combine different assembly runs).
- temporary directory for local users:
```{bash}
path_to_tmp=$path_to_dir_out
```
- temporary directory for cluster users: depends on the cluster manager and its setup (please see with your cluster documentation), e.g. for SLURM on HiperGator:
```{bash}
path_to_tmp=$SLURM_TMPDIR
```
### Copy all the files in the working directory
Go to working directory.
```{bash}
cd $path_to_tmp
```
Copy the reference file.
```{bash}
scp $path_to_ref/$reference_fasta_file $path_to_tmp
```
Copy the data related files.
```{bash}
scp $path_to_dir_in/$analysis_ID/namelist_$analysis_ID".txt" $path_to_tmp
scp $path_to_dir_in/$analysis_ID/input_fastq.txt $path_to_tmp
scp $path_to_dir_in/$analysis_ID/files_renaming.txt $path_to_tmp
```
Text files written on Windows systems have a 'carriage return' (CR, \\r) and a 'line feed' (LF, \\f) at the end of each line. This is incompatible (in some situations) with Unix-based systems (Linux and MacOS) files, that only have a LF.\
Thus, if you wrote your text files on Windows system, you need to run to following command to make them compatible with Linux.
```{bash}
dos2unix *.txt
```
Copy the .fastq files, either manually or using the `input_fastq.txt` file.
```{bash}
sh input_fastq.txt
```
Each line in `input_fastq.txt` will be run at a time. To copy several samples in parallel (here, 8 in parallel), use:
```{bash}
parallel -j 8 < input_fastq.txt
```
If your .fastq files are in .fastq.gz (compressed .fastq), you need to decompress them. One by one:
```{bash}
gunzip *.gz
```
Or in parallel:
```{bash}
ls *.gz | parallel -j 8 gunzip
```
Rename you .fastq files:
```{bash}
sh files_renaming.txt
```
## `hybpiper assemble`
Go to working directory.
```{bash}
cd $path_to_tmp
```
Rename `namelist_<analysis_ID>.txt` to `namelist.txt`.
```{bash}
mv $path_to_tmp/namelist_$analysis_ID".txt" $path_to_tmp/namelist.txt
```
Run `hybpiper assemble`.
```{bash}
while read name;
do hybpiper assemble -t_aa $reference_fasta_file -r $name*.fastq --prefix $name --cpu 8 --diamond --run_intronerate;
done < namelist.txt
```
Here, the assembly is run with an amino-acid reference and the Diamond aligner. The reference has to be in amino-acids and specified with `-t_aa` (note: if target reference is in nucleotide and specified with `-t_dna`, it will automatically be translated)
For the Melastomataceae probe set, I recommend to add the `--no_stitched_contig` parameter.
See the [full list of parameters](https://github.com/mossmatters/HybPiper/wiki/Full-pipeline-parameters#10-hybpiper-assemble) for more details.
### Faster `hybpiper assemble`
After different trials, I found out that running `hybpiper assemble` in parallel was faster. For example, on 8 CPUs, running 4 `hybpiper assemble --cpu 2` in parallel is faster than one single `hybpiper assemble --cpu 8`.
Initialize an empty text file that will contain all the commands to run in parallel.
```{bash}
rm hybpiper_parallel.txt
touch hybpiper_parallel.txt
```
Write the commands to be run in parallel in the file, 1 command per line.
```{bash}
while read name;
do
echo "hybpiper assemble -t_aa $reference_fasta_file -r $path_to_fastq/$name"_"*.fastq --prefix $name --cpu 2 --diamond --unpaired $path_to_fastq/$name.fastq --no_stitched_contig;" >> hybpiper_parallel.txt
done < namelist.txt
```
Run the commands in the text file on 4 parallel instances.
```{bash}
parallel -j 4 < hybpiper_parallel.txt
```
As an example, running `hybpiper assemble` on 240 samples (64Gb RAM) took the following times:
+-----------------------------+-----------------+----------------+
| Command | CPU time | Wall time |
+=============================+=================+================+
| `hybpiper assemble --cpu 8` | 147h and 6mins | 30h and 40mins |
+-----------------------------+-----------------+----------------+
| `hybpiper assemble --cpu 2` | 147h and 16mins | 21h and 34mins |
| `parallel -j4` | | |
+-----------------------------+-----------------+----------------+
## Summary statistics
Compute the summary statistics for the exons.
```{bash}
hybpiper stats -t_aa $reference_fasta_file --seq_lengths_filename genes_sequences_lengths --stats_filename hybpiper_genes_statistics gene namelist.txt
```
Compute the summary statistics for the supercontigs (exons+introns).
```{bash}
hybpiper stats -t_aa $reference_fasta_file --seq_lengths_filename supercontigs_sequences_lengths --stats_filename hybpiper_supercontigs_statistics supercontig namelist.txt
```
Visualize the summary statistics.
```{bash}
hybpiper recovery_heatmap --heatmap_dpi 300 --heatmap_filetype pdf --heatmap_filename recovery_heatmap_exons genes_sequences_lengths.tsv
hybpiper recovery_heatmap --heatmap_dpi 300 --heatmap_filetype pdf --heatmap_filename recovery_heatmap_supercontigs supercontigs_sequences_lengths.tsv
```
![heatmap example](figures/hybpiper_heatmap_lowres.PNG)
([click here](figures/recovery_heatmap_exons_example.pdf) to see this figure in full resolution)
## Transfer to output directory
(only in cases the temporary directory is different from the output directory)
.fastq files can take up a lot of space so you may want to remove them.
```{bash}
rm $path_to_tmp/*.fastq
```
Transfer the assemblies to the output directory.
```{bash}
mkdir $path_to_dir_out
scp -rp $path_to_tmp/* $path_to_dir_out/
```
If your running on a cluster, remove the temporary directory with the following command (on some clusters (e.g. HiperGator), it is automatically removed at the end of the job).\
Do not remove the temporary directory if your temporary directory and your output directory are the same (local users).
```{bash}
# rm -r $path_to_tmp
```
# Locus extraction {#locus-extraction}
:point_right: :computer: See the [script for local use](PHYLOGENY_RECONSTRUCTION/SCRIPTS_local/hybpiper2_extract.sh)\
:point_right: :woman_technologist: See the [script for cluster (SLURM) use](PHYLOGENY_RECONSTRUCTION/SCRIPTS_cluster/hybpiper2_extract_TEMPLATE.sh).
## Preparation
For extracting the loci sequences, you will usually be in 2 situations:
- the list of samples you want to extract sequences for is ***the same*** as the list of samples you just assembled sequences for (same as previous step)
- the list of samples you want to extract sequences for is ***different*** from the list of samples you just assembled sequences for (different from previous step): this can happen when you want to discard samples that have poor assembly statistics, or when you want to extract the sequences for samples that were assembled in 2 separate assembly runs.
### List of sample is *the same* as in the previous assembly step
To extract the sequences for the samples you assembled in the step before, you can simply define the temporary directory as the output directory of the previous step.
```{bash}
analysis_ID="my_analysis_ID"
path_to_assemble="<base_directory>/DATA_ANALYSES/PHYLOGENY_RECONSTRUCTION/JOBS_OUTPUTS/"$analysis_ID"_hybpiper2_assemble/";
path_to_tmp=$path_to_assemble
reference_fasta_file="target_reference.FAA"
```
### List of sample is *different* as in the previous assembly step
To extract the sequences for samples assembled in several different assembly runs, or for a subset of samples from the step before, you need:
- a new `<analysis_ID>`,
- a new `namelist_<analysis_ID>.txt` with the list of samples you want to extract sequences of
- the different assemblies of the samples you want to extract sequence of. The assemblies folder can either be copied manually, or you can create a text file to copy the assemblies folders:
- [`input_assemblies.txt`](example_files/input_assemblies.txt): contains the `scp` commands to transfer the folders from their respective assembly output directories to the current working directory. Each line contains a command specific to a single sample, e.g. `scp -r path_to_assembly/Sample1 .`
Similarly to `input_fastq.txt` and `files_renaming.txt`, `input_assemblies.txt` can be easily created using a simple spreadsheet.
`namelist_<analysis_ID>.txt` and `input_assemblies.txt` are stored in the `DATA/<analysis_ID>` directory.
**Note.** Since the release of [HybPiper 2.1.3](https://github.com/mossmatters/HybPiper/blob/master/change_log.md), you can store all the assemblies outputs in a same directory thanks to the parameter `--hybpiper_output` or `-o`. You can then directly retrieve the assemblies from this directory for the extraction step (i.e. without copying the assemblies with `input_assemblies.txt`, which can be long).
#### Define the paths and variables
- for the inputs
```{bash}
analysis_ID="<new_analysis_id>"
path_to_dir_in="<base_directory>/DATA_ANALYSES/PHYLOGENY_RECONSTRUCTION/DATA"
path_to_ref="<base_directory>/DATASETS/PHYLOGENOMICS/target_references"
reference_fasta_file="target_reference.FAA"
```
- for the outputs
```{bash}
step_ID="hybpiper2_extract"
path_to_dir_out="<base_directory>/DATA_ANALYSES/PHYLOGENY_RECONSTRUCTION/JOBS_OUTPUTS/"$analysis_ID"_"$step_ID/;
```
- temporary directory for local users
```{bash}
path_to_tmp=$path_to_dir_out
```
- temporary directory for cluster users depends on the cluster manager and its setup (please see with your cluster documentation), e.g. for SLURM on HiperGator:
```{bash}
path_to_tmp=$SLURM_TMPDIR
```
#### Copy all the files in the working directory
Go to working directory, copy the reference file and data related files.
```{bash}
cd $path_to_tmp
scp $path_to_ref/$reference_fasta_file $path_to_tmp
scp $path_to_dir_in/$analysis_ID/namelist_$analysis_ID".txt" $path_to_tmp
mv $path_to_tmp/namelist_$analysis_ID".txt" $path_to_tmp/namelist.txt
scp $path_to_dir_in/$analysis_ID/input_assemblies.txt $path_to_tmp
dos2unix *.txt
```
Copy the assemblies, either manually, or using the `input_assemblies.txt` file.
```{bash}
parallel -j 8 < input_assemblies.txt
```
#### Assembly statistics (optional)
You can re-compute the assembly statistics and plot heatmap for the samples you're focusing on. These will basically be the same than when run during the assembly step, except that they will only include the new list of samples.
Compute the summary statistics for the exons and supercontigs
```{bash}
hybpiper stats -t_aa $reference_fasta_file --seq_lengths_filename genes_sequences_lengths --stats_filename hybpiper_genes_statistics gene namelist.txt
hybpiper stats -t_aa $reference_fasta_file --seq_lengths_filename supercontigs_sequences_lengths --stats_filename hybpiper_supercontigs_statistics supercontig namelist.txt
```
Visualize the summary statistics.
```{bash}
hybpiper recovery_heatmap --heatmap_dpi 300 --heatmap_filetype pdf --heatmap_filename recovery_heatmap_exons genes_sequences_lengths.tsv
hybpiper recovery_heatmap --heatmap_dpi 300 --heatmap_filetype pdf --heatmap_filename recovery_heatmap_supercontigs supercontigs_sequences_lengths.tsv
```
## Sequences extraction
Go to working directory.
```{bash}
cd $path_to_tmp
```
To recover the exon sequences, first create a subdirectory, then run the extraction command
```{bash}
mkdir retrieved_exons
hybpiper retrieve_sequences dna -t_aa $reference_fasta_file --sample_names namelist.txt --fasta_dir retrieved_exons
```
To recover the supercontigs sequences, first create a subdirectory, then run the extraction command
```{bash}
mkdir retrieved_supercontigs
hybpiper retrieve_sequences supercontig -t_aa $reference_fasta_file --sample_names namelist.txt --fasta_dir retrieved_supercontigs
```
To recover the introns sequences, first create a subdirectory, then run the extraction command
```{bash}
mkdir retrieved_introns
hybpiper retrieve_sequences intron -t_aa $reference_fasta_file --sample_names namelist.txt --fasta_dir retrieved_introns
```
To recover the exons sequences in amino-acids, first create a subdirectory, then run the extraction command
```{bash}
mkdir retrieved_aa
hybpiper retrieve_sequences aa -t_aa $reference_fasta_file --sample_names namelist.txt --fasta_dir retrieved_aa
```
## Post-HybPiper files formatting
The .fasta files containing the retrieved sequences have to be slightly modified to be compatible with downstream analyses. In particular, HybPiper incorporates additional information in the sequences header that we don't need downstream.
- for exons sequences
```{bash}
cd $path_to_tmp/retrieved_exons/
```
Create a subdirectory to contain the formatted fastas, and copy (not move) the original fastas to this subdirectory.
```{bash}
mkdir formatted_fastas
cp *.FNA formatted_fastas
```
Remove empty files
```{bash}
cd formatted_fastas
find . -type f -empty -delete
```
Write the sequence on a single line instead of wrapping the sequences text on several lines. New files `.FNA.oneline` are created.
```{bash}
ls -1 ./ | \
while read sample; \
do
cat $sample | awk '/^>/ {printf("\n%s\n",$0);next; } { printf("%s",$0);} END {printf("\n");}' > $sample.oneline
done
```
Remove the `.FNA` files and rename the `.FNA.oneline` to `.FNA`.
```{bash}
rm *.FNA
rename '.oneline' '' *
```
Removes everything after a space in the headers (i.e. keep only the sample name).
```{bash}
sed -i '/^>/s/[[:space:]].*//g' * # removes everything after a space in lines begining with >
```
Remove empty lines in the files.
```{bash}
sed -i '/^$/d' * #removes empty lines
```
- same for supercontigs
```{bash}
cd $path_to_tmp/retrieved_supercontigs/
mkdir formatted_fastas
cp *.fasta formatted_fastas
cd formatted_fastas
find . -type f -empty -delete
ls -1 ./ | \
while read sample; \
do
cat $sample | awk '/^>/ {printf("\n%s\n",$0);next; } { printf("%s",$0);} END {printf("\n");}' > $sample.oneline
done
rm *.fasta
rename '.fasta.oneline' '.FNA' *
sed -i '/^>/s/[[:space:]].*//g' *
sed -i '/^$/d' *
```
- same can be done for introns sequences and amino-acids sequences if needed.
## Transfer to output directory
(only in cases the temporary directory is different from the output directory)
Transfer the extracted sequences in their respective subdirectory to the output directory.
```{bash}
mkdir $path_to_dir_out
scp -rp $path_to_tmp/* $path_to_dir_out/
```
If your running on a cluster, remove the temporary directory with the following command (on some clusters, e.g. HiperGator, it is automatically removed at the end of the job).\
Do not remove the temporary directory if your temporary directory and your output directory are the same (local users).
```{bash}
# rm -r $path_to_tmp
```