forked from LUMC/ribosome-profiling-analysis-framework
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathREADME
851 lines (699 loc) · 42.2 KB
/
README
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
RIBOSOME PROFILING ANALYSIS FRAMEWORK; de Klerk E., Fokkema I.F.A.C et al (2015).
FOOTPRINT ALIGNMENT, TRIPLET PERIODICITY ANALYSIS, PEAK CALLING.
================================================================================
This work is licensed under the Creative Commons
Attribution-NonCommercial-ShareAlike 4.0 International License. To view a
copy of this license, visit http://creativecommons.org/licenses/by-nc-sa/4.0/
or send a letter to:
Creative Commons, PO Box 1866, Mountain View, CA 94042, USA.
The latest version of the scripts and this README file are available at:
http://lumc.github.io/ribosome-profiling-analysis-framework/
================================================================================
WORKFLOW
================================================================================
This framework has been developed on Linux and tested on various Linux systems.
It may run on Windows, but this is not supported. Several provided examples for
running the scripts on a large number of files are written in Bash and will
therefore not run on Windows without additional software, like Cygwin.
This framework requires the installation of PHP-CLI (Command Line Interface),
available from http://php.net/.
This framework has been developed and tested using PHP 5.4 and up, but may work
with earlier versions.
OVERVIEW OF THE FRAMEWORK
--------------------------------------------------------------------------------
- Alignment of ribosome footprint reads
- Analysis of read length distribution of ribosome footprints
- Creating wiggle files with genomic coordinates from transcriptome alignment
- Merging wiggle files from transcriptome and genome alignments
- Retrieving coordinates relative to the annotated open reading frames (ORFs)
using Mutalyzer
- Analysis of triplet periodicity
- Merging biological replicates prior to peak calling (identification of
translation initiation sites (TISs))
- Peak calling: identification of primary ORFs, alternative ORFs and upstream
ORFs
- Merging result files
- Analysis of switches in TIS usage
- TIS location and motif analysis
SAMPLE NAMING
--------------------------------------------------------------------------------
Throughout the documentation of this framework, we provide example commands.
The examples provided here are from the study of de Klerk E., Fokkema I.F.A.C.
et al. (2015), in which two different treatments were performed (harringtonin
and cycloheximide) in two different timepoints, with three biological
replicates. Different treatments or timepoints are named A to Z, with replicates
named in numbers (1 to 9). We recommend using similar, or equal, sample naming.
This framework will read the sample names from the beginning of the file names,
if separated with an underscore or period.
Examples of possible FASTQ file names:
A1_pool7213_TATAT_L002_R1_001.fastq (sample name: A1)
D3.fastq (sample name: D3)
I2_ACACAC_L008_R1_001.fastq (sample name: I2)
REFERENCE SEQUENCES
--------------------------------------------------------------------------------
This framework is based on the Mus Musculus mm10 genome assembly but it can also
be adapted for the Homo Sapiens hg19 and hg38 genome assemblies. Other
assemblies are currently not supported due to their absence in the Mutalyzer
tool, which is used to translate genomic coordinates into transcriptomic
coordinates and vice versa. See https://www.mutalyzer.nl for more information.
The transcriptome reference sequence was downloaded from
ftp://ftp.ncbi.nlm.nih.gov/refseq/M_musculus/mRNA_Prot/mouse.rna.fna.gz
Last modified date: 2013-05-08.
If mouse.rna.fna.gz is not available, you can download the separate files (named
mouse.1.rna.fna.gz, mouse.2.rna.fna.gz and so on) and combine them using bash
with:
for file in mouse.?.rna.fna.gz; do cat $file >> mouse.rna.fna.gz; done
The genome reference sequence was downloaded from
ftp://hgdownload.cse.ucsc.edu/goldenPath/mm10/chromosomes/
Last modified date: 2012-02-09.
ALIGNMENT OF RIBOSOME FOOTPRINT READS
--------------------------------------------------------------------------------
To prevent loss of reads or misalignment of short reads (30 nt) adjacent or
spanning an exon-exon junction, reads are first mapped to the transcriptome
reference, and unaligned reads are subsequently mapped to the genome reference.
For more information about the alignment method, please see de Klerk E., Fokkema
I.F.A.C. et al (2015).
1) The SAM file obtained from the transcriptome alignment reports positions of
mapped reads relative to the start of the transcript. To be able to merge the
results of the genome and the transcriptome alignment these positions need to be
converted to genomic positions. Transcriptome sequences do not always properly
align to the genome, for instance due to the presence of insertions, deletions,
chimeric alignments, or they align to random chromosomes. Because the
coordinates of the reads mapped to the transcriptome reference cannot be
correctly converted to genomic coordinates for transcripts that do not properly
align, first the transcriptome reference needs to be filtered for these
transcripts. For the remaining transcripts, the
mm10_transcript_positions_create.php script is used to create a file that
indicates the location of the exons of the transcripts on the genome.
For more information on this script and the alignment, please see the help file:
help/mm10_transcript_positions_create.txt.
Usage:
./mm10_transcript_positions_create.php TRANSCRIPTOME_ALIGNMENT_SAM_FILE
Produces:
- mm10_transcript_positions.txt
This file contains the genomic coordinates of all exons for transcripts that
align to the genome, excluding random chromosomes, without insertions or
deletions or chimeric alignment. These can therefore be used for the
transcriptome alignment.
- transcriptome_alignment_unsupported_transcripts.txt
Contains all unsupported transcripts and the reasons for rejection.
2) The unsupported transcripts are then removed from the transcriptome
reference, using the following bash code:
cut -f 1 transcriptome_alignment_unsupported_transcripts.txt | grep -v '^#' \
> transcriptome_alignment_unsupported_transcripts_list.txt
tr '\n' '+' < mouse.rna.fna | sed 's/+>/\n>/g' \
| grep -vFf transcriptome_alignment_unsupported_transcripts_list.txt \
| tr '+' '\n' > mouse.rna.fna.no_unsupported.fa
The resulting file (mouse.rna.fna.no_unsupported.fa) is then used to create a
Bowtie index file suitable for alignment using bowtie.
3) Before alignment, the adapter sequences need to be removed from the reads.
There are various tools available which do this, such as cutadapt.
Make sure that after this step, your files are named *.trunc.
4) The following set of commands runs Bowtie to align the reads to the
transcriptome reference, store the unaligned reads, and finally align the
unaligned reads to the genome reference.
For the alignments on the transcriptome, only the forward mapped reads are
selected, and then the SAM files are filtered, requiring a minimum of 25
nucleotides per read. SAM files obtained from the alignment on the genome are
also filtered, requiring a minimum of 25 nucleotides per read. SAM files from
the genome alignment are converted to BAM files, which are then sorted and
converted into mpileup files. The mpileup files are then converted to 5' end
wiggle files, one for each strand. For the creation of 5' end wiggle files from
the genome alignment, use the SageWiggle.py script included in this framework.
In the commands below, replace "mouse.rna.fa.no_unsupported.fa.bowtieindex" with
the correct location of the bowtie index file of your transcriptome reference
sequence file, filtered for unsupported transcripts (see step 2).
Replace "mm10.reference.fa.bowtieindex" with the correct location of the
bowtie index file of your genome reference sequence file.
for file in *.trunc; do bowtie -k 1 -m 20 -n 1 --best --strata -p 8 -S --chunkmbs 2000 --norc --un "${file}.transcriptome_unaligned" mouse.rna.fa.no_unsupported.fa.bowtieindex "$file" "${file}.transcriptome_aligned.sam" 2>> results.transcriptome_alignment.txt; done;
for file in *.transcriptome_unaligned; do bowtie -k 1 -m 2 -n 1 -p 8 -S --best --strata --chunkmbs 2000 mm10.reference.fa.bowtieindex "$file" "${file}.genome_aligned.sam" 2>> results.transcriptome_unaligned.genome_alignment_mm10.txt; done;
for file in *.transcriptome_unaligned.genome_aligned.sam; do echo $file; head -n 100 "$file" | grep "^@" > "${file}.M25.sam"; awk '($6 > 25M)' "$file" >> "${file}.M25.sam"; done;
for file in *.transcriptome_unaligned.genome_aligned.sam.M25.sam; do echo $file; samtools view -bS "$file" > "${file}.bam"; done;
for file in *.transcriptome_unaligned.genome_aligned.sam.M25.sam.bam; do echo $file; samtools sort "$file" "${file}.sorted"; done;
for file in *.transcriptome_unaligned.genome_aligned.sam.M25.sam.bam.sorted.bam; do echo $file; samtools index "$file"; done;
for file in *.transcriptome_unaligned.genome_aligned.sam.M25.sam.bam.sorted.bam; do echo $file; samtools mpileup -d 10000000 "$file" > "${file}.mpileup"; done;
for file in *.transcriptome_unaligned.genome_aligned.sam.M25.sam.bam.sorted.bam.mpileup; do echo $file; python sageWiggle.py -i "$file" -o "${file}.F.wig5" "${file}.R.wig5"; done;
for file in *.transcriptome_aligned.sam; do echo $file; awk '($2 == 0)' "$file" > "${file}.mappedforward.sam"; done;
for file in *.transcriptome_aligned.sam.mappedforward.sam; do echo $file; awk '($6 > 25M)' "$file" > "${file}.M25.sam"; done;
for file in *.transcriptome_aligned.sam.mappedforward.sam.M25.sam; do echo $file; cut -f 1,3,4 "$file" > "${file}.3col.sam"; done
5) Multiple scripts have transcript names but need either strand or gene
information, or vice versa. To provide this, download a file from the UCSC table
browser, listing the transcripts, the strand and the genes.
To build this list:
- Open the UCSC table browser: http://genome.ucsc.edu/cgi-bin/hgTables
- Select genome, assembly, track: "RefSeq Genes".
- Table: "refGene", output format: "selected fields from primary and related
tables".
- Click "get output".
- On this next page, select "name", "strand" and "name2" in the field list, and
click "get output".
- Save in a file and sort. On linux, this can be done with the sort command.
- In this README, this file is referred to as mm10_gene_list.txt.
ANALYSIS OF READ LENGTH DISTRIBUTION OF RIBOSOME FOOTPRINTS
--------------------------------------------------------------------------------
6) The SAM files generated by the alignment of the ribosome footprint reads can
be used to retreive information on the read length distribution. A simple bash
script that calculates the read length distribution for reads aligning to the
transcriptome (NM and NR separately) and reads aligning to the genome, is
described in help/read_length_distribution.txt.
The read length distribution can also be calculated per gene. For that, use the
get_read_length_per_gene.php script. The script takes gene IDs in an input file
and determines the read length distribution for its transcripts per sample, from
both transcriptome and genome alignment SAM files. It requires the
mm10_gene_list.txt file to see which transcripts belongs to which gene, the
mm10_transcript_positions.txt file to check the strand and the positions of the
transcripts, and the input SAM files.
Usage:
./get_read_length_per_gene.php FILE_WITH_GENES_TO_ANALYZE mm10_gene_list \
mm10_transcript_positions.txt SAM_FILE [SAM_FILE [SAM_FILE [...]]]
Example:
./get_read_length_per_gene.php genes_to_analyze.txt mm10_gene_list.txt \
mm10_transcript_positions.txt *.M25.sam
Produces:
- For each transcript of the given genes, one
gene_transcript_read_length_distribution.txt file, with the read lengths in
the first column, and the number of reads with this length per sample in the
next columns.
- One summary file, ALL_ALL_read_length_distribution.txt, with all read lengths
summed up.
This README will continue describing the steps to follow for triplet periodicity
analysis and ORF analysis (peak calling), regardless of whether the read length
distribution has been analyzed or not.
GROUPING READS FROM SAM FILES OBTAINED FROM TRANSCRIPTOME ALIGNMENT
AND BASIC STATISTICS
--------------------------------------------------------------------------------
7) Mapped reads in the SAM files are first grouped together to generate a file
that sums the coverage per transcript per position.
For this, use the pack_sam_files.php file.
Usage:
./pack_sam_files.php SAM_FILE [SAM_FILE [SAM_FILE [...]]]
Example:
./pack_sam_files.php *.transcriptome_aligned.*3col.sam
Produces:
- For each given SAM file, a SAM.packed file.
Contains the transcript ID, the position relative to the start of the tran-
script, and the total coverage for that position, i.e. the number of reads
aligned starting at this position.
8) The packed SAM files can be used to run some statistics, using
packed_sam2coverage_per_gene.php.
Usage:
./packed_sam2coverage_per_gene.php mm10_gene_list.txt \
PACKED_SAM_FILE [PACKED_SAM_FILE [...]]
Example:
./packed_sam2coverage_per_gene.php mm10_gene_list.txt *.packed
Produces:
- One file which name depends on the given SAM file(s), ending in the suffix
.coverage_per_gene.txt, containing (per SAM file) which transcripts can not
be matched to a gene, listing the top 5 unknown transcripts based on their
coverage, and a table containing coverage per gene per sample. The
genes are sorted on total coverage in all samples together.
Other statistics can be run directly on the packed SAM files, for instance how
many reads are mapped to non-coding reads.
CREATING WIGGLE FILES WITH GENOMIC COORDINATES FROM TRANSCRIPTOME ALIGNMENT
--------------------------------------------------------------------------------
9) The packed SAM files are used to generate wiggle files. For this, use the
packed_sam2wiggle.php script, which needs the transcript positions file created
in 1).
Usage:
./packed_sam2wiggle.php mm10_transcript_positions.txt \
PACKED_SAM_FILE [PACKED_SAM_FILE [PACKED_SAM_FILE [...]]]
Example:
./packed_sam2wiggle.php mm10_transcript_positions.txt *.packed
Produces:
- 4 wiggle files per SAM file;
+ F unfiltered,
+ F filtered (NR and XR removed),
+ R unfiltered,
+ R filtered (NR and XR removed).
Unfiltered 5' wiggle files are used for visualization purposes only, whereas
filtered 5' wiggle files, containing positions of reads mapping only to coding
transcripts, are used for triplet periodicity analysis and peak calling.
Positions mapping to non-coding transcripts won't be processed with this
analysis framework.
MERGING WIGGLE FILES FROM TRANSCRIPTOME AND GENOME ALIGNMENTS
--------------------------------------------------------------------------------
10) The merging of the wiggle files from the alignment on the transcriptome and
the alignment on the genome is done using the "wiggelen" package written by
Martijn Vermaat and Jeroen Laros. The merge_wigglefiles.sh script is a wrapper
around the wiggelen package.
For detailed information, see the help file help/merge_wigglefiles.txt.
Usage:
./merge_wigglefiles.sh
(it should be run in the directory where all the wiggle files reside)
Produces:
- 4 wiggle files per sample, using the transcriptome and genome alignment wiggle
files for each created file:
+ F unfiltered,
+ F filtered (NR and XR removed),
+ R unfiltered,
+ R filtered (NR and XR removed).
Unfiltered 5' wiggle files are used for visualization purposes only, whereas
filtered 5' wiggle files, containing positions of reads mapping only to coding
transcripts, are used for triplet periodicity analysis and peak calling.
Positions mapping to non-coding transcripts won't be processed with this
analysis framework.
RETRIEVING COORDINATES RELATIVE TO THE ANNOTATED OPEN READING FRAMES (ORFs)
USING MUTALYZER
--------------------------------------------------------------------------------
11) To translate the genomic coordinates to coordinates relative to the
annotated Translation Initiation Site (TIS), we use the Mutalyzer service.
Mutalyzer's position converter has a batch-feature that allows for the upload of
large files containing genomic postions. Through email you will be notified when
your batch job is done, after which you can download the result file.
Wiggle files are converted to Mutalyzer position converter batch files using the
wig2batchfile.php script. Please note that since Mutalyzer is a service aimed
at genetic variation, all positions are converted into genetic variants.
Batch files are created for the filtered 5' wiggle files only.
Usage:
./wig2batchfile.php WIGGLE_FILE
Example:
for file in *.filtered.wig5; do ./wig2batchfile.php "$file"; done
Produces:
- One Mutalyzer batch input file.
12) The position converter batch files are manually loaded into Mutalyzer:
https://mutalyzer.nl/batchPositionConverter
or
https://test.mutalyzer.nl/batchPositionConverter
Do not run more than a few files at a time. It will considerably slow down the
process, and will increase the chance of failures, after which you might need to
re-upload your file. Wait until Mutalyzer reports that the file has been
uploaded before uploading another file.
The test installation is slightly faster but there is no guarantee that it is
always online.
13) When done, download the result files from Mutalyzer. If you're not sure
anymore which results file belongs to which batchfile, you can match these files
by using the match_mutalyzer_output_to_batchfile.sh script.
Usage:
./match_mutalyzer_output_to_batchfile.sh \
MUTALYZER_RESULTS_FILE [MUTALYZER_RESULTS_FILE [...]] \
BATCH_FILE [BATCH FILE [...]]
Example:
./match_mutalyzer_output_to_batchfile.sh batch-job-* *mutalyzer_batchfile.txt
This script will rename the Mutalyzer results files according to the batch input
files.
ANALYSIS OF TRIPLET PERIODICITY
--------------------------------------------------------------------------------
14) The analyze_triplet_periodicity.php script reports the number of reads
relative to the annotated start codon. A standard feature of ribosome profiling
data is the presence of a major peak at position -12 relative to the annotated
start codon (harringtonin-treated samples), and an higher amount of reads whose
5' ends map to the first nucleotide of a codon (cycloheximide-treated samples).
In the output file of the analyze_triplet_periodicity.php script, numbers in
positions %1, %2 and %3 represent the sum of reads whose 5' end mapped to the
first, second and third nucleotide of the annotated codons, respectively.
Usage:
./analyze_triplet_periodicity.php MUTALYZER_RESULTS_FILE WIGGLE_FILE \
mm10_gene_list.txt STRAND
(Valid values for STRAND: +, -, F and R)
Example:
./analyze_triplet_periodicity.php \
A1.merged_wiggle.F.filtered.wig5_mutalyzer_batchfile_results.txt \
A1.merged_wiggle.F.filtered.wig5 mm10_gene_list.txt F \
> A1.merged_wiggle.F.filtered.wig5.periodicity.txt
Produces:
- Direct output to the screen (which should then be saved to a file if needed)
with detailed statistics about the mapping of the positions to the
transcriptome by Mutalyzer, followed by the list of positions relative to the
TIS, and the total coverage for each position. Of the positions in the coding
region, only the first 6 are shown individually. All positions in the coding
region, including the positions up to 15 bases upstream of the coding region,
are grouped by their location in the annotated codons (first, second and third
base), and displayed as positions %1, %2 and %3 respectively, with the summed
coverages to assess the overall triplet periodicity.
Positions starting with an asterisk (*) are positions in the 3' UTR. The given
number is the relative distance to the stop codon: *1 is the first base after
the stop codon.
The script does the following things:
+ Read out the coverage information from the wiggle file.
+ Read the gene list to find out which genes (transcripts, actually) are on
which strand.
+ Read the Mutalyzer result file to see which locations map to which
transcripts, determining the position within the transcript.
* Filter low coverage positions (< 3).
* Mappings on non-coding transcripts are ignored.
* Mappings on transcripts on the other strand are ignored.
* Transcripts not in the gene list are ignored and reported.
* Filter positions without transcript mapping.
* Filter positions with only intronic mapping.
* Filter positions too far from translation start or stop sites (> 500bp).
* The remaining possible mappings are processed.
Please note that the "coding region" is defined as the region starting at
-15 nucleotides from the ATG to the end of the open reading frame.
Please note that this script does not see the difference between the UTR and
the intergenic sequence. Both are referred to as "UTR mappings".
- 5' UTR or 3' UTR mappings, while mappings to the coding region are also
available, are discarded.
- If there are 5' UTR and 3' UTR mappings, but none in the coding
region, we assume an intergenic situation and if one of the two
positions is clearly closer to the coding region than the
other (> 100bp closer), then the closest position is picked.
If not, the position is discarded, and reported to the user.
- After this, all mappings are counted by their coverage and
reported.
MERGING BIOLOGICAL REPLICATES PRIOR TO PEAK CALLING
(IDENTIFICATION OF TRANSLATION INITIATION SITES (TISs))
--------------------------------------------------------------------------------
15) To increase the statistical power and to lower background signal, merge the
biological replicates of the harringtonin-treated samples. Both the wiggle files
as well as the Mutalyzer result files, need to be merged.
Examples, assuming A and C are harringtonin samples:
Merging the wiggle files:
for sample in A C;
do
for strand in F R;
do
wiggelen merge ${sample}?.merged_wiggle.${strand}.filtered.wig5 \
| sed 's/^\([0-9]\+\) \([0-9]\+\)/\1\t\2/' \
> merged_${sample}.merged_wiggle.${strand}.filtered.wig5;
rm ${sample}?.merged_wiggle.${strand}.filtered.wig5.idx;
done
done
Merging the Mutalyzer files:
for sample in A C;
do
for strand in F R;
do
head -n 1 ${sample}1.merged_wiggle.${strand}.filtered.wig5_mutalyzer_batchfile_results.txt \
> merged_${sample}.merged_wiggle.${strand}.filtered.wig5_mutalyzer_batchfile_results.txt
grep -h "^chr" ${sample}?.merged_wiggle.${strand}.filtered.wig5_mutalyzer_batchfile_results.txt \
| sort -g | uniq \
>> merged_${sample}.merged_wiggle.${strand}.filtered.wig5_mutalyzer_batchfile_results.txt
done
done
The order of the variants in the Mutalyzer results file is sorted differently,
but this has no effect on the following analysis results.
PEAK CALLING: IDENTIFICATION OF PRIMARY ORFs, ALTERNATIVE ORFs and UPSTREAM ORFs
--------------------------------------------------------------------------------
16) The find_ORFs.php script is based on a local dynamic peak calling algorithm,
which tries to identify peaks (TISs) relative to the coverage of the surrounding
region.
NOTE: Run this script on the merged replicates of harringtonin samples, only.
Usage:
./find_ORFs.php MUTALYZER_RESULTS WIGGLE_FILE mm10_gene_list.txt STRAND
(Valid values for STRAND: +, -, F and R)
Example:
./find_ORFs.php \
merged_A.merged_wiggle.F.filtered.wig5_mutalyzer_batchfile_results.txt \
merged_A.merged_wiggle.F.filtered.wig5 mm10_gene_list.txt F
Produces:
- *.ORF_analysis_results_stats.txt
Statistics on the ORF finding run; information on the loaded gene list file
and wiggle file, mentions the number of positions left out on each filtering
step, and mentions the number of genes left with found translation initiation
sites.
- *.ORF_analysis_results.txt
The results file shows, per gene, the number of positions found, the number
actually analyzed as candidate peaks, and the number of TISs found. These are
then reported with the genomic position, the coverage, and per transcript, the
position relative to the annotated TIS on the transcript.
- *.ORF_analysis_results_after_cutoff.txt
This file has the same format of the results file, but shows only TISs found
in the annotated coding region, at least 5000 bases from the annotated TIS.
(the value of 5000 can be configured in the script)
For more information on this cutoff, please see de Klerk E., Fokkema I.F.A.C.
et al (2015).
The script does the following things:
+ Read out the coverage information from the wiggle file
+ Read the gene list to find out which genes (transcripts, actually) are on
which strand
+ Read the Mutalyzer result file to see which locations map to which
transcripts, determining the position within the transcript
* Filter low coverage positions (< 3)
* Mappings on non-coding transcripts are ignored
* Mappings on transcripts on the other strand are ignored
* Transcripts not in the gene list are ignored and reported (so they could be
fixed == added to the gene info file)
* Filter positions without transcript mapping
* Filter positions with only intronic mapping
* Filter positions too far from translation start or stop sites (> 500bp)
* The remaining possible mappings are processed.
Please note that the "coding region" is defined as the region starting at
-15 nucleotides from the ATG to the end of the open reading frame.
Please note that this script does not see the difference between the UTR and
the intergenic sequence. Both are referred to as "UTR mappings".
- 3' UTR mappings, while mappings to the coding region or in the 5' UTR are
also available, are discarded.
- After this, all positions are stored, per gene, for the next step.
+ Peaks are searched for by walking through the positions, starting upstream.
Each position, with at least a coverage of 20, is analyzed.
(Please note, that 20 has been chosen for analysis of merged biological
replicates. Otherwise, 10 was used.)
* Its coverage should be higher than that of the positions located 3, 6, 9, 12
and 15 nucleotides upstream of it.
* Its coverage should be at least as high as that of the positions located 1
and 2 nucleotides downstream of it.
* It should show a triplet periodicity and a clear "harringtonin pattern":
- The first nucleotide of the analyzed codon should have a coverage of at
least 60% of the total coverage of that codon (configurable setting).
- The following 5 codons should also not have a higher maximum coverage than
this position's coverage.
- If any of the following 5 codons have a maximum coverage higher than 10%
of the coverage of the position we're analyzing, that codon must not show
a conflicting triplet periodicity pattern (see first two points above).
* If all these rules apply, the position is stored as a possible TIS.
+ Per gene, from all its found possible TISs, we will take the one with the
highest coverage as a reference, and discard any other candidate TISs that do
not have at least a coverage (on that position) of 10% of the reference
(highest candidate).
+ All remaining candidate TISs will be reported.
+ Genes that have no candidate peaks left, are ignored.
+ For all genes remaining, the positions of the candidate TISs are reported
relative to all known transcripts of the gene in question. The original number
of positions with high coverage (>10) is mentioned along with the number of
remaining candidate TISs.
Results are displayed as: genomic position, coverage, position on transcript.
The last column may be repeated, depending on the number of known transcripts.
+ Additionally, a separate file reports all candidate peaks found at a position
higher than 5Kb from a known TIS, so they can easily be checked more
thoroughly if they are false positives.
MERGING RESULT FILES
--------------------------------------------------------------------------------
17) To be able to statistically compare the possible switches in TIS usage, the
results of the ORF analyses need to be combined into one file, per strand (F and
R). The merge_ORF_files.php script merges the files into one summary file, but
does not take care of the strands, therefore make sure you don't mix forward and
reverse files with each other.
Usage:
./merge_ORF_files.php ORF_FILE [ORF_FILE [ORF_FILE [...]]]
Example:
./merge_ORF_files.php *.F.*.ORF_analysis_results.txt
Produces:
- One file which name depends on the given input files, ending in the suffix
.merged_ORF_analyses.txt. The format is described below.
The script analyzes the file names to isolate the sample IDs. It assumes that
the sample IDs are the only differences between the input file names. The sample
IDs are also used in the headers of the output.
While grouping, it ignores the positions of the peaks on the transcripts. It
reports the chromosomal position, the coverage per sample, and the gene name.
Every peak that has been called in at least one sample, is displayed.
NOTE: If the original wiggle files of the individual biological replicates are
present in the same directory, these are all read out while merging the ORF
analysis result files, and the resulting file will report the individual
coverages of all biological replicates, per TIS peak.
If the wiggle files are not found, the coverages are taken from the ORF analysis
result files, which means when a coverage of 0 is reported, it simply means that
the position was not recognized as a TIS in that sample. The actual measured
coverage in that sample may not be 0.
Example output:
# Chromosome Position A1 A2 A3 C1 C2 C3 Gene
chr1 4857920 32 14 27 6 3 10 Tcea1
chr1 4857929 24 6 24 2 4 6 Tcea1
ANALYSIS OF SWITCHES IN TIS USAGE
--------------------------------------------------------------------------------
The output file of the merge_ORF_files.php script is used to test switches in
TIS usage. For this, the lme4.0 R package is used.
The input file for the following R code is a tab delimited file. The first line
containins a list of all sample names (in the below example, 6), separated by a
tab. All following lines contain as first column the gene symbols and then, per
sample, the actual individual coverages. The file is imported into R as a
dataframe, containing 6 columns and n rows (n=number of called peaks).
Example input for R:
A1 A2 A3 C1 C2 C3
Tcea1 32 14 27 6 3 10
Tcea1 24 6 24 2 4 6
Below is the R code used for the analysis.
library(lme4.0)
load("TIS_analysis.Rdata")
data_samples<-data_small[,2:7]
Myotubes <- grepl("C", names(data_samples))
coeff_summary_fit_col3_col4 <- list()
for(i in 1: length(list_data.multinom))
{
print(i)
mydata <- list_data.multinom [[i]]
loc <- mydata[,1]
N <- nrow( mydata)
M <- 6
weight <- colSums(mydata[,2:7])
if ((!weight[1] && !weight[2] && !weight[3]) || (!weight[4] && !weight[5] && !weight[6])) {
print("No coverage available in at least one condition")
next
}
dat <- as.vector(as.matrix(mydata[,2:7]))
loc_all <- rep(loc, M)
Myotubes_all <- rep(Myotubes, each=N)
weight_all <- rep(weight, each=N)
subj_all <- factor(rep(colnames(mydata[,2:7]), each=N))
fit<- lmer( dat/weight_all ~ 0 + Myotubes_all*loc_all - Myotubes_all + (loc_all|subj_all), family=binomial(), weight=weight_all)
fit0 <- lmer( dat/weight_all ~ 0+loc_all + (loc_all|subj_all), family=binomial(), weight=weight_all)
coeff_summary_fit_col3_col4[[i]] <- coef(summary(fit))[(1+nrow(coef(summary(fit)))/2):nrow(coef(summary(fit))),3:4]
}
sink("coeff_summary_fit_col3_col4.csv")
coeff_summary_fit_col3_col4
sink()
TIS LOCATION AND MOTIF ANALYSIS
--------------------------------------------------------------------------------
18) The generate_stats_peaks_per_location.php script is used to (i) categorize
TISs based on their location relative to different gene regions, (ii) report
codon motifs, (iii) report the sequence of potential uORFs (sequence between
each TIS located in the 5'UTR and the annotated TIS. The script uses the ORF
analysis results files of the merged replicates. It takes all ORF analysis
results files, both the results after the cutoff as the ones before, and sums up
the number of peaks found per location: 5' UTR, annotated TIS, coding, 3' UTR,
or multiple.
"Multiple" means that the TIS location was mapping to multiple transcripts, and
in at least two of these, the location was different. An exception here is when
one of the locations where the TIS maps to, is "annotated_TIS". Then, it is
assumed that the other transcripts with other positions are not the translated
transcripts.
Note that to determine the locations of the reads, the positions that are given
by the find_ORFs.php script have to be increased by 12 bases, since these are
the 5' ends of the reads. This means that positions -12 through -10 are counted
as annotated TIS. Lower numbers are counted as 5'UTR, while higher numbers
(> -10) are counted as coding. Positions starting with an asterisk (*) are
counted as 3'UTR.
This also means that read start positions at the end of the coding region, less
than 12 bp away from the 3'UTR, are counted as coding while in fact in reality
they represent a read that lies in the 3'UTR. This can not be detected however,
because we don't know the length of the coding region of the transcripts.
To be able to report the TIS codon motifs and the sequence between each 5'UTR
TIS and the annotated TIS, this script downloads the RefSeq sequences of the
transcripts that the read was aligned to, automatically from the NCBI website
using the URL format:
http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nuccore&id=NM_007386.2.gb&rettype=gb
To prevent repeated downloads on successive runs of the script, it requires an
"NM cache" directory where all downloaded sequences are stored and can be re-
used by successive runs. The name of the directory is currently configured in
the $_SETT settings array within the script. If the directory is not found or
not writable, the script will complain and refuse execution. When that happens,
check if the directory can be made writable. Otherwise create an empty directory
and set its name in the settings in the script source code.
Usage:
./generate_stats_peaks_per_location.php ORF_FILE [ORF_FILE [ORF_FILE [...]]]
Example:
./generate_stats_peaks_per_location.php *ORF_analysis_results*
Produces:
- *.ORF_analysis_results.stats_peaks_per_location.txt (one per sample)
Reports the total number of TISs found as well as the total coverage of found
TISs per gene region (5'UTR, annotated TIS, coding, 3'UTR, multiple). Both are
reported since a high number of identified TISs may not correlate to also
having high coverage in this region.
The statistics are reported separately for all peaks before the cutoff and all
peaks including the ones after the cutoff.
- *.ORF_analysis_results.peaks_classification.txt (one per sample)
Reports all found TISs before the set cutoff, sorted on category, strand and
genomic position. Other fields reported are the "real" genomic position of the
TIS, calculated by adding (or substracting, for the negative strand) an offset
of 12 nucleotides to the 5' read genomic positions, the gene, the summed
coverage of the replicates in that position, the transcript RefSeq ID (only
one reported), the position on that transcript including the position with the
offset of 12 applied, optional status messages, and the sequence of the codon
at the reported TIS.
For TISs reported in the category "multiple", the information in the last four
columns is missing.
In case the RefSeq file has no CDS annotated, instead of the motif sequence
the error "could_not_parse_CDS" is given.
In case the RefSeq file has no 5'UTR annotated (CDS starts at position 1), or
in case the RefSeq file has not enough 5'UTR annotated to fetch the full motif
sequence (CDS starts at a position smaller than the distance of the TIS to the
annotated TIS), the status is set to "no_5UTR" or "unannotated_5UTR",
respectively, and a "slice" of the genomic sequence is downloaded. This slice
starts from the corrected genomic position, and 75000 bases downstream. From
this file, this script attempts to find the correct mRNA and CDS tags, takes
the sequence until the annotated CDS, removing any annotated introns before
the CDS. When the mRNA tag can not be found, the status also mentions the code
"no_mRNA_definition". The motif is still reported, but the output file
described below will not contain the sequence until the annotated TIS.
- *.ORF_analysis_results.peaks_classification_5UTR.txt (one per sample)
Reports all TISs found in the 5'UTR, sorted on strand and genomic position.
The fields reported are very similar to the fields mentioned in the previous
file, with the following exceptions: the motif field is missing, and the
following two fields are added: the DNA sequence starting at the TIS until the
annotated TIS, and the translation of this sequence to protein sequence. If a
found TIS can be mapped to different transcripts at different distances from
the annotated TIS, all these transcripts are reported on a separate line.
The protein sequence can contain a *, which indicates a stop-codon. A ?
indicates an incomplete or unrecognizable codon. This should happen only if
the sequence is in a different frame compared with the annotated TIS; in this
case the DNA sequence given will end with an incomplete codon and the trans-
lated sequence will end with a ?.
Any messages that might be reported in the status column, are explained above
in the description of the previous file.
The script is checking the names of input files to automatically ignore its own
result files, the statistics files generated by the find_ORFs.php script and
merged ORF files generated by the merge_ORF_files.php script. Input file names
need to contain the strand information ('.F.' or '.R.') and must end with either
'ORF_analysis_results.txt' or 'ORF_analysis_results_after_cutoff.txt'. Files are
grouped by their file name's prefix (until the strand information). As such, per
sample the script can find four files; two forward strand and two reverse
strand; each a file with peaks before cutoff, and one with peaks after the
cutoff.
If files are passed for which the file name is not recognized, they are reported
and the script will stop processing.
19) The result files of the generate_stats_peaks_per_location.php script contain
the motif sequences of every found TIS. The code to analyze the motif sequences
used is below. If needed, change the first line to use the sample names you wish
to analyze.
for sample in A C;
do
echo "Analyzing sample ${sample}:" > merged_${sample}.motif_analysis_results.txt;
for category in 5UTR unannotated_5UTR annotated_TIS coding 3UTR multiple;
do
if [[ $category == '5UTR' ]];
then
TOTAL=`grep "${category}" merged_${sample}.merged_wiggle.ORF_analysis_results.peaks_classification.txt | grep -vE "(unannotated|no)_5UTR" | cut -f 11 | grep -v '^#' | grep -v '^$' | wc -l`;
TOTAL_COVERAGE=`grep "${category}" merged_${sample}.merged_wiggle.ORF_analysis_results.peaks_classification.txt | grep -vE "(unannotated|no)_5UTR" | cut -f 6,11 | grep -E "^[0-9]+\s...$" | grep -v '^#' | cut -f 1 | paste -sd+ | bc`;
elif [[ $category == 'unannotated_5UTR' ]];
then
TOTAL=`grep -E "(unannotated|no)_5UTR" merged_${sample}.merged_wiggle.ORF_analysis_results.peaks_classification.txt | cut -f 11 | grep -v '^#' | grep -v '^$' | wc -l`;
TOTAL_COVERAGE=`grep -E "(unannotated|no)_5UTR" merged_${sample}.merged_wiggle.ORF_analysis_results.peaks_classification.txt | cut -f 6,11 | grep -E "^[0-9]+\s...$" | grep -v '^#' | cut -f 1 | paste -sd+ | bc`;
else
TOTAL=`grep "${category}" merged_${sample}.merged_wiggle.ORF_analysis_results.peaks_classification.txt | cut -f 11 | grep -v '^#' | grep -v '^$' | wc -l`;
TOTAL_COVERAGE=`grep "${category}" merged_${sample}.merged_wiggle.ORF_analysis_results.peaks_classification.txt | cut -f 6,11 | grep -E "^[0-9]+\s...$" | grep -v '^#' | cut -f 1 | paste -sd+ | bc`;
fi
echo -e "${category}\tTotal TISs:\t${TOTAL}\tTotal coverage:\t${TOTAL_COVERAGE}";
echo -e "Motif\tTISs\tCoverage\tPercTISs\tPercCoverage";
if [[ $category == '5UTR' ]];
then
MOTIFS=`grep "${category}" merged_${sample}.merged_wiggle.ORF_analysis_results.peaks_classification.txt | grep -vE "(unannotated|no)_5UTR" | cut -f 11 | grep -v '^#' | grep -v '^$' | sort | uniq`;
elif [[ $category == 'unannotated_5UTR' ]];
then
MOTIFS=`grep -E "(unannotated|no)_5UTR" merged_${sample}.merged_wiggle.ORF_analysis_results.peaks_classification.txt | cut -f 11 | grep -v '^#' | grep -v '^$' | sort | uniq`;
else
MOTIFS=`grep "${category}" merged_${sample}.merged_wiggle.ORF_analysis_results.peaks_classification.txt | cut -f 11 | grep -v '^#' | grep -v '^$' | sort | uniq`;
fi
for motif in $MOTIFS;
do
if [[ $category == '5UTR' ]];
then
TISs=`grep "${category}" merged_${sample}.merged_wiggle.ORF_analysis_results.peaks_classification.txt | grep -vE "(unannotated|no)_5UTR" | cut -f 11 | grep "^${motif}$" | wc -l`;
COVERAGE=`grep "${category}" merged_${sample}.merged_wiggle.ORF_analysis_results.peaks_classification.txt | grep -vE "(unannotated|no)_5UTR" | cut -f 6,11 | grep -E "^[0-9]+\s${motif}$" | cut -f 1 | paste -sd+ | bc`;
elif [[ $category == 'unannotated_5UTR' ]];
then
TISs=`grep -E "(unannotated|no)_5UTR" merged_${sample}.merged_wiggle.ORF_analysis_results.peaks_classification.txt | cut -f 11 | grep "^${motif}$" | wc -l`;
COVERAGE=`grep -E "(unannotated|no)_5UTR" merged_${sample}.merged_wiggle.ORF_analysis_results.peaks_classification.txt | cut -f 6,11 | grep -E "^[0-9]+\s${motif}$" | cut -f 1 | paste -sd+ | bc`;
else
TISs=`grep "${category}" merged_${sample}.merged_wiggle.ORF_analysis_results.peaks_classification.txt | cut -f 11 | grep "^${motif}$" | wc -l`;
COVERAGE=`grep "${category}" merged_${sample}.merged_wiggle.ORF_analysis_results.peaks_classification.txt | cut -f 6,11 | grep -E "^[0-9]+\s${motif}$" | cut -f 1 | paste -sd+ | bc`;
fi
PERC_TISs=`echo "scale=2; $TISs * 100 / $TOTAL" | bc`;
PERC_COVERAGE=`echo "scale=2; $COVERAGE * 100 / $TOTAL_COVERAGE" | bc`;
echo -e "${motif}\t${TISs}\t${COVERAGE}\t${PERC_TISs}%\t${PERC_COVERAGE}%";
done
done >> merged_${sample}.motif_analysis_results.txt;
done
Produces:
- Per sample, one file containing per TIS category (5UTR, unannotated_5UTR,
annotated_TIS, coding, 3UTR, multiple): the analyzed motif, the number of TISs
with this motif, the total coverage of TISs with this motif, the percentage of
TISs with this motif, and the percentage of reads on TISs with this motif.