Merge pull request #62 from hiruna72/dev

Dev
hiruna72 · May 6, 2024 · d52e29a · d52e29a
2 parents d160083 + 5fe7b41
commit d52e29a
Show file tree

Hide file tree

Showing 14 changed files with 44,271 additions and 181 deletions.
diff --git a/README.md b/README.md
@@ -4,7 +4,7 @@ squigualiser is a tool to Visualise nanopore raw signal-base alignment.
 
 signals (**squig**gles) + vis**ualiser** = **squigualiser**
 
-Google Chrome is the recommended web browser to visualise these plots.
+**Google Chrome** is the recommended web browser to visualise these plots.
 
 Watch [the video](https://youtu.be/kClYH4KpOjk) to learn a few tricks to get the best out of the plots.
 
@@ -34,6 +34,7 @@ Squigualiser preprint - https://www.biorxiv.org/content/10.1101/2024.02.19.58111
    1. [Option 1 - Using f5c eventalign](#option-1-f5c-eventalign)
    2. [Option 2 - Using basecaller move table](#option-2---basecaller-move-table-1)
    3. [Option 3 - Using squigulator signal simulation](#option-3---squigulator-signal-simulation-1)
+   4. [Option 4 - Using uncalled4 align](#option-4-uncalled4-align)
 6. [Pileup view](#pileup-view)
 7. [Plot multiple tracks](#plot-multiple-tracks)
 8. [BED annotations](#bed-annotations)
@@ -44,6 +45,7 @@ Squigualiser preprint - https://www.biorxiv.org/content/10.1101/2024.02.19.58111
 11. [Plot conventions](#plot-conventions)
 12. [Calculate alignment statistics](#calculate-alignment-statistics)
 13. [Notes](#notes)
+    1. [FAST5 and POD5 support](#fast5-and-pod5-support)
 14. [Examples](#examples)
 
 
@@ -466,6 +468,37 @@ tabix -0 -b 9 -e 8 -s 6 ${ALIGNMENT}
 </div>
 </details>
 
+#### Option 4: Uncalled4 align
+<details><summary>Steps for using uncalled4 align</summary>
+<div markdown=1>
+
+1. Align reads to reference genome using uncalled4 following the [steps mentioned in uncalled4 guide](https://github.com/skovaka/uncalled4?tab=readme-ov-file#align)
+
+````
+REF=genome.fa #reference
+MAP_BAM=mapped.bam
+FASTQ=read.fastq
+SIGNAL=reads.blow5
+ALIGNMENT=uncalled4.bam
+
+samtools fastq -T "mv,ts" ${BASECALLER_MOVES_BAM} > ${FASTQ}
+minimap2 -y -ax map-ont ${REF} -t32 --secondary=no ${FASTQ} | samtools sort -o ${MAP_BAM}
+samtools index ${MAP_BAM}
+uncalled4 align --kit "SQK-LSK114" ${REF} ${SIGNAL} --bam-in ${MAP_ BAM} --bam-f5c -o ${ALIGNMENT}
+samtools index ${ALIGNMENT}
+
+````
+2. Plot signal to reference alignment.
+
+````
+OUTPUT_DIR=output_dir
+REGION=chr1:6811404-6811443
+squigualiser plot -f ${REF} -s ${SIGNAL_FILE} -a ${ALIGNMENT} -o ${OUTPUT_DIR} --region ${REGION} --tag_name "uncalled4"
+````
+
+</div>
+</details>
+
 ## Pileup view
 ![image](docs/figures/pileup/igv.png)
 ![image](docs/figures/pileup/pileup_plot.png)
@@ -577,6 +610,12 @@ Check [here](docs/different_alignments.md) for an example.
 7. Squigulator's signal simulation is a good way to understand the nature of the alignments. Please refer to the documentation about [real_vs_simulated_signal](docs/real_vs_simulated_signal.md).
 8. For a explanation of the Guppy move table explanation see please refer [here](docs/move_table.md).
 
+### FAST5 and POD5 support
+Squigualiser randomly access signal files from BLOW5.
+Fast5 and Pod5 do not have such random access functionality.
+We provide methods to convert FAST5 and POD5 to BLOW5.
+1. FAST5 - `slow5tools f2s FAST5 -o BLOW5` [check here](https://github.com/hasindu2008/slow5tools?tab=readme-ov-file#examples)
+2. POD5 - `blue-crab p2s example.pod5 -o example.blow5` [check here](https://github.com/Psy-Fer/blue-crab?tab=readme-ov-file#usage)
 
 ## Examples
 

diff --git a/docs/commands.md b/docs/commands.md
@@ -95,6 +95,8 @@ Plot read/reference - signal alignments.
    Specifies name/location of the output directory. A valid relative or absolute path can be provided. Data will be overwritten but the directory will not be recreated.
 * `--tag_name STR`
 	(optional) A tag name to easily identify the plot.
+* `-k, --kmer_length INT`
+   (optional) kmer length to consider for the base shift calculation and plotting.
 * `--plot_reverse`
 	(optional) Plot only the reverse mapped reads [default value: false].
 * `--rna`
@@ -123,8 +125,10 @@ Plot read/reference - signal alignments.
 	(optional) The base width when plotting with fixed base width [default value: 10]. 
 * `--base_shift INT`
 	(optional) The number of bases to shift to align fist signal move [default value: 0]. More information on this can be found at [here](pore_model.md)
+* `--auto`
+	(optional) Calculate and automatically set the base shift. [default value: false]
 *  `--profile STR`:<br/>
-   (optional) This is used to determine base shift using preset values. The available profiles can be listed using `--list_profile`  
+   (optional) This is used to determine base shift using preset values. The available profiles can be listed using `--list_profile`. The precedence is in the order of `--base_shift < --auto < --profile`.
 *  `--list_profile`:<br/>
     (optional) Print the available profiles and exit.
 * `--plot_limit INT`
@@ -169,6 +173,8 @@ Plot reference - signal alignment pileups.
 	(optional) Path to a file containing a list of read_ids to plot only the reads listed.
 * `--tag_name STR`
 	(optional) A tag name to easily identify the plot
+* `-k, --kmer_length INT`
+   (optional) kmer length to consider for the base shift calculation and plotting.
 * `--plot_reverse`
 	(optional) Plot only the reverse mapped reads.
 * `--rna`
@@ -191,8 +197,10 @@ Plot reference - signal alignment pileups.
 	(optional) The base width when plotting with fixed base width [default value: 10]. 
 * `--base_shift INT`
 	(optional) The number of bases to shift to align fist signal move [default value: 0]. More information on this can be found at [here](pore_model.md)
+* `--auto`
+	(optional) Calculate and automatically set the base shift. [default value: false]
 *  `--profile STR`:<br/>
-   (optional) This is used to determine base shift using preset values. The available profiles can be listed using `--list_profile`  
+   (optional) This is used to determine base shift using preset values. The available profiles can be listed using `--list_profile`. The precedence is in the order of `--base_shift < --auto < --profile`.
 *  `--list_profile`:<br/>
     (optional) Print the available profiles and exit.
 * `--plot_limit INT`

diff --git a/src/__init__.py b/src/__init__.py
@@ -26,8 +26,8 @@ def main():
         description='squigualiser - A simple tool to Visualise nanopore raw signal-base alignment.',
         epilog='''
 See https://slow5.bioinf.science/squigualiser for a detailed description of these command-line options.
-
-Citation:...
+    
+Citation: Samarakoon, H., Liyanage, K., Ferguson, J.M., Parameswaran, S., Gamaarachchi, H. and Deveson, I.W., 2024. Interactive visualisation of raw nanopore signal data with Squigualiser. bioRxiv, pp.2024-02.
                ''',
         formatter_class=RawTextHelpFormatter)
 

diff --git a/src/calculate_offsets.py b/src/calculate_offsets.py
@@ -11,7 +11,7 @@
 from matplotlib.backends.backend_pdf import PdfPages
 import seaborn as sns
 
-DEFAULT_KMER_SIZE = 6
+DEFAULT_KMER_SIZE = 9
 BASE_INDEX = {'A': 0, 'C': 1, 'G': 2, 'T': 3}
 BASE_MAP = {0: 'A', 1: 'C', 2: 'G', 3: 'T'}
 MAX_DIST_THRESHOLD = 0.0
@@ -51,10 +51,11 @@ def plot_distributions(kmer_length, test_array, output_pdf, plt_title):
     for offset in range(start_offset, end_offset):
         i = 0
         for base in test_array[offset]:
+            # print(base)
             if kmer_length == 1:
                 sns.kdeplot(base, label=BASE_MAP[i])
             else:
-                sns.kdeplot(base, label=BASE_MAP[i], ax=axes[offset-start_offset])
+                sns.kdeplot(base, label=BASE_MAP[i], ax=axes[offset-start_offset], warn_singular=False)
             i += 1
         if kmer_length == 1:
             axes.set_title('base shift (offset): {}'.format(-1*offset), size=10, loc='right')
@@ -64,7 +65,7 @@ def plot_distributions(kmer_length, test_array, output_pdf, plt_title):
     plt.suptitle("{}".format(plt_title), size=10)
     plt.draw()
     plt.savefig(output_pdf, format='pdf')
-def calculate_distance(kmer_length, test_array):
+def calculate_distance(kmer_length, test_array, verbose=0):
     start_offset = 0
     end_offset = kmer_length
     offset_dist = []
@@ -73,13 +74,16 @@ def calculate_distance(kmer_length, test_array):
         max_median = -1
         min_median = 10000
         for base in test_array[offset]:
+            if len(base) == 0:
+                continue
             median = np.median(base)
             if median < min_median:
                 min_median = median
             if median > max_median:
                 max_median = median
         distance = max_median - min_median
-        print("offset: {} max_median: {} min_median: {} max_median-min_median: {}".format(offset, max_median, min_median, distance))
+        if verbose:
+            print("offset: {} max_median: {} min_median: {} max_median-min_median: {}".format(offset, max_median, min_median, distance))
         offset_dist.append(distance)
 
     # for offset in range(start_offset, end_offset):
@@ -181,6 +185,72 @@ def calculate_offsets(args, sig_move_offset, output_pdf, s5):
         raise Exception("Error: no reads were processed. Check the dataset and the read_id if provided.")
     index = np.argsort(max_offset_arr)[len(max_offset_arr)//2]
     return max_offset_arr[index], max_dist_arr[index]
+
+def calculate_base_shift(moves, raw_signal, sequence, kmer_length, record_is_reverse, rna):
+    if not 'T' in sequence and 'A' in sequence and 'C' in sequence and 'G' in sequence:
+        if 'N' in sequence:
+            sequence = sequence.replace('N', 'T')
+        elif 'U' in sequence:
+            sequence = sequence.replace('U', 'T')
+
+    len_seq = len(sequence)
+    model = {}
+    start_raw = 0
+    if kmer_length > len_seq:
+        print("The sequence length ({}) is smaller than the kmer length ({}). Cannot calculate an accurate approximation for the base shift. Skipping base shift calculation...".format(len_seq, kmer_length))
+        return 0
+
+    base_index = 0
+    for i in moves:
+        if 'D' in i:
+            i = re.sub('D', '', i)
+            base_index += int(i)
+        elif 'I' in i:
+            i = re.sub('I', '', i)
+            start_raw += int(i)
+        else:
+            end_raw = start_raw + int(i)
+            value = raw_signal[start_raw: end_raw]
+            start_raw = end_raw
+            key = sequence[base_index:base_index+kmer_length]
+            if key not in model:
+                model[key] = np.median(value)
+            else:
+                value_ = np.append(value, model[key])
+                model[key] = np.median(value_)
+            base_index += 1
+        if base_index >= len_seq - kmer_length + 1:
+            break
+    # for key, value in model.items():
+    #     print("{}:{}".format(key, value))
+    num_kmers = len(model)
+
+    test_array = []
+    for base_offset in range(0, kmer_length):
+        freq = [[], [], [], []]
+        for kmer, value in model.items():
+            freq[BASE_INDEX[kmer[base_offset]]].append(value)
+        test_array.append(freq)
+    max_offset, max_dist = calculate_distance(kmer_length, test_array)
+    forward_shift = -1 * max_offset
+    reverse_shift = -1 * (kmer_length - max_offset - 1)
+
+    # print("forward_shift: {} reverse_shift: {}".format(forward_shift, reverse_shift))
+    # output_pdf = PdfPages("delete.pdf")
+    # if rna:
+    #     plt_title = "RNA\n{}\nkmer length: {}\ndifference between highest and lowest medians of the distributions: {}\nbest base shift (offset) for forward mapped reads (derived): {}\nbest base shift (offset) for reverse mapped reads (shown below): {}\n".format("", kmer_length, str(round(max_dist, 4)), reverse_shift, forward_shift)
+    # else:
+    #     plt_title = "DNA\n{}\nkmer length: {}\ndifference between highest and lowest medians of the distributions: {}\nbest base shift (offset) for forward mapped reads (shown below): {}\nbest base shift (offset) for reverse mapped reads (derived): {}\n".format("", kmer_length, str(round(max_dist, 4)), forward_shift, reverse_shift)
+    # plot_distributions(kmer_length, test_array, output_pdf, plt_title)
+    # output_pdf.close()
+
+    if record_is_reverse or rna:
+        print("Only {} ditinct {}-mers were found in the sequence. The calculated base shift value ({}) might not be correct. Reduce kmer length (-k) or increase the region interval or use a preset base shift (--profile)".format(num_kmers, kmer_length, reverse_shift))
+        return reverse_shift
+    else:
+        print("Only {} ditinct {}-mers were found in the sequence. The calculated base shift value ({}) might not be correct. Reduce kmer length (-k) or increase the region interval or use a preset base shift (--profile)".format(num_kmers, kmer_length, forward_shift))
+        return forward_shift
+
 def run(args):
     if args.kmer_length < 1:
         raise Exception("Error: kmer length must be a positive integer")