Skip to content

Commit

Permalink
Merge pull request #62 from hiruna72/dev
Browse files Browse the repository at this point in the history
Dev
  • Loading branch information
hiruna72 authored May 6, 2024
2 parents d160083 + 5fe7b41 commit d52e29a
Show file tree
Hide file tree
Showing 14 changed files with 44,271 additions and 181 deletions.
41 changes: 40 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@ squigualiser is a tool to Visualise nanopore raw signal-base alignment.

signals (**squig**gles) + vis**ualiser** = **squigualiser**

Google Chrome is the recommended web browser to visualise these plots.
**Google Chrome** is the recommended web browser to visualise these plots.

Watch [the video](https://youtu.be/kClYH4KpOjk) to learn a few tricks to get the best out of the plots.

Expand Down Expand Up @@ -34,6 +34,7 @@ Squigualiser preprint - https://www.biorxiv.org/content/10.1101/2024.02.19.58111
1. [Option 1 - Using f5c eventalign](#option-1-f5c-eventalign)
2. [Option 2 - Using basecaller move table](#option-2---basecaller-move-table-1)
3. [Option 3 - Using squigulator signal simulation](#option-3---squigulator-signal-simulation-1)
4. [Option 4 - Using uncalled4 align](#option-4-uncalled4-align)
6. [Pileup view](#pileup-view)
7. [Plot multiple tracks](#plot-multiple-tracks)
8. [BED annotations](#bed-annotations)
Expand All @@ -44,6 +45,7 @@ Squigualiser preprint - https://www.biorxiv.org/content/10.1101/2024.02.19.58111
11. [Plot conventions](#plot-conventions)
12. [Calculate alignment statistics](#calculate-alignment-statistics)
13. [Notes](#notes)
1. [FAST5 and POD5 support](#fast5-and-pod5-support)
14. [Examples](#examples)


Expand Down Expand Up @@ -466,6 +468,37 @@ tabix -0 -b 9 -e 8 -s 6 ${ALIGNMENT}
</div>
</details>

#### Option 4: Uncalled4 align
<details><summary>Steps for using uncalled4 align</summary>
<div markdown=1>

1. Align reads to reference genome using uncalled4 following the [steps mentioned in uncalled4 guide](https://github.com/skovaka/uncalled4?tab=readme-ov-file#align)

````
REF=genome.fa #reference
MAP_BAM=mapped.bam
FASTQ=read.fastq
SIGNAL=reads.blow5
ALIGNMENT=uncalled4.bam
samtools fastq -T "mv,ts" ${BASECALLER_MOVES_BAM} > ${FASTQ}
minimap2 -y -ax map-ont ${REF} -t32 --secondary=no ${FASTQ} | samtools sort -o ${MAP_BAM}
samtools index ${MAP_BAM}
uncalled4 align --kit "SQK-LSK114" ${REF} ${SIGNAL} --bam-in ${MAP_ BAM} --bam-f5c -o ${ALIGNMENT}
samtools index ${ALIGNMENT}
````
2. Plot signal to reference alignment.

````
OUTPUT_DIR=output_dir
REGION=chr1:6811404-6811443
squigualiser plot -f ${REF} -s ${SIGNAL_FILE} -a ${ALIGNMENT} -o ${OUTPUT_DIR} --region ${REGION} --tag_name "uncalled4"
````

</div>
</details>

## Pileup view
![image](docs/figures/pileup/igv.png)
![image](docs/figures/pileup/pileup_plot.png)
Expand Down Expand Up @@ -577,6 +610,12 @@ Check [here](docs/different_alignments.md) for an example.
7. Squigulator's signal simulation is a good way to understand the nature of the alignments. Please refer to the documentation about [real_vs_simulated_signal](docs/real_vs_simulated_signal.md).
8. For a explanation of the Guppy move table explanation see please refer [here](docs/move_table.md).

### FAST5 and POD5 support
Squigualiser randomly access signal files from BLOW5.
Fast5 and Pod5 do not have such random access functionality.
We provide methods to convert FAST5 and POD5 to BLOW5.
1. FAST5 - `slow5tools f2s FAST5 -o BLOW5` [check here](https://github.com/hasindu2008/slow5tools?tab=readme-ov-file#examples)
2. POD5 - `blue-crab p2s example.pod5 -o example.blow5` [check here](https://github.com/Psy-Fer/blue-crab?tab=readme-ov-file#usage)

## Examples

Expand Down
12 changes: 10 additions & 2 deletions docs/commands.md
Original file line number Diff line number Diff line change
Expand Up @@ -95,6 +95,8 @@ Plot read/reference - signal alignments.
Specifies name/location of the output directory. A valid relative or absolute path can be provided. Data will be overwritten but the directory will not be recreated.
* `--tag_name STR`
(optional) A tag name to easily identify the plot.
* `-k, --kmer_length INT`
(optional) kmer length to consider for the base shift calculation and plotting.
* `--plot_reverse`
(optional) Plot only the reverse mapped reads [default value: false].
* `--rna`
Expand Down Expand Up @@ -123,8 +125,10 @@ Plot read/reference - signal alignments.
(optional) The base width when plotting with fixed base width [default value: 10].
* `--base_shift INT`
(optional) The number of bases to shift to align fist signal move [default value: 0]. More information on this can be found at [here](pore_model.md)
* `--auto`
(optional) Calculate and automatically set the base shift. [default value: false]
* `--profile STR`:<br/>
(optional) This is used to determine base shift using preset values. The available profiles can be listed using `--list_profile`
(optional) This is used to determine base shift using preset values. The available profiles can be listed using `--list_profile`. The precedence is in the order of `--base_shift < --auto < --profile`.
* `--list_profile`:<br/>
(optional) Print the available profiles and exit.
* `--plot_limit INT`
Expand Down Expand Up @@ -169,6 +173,8 @@ Plot reference - signal alignment pileups.
(optional) Path to a file containing a list of read_ids to plot only the reads listed.
* `--tag_name STR`
(optional) A tag name to easily identify the plot
* `-k, --kmer_length INT`
(optional) kmer length to consider for the base shift calculation and plotting.
* `--plot_reverse`
(optional) Plot only the reverse mapped reads.
* `--rna`
Expand All @@ -191,8 +197,10 @@ Plot reference - signal alignment pileups.
(optional) The base width when plotting with fixed base width [default value: 10].
* `--base_shift INT`
(optional) The number of bases to shift to align fist signal move [default value: 0]. More information on this can be found at [here](pore_model.md)
* `--auto`
(optional) Calculate and automatically set the base shift. [default value: false]
* `--profile STR`:<br/>
(optional) This is used to determine base shift using preset values. The available profiles can be listed using `--list_profile`
(optional) This is used to determine base shift using preset values. The available profiles can be listed using `--list_profile`. The precedence is in the order of `--base_shift < --auto < --profile`.
* `--list_profile`:<br/>
(optional) Print the available profiles and exit.
* `--plot_limit INT`
Expand Down
4 changes: 2 additions & 2 deletions src/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -26,8 +26,8 @@ def main():
description='squigualiser - A simple tool to Visualise nanopore raw signal-base alignment.',
epilog='''
See https://slow5.bioinf.science/squigualiser for a detailed description of these command-line options.
Citation:...
Citation: Samarakoon, H., Liyanage, K., Ferguson, J.M., Parameswaran, S., Gamaarachchi, H. and Deveson, I.W., 2024. Interactive visualisation of raw nanopore signal data with Squigualiser. bioRxiv, pp.2024-02.
''',
formatter_class=RawTextHelpFormatter)

Expand Down
78 changes: 74 additions & 4 deletions src/calculate_offsets.py
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,7 @@
from matplotlib.backends.backend_pdf import PdfPages
import seaborn as sns

DEFAULT_KMER_SIZE = 6
DEFAULT_KMER_SIZE = 9
BASE_INDEX = {'A': 0, 'C': 1, 'G': 2, 'T': 3}
BASE_MAP = {0: 'A', 1: 'C', 2: 'G', 3: 'T'}
MAX_DIST_THRESHOLD = 0.0
Expand Down Expand Up @@ -51,10 +51,11 @@ def plot_distributions(kmer_length, test_array, output_pdf, plt_title):
for offset in range(start_offset, end_offset):
i = 0
for base in test_array[offset]:
# print(base)
if kmer_length == 1:
sns.kdeplot(base, label=BASE_MAP[i])
else:
sns.kdeplot(base, label=BASE_MAP[i], ax=axes[offset-start_offset])
sns.kdeplot(base, label=BASE_MAP[i], ax=axes[offset-start_offset], warn_singular=False)
i += 1
if kmer_length == 1:
axes.set_title('base shift (offset): {}'.format(-1*offset), size=10, loc='right')
Expand All @@ -64,7 +65,7 @@ def plot_distributions(kmer_length, test_array, output_pdf, plt_title):
plt.suptitle("{}".format(plt_title), size=10)
plt.draw()
plt.savefig(output_pdf, format='pdf')
def calculate_distance(kmer_length, test_array):
def calculate_distance(kmer_length, test_array, verbose=0):
start_offset = 0
end_offset = kmer_length
offset_dist = []
Expand All @@ -73,13 +74,16 @@ def calculate_distance(kmer_length, test_array):
max_median = -1
min_median = 10000
for base in test_array[offset]:
if len(base) == 0:
continue
median = np.median(base)
if median < min_median:
min_median = median
if median > max_median:
max_median = median
distance = max_median - min_median
print("offset: {} max_median: {} min_median: {} max_median-min_median: {}".format(offset, max_median, min_median, distance))
if verbose:
print("offset: {} max_median: {} min_median: {} max_median-min_median: {}".format(offset, max_median, min_median, distance))
offset_dist.append(distance)

# for offset in range(start_offset, end_offset):
Expand Down Expand Up @@ -181,6 +185,72 @@ def calculate_offsets(args, sig_move_offset, output_pdf, s5):
raise Exception("Error: no reads were processed. Check the dataset and the read_id if provided.")
index = np.argsort(max_offset_arr)[len(max_offset_arr)//2]
return max_offset_arr[index], max_dist_arr[index]

def calculate_base_shift(moves, raw_signal, sequence, kmer_length, record_is_reverse, rna):
if not 'T' in sequence and 'A' in sequence and 'C' in sequence and 'G' in sequence:
if 'N' in sequence:
sequence = sequence.replace('N', 'T')
elif 'U' in sequence:
sequence = sequence.replace('U', 'T')

len_seq = len(sequence)
model = {}
start_raw = 0
if kmer_length > len_seq:
print("The sequence length ({}) is smaller than the kmer length ({}). Cannot calculate an accurate approximation for the base shift. Skipping base shift calculation...".format(len_seq, kmer_length))
return 0

base_index = 0
for i in moves:
if 'D' in i:
i = re.sub('D', '', i)
base_index += int(i)
elif 'I' in i:
i = re.sub('I', '', i)
start_raw += int(i)
else:
end_raw = start_raw + int(i)
value = raw_signal[start_raw: end_raw]
start_raw = end_raw
key = sequence[base_index:base_index+kmer_length]
if key not in model:
model[key] = np.median(value)
else:
value_ = np.append(value, model[key])
model[key] = np.median(value_)
base_index += 1
if base_index >= len_seq - kmer_length + 1:
break
# for key, value in model.items():
# print("{}:{}".format(key, value))
num_kmers = len(model)

test_array = []
for base_offset in range(0, kmer_length):
freq = [[], [], [], []]
for kmer, value in model.items():
freq[BASE_INDEX[kmer[base_offset]]].append(value)
test_array.append(freq)
max_offset, max_dist = calculate_distance(kmer_length, test_array)
forward_shift = -1 * max_offset
reverse_shift = -1 * (kmer_length - max_offset - 1)

# print("forward_shift: {} reverse_shift: {}".format(forward_shift, reverse_shift))
# output_pdf = PdfPages("delete.pdf")
# if rna:
# plt_title = "RNA\n{}\nkmer length: {}\ndifference between highest and lowest medians of the distributions: {}\nbest base shift (offset) for forward mapped reads (derived): {}\nbest base shift (offset) for reverse mapped reads (shown below): {}\n".format("", kmer_length, str(round(max_dist, 4)), reverse_shift, forward_shift)
# else:
# plt_title = "DNA\n{}\nkmer length: {}\ndifference between highest and lowest medians of the distributions: {}\nbest base shift (offset) for forward mapped reads (shown below): {}\nbest base shift (offset) for reverse mapped reads (derived): {}\n".format("", kmer_length, str(round(max_dist, 4)), forward_shift, reverse_shift)
# plot_distributions(kmer_length, test_array, output_pdf, plt_title)
# output_pdf.close()

if record_is_reverse or rna:
print("Only {} ditinct {}-mers were found in the sequence. The calculated base shift value ({}) might not be correct. Reduce kmer length (-k) or increase the region interval or use a preset base shift (--profile)".format(num_kmers, kmer_length, reverse_shift))
return reverse_shift
else:
print("Only {} ditinct {}-mers were found in the sequence. The calculated base shift value ({}) might not be correct. Reduce kmer length (-k) or increase the region interval or use a preset base shift (--profile)".format(num_kmers, kmer_length, forward_shift))
return forward_shift

def run(args):
if args.kmer_length < 1:
raise Exception("Error: kmer length must be a positive integer")
Expand Down
Loading

0 comments on commit d52e29a

Please sign in to comment.