-
Notifications
You must be signed in to change notification settings - Fork 45
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Completed HW4 task #18
base: main
Are you sure you want to change the base?
Changes from all commits
5550a71
b097912
9dd03a8
16dcf81
7a9758c
5e352af
c508ba8
d838f80
2bdfba8
3a1f1d2
3a49817
4db490e
db01584
c64ea1a
848a7e3
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,126 @@ | ||
# AminoAcidTools | ||
|
||
## Overview | ||
|
||
AminoAcidTools is a mini-program designed to work with amino acid sequences. The program provides some useful procedures that one can find handy in routine scientific work, for example, calculation of various physical and chemical parameters, to find possible protease cleavage sites etc. | ||
|
||
## Usage | ||
|
||
To use AminoAcidTools simply import `run_aminoacid_tools` into your `*.py` script as shown below: | ||
```python | ||
from aminoacid_tools import run_aminoacid_tools | ||
``` | ||
#### Input | ||
The program has two required input parameters: | ||
* (`str` type) Amino acid sequence(s), variable argument. Sequences to be analyzed are not case-specific; | ||
* (`str` type) Name of the operation to be executed, keyword argument. | ||
|
||
:exclamation: The amino acid sequence must be written in single-letter form and in uppercase. | ||
|
||
The name of an operation is defined as a keyword argument `operation = 'option'`: | ||
|
||
```python | ||
run_aminoacid_tools('ARDF', operation = 'calculate_percentage') # correct | ||
run_aminoacid_tools('ARDF', 'DEH', operation = 'calculate_pI') # correct | ||
``` | ||
```python | ||
run_aminoacid_tools('ARDF') # incorrect | ||
run_aminoacid_tools(ARDtt, deh, calculate_pI) # incorrect | ||
``` | ||
|
||
:exclamation: You must use one of predefined operation names described in the "Options" section below. | ||
|
||
#### Output | ||
A string with details of performed operation. | ||
|
||
## Options | ||
|
||
The program implements the following operations: | ||
|
||
#### `calculate_molecular_weight` | ||
Calculate the molecular weight of an amino acid sequence. | ||
|
||
Calculate the molecular weight of the input amino acid sequences based on the mass of each amino acid residue. Reference values for the masses of amino acid residues are taken from [The University of Washington's Proteomics Resource (UWPR)](https://proteomicsresource.washington.edu/protocols06/masses.php) and rounded to three decimal places. The calculations took into account the presence of *H* and *OH* groups at the termini of the sequences. | ||
The input is a string with amino acid sequence. The output is a string with the molecular weight of sequence in Daltons (the result is rounded to two decimal places): | ||
```python | ||
run_aminoacid_tools('ARG', operation = 'calculate_molecular_weight') # input | ||
|
||
Molecular weight of the sequence ARG: 302.33 Da # output | ||
``` | ||
|
||
#### `calculate_percentage` | ||
Calculate the percentage of amino acids in a sequence. | ||
|
||
Calculate the percentage of each amino acid in the sequence. The input is a string with amino acid sequence. The output is a string containing the percentage of each amino acid in the sequence (the result is rounded to two decimal places): | ||
```python | ||
run_aminoacid_tools('ARG', operation = 'calculate_percentage') # input | ||
|
||
Amino acids percentage of the sequence ARG: {A: 33.33, R: 33.33, G: 33.33} # output | ||
``` | ||
|
||
#### `calculate_pI` | ||
Calculate the isoelectric point of aminoacids sequence. | ||
|
||
The function operation is based on the formula for determining the isoelectric point: | ||
|
||
$$pI = \dfrac{(pK_1 + pK_2 + ... + pK_n)}{n},$$ | ||
|
||
where $pK$ - dissociation constant of free $NH_2$ and $COOH$ radicals in amino acids. | ||
|
||
The input is a string with amino acid sequences. The output is a string containing the $pI$ (isoelectric point) of sequence: | ||
```python | ||
run_aminoacid_tools('ARG', operation = 'calculate_pI') # input | ||
|
||
Isoelectric point for the sequence ARG: 8.14 # output | ||
``` | ||
|
||
#### `calculate_hydrophobicity_eisenberg` | ||
Calculate estimation of hydrophilicity/hydrophobicity of amino acid sequence. | ||
|
||
The function operation is based on the Einzenberg hydrophilicity/hydrophobicity scale of amino acids. | ||
The input is a string with amino acid sequences. The output is a string containing the rough estimation of hydrophilicity/hydrophobicity of amino acid sequence: | ||
```python | ||
run_aminoacid_tools('ARG', operation = 'calculate_hydrophobicity_eisenberg') # input | ||
|
||
Sequence ARG: Hydrophilic # output | ||
``` | ||
|
||
#### `get_cleavage_sites` | ||
Return amount and coordinates of cleavage sites for motif-specific proteases (casp3, casp6, casp7, enterokinase). | ||
|
||
The function finds traces of motifs in amino acid sequence that can be recognized by next site-specific proteases: caspases 3, 6, 7 and enterokinase, then calculates amount of each protease's site and its coordinates. The coordinate is a position of amino acid in C-end of potentially cleaved peptide. Motifs that can be recognized by these proteases were taken from [PeptideCutter](https://web.expasy.org/peptide_cutter/peptidecutter_enzymes.html) documentation. | ||
|
||
Some of the proteases permit more than one possible amino acids at a single position. For example, Caspase 6 protease motif is defined as `V, E, H or I, D`. It means there are two possible sequences `...VEHD...` and `...VEID...` containing the protease motif. Please note the function implementation supports 'OR' condition only. | ||
|
||
The input is 1) a string with amino acid sequence to be analyzed, 2) subsequence to be found in a sequence. Subsequence is specified as list of lists. Each nested list means more than one possible amino acid at a single position and checked by `or` condition. The output is a string with input amino acid sequence, number and coordinates of cleavage sites for each protease separately: | ||
```python | ||
# input | ||
run_aminoacid_tools('ESDMQDMDSGISLDNDDEEKMQ', operation = 'get_cleavage_sites') | ||
|
||
# output | ||
ESDMQDMDSGISLDNDDEEKMQ | ||
1 protease cleavage site(s) for Caspase 3: [6] | ||
0 protease cleavage site(s) for Caspase 6: [] | ||
0 protease cleavage site(s) for Caspase 7: [] | ||
1 protease cleavage site(s) for Enterokinase: [20] | ||
``` | ||
|
||
## Troubleshooting | ||
|
||
The program automatically checks whether the input amino acid sequence (contains only letters corresponding to one-letter amino acid abbreviations) and the name of the operation are correct. If an error is detected, the program will be interrupted and a corresponding message will be displayed on the screen. Examples: | ||
```python | ||
# incorrect input sequence | ||
ValueError: Incoming sequence ESDMQDMDSGaISLDNDDEEKMQ is not a peptide | ||
|
||
# incorrect operation | ||
ValueError: Incorrect operation value | ||
Supported operations: ['get_cleavage_sites', 'calculate_molecular_weight', 'calculate_percentage', 'calculate_pI', 'calculate_hydrophobicity_eisenberg'] | ||
``` | ||
|
||
## Authors | ||
|
||
* Kseniia Matveeva (team lead, author of the `get_cleavage_sites` and `run_aminoacid_tools` functions) | ||
* Anastasiya Ivanova (author of the `calculate_molecular_weight` and `calculate_percentage` functions) | ||
* Danila Chernikov (author of the `calculate_pI` and `calculate_hydrophobicity_eisenberg` functions) | ||
|
||
![Team photo](team_photo.png) |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,214 @@ | ||
from typing import Dict | ||
|
||
|
||
def calculate_percentage(seq: str) -> str: | ||
""" | ||
Calculates the percentage of amino acids in the entered amino acid sequence | ||
Arguments: | ||
- seq (str): amino acid sequences to be analyzed | ||
Return: | ||
- str: a string with the percentage of each amino acid | ||
""" | ||
amino_acid_counts: Dict[str, int] = {} # dict to store count of each amino acid | ||
for amino_acid in seq: | ||
if amino_acid in amino_acid_counts: | ||
amino_acid_counts[amino_acid] += 1 | ||
else: | ||
amino_acid_counts[amino_acid] = 1 | ||
total_amino_acids = len(seq) | ||
amino_acid_percentages = {} # dict to store each amino acid and its % | ||
for amino_acid, count in amino_acid_counts.items(): | ||
percentage = round(((count / total_amino_acids) * 100), 2) | ||
amino_acid_percentages[amino_acid] = percentage | ||
Comment on lines
+19
to
+22
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. В целом-то вы могли сразу пересчитывать значения в |
||
return f'Amino acids percentage of the sequence {seq}: {amino_acid_percentages}' | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Ну и еще раз -- |
||
|
||
|
||
def calculate_molecular_weight(seq: str) -> str: | ||
""" | ||
Calculates the molecular weight of entered amino acid sequence | ||
Arguments: | ||
- seq (str): amino acid sequences to be analyzed | ||
Return: | ||
- str: a string with the molecular weight value for amino acid sequence | ||
""" | ||
amino_acid_weights = { | ||
'G': 57.051, 'A': 71.078, 'S': 87.077, 'P': 97.115, 'V': 99.131, | ||
'T': 101.104, 'C': 103.143, 'I': 113.158, 'L': 113.158, 'N': 114.103, | ||
'D': 115.087, 'Q': 128.129, 'K': 128.172, 'E': 129.114, 'M': 131.196, | ||
'H': 137.139, 'F': 147.174, 'R': 156.186, 'Y': 163.173, 'W': 186.210 | ||
} | ||
weight = 18.02 # for the H and OH at the termini | ||
for amino_acid in seq: | ||
weight += amino_acid_weights[amino_acid] | ||
return f'Molecular weight of the sequence {seq}: {round(weight, 2)} Da' | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. И опять |
||
|
||
|
||
def calculate_hydrophobicity_eisenberg(sequence): | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. нет аннотации типов :( |
||
|
||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. и докстринги тоже нет :( |
||
# Amino acid hydrophilicity/hydrophobicity scale by Eisengerg | ||
hydrophobicity_values = { | ||
'A': 0.5, 'R': 0.65, 'N': 1.0, 'D': 1.3, 'C': -0.15, | ||
'Q': 1.0, 'E': 1.5, 'G': 0.75, 'H': 0.7, 'I': -1.3, | ||
'L': -1.3, 'K': 0.75, 'M': -1.1, 'F': -1.9, 'P': 0.55, | ||
'S': 0.6, 'T': 0.3, 'W': -0.5, 'Y': -1.65, 'V': -0.9 | ||
} | ||
|
||
# Calculate sum of hydrophilicities for all amino acids in the sequence | ||
hydrophobicity_sum = sum(hydrophobicity_values.get(aa, 0) for aa in sequence) | ||
|
||
# Determine hydrophilicity/hydrophobicity of sequence | ||
if hydrophobicity_sum > 0: | ||
return f"Sequence {sequence}: Hydrophilic" | ||
elif hydrophobicity_sum < 0: | ||
return f"Sequence {sequence}: Hydrophobic" | ||
else: | ||
return f"Sequence {sequence}: Neutral" | ||
Comment on lines
+60
to
+65
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Ну и вот опять же, лучше возвращать уж |
||
|
||
|
||
def calculate_pI(sequence): | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. а где аннотация типов? ))) |
||
"""Create a dictionary of pK values (COO-, NH3+, R) information taken | ||
from source http://www.sev-chem.narod.ru/spravochnik/piaminoacid.htm""" | ||
pK_values = { | ||
'A': (2.34, 9.60), | ||
'R': (2.17, 9.04, 12.48), | ||
'N': (2.02, 8.80), | ||
'D': (2.09, 9.82, 3.86), | ||
'C': (1.71, 8.33, 10.30), | ||
'Q': (2.17, 9.13), | ||
'E': (2.19, 9.76, 4.25), | ||
'G': (2.34, 9.60), | ||
'H': (1.82, 9.17, 6.00), | ||
'I': (2.32, 9.76), | ||
'L': (2.36, 9.60), | ||
'K': (2.18, 8.95, 10.5), | ||
'M': (2.28, 9.21), | ||
'F': (2.58, 9.24), | ||
'P': (2.00, 10.60), | ||
'S': (2.21, 9.15), | ||
'T': (2.63, 10.43), | ||
'W': (1.22, 9.39), | ||
'Y': (2.20, 9.11, 10.10), | ||
'V': (2.29, 9.72) | ||
} | ||
|
||
# Initialization of variables for leftmost and rightmost elements | ||
N_end_pK = None | ||
C_end_pK = None | ||
|
||
# Find the marginal elements and their corresponding pKs | ||
for amino_acid in sequence: | ||
if amino_acid in pK_values: | ||
pK_list = pK_values[amino_acid] | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Кажется, что название |
||
if len(pK_list) >= 2: | ||
if N_end_pK is None: | ||
N_end_pK = pK_list[1] # Второй pK | ||
C_end_pK = pK_list[0] # Первый pK | ||
|
||
# If no amino acid sequence is specified - return None | ||
if N_end_pK is None or C_end_pK is None: | ||
return None | ||
|
||
# Calculate pI | ||
total_pK = N_end_pK + C_end_pK | ||
count = 2 # We take into account the found pKs - there are at least 2 | ||
|
||
# Also add pK of AA radicals - the dictionary contains 3 pK values | ||
for amino_acid in sequence: | ||
if amino_acid in pK_values: | ||
pK_list = pK_values[amino_acid] | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Кажется, что название |
||
if len(pK_list) >= 3: | ||
total_pK += pK_list[2] # Третий pK | ||
count += 1 | ||
|
||
# Substitute all found values into the formula and calculate pI | ||
pI = total_pK / count | ||
return f"Isoelectric point for the sequence {sequence}: {pI}" | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Всё-таки в таком случае хочется вернуть просто |
||
|
||
|
||
def find_cleavage_sites(seq: str, motif: list) -> list: | ||
"""Find cleavage sites for motif-specific proteases. | ||
Arguments: | ||
- seq - string sequence to be analyzed | ||
- motif - subsequence to be found in a sequence. Subsequence is specified as list of lists. | ||
Each nested list means more than one possible aminoacid at a single position (checked by OR condition). | ||
Return: | ||
- list of cleavage sites coordinates (C-end aminoacid of *potentially* cleaved sequence) | ||
""" | ||
cleavage_sites = [] | ||
seq_idx = 0 | ||
while seq_idx < len(seq): | ||
motif_idx = 0 | ||
chars_at_motif_idx = motif[motif_idx] | ||
seq_char = seq[seq_idx] | ||
if seq_char in chars_at_motif_idx: | ||
motif_idx += 1 | ||
while motif_idx < len(motif): | ||
chars_at_motif_idx = motif[motif_idx] | ||
seq_char = seq[seq_idx+motif_idx] | ||
if seq_char in chars_at_motif_idx: | ||
motif_idx += 1 | ||
else: | ||
break | ||
if motif_idx == len(motif): | ||
cleavage_sites.append(seq_idx + motif_idx) | ||
seq_idx += 1 | ||
return cleavage_sites | ||
Comment on lines
+128
to
+155
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Че-то вы тут прям оч сильно переусложнили (скажем мягко) ... ))) Тогда что нам мешает просто идти окном по последовательности и проверять эти 4 символа )) def find_cleavage_sites_(seq: str, motif: List[List[str]]) -> List[int]:
pos_1, pos_2, pos_3, pos_4 = motif
cleavage_sites = []
for i in range(len(seq) - 3):
window = seq[i:i + 4]
if (
window[0] in pos_1 and
window[1] in pos_2 and
window[2] in pos_3 and
window[3] in pos_4
):
cleavage_sites.append(i + 4)
return cleavage_sites P.S. функцию не тестировал, так что может ломаться (особенно на каких-нибудь крайностях) Ну то есть да, для произвольного мотива (где не обязательно 4 сайта) такое не подойдет ... но конкретно в этом случае +- норм ))) There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Вообще предполагалось, что в словарь с сайтами распознавания пептидаз ( There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Ну ... оке, тогда что-нить в таком духе: def find_cleavage_sites(seq: str, motif: List[List[str]]) -> List[int]:
cleavage_sites = []
for i in range(len(seq) - len(motif)):
window = seq[i: i + len(motif)]
is_match = True
for position_value_idx in range(len(motif)):
value_in_seq = window[position_value_idx]
value_in_motif = motif[position_value_idx]
if value_in_seq not in value_in_motif:
is_match = False
if is_match:
cleavage_sites.append(i + len(motif))
return cleavage_sites (P.S. опять же нужно дописать, т.к. 100% где-то ломается) Но вот так хоть +- понятно. Просто ради интереса попробуй в следующем семестре прочитать этот свой алгоритм ))) |
||
|
||
|
||
motif_dict = { | ||
'Caspase 3': [['D'], ['M'], ['Q'], ['D']], | ||
'Caspase 6': [['V'], ['E'], ['H', 'I'], ['D']], | ||
'Caspase 7': [['D'], ['E'], ['V'], ['D']], | ||
'Enterokinase': [['D', 'E'], ['D', 'E'], ['D', 'E'], ['K']] | ||
} | ||
Comment on lines
+158
to
+163
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. А вот это уже какая-то специфичная штука для функции |
||
|
||
|
||
def get_cleavage_sites(seq: str) -> str: | ||
"Return amount and coordinates of cleavage sites for proteases, specified in motif_dict" | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Докстринга должна быть завернута в 3 кавычки: |
||
output = f'{seq}\n' | ||
for motif_name, motif_value in motif_dict.items(): | ||
sites = find_cleavage_sites(seq, motif_value) | ||
output += f'{len(sites)} protease cleavage site(s) for {motif_name}: {sites}\n' | ||
return output | ||
|
||
|
||
all_aminoacids = { | ||
'A', 'R', 'N', 'D', 'C', 'H', 'G', 'Q', 'E', 'I', | ||
'L', 'K', 'M', 'P', 'S', 'Y', 'T', 'W', 'F', 'V' | ||
} | ||
Comment on lines
+175
to
+178
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Нужно было сделать константой ( |
||
|
||
|
||
def is_peptide(seq: str) -> bool: | ||
"Check whether the incoming sequence is an aminoacid" | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Докстринга должна быть завернута в 3 кавычки: """text""" |
||
if set(seq).issubset(all_aminoacids): # if set(seq) <= all_aminoacids | ||
return True | ||
raise ValueError(f'Incoming sequence {seq} is not a peptide') | ||
Comment on lines
+183
to
+185
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. А зачем тут кидать ошибку, если мы можем вернуть False? ))) return set(seq).issubset(all_aminoacids) |
||
|
||
|
||
operation_dict = { | ||
'get_cleavage_sites': get_cleavage_sites, | ||
'calculate_molecular_weight': calculate_molecular_weight, | ||
'calculate_percentage': calculate_percentage, | ||
'calculate_pI': calculate_pI, | ||
'calculate_hydrophobicity_eisenberg': calculate_hydrophobicity_eisenberg | ||
} | ||
|
||
|
||
def run_aminoacid_tools(*seqs: str, operation: str) -> str: | ||
"""Run AminoAcid Tools | ||
Arguments: | ||
- *seqs - one or more string sequences to be analyzed | ||
- operation - action to be done with sequence(s) | ||
Return: | ||
- string that contains incoming sequence and result of operation""" | ||
if operation == '': | ||
raise ValueError('Operation value is not specified') | ||
Comment on lines
+204
to
+205
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. А зачем это проверять? )) |
||
if operation not in operation_dict: | ||
raise ValueError(f'Incorrect operation value\nSupported operations: {list(operation_dict.keys())}') | ||
for seq in seqs: | ||
is_peptide(seq) | ||
output = '' | ||
for seq in seqs: | ||
output += operation_dict[operation](seq) | ||
output += '\n\n' | ||
Comment on lines
+208
to
+213
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
output = []
for seq in seqs:
is_peptide(seq)
output.append(operation_dict[operation](seq)) There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Вот когда у нас будут ошибки, тогда и разберемся ))) |
||
return output |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
неудачно название :(
буквально "посчитать процент" ... чего? )))