Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Completed HW4 task #18

Open
wants to merge 15 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
126 changes: 126 additions & 0 deletions HW4_Matveeva/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,126 @@
# AminoAcidTools

## Overview

AminoAcidTools is a mini-program designed to work with amino acid sequences. The program provides some useful procedures that one can find handy in routine scientific work, for example, calculation of various physical and chemical parameters, to find possible protease cleavage sites etc.

## Usage

To use AminoAcidTools simply import `run_aminoacid_tools` into your `*.py` script as shown below:
```python
from aminoacid_tools import run_aminoacid_tools
```
#### Input
The program has two required input parameters:
* (`str` type) Amino acid sequence(s), variable argument. Sequences to be analyzed are not case-specific;
* (`str` type) Name of the operation to be executed, keyword argument.

:exclamation: The amino acid sequence must be written in single-letter form and in uppercase.

The name of an operation is defined as a keyword argument `operation = 'option'`:

```python
run_aminoacid_tools('ARDF', operation = 'calculate_percentage') # correct
run_aminoacid_tools('ARDF', 'DEH', operation = 'calculate_pI') # correct
```
```python
run_aminoacid_tools('ARDF') # incorrect
run_aminoacid_tools(ARDtt, deh, calculate_pI) # incorrect
```

:exclamation: You must use one of predefined operation names described in the "Options" section below.

#### Output
A string with details of performed operation.

## Options

The program implements the following operations:

#### `calculate_molecular_weight`
Calculate the molecular weight of an amino acid sequence.

Calculate the molecular weight of the input amino acid sequences based on the mass of each amino acid residue. Reference values for the masses of amino acid residues are taken from [The University of Washington's Proteomics Resource (UWPR)](https://proteomicsresource.washington.edu/protocols06/masses.php) and rounded to three decimal places. The calculations took into account the presence of *H* and *OH* groups at the termini of the sequences.
The input is a string with amino acid sequence. The output is a string with the molecular weight of sequence in Daltons (the result is rounded to two decimal places):
```python
run_aminoacid_tools('ARG', operation = 'calculate_molecular_weight') # input

Molecular weight of the sequence ARG: 302.33 Da # output
```

#### `calculate_percentage`
Calculate the percentage of amino acids in a sequence.

Calculate the percentage of each amino acid in the sequence. The input is a string with amino acid sequence. The output is a string containing the percentage of each amino acid in the sequence (the result is rounded to two decimal places):
```python
run_aminoacid_tools('ARG', operation = 'calculate_percentage') # input

Amino acids percentage of the sequence ARG: {A: 33.33, R: 33.33, G: 33.33} # output
```

#### `calculate_pI`
Calculate the isoelectric point of aminoacids sequence.

The function operation is based on the formula for determining the isoelectric point:

$$pI = \dfrac{(pK_1 + pK_2 + ... + pK_n)}{n},$$

where $pK$ - dissociation constant of free $NH_2$ and $COOH$ radicals in amino acids.

The input is a string with amino acid sequences. The output is a string containing the $pI$ (isoelectric point) of sequence:
```python
run_aminoacid_tools('ARG', operation = 'calculate_pI') # input

Isoelectric point for the sequence ARG: 8.14 # output
```

#### `calculate_hydrophobicity_eisenberg`
Calculate estimation of hydrophilicity/hydrophobicity of amino acid sequence.

The function operation is based on the Einzenberg hydrophilicity/hydrophobicity scale of amino acids.
The input is a string with amino acid sequences. The output is a string containing the rough estimation of hydrophilicity/hydrophobicity of amino acid sequence:
```python
run_aminoacid_tools('ARG', operation = 'calculate_hydrophobicity_eisenberg') # input

Sequence ARG: Hydrophilic # output
```

#### `get_cleavage_sites`
Return amount and coordinates of cleavage sites for motif-specific proteases (casp3, casp6, casp7, enterokinase).

The function finds traces of motifs in amino acid sequence that can be recognized by next site-specific proteases: caspases 3, 6, 7 and enterokinase, then calculates amount of each protease's site and its coordinates. The coordinate is a position of amino acid in C-end of potentially cleaved peptide. Motifs that can be recognized by these proteases were taken from [PeptideCutter](https://web.expasy.org/peptide_cutter/peptidecutter_enzymes.html) documentation.

Some of the proteases permit more than one possible amino acids at a single position. For example, Caspase 6 protease motif is defined as `V, E, H or I, D`. It means there are two possible sequences `...VEHD...` and `...VEID...` containing the protease motif. Please note the function implementation supports 'OR' condition only.

The input is 1) a string with amino acid sequence to be analyzed, 2) subsequence to be found in a sequence. Subsequence is specified as list of lists. Each nested list means more than one possible amino acid at a single position and checked by `or` condition. The output is a string with input amino acid sequence, number and coordinates of cleavage sites for each protease separately:
```python
# input
run_aminoacid_tools('ESDMQDMDSGISLDNDDEEKMQ', operation = 'get_cleavage_sites')

# output
ESDMQDMDSGISLDNDDEEKMQ
1 protease cleavage site(s) for Caspase 3: [6]
0 protease cleavage site(s) for Caspase 6: []
0 protease cleavage site(s) for Caspase 7: []
1 protease cleavage site(s) for Enterokinase: [20]
```

## Troubleshooting

The program automatically checks whether the input amino acid sequence (contains only letters corresponding to one-letter amino acid abbreviations) and the name of the operation are correct. If an error is detected, the program will be interrupted and a corresponding message will be displayed on the screen. Examples:
```python
# incorrect input sequence
ValueError: Incoming sequence ESDMQDMDSGaISLDNDDEEKMQ is not a peptide

# incorrect operation
ValueError: Incorrect operation value
Supported operations: ['get_cleavage_sites', 'calculate_molecular_weight', 'calculate_percentage', 'calculate_pI', 'calculate_hydrophobicity_eisenberg']
```

## Authors

* Kseniia Matveeva (team lead, author of the `get_cleavage_sites` and `run_aminoacid_tools` functions)
* Anastasiya Ivanova (author of the `calculate_molecular_weight` and `calculate_percentage` functions)
* Danila Chernikov (author of the `calculate_pI` and `calculate_hydrophobicity_eisenberg` functions)

![Team photo](team_photo.png)
214 changes: 214 additions & 0 deletions HW4_Matveeva/aminoacid_tools.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,214 @@
from typing import Dict


def calculate_percentage(seq: str) -> str:

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

неудачно название :(
буквально "посчитать процент" ... чего? )))

"""
Calculates the percentage of amino acids in the entered amino acid sequence
Arguments:
- seq (str): amino acid sequences to be analyzed
Return:
- str: a string with the percentage of each amino acid
"""
amino_acid_counts: Dict[str, int] = {} # dict to store count of each amino acid
for amino_acid in seq:
if amino_acid in amino_acid_counts:
amino_acid_counts[amino_acid] += 1
else:
amino_acid_counts[amino_acid] = 1
total_amino_acids = len(seq)
amino_acid_percentages = {} # dict to store each amino acid and its %
for amino_acid, count in amino_acid_counts.items():
percentage = round(((count / total_amino_acids) * 100), 2)
amino_acid_percentages[amino_acid] = percentage
Comment on lines +19 to +22

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

В целом-то вы могли сразу пересчитывать значения в amino_acid_counts

return f'Amino acids percentage of the sequence {seq}: {amino_acid_percentages}'

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ну и еще раз -- return amino_acid_percentages



def calculate_molecular_weight(seq: str) -> str:
"""
Calculates the molecular weight of entered amino acid sequence
Arguments:
- seq (str): amino acid sequences to be analyzed
Return:
- str: a string with the molecular weight value for amino acid sequence
"""
amino_acid_weights = {
'G': 57.051, 'A': 71.078, 'S': 87.077, 'P': 97.115, 'V': 99.131,
'T': 101.104, 'C': 103.143, 'I': 113.158, 'L': 113.158, 'N': 114.103,
'D': 115.087, 'Q': 128.129, 'K': 128.172, 'E': 129.114, 'M': 131.196,
'H': 137.139, 'F': 147.174, 'R': 156.186, 'Y': 163.173, 'W': 186.210
}
weight = 18.02 # for the H and OH at the termini
for amino_acid in seq:
weight += amino_acid_weights[amino_acid]
return f'Molecular weight of the sequence {seq}: {round(weight, 2)} Da'

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

И опять return weight



def calculate_hydrophobicity_eisenberg(sequence):

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

нет аннотации типов :(


Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

и докстринги тоже нет :(

# Amino acid hydrophilicity/hydrophobicity scale by Eisengerg
hydrophobicity_values = {
'A': 0.5, 'R': 0.65, 'N': 1.0, 'D': 1.3, 'C': -0.15,
'Q': 1.0, 'E': 1.5, 'G': 0.75, 'H': 0.7, 'I': -1.3,
'L': -1.3, 'K': 0.75, 'M': -1.1, 'F': -1.9, 'P': 0.55,
'S': 0.6, 'T': 0.3, 'W': -0.5, 'Y': -1.65, 'V': -0.9
}

# Calculate sum of hydrophilicities for all amino acids in the sequence
hydrophobicity_sum = sum(hydrophobicity_values.get(aa, 0) for aa in sequence)

# Determine hydrophilicity/hydrophobicity of sequence
if hydrophobicity_sum > 0:
return f"Sequence {sequence}: Hydrophilic"
elif hydrophobicity_sum < 0:
return f"Sequence {sequence}: Hydrophobic"
else:
return f"Sequence {sequence}: Neutral"
Comment on lines +60 to +65

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ну и вот опять же, лучше возвращать уж hydrophobicity_sum. Или значения Hydrophilic, Hydrophobic, Neutral



def calculate_pI(sequence):

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

а где аннотация типов? )))

"""Create a dictionary of pK values (COO-, NH3+, R) information taken
from source http://www.sev-chem.narod.ru/spravochnik/piaminoacid.htm"""
pK_values = {
'A': (2.34, 9.60),
'R': (2.17, 9.04, 12.48),
'N': (2.02, 8.80),
'D': (2.09, 9.82, 3.86),
'C': (1.71, 8.33, 10.30),
'Q': (2.17, 9.13),
'E': (2.19, 9.76, 4.25),
'G': (2.34, 9.60),
'H': (1.82, 9.17, 6.00),
'I': (2.32, 9.76),
'L': (2.36, 9.60),
'K': (2.18, 8.95, 10.5),
'M': (2.28, 9.21),
'F': (2.58, 9.24),
'P': (2.00, 10.60),
'S': (2.21, 9.15),
'T': (2.63, 10.43),
'W': (1.22, 9.39),
'Y': (2.20, 9.11, 10.10),
'V': (2.29, 9.72)
}

# Initialization of variables for leftmost and rightmost elements
N_end_pK = None
C_end_pK = None

# Find the marginal elements and their corresponding pKs
for amino_acid in sequence:
if amino_acid in pK_values:
pK_list = pK_values[amino_acid]

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Кажется, что название pK_list тут не подходит )))
Особенно с учетом того, что там кортежи )))

if len(pK_list) >= 2:
if N_end_pK is None:
N_end_pK = pK_list[1] # Второй pK
C_end_pK = pK_list[0] # Первый pK

# If no amino acid sequence is specified - return None
if N_end_pK is None or C_end_pK is None:
return None

# Calculate pI
total_pK = N_end_pK + C_end_pK
count = 2 # We take into account the found pKs - there are at least 2

# Also add pK of AA radicals - the dictionary contains 3 pK values
for amino_acid in sequence:
if amino_acid in pK_values:
pK_list = pK_values[amino_acid]

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Кажется, что название pK_list тут не подходит )))
Особенно с учетом того, что там кортежи )))

if len(pK_list) >= 3:
total_pK += pK_list[2] # Третий pK
count += 1

# Substitute all found values into the formula and calculate pI
pI = total_pK / count
return f"Isoelectric point for the sequence {sequence}: {pI}"

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Всё-таки в таком случае хочется вернуть просто pI (т.е. число).
Может это значение нужно куда-то передавать для последующих расчетов



def find_cleavage_sites(seq: str, motif: list) -> list:
"""Find cleavage sites for motif-specific proteases.
Arguments:
- seq - string sequence to be analyzed
- motif - subsequence to be found in a sequence. Subsequence is specified as list of lists.
Each nested list means more than one possible aminoacid at a single position (checked by OR condition).
Return:
- list of cleavage sites coordinates (C-end aminoacid of *potentially* cleaved sequence)
"""
cleavage_sites = []
seq_idx = 0
while seq_idx < len(seq):
motif_idx = 0
chars_at_motif_idx = motif[motif_idx]
seq_char = seq[seq_idx]
if seq_char in chars_at_motif_idx:
motif_idx += 1
while motif_idx < len(motif):
chars_at_motif_idx = motif[motif_idx]
seq_char = seq[seq_idx+motif_idx]
if seq_char in chars_at_motif_idx:
motif_idx += 1
else:
break
if motif_idx == len(motif):
cleavage_sites.append(seq_idx + motif_idx)
seq_idx += 1
return cleavage_sites
Comment on lines +128 to +155

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Че-то вы тут прям оч сильно переусложнили (скажем мягко) ... )))
Лан, регэкспы нельзя было использовать ... но у вас же "архитектура" сайтов строго задана, не?
Если я правильно понял, у вас всегда идет выбор из 4х позиций. Внутри каждой позиции уже могут быть разные а.к., но позиций всего 4. Соответственно, если я правильно понял, режем после 4ой позиции.

Тогда что нам мешает просто идти окном по последовательности и проверять эти 4 символа ))
Буквально в таком духе:

def find_cleavage_sites_(seq: str, motif: List[List[str]]) -> List[int]:
    pos_1, pos_2, pos_3, pos_4 = motif
    cleavage_sites = []

    for i in range(len(seq) - 3):
        window = seq[i:i + 4]
        if (
            window[0] in pos_1 and
            window[1] in pos_2 and
            window[2] in pos_3 and
            window[3] in pos_4
        ):
            cleavage_sites.append(i + 4)

    return cleavage_sites

P.S. функцию не тестировал, так что может ломаться (особенно на каких-нибудь крайностях)

Ну то есть да, для произвольного мотива (где не обязательно 4 сайта) такое не подойдет ... но конкретно в этом случае +- норм )))

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Вообще предполагалось, что в словарь с сайтами распознавания пептидаз (motif_dict) можно будет запихать последовательности разной длины и с вырожденностью аминокислот на разных позициях (предшествующих сайту протеолиза). И все они будут обрабатываться данной функцией. Так что да, код заведомо сложнее, чем требовалось для конкретных приведенных случаев🙂

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ну ... оке, тогда что-нить в таком духе:

def find_cleavage_sites(seq: str, motif: List[List[str]]) -> List[int]:
    cleavage_sites = []

    for i in range(len(seq) - len(motif)):
        window = seq[i: i + len(motif)]
        is_match = True
        
        for position_value_idx in range(len(motif)):
            value_in_seq = window[position_value_idx]
            value_in_motif = motif[position_value_idx]

            if value_in_seq not in value_in_motif:
                is_match = False

        if is_match:
            cleavage_sites.append(i + len(motif))

    return cleavage_sites

(P.S. опять же нужно дописать, т.к. 100% где-то ломается)

Но вот так хоть +- понятно. Просто ради интереса попробуй в следующем семестре прочитать этот свой алгоритм )))
Ну или хотя бы через месяц...



motif_dict = {
'Caspase 3': [['D'], ['M'], ['Q'], ['D']],
'Caspase 6': [['V'], ['E'], ['H', 'I'], ['D']],
'Caspase 7': [['D'], ['E'], ['V'], ['D']],
'Enterokinase': [['D', 'E'], ['D', 'E'], ['D', 'E'], ['K']]
}
Comment on lines +158 to +163

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

А вот это уже какая-то специфичная штука для функции get_cleavage_sites. Непонятно, почему она тут, а не в теле функции



def get_cleavage_sites(seq: str) -> str:
"Return amount and coordinates of cleavage sites for proteases, specified in motif_dict"

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Докстринга должна быть завернута в 3 кавычки: """text"""

output = f'{seq}\n'
for motif_name, motif_value in motif_dict.items():
sites = find_cleavage_sites(seq, motif_value)
output += f'{len(sites)} protease cleavage site(s) for {motif_name}: {sites}\n'
return output


all_aminoacids = {
'A', 'R', 'N', 'D', 'C', 'H', 'G', 'Q', 'E', 'I',
'L', 'K', 'M', 'P', 'S', 'Y', 'T', 'W', 'F', 'V'
}
Comment on lines +175 to +178

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Нужно было сделать константой (ONE_LETTER_AMINOACIDS) и передвинуть вверх



def is_peptide(seq: str) -> bool:
"Check whether the incoming sequence is an aminoacid"

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Докстринга должна быть завернута в 3 кавычки: """text"""

if set(seq).issubset(all_aminoacids): # if set(seq) <= all_aminoacids
return True
raise ValueError(f'Incoming sequence {seq} is not a peptide')
Comment on lines +183 to +185

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

А зачем тут кидать ошибку, если мы можем вернуть False? )))
Ну то есть буквально is_peptide --> True / False. У нас же не какой-нибудь check_...
Ну и в таком случае можно было бы сразу сделать

return set(seq).issubset(all_aminoacids)



operation_dict = {
'get_cleavage_sites': get_cleavage_sites,
'calculate_molecular_weight': calculate_molecular_weight,
'calculate_percentage': calculate_percentage,
'calculate_pI': calculate_pI,
'calculate_hydrophobicity_eisenberg': calculate_hydrophobicity_eisenberg
}


def run_aminoacid_tools(*seqs: str, operation: str) -> str:
"""Run AminoAcid Tools
Arguments:
- *seqs - one or more string sequences to be analyzed
- operation - action to be done with sequence(s)
Return:
- string that contains incoming sequence and result of operation"""
if operation == '':
raise ValueError('Operation value is not specified')
Comment on lines +204 to +205

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

А зачем это проверять? ))
В таком случае operation нужно вызывать всегда (это теперь тип keyword_only_args)

if operation not in operation_dict:
raise ValueError(f'Incorrect operation value\nSupported operations: {list(operation_dict.keys())}')
for seq in seqs:
is_peptide(seq)
output = ''
for seq in seqs:
output += operation_dict[operation](seq)
output += '\n\n'
Comment on lines +208 to +213

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Почему бы is_peptide не проверять в нижнем for'е? )))
  2. Зачем output делать строкой а потом добавлять "\n\n"?
    По сути этот код нужно было написать вот так:
output = []
for seq in seqs:
    is_peptide(seq)
    output.append(operation_dict[operation](seq))

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Идея была в том, чтобы сначала сделать проверку последовательностей, и после нее уже вызывать функции. Типа, чтобы после обработки 188-ой последовательности из 190 не вывалилась бы ошибка и все оказалось бы впустую...
    Либо как-то обыграть эту историю так, чтобы ошибка тоже записывалась? Например, записывать результаты выполнения функции в словарь в виде { 'sequence' : result } и сюда же записать ошибку в случае ее возникновения? Просто чтобы в любом случае функция сделала какой-то return с тем, что выполнено...

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Вот когда у нас будут ошибки, тогда и разберемся )))

return output
Binary file added HW4_Matveeva/team_photo.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.