Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HW4_Gorbarenko #14

Open
wants to merge 16 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
174 changes: 174 additions & 0 deletions HW4_Gorbarenko/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,174 @@
# Welcome to amino_analyzer tool

## Overview
The amino_analyzer is an easy-to-use Python tool designed to facilitate the comprehensive analysis of protein sequences. It provides a broad functionality from basic checks for valid amino acid sequences to more complicated computations like molecular weights, hydrophobicity analysis, and cleavage site identification.

## :green_heart: Key features

### 1. Protein molecular weight calculation
The amino_analyzer offers the capability to calculate the molecular weight of a protein sequence. Users can choose between average and monoisotopic weights.
### 2. Hydrophobicity analysis
This function counts the quantity of hydrophobic and hydrophilic amino acids within a protein sequence.
### 3. Cleavage site identification
Researchers can identify cleavage sites in a given peptide sequence using a specified enzyme. The tool currently supports two commonly used enzymes, trypsin and chymotrypsin.
### 4. One-letter to three-Letter code conversion
The amino_analyzer provides a function to convert a protein sequence from the standard one-letter amino acid code to the three-letter code.
### 5. Sulphur-containing amino acid counting
The tool allows a quick determine the number of sulphur-containing amino acids, namely Cysteine (C) and Methionine (M), within a protein sequence.

## Usage

To run amino_analyzer tool you need to use the function ***run_amino_analyzer*** with the following arguments:

```python
from amino_analyzer import run_amino_analyzer
run_amino_analyzer(sequence, procedure, *, weight_type = 'average', enzyme: str = 'trypsine')`
```

- `sequence (str):` The input protein sequence in one-letter code.
- `procedure (str):` The procedure to perform over your protein sequence.
- `weight_type: str = 'average':` default argument for `aa_weight` function. `weight_type = 'monoisotopic'` can be used as another option.
- `enzyme: str = 'trypsine':` default argument for `peptide_cutter` function. `enzyme = 'chymotrypsin'` can be used as another option


**Available procedures list**
- `aa_weight` — calculates the amino acids weight in a protein sequence.
- `count_hydroaffinity` — counts the quantity of hydrophobic and hydrophilic amino acids in a protein sequence.
- `peptide_cutter` — identifies cleavage sites in a given peptide sequence using a specified enzyme (trypsine or chymotripsine).
- `one_to_three_letter_code` — converts a protein sequence from one-letter amino acid code to three-letter code.
- `sulphur_containing_aa_counter` - counts sulphur-containing amino acids in a protein sequence.

You can also use each function separately by importing them in advance. Below are the available functions and their respective purposes:

#### 1. **aa_weight** function calculates the weight of amino acids in a protein sequence:
The type of weight to use, either `average` or `monoisotopic`. Default is `average`.
```python
from amino_analyzer import aa_weight
aa_weight(seq: str, weight: str = `average`) -> float`
```
```python
sequence = "VLDQRKSTMA"
result = aa_weight(sequence, weight='monoisotopic')
print(result) # Output: 1348.517
```

#### 2. **count_hydroaffinity** сounts the quantity of hydrophobic and hydrophilic amino acids in a protein sequence:
```python
from amino_analyzer import count_hydroaffinity
count_hydroaffinity(seq: str) -> tuple
```
```python
sequence = "VLDQRKSTMA"
result = count_hydroaffinity(sequence)
print(result) # Output: (3, 7)
```
#### 3. **peptide_cutter** function identifies cleavage sites in a given peptide sequence using a specified enzyme: trypsine or chymotrypsine:
```python
from amino_analyzer import peptide_cutter
peptide_cutter(sequence: str, enzyme: str = "trypsin") -> str
```
```python
sequence = "VLDQRKSTMA"
result = peptide_cutter(sequence, enzyme="trypsin")
print(result) # Output: Found 2 trypsin cleavage sites at positions 3, 6
```
#### 4. **one_to_three_letter_code** converts a protein sequence from one-letter amino acid code to three-letter code.
```python
from amino_analyzer import one_to_three_letter_code
one_to_three_letter_code(sequence: str) -> str
```

```python
sequence = "VLDQRKSTMA"
result = one_to_three_letter_code(sequence)
print(result) # Output: ValLeuAspGlnArgLysSerThrMetAla
```

#### 5. **sulphur_containing_aa_counter** counts sulphur-containing amino acids in a protein sequence
```python
from amino_analyzer import sulphur_containing_aa_counter
sulphur_containing_aa_counter(sequence: str) -> str
```
```python
sequence = "VLDQRKSTMA"
result = sulphur_containing_aa_counter(sequence)
print(result) # Output: The number of sulphur-containing amino acids in the sequence is equal to 2
```

## Examples
To calculate protein molecular weight:
```python
run_amino_analyzer("VLSPADKTNVKAAW", "aa_weight") # Output: 1481.715

run_amino_analyzer("VLSPADKTNVKAAW", "aa_weight", weight_type = 'monoisotopic') # Output: 1480.804
```

To count hydroaffinity:
```python
run_amino_analyzer("VLSPADKTNVKAAW", "count_hydroaffinity") # Output: (8, 6)
```

To find trypsin/chymotripsine clivage sites:
```python
run_amino_analyzer("VLSPADKTNVKAAW", "peptide_cutter") # Output: 'Found 2 trypsin cleavage sites at positions 7, 11'

run_amino_analyzer("VLSPADKTNVKAAWW", "peptide_cutter", enzyme = 'chymotrypsin') # Output: 'Found 1 chymotrypsin cleavage sites at positions 14'
```

To change to 3-letter code and count sulphur-containing amino acids.
```python
run_amino_analyzer("VLSPADKTNVKAAW", "one_to_three_letter_code") # Output: 'ValLeuSerProAlaAspLysThrAsnValLysAlaAlaTrp'

run_amino_analyzer("VLSPADKTNVKAAWM", "sulphur_containing_aa_counter") # Output: The number of sulphur-containing amino acids in the sequence is equal to 1
```

## Troubleshooting
Here are some common issues you can come ascross while using the amino-analyzer tool and their possible solutions:

1. **ValueError: Incorrect procedure**
If you receive this error, it means that you provided an incorrect procedure when calling `run_amino_analyzer`. Make sure you choose one of the following procedures: `aa_weight`, `count_hydroaffinity`, `peptide_cutter`, `one_to_three_letter_code`, or `sulphur_containing_aa_counter`.

Example:
```python
run_amino_analyzer("VLSPADKTNVKAAW", "incorrect_procedure")
# Output: ValueError: Incorrect procedure. Acceptable procedures: aa_weight, count_hydroaffinity, peptide_cutter, one_to_three_letter_code, sulphur_containing_aa_counter
```

2. **ValueError: Incorrect sequence**
This error occurs if the input sequence provided to run_amino_analyzer contains characters that are not valid amino acids. Make sure your sequence only contains valid amino acid characters (V, I, L, E, Q, D, N, H, W, F, Y, R, K, S, T, M, A, G, P, C, v, i, l, e, q, d, n, h, w, f, y, r, k, s, t, m, a, g, p, c).

Example:
```python
run_amino_analyzer("VLSPADKTNVKAAW!", "aa_weight")
# Output: ValueError: Incorrect sequence. Only amino acids are allowed (V, I, L, E, Q, D, N, H, W, F, Y, R, K, S, T, M, A, G, P, C, v, i, l, e, q, d, n, h, w, f, y, r, k, s, t, m, a, g, p, c).
```

3. **ValueError: You have chosen an enzyme that is not provided**
This error occurs if you provide an enzyme other than "trypsin" or "chymotrypsin" when calling peptide_cutter. Make sure to use one of the specified enzymes.

Example:
```python
peptide_cutter("VLSPADKTNVKAAW", "unknown_enzyme")
# Output: You have chosen an enzyme that is not provided. Please choose between trypsin and chymotrypsin.
```
4. **ValueError: You have chosen an enzyme that is not provided.**
If you encounter this error, it means that you're trying to iterate over a float value. Ensure that you're using the correct function and passing the correct arguments.

Example:
```python
result = count_hydroaffinity(123)
# Output: TypeError: 'int' object is not iterable
```
## Development team:
![image](https://github.com/CaptnClementine/HW4_Gorbarenko/assets/131146976/ad89e427-5b2a-4b32-b65f-519d284fcaa7)

**Anastasia Gorbarenko** - team leader, author of aa_weight and count_hydroaffinity functions
**Anna Ogurtsova** - author of peptide_cutter and one_to_three_letter_code functions
**Ilya Popov** - author of main and sulphur_containing_aa_counter functions

## Contacts
If you have any questions, suggestions, or encounter any issues while using the amino-analyzer tool, feel free to reach out:


- **GitHub**: [Cucumberan](https://github.com/YourGitHubUsername), [CaptnClementine](https://github.com/YourGitHubUsername), [iliapopov17](https://github.com/YourGitHubUsername)

196 changes: 196 additions & 0 deletions HW4_Gorbarenko/amino_analyzer.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,196 @@
from typing import List

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Не библиотека не нужна, я пропуск строки добавила. Насколько помню, в PEP8 допускается пропуск 1 строки после импортов, но почти во всем коде, что я смотрела, видела именно 2 пропуска строки. За это баллы не снижала.

def is_aa(seq: str) -> bool:
"""
Check if a sequence contains only amino acids.

Args:
seq (str): The input sequфence to be checked.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
seq (str): The input sequфence to be checked.
seq (str): The input sequence to be checked.


Returns:
bool: True if the sequence contains only amino acids, False otherwise.
"""
aa_list = ['V', 'I', 'L', 'E', 'Q', 'D', 'N', 'H', 'W', 'F', 'Y', 'R', 'K', 'S', 'T', 'M', 'A', 'G', 'P', 'C',
'v', 'i', 'l', 'e', 'q', 'd', 'n', 'h', 'w', 'f', 'y', 'r', 'k', 's', 't', 'm', 'a', 'g', 'p', 'c']
Comment on lines +13 to +14
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Лучше сделать константой

unique_chars = set(seq)
amino_acids = set(aa_list)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Лучше тогда завести сразу set, вы же его сами задали.

return unique_chars <= amino_acids



def choose_weight(weight: str) -> List[float]:
"""
Choose the weight type of amino acids - average or monoisotopic.

Args:
weight (str): The type of weight to choose, either 'average' or 'monoisotopic'.

Returns:
List[float]: A list of amino acid weights based on the chosen type.
"""
if weight == 'average':
average_weights = [71.0788, 156.1875, 114.1038, 115.0886, 103.1388, 129.1155, 128.1307, 57.0519, 137.1411, 113.1594,
113.1594, 128.1741, 131.1926, 147.1766, 97.1167, 87.0782, 101.1051, 186.2132, 163.1760, 99.1326]
Comment on lines +32 to +33
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • Вот такая запись мне идейно не нравится. Понятно, что вы в функции aa_weight собираете словарь на лету. Но читать это очень неудобно. А вдруг вы где-то ошиблись в весе конкретной аминокислоты? Тогда придется отстирывать до нужного числа, это не очень удобно и читаемо. Лучше сразу сделать тут словарь.

  • Можно не создать эти списки с весами с разными названиями переменных, можно было просто их назвать weights_aa и тогда возвращался бы тот список (а лучше бы словарь), который прошел условие if/elif.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Просто самый простой способ сделать сначала списки, а потом словарь) (потому что списки в нужном порядке можно найти в интернете, а вот словари - нет :D)
Учту на будущее, спасибо!
Второй комментарий не очень поняла, там же по итогу функция возвращает как раз нужный список

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Смотри, я загуглила словарь с весами, в первой же ссылке нашла, нужный: https://stackoverflow.com/questions/27361877/how-i-can-add-the-molecular-weight-of-my-new-list-of-strings-using-a-mw-dictiona . Списки сложнее спустя время отслеживать, что нигде нет ошибки.

Второй комментарий был про то, что у вас в, например, в if сперва создается список average_weights, а строкой ниже вы создаете переменную weights_aa = average_weights. Здесь просто не нужно создавать отдельные переменные average_weights и monoisotopic_weights.

weights_aa = average_weights
elif weight == 'monoisotopic':
monoisotopic_weights = [71.03711, 156.10111, 114.04293, 115.02694, 103.00919, 129.04259, 128.05858, 57.02146, 137.05891, 113.08406,
113.08406, 128.09496, 131.04049, 147.06841, 97.05276, 87.03203, 101.04768, 186.07931, 163.06333, 99.06841]
weights_aa = monoisotopic_weights
else:
raise ValueError(f"I do not know what '{weight}' is :( \n Read help or just do not write anything except your sequence")

return weights_aa


def aa_weight(seq: str, weight: str = 'average') -> float:
"""
Calculate the amino acids weight in a protein sequence.

Args:
seq (str): The amino acid sequence to calculate the weight for.
weight (str, optional): The type of weight to use, either 'average' or 'monoisotopic'. Default is 'average'.

Returns:
float: The calculated weight of the amino acid sequence.
"""
aa_list = str('A, R, N, D, C, E, Q, G, H, I, L, K, M, F, P, S, T, W, Y, V').split(', ')
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Зачем такая запись? Сперва вы берете строку, потом опять ее превращаете в строку, а потом делаете из строки список. Лучше сразу сделать список и объявить его константной переменной (+ вынести из под функции).

Suggested change
aa_list = str('A, R, N, D, C, E, Q, G, H, I, L, K, M, F, P, S, T, W, Y, V').split(', ')
AA_LIST = ['A', 'R', 'N', 'D', 'C', 'E', 'Q', 'G', 'H', 'I', 'L', 'K', 'M', 'F', 'P', 'S', 'T', 'W', 'Y', 'V']

weights_aa = choose_weight(weight)
aa_to_weight = dict(zip(aa_list, weights_aa))
final_weight = 0
for i in seq.upper():
final_weight += aa_to_weight.get(i, 0)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

десь метод get не нужен. Лучше просто вызывать значение из словаря

Suggested change
final_weight += aa_to_weight.get(i, 0)
final_weight += aa_to_weight[i]

return round(final_weight, 3)


def count_hydroaffinity(seq: str) -> tuple:
"""
Count the quantity of hydrophobic and hydrophilic amino acids in a protein sequence.

Args:
seq (str): The protein sequence for which to count hydrophobic and hydrophilic amino acids.

Returns:
tuple: A tuple containing the count of hydrophobic and hydrophilic amino acids, respectively.
"""
hydrophobic_aa = ['A', 'V', 'L', 'I', 'P', 'F', 'W', 'M']
hydrophilic_aa = ['R', 'N', 'D', 'C', 'Q', 'E', 'G', 'H', 'K', 'S', 'T', 'Y']
Comment on lines +75 to +76
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Лучше сделать константами.


hydrophobic_count = 0
hydrophilic_count = 0

seq = seq.upper()

for aa in seq:
if aa in hydrophobic_aa:
hydrophobic_count += 1
elif aa in hydrophilic_aa:
hydrophilic_count += 1

return hydrophobic_count, hydrophilic_count

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change

def peptide_cutter(sequence: str, enzyme: str = "trypsin") -> str:
"""
This function identifies cleavage sites in a given peptide sequence using a specified enzyme.

Args:
sequence (str): The input peptide sequence.
enzyme (str): The enzyme to be used for cleavage. Choose between "trypsin" and "chymotrypsin". Default is "trypsin".

Returns:
str: A message indicating the number and positions of cleavage sites, or an error message if an invalid enzyme is provided.
"""
cleavage_sites = []
if enzyme not in ("trypsin", "chymotrypsin"):
return "You have chosen an enzyme that is not provided. Please choose between trypsin and chymotrypsin."

if enzyme == "trypsin": # Trypsin cuts peptide chains mainly at the carboxyl side of the amino acids lysine or arginine.
for i in range(len(sequence)-1):
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Уделите больше внимания неймингу. i - плохой нейминг здесь.

Suggested change
for i in range(len(sequence)-1):
for i in range(len(sequence) - 1):

if sequence[i] in ['K', 'R', 'k', 'r'] and sequence[i+1] not in ['P','p']:
cleavage_sites.append(i+1)
Comment on lines +108 to +109
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if sequence[i] in ['K', 'R', 'k', 'r'] and sequence[i+1] not in ['P','p']:
cleavage_sites.append(i+1)
if sequence[i] in ['K', 'R', 'k', 'r'] and sequence[i + 1] not in ['P','p']:
cleavage_sites.append(i + 1)


if enzyme == "chymotrypsin": # Chymotrypsin preferentially cleaves at Trp, Tyr and Phe in position P1(high specificity)
for i in range(len(sequence)-1):
if sequence[i] in ['W', 'Y', 'F', 'w', 'y', 'f'] and sequence[i+1] not in ['P','p']:
cleavage_sites.append(i+1)

if cleavage_sites:
return f"Found {len(cleavage_sites)} {enzyme} cleavage sites at positions {', '.join(map(str, cleavage_sites))}"
else:
return f"No {enzyme} cleavage sites were found."


def one_to_three_letter_code(sequence: str) -> str:
"""
This function converts a protein sequence from one-letter amino acid code to three-letter code.

Args:
sequence (str): The input protein sequence in one-letter code.

Returns:
str: The converted protein sequence in three-letter code.
"""
amino_acids = {
'A': 'Ala', 'C': 'Cys', 'D': 'Asp', 'E': 'Glu', 'F': 'Phe',
'G': 'Gly', 'H': 'His', 'I': 'Ile', 'K': 'Lys', 'L': 'Leu',
'M': 'Met', 'N': 'Asn', 'P': 'Pro', 'Q': 'Gln', 'R': 'Arg',
'S': 'Ser', 'T': 'Thr', 'V': 'Val', 'W': 'Trp', 'Y': 'Tyr'
}
Comment on lines +132 to +137
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Лучше сделать константой и вынести за пределы функций


three_letter_code = [amino_acids.get(aa.upper()) for aa in sequence]
return ''.join(three_letter_code)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Я бы лучше сделала '-'.join(three_letter_code), так запись более читаема (и, если мне не изменяет память, так трехбуквенные аминокислотные остатки и записывают). Например, Val-Ile-Leu-Glu (вместо ValIleLeuGlu)

Suggested change
return ''.join(three_letter_code)
return '-'.join(three_letter_code)


Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Потеряла пропуск строки

Suggested change

def sulphur_containing_aa_counter(sequence: str) -> str:
"""
This function counts sulphur-containing amino acids (Cysteine and Methionine) in a protein sequence.

Args:
sequence (str): The input protein sequence in one-letter code.

Returns:
str: The number of sulphur-containing amino acids in a protein sequence.
"""
counter = 0
for i in sequence:
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Плохой нейминг. Лучше было назвать не i, а aminoacid или aa хотя бы.

if i == 'C' or i == 'M':
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Из-за этой логики, у вас тянется ошибка: вы проверяете только по заглавным буквам, а при выбросе ошибки на аминокислотный словарь в вашей главной функции написано, что "Only amino acids are allowed (V, I, L, E, Q, D, N, H, W, F, Y, R, K, S, T, M, A, G, P, C, v, i, l, e, q, d, n, h, w, f, y, r, k, s, t, m, a, g, p, c".

Поэтому я могу захотеть подать последовательность 'mmmmm' и результат будет 0.

counter += 1
answer = str(counter)
return 'The number of sulphur-containing amino acids in the sequence is equal to ' + answer

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change

def run_amino_analyzer(sequence: str, procedure: str, *, weight_type: str = 'average', enzyme: str = 'trypsin'):
"""
This is the main function to run the amino-analyzer.py tool.

Args:
sequence (str): The input protein sequence in one-letter code.
procedure (str): amino-analyzer.py tool has 5 functions at all:
1. aa_weight - Calculate the amino acids weight in a protein sequence.
2. count_hydroaffinity - Count the quantity of hydrophobic and hydrophilic amino acids in a protein sequence.
3. peptide_cutter - This function identifies cleavage sites in a given peptide sequence using a specified enzyme.
4. one_to_three_letter_code - This function converts a protein sequence from one-letter amino acid code to three-letter code.
5. sulphur_containing_aa_counter - This function counts sulphur-containing amino acids in a protein sequence.
weight_type = 'average': default argument for 'aa_weight' function. weight_type = 'monoisotopic' can be used as a second option.
enzyme = 'trypsin': default argument for 'peptide_cutter' function. enzyme = 'chymotrypsin' can be used as a second option.

Returns:
The result of the procedure.
"""

procedures = ['aa_weight', 'count_hydroaffinity', 'peptide_cutter', 'one_to_three_letter_code', 'sulphur_containing_aa_counter']
if procedure not in procedures:
raise ValueError(f"Incorrect procedure. Acceptable procedures: {', '.join(procedures)}")

for i in sequence:
if not is_aa(sequence):
Comment on lines +182 to +183
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Цикл for здесь не нужен. Функция is_aa и так превращает строку в set и анализирует последовательность.

Suggested change
for i in sequence:
if not is_aa(sequence):
if not is_aa(sequence):

raise ValueError("Incorrect sequence. Only amino acids are allowed (V, I, L, E, Q, D, N, H, W, F, Y, R, K, S, T, M, A, G, P, C, v, i, l, e, q, d, n, h, w, f, y, r, k, s, t, m, a, g, p, c).")

if procedure == 'aa_weight':
result = aa_weight(sequence, weight_type)
elif procedure == 'count_hydroaffinity':
result = count_hydroaffinity(sequence)
elif procedure == 'peptide_cutter':
result = peptide_cutter(sequence, enzyme)
elif procedure == 'one_to_three_letter_code':
result = one_to_three_letter_code(sequence)
elif procedure == 'sulphur_containing_aa_counter':
result = sulphur_containing_aa_counter(sequence)
return result