Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HW4 Shtompel #23

Open
wants to merge 12 commits into
base: main
Choose a base branch
from
50 changes: 50 additions & 0 deletions HW4_Shtompel/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,50 @@
# Protein_tools
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Приятный хороший README! Вы все что нужно в целом расписали. Единственное, хорошо когда есть прям пример использования. Строчка кода которую можно прям сразу скопировать, вставить, и посмотреть какой будет вывод (и сравнить с тем что написано в README). Этого немного не хватает:)

### Overview
**Protein_tools** is a tool for basic analysis of protein and polypeptide sequenses. Using this tool you can estimate sequence length, charge, aminoacid compound and mass of the protein, find out the aliphatic index and see if the protein could be cleaved by trypsin.

### Usage
If you want to use the **Protein_tools**, use `git clone` to this repo. To run this tool, you can use this command:
`run_protein_tools('<sequence>', '<procedure>')`, where `<sequence>` is the protein sequence (or several sequences) that should be analysed, and `<procedure>` is the name of option that you want to be done with the sequence(-s). Please write the name of option and sequences in quotes separated by commas, use only one option per time and make sure that your sequences contain the one-letter names of aminoacids (the case is not important).

### Options
1. `count_seq_length`: counts the length of protein sequence and output the number of aminoacids.
2. `classify_aminoacids`: classify all aminoacids from the input sequence in accordance with the 'AA_ALPHABET' classification. If aminoacid is not included in this list, it should be classified as 'Unusual'.

AA_ALPHABET classification:
| Class | Aminoacids |
|----------|-----------|
| Nonpolar | G, A, V, I, L, P|
| Polar uncharged | S, T, C, M, N, Q |
| Aromatic | F, W, Y |
| Polar with negative charge | D, E |
| Polar with positive charge | K, R, H |

3. `check_unusual_aminoacids`: checks the composition of aminoacids and return the list of unusual aminoacids if they present in the sequence. We call the aminoacid unusual when it does not belong to the list of proteinogenic aminoacids (see AA_ALPHABET classification).
4. `count_charge`: counts the charge of the protein by the subtraction between the number of positively and negatively charged aminoacids.
5. `count_protein_mass`: calculates mass of all aminoacids of input sequence in g/mol scale.
6. `count_aliphatic_index`: calculates aliphatic index - relative proportion of aliphatic aminoacids in input peptide. The higher aliphatic index the higher thermostability of peptide.
7. `count_trypsin_sites`: counts number of valid trypsin cleavable sites: Arginine/any aminoacid and Lysine/any aminoacid (except Proline). If peptide has not any trypsin cleavable sites, it will return zero.

### Examples
An illustration of the capabilities of **Protein_tools** using a random protein sequence is presented below:
*sequence:* CVWGWAMGEACPNPIKINISAYAKTWYQNGPIGRCCCWVGYTAIRFPHQEMQQNTRFNKP

| Option | Output |
|--------|---------|
| count_seq_length | 60 |
| classify_aminoacids | 'Nonpolar': 22, 'Polar uncharged': 20, 'Aromatic': 9, 'Polar with negative charge': 2, 'Polar with positive charge': 7, 'Unusual': 0 |
| check_unusual_aminoacids | This sequence contains only proteinogenic aminoacids. |
| count_charge | 5 |
| count_protein_mass | 6918.99 |
| count_aliphatic_index | 0.5049999999999999 |
| count_trypsin_sites | 5 |

### Limitations and troubleshooting
**Protein_tools** has several limitations that can raise the errors in the work of the program. Here are some of them:
1. **Protein_Tools** works only with protein sequences that contains letters of Latin alphabet (the case is not important); also every aminoacid should be coded by one letter. If there are other symbols in the sequence, the tool raise `ValueError` *"One of these sequences is not protein sequence or does not match the rools of input. Please select another sequence."*. In this case you should check if there are punctuation marks, spaces or some other symbols in your sequence.
2. Be careful to work only with the sequences that contain aminoacids that coded with one letter. If your sequense is "SerMetAlaGly", **Protein_tools** reads it as "SERMETALAGLY".
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Очень хорошее замечение! Было бы здорово в версии 2.0 ввести поодержку трехбуквенной кодировки (полушутка).
Но, ведь правда, мне кажется сходну почти все помнят 3буквенную и не всё помнят из 1буквенной

3. The list of available functions is available in section "Options". If you see `ValueError` *"This procedure is not available. Please choose another procedure."*, probably your spelling of the name of function is incorrect. Please check the name of chosen prosedure and make sure that it is available in the **Protein_Tools**.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔥

### Contribution and contacts
- Shtompel Anastasia (Telegram: @Aenye) — teamlead, developer (options 'count_protein_mass', 'count_aliphatic_index', 'count_trypsin_sites')
- Chevokina Elizaveta (Telegram: @lzchv) — developer (options 'count_seq_length', 'classify_aminoacids', 'check_unusual_aminoacids', 'count_charge'), author of README file
36 changes: 36 additions & 0 deletions HW4_Shtompel/pre-commit
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ну это тут наверное не нужно

Original file line number Diff line number Diff line change
@@ -0,0 +1,36 @@
#!/bin/bash

echo "Hi! I'm your pre-commit code checker."

FILE="dna_rna_tools.py"
TESTS="${FILE%.py}_test.py"

if [ -f $FILE ]; then

if [ ! -f hooks_env/bin/activate ]; then
echo "For the first time I need to prepare an environment, give me a minute..."
python3 -m venv hooks_env
source hooks_env/bin/activate
python3 -m pip install --upgrade pip --quiet
pip install pytest flake8 flake8-bugbear pep8-naming flake8-builtins flake8-functions-names flake8-variables-names pep8-naming pylint mypy --quiet
echo "hooks_env" >> .gitignore
echo ".gitignore" >> .gitignore
else
source hooks_env/bin/activate
fi

echo "$(tput setab 7 setaf 1)>>>> Code quality checks <<<<$(tput sgr 0)"
echo ">>>> flake8 check"
flake8 $FILE
echo ">>>> pylint check"
pylint $FILE
echo ">>>> mypy check"
mypy $FILE

deactivate

else

echo "Seems no python code to be checked. You can configure me in .git/hook/pre-commit"

fi
200 changes: 200 additions & 0 deletions HW4_Shtompel/protein_tools.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,200 @@
"""
Global variables:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ну они скорее не глобал, а просто константы. В целом никогда такого не встречал (такого комментария к константам), но если у модуля и без того есть докстринга, то добавить такую инфу - кажется имеет смысл.

- AA_ALPHABET — a dictionary variable that contains a list of proteinogenic aminoacids classes.
- ALL_AMINOACIDS — a set variable that contains a list of all proteinogenic aminoacids.
- AMINO_ACIDS_MASSES — a dictionary variable that contains masses of all proteinogenic aminoacids.
"""

AA_ALPHABET = {'Nonpolar': ['G', 'A', 'V', 'I', 'L', 'P'],
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Здорово что правильно это дело оформили!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Хотя это ж не алфавит)
Это группы аминокислот.

Suggested change
AA_ALPHABET = {'Nonpolar': ['G', 'A', 'V', 'I', 'L', 'P'],
AA_GROUPS = {'Nonpolar': ['G', 'A', 'V', 'I', 'L', 'P'],

'Polar uncharged': ['S', 'T', 'C', 'M', 'N', 'Q'],
'Aromatic': ['F', 'W', 'Y'],
'Polar with negative charge': ['D', 'E'],
'Polar with positive charge': ['K', 'R', 'H']
}

ALL_AMINOACIDS = set(('G', 'A', 'V', 'I', 'L', 'P', 'S', 'T', 'C', 'M', 'N', 'Q', 'F', 'W', 'Y', 'D', 'E', 'K', 'R', 'H'))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
ALL_AMINOACIDS = set(('G', 'A', 'V', 'I', 'L', 'P', 'S', 'T', 'C', 'M', 'N', 'Q', 'F', 'W', 'Y', 'D', 'E', 'K', 'R', 'H'))
AMINOACIDS = {'G', 'A', 'V', 'I', 'L', 'P',
'S', 'T', 'C', 'M', 'N', 'Q',
'F', 'W', 'Y', 'D', 'E', 'K',
'R', 'H'
}


AMINO_ACIDS_MASSES = {
'G': 57.05, 'A': 71.08, 'S': 87.08, 'P': 97.12, 'V': 99.13,
'T': 101.1, 'C': 103.1, 'L': 113.2, 'I': 113.2, 'N': 114.1,
'D': 115.1, 'Q': 128.1, 'K': 128.2, 'E': 129.1, 'M': 131.2,
'H': 137.1, 'F': 147.2, 'R': 156.2, 'Y': 163.2, 'W': 186.2
}


def is_protein(seq: str) -> bool:
"""
Input: a protein sequence (a str type).
Output: boolean value.
'is_protein' function check if the sequence contains only letters in the upper case.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Это идет первой строчкой в докстринге, а потом уже аргументы и результат

"""
if seq.isalpha() and seq.isupper():
return True
Comment on lines +31 to +32
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

А что если else? Возвращает None, хотя логично что должен бы False. Тут можно было даже не мудрить:

Suggested change
if seq.isalpha() and seq.isupper():
return True
return seq.isalpha() and seq.isupper()



def count_seq_length(seq: str) -> int:
"""
Input: a protein sequence (a str type).
Output: length of protein sequence (an int type).
'count_seq_length' function counts the length of protein sequence.
"""
return len(seq)
Comment on lines +35 to +41
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ну уж это совсем примитивненькая функция:)
Хотя б при поддержке 3хбуквенного кода - и то было бы как то получше



def classify_aminoacids(seq: str) -> dict:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Прикольная идея, 👍

"""
Input: a protein sequence (a str type).
Output: a classification of all aminoacids from the sequence (a dict type — 'all_aminoacids_classes' variable).
'classify_aminoacids' function classify all aminoacids from the input sequence in accordance with the 'AA_ALPHABET' classification. If aminoacid is not included in this list,
it should be classified as 'Unusual'.
"""
all_aminoacids_classes = dict.fromkeys(['Nonpolar', 'Polar uncharged', 'Aromatic', 'Polar with negative charge', 'Polar with positive charge', 'Unusual'], 0)
for aminoacid in seq:
aminoacid = aminoacid.upper()
if aminoacid not in ALL_AMINOACIDS:
all_aminoacids_classes['Unusual'] += 1
for aa_key, aa_value in AA_ALPHABET.items():
if aminoacid in aa_value:
all_aminoacids_classes[aa_key] += 1
return all_aminoacids_classes


def check_unusual_aminoacids(seq: str) -> str:
"""
Input: a protein sequence (a str type).
Output: an answer whether the sequense contains unusual aminoacids (a str type).
'check_unusual_aminoacids' function checks the composition of aminoacids and return the list of unusual aminoacids if they present in the sequence. We call the aminoacid
unusual when it does not belong to the list of proteinogenic aminoacids (see 'ALL_AMINOACIDS' global variable).
"""
seq_aminoacids = set()
for aminoacid in seq:
aminoacid = aminoacid.upper()
seq_aminoacids.add(aminoacid)
if seq_aminoacids <= ALL_AMINOACIDS:
return 'This sequence contains only proteinogenic aminoacids.'
else:
unusual_aminoacids = seq_aminoacids - ALL_AMINOACIDS
unusual_aminoacids_str = ''
for elem in unusual_aminoacids:
unusual_aminoacids_str += elem
unusual_aminoacids_str += ', '
return f'This protein contains unusual aminoacids: {unusual_aminoacids_str[:-2]}.'
Comment on lines +62 to +81
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

В целом тоже прикольная идея. Два комментария

  1. Функционал очень очень пересекается с функцией выше. Как будто какая то часть логики общая написана и тут и там. Можно было бы кажется это чуть переписать, чтобы была функция которая давала просто огромный словарь всей инфы, а эти две брали из него только то что им нужно)
  2. Не очень хорошо делать return какого-то человеческого текста. Можно например возвращать их число или список, а заодно печатать такое сообщение. Все таки туда сюда гонять в программе лучше программистиские объекты.



def count_charge(seq: str) -> int:
"""
Input: a protein sequence (a str type).
Output: a charge of the sequence (an int type).
'count_charge' function counts the charge of the protein by the subtraction between the number of positively and negatively charged aminoacids.
"""
seq_classes = classify_aminoacids(seq)
positive_charge = seq_classes['Polar with positive charge']
negative_charge = seq_classes['Polar with negative charge']
sum_charge = positive_charge - negative_charge
return sum_charge
Comment on lines +84 to +94
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Хе-хе, вот это здорово!
Было бы круто допилить чтобы учитывать как то что заряд боковых АК отличается от тех что внутри.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ну и seq тут и везде в этой работе не очень хорошее слово мне кажется. Все таки это больше воспринимается про ДНК/РНК, а не про белки.



def count_protein_mass(seq: str) -> float:
"""
Calculates mass of all aminoacids of input peptide in g/mol scale.
Arguments:
- seq (str): one-letter code peptide sequence, case is not important;
Output:
Returns mass of peptide (float).
"""
aa_mass = 0
for aminoacid in seq.upper():
if aminoacid in AMINO_ACIDS_MASSES:
aa_mass += AMINO_ACIDS_MASSES[aminoacid]
Comment on lines +107 to +108
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

А что если else? Просто ничего не прибавляем? Тут либо у вас уже есть проверка ввода, тогда тут она не нужна. Либо ее нет и тогда надо сделать.

return aa_mass


def count_aliphatic_index(seq: str) -> float:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Прикольно)

"""
Calculates aliphatic index - relative proportion of aliphatic aminoacids in input peptide.
The higher aliphatic index the higher thermostability of peptide.
Argument:
- seq (str): one-letter code peptide sequence, letter case is not important.
Output:
Returns alipatic index (float).
"""
ala_count = seq.count('A') / len(seq)
val_count = seq.count('V') / len(seq)
lei_count = seq.count('L') / len(seq)
izlei_count = seq.count('I') / len(seq)
Comment on lines +121 to +124
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Получается вы четырежды пробегаете по белку и четырежды вдобавок к тому считаете его длину.
Вычислительно скорее всего питону будет все равно, просто тут больше момент про логику.
Лучше было бы пойти одним циклом и увеличивать каждый счетчик на 1 в соотвествующем случае.
Ну и переменная называется count, хотя по факту там не целые числа (штуки), а дробные (относительное содержание)

В целом на длину можно было бы поделить один разок в финальной формуле)

aliph_index = ala_count + 2.9 * val_count + 3.9 * lei_count + 3.9 * izlei_count
return aliph_index


def not_trypsin_cleaved(seq: str) -> int:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Название какое-то не очень информативное касательного того что делает и возвращет функция)

"""
Counts non-cleavable sites of trypsin: Arginine/Proline (RP) and Lysine/Proline (KP) pairs.
Argument:
- seq (str): one-letter code peptide sequence, case is not important.
Output:
Returns number of exception sites that cannot be cleaved by trypsin (int).
"""
not_cleavage_count = 0
not_cleavage_count += seq.upper().count('RP')
not_cleavage_count += seq.upper().count('KP')
Comment on lines +138 to +139
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Опять же, тут это можно было бы за один прогон посчитать:)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Но круто что используете +=!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

И upper() тоже получается делаете много раз, можно было бы повышение регистра как то отдельно вынести в начало работы функции

return not_cleavage_count


def count_trypsin_sites(seq: str) -> int:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Прикольно! Это прям даже неочевидно и полезно для какого-нибдуь масс-спека. Отдельный лайк за то что вы учли не-разрезаемые сайты. Прям огонь

"""
Counts number of valid trypsin cleavable sites:
Arginine/any aminoacid and Lysine/any aminoacid (except Proline).
Argument:
- seq (str): one-letter code peptide sequence, case is not important.
Output:
Returns number of valid trypsin cleavable sites (int).
If peptide has not any trypsin cleavable sites, it will return zero.
"""
arginine_value = seq.upper().count('R')
lysine_value = seq.upper().count('K')
count_cleavage = arginine_value + lysine_value - not_trypsin_cleaved(seq)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
count_cleavage = arginine_value + lysine_value - not_trypsin_cleaved(seq)
cleavage_sites_count = arginine_value + lysine_value - not_trypsin_cleaved(seq)

Когда count в начале то это скорее как название какой-то функции)

return count_cleavage


OPERATIONS = {'count_protein_mass':count_protein_mass,
'count_aliphatic_index': count_aliphatic_index,
'count_trypsin_sites': count_trypsin_sites,
'count_seq_length': count_seq_length,
'classify_aminoacids': classify_aminoacids,
'check_unusual_aminoacids': check_unusual_aminoacids,
'count_charge': count_charge}


def protein_tools(*args: str) -> list:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
def protein_tools(*args: str) -> list:
def run_protein_tools(*args: str) -> list:

"""
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

йо-о, прям мощная докстринга!

Calculates protein phisical properties: mass, charge, length, aliphatic index;
as well as defines biological features: aminoacid composition, trypsin cleavable sites.

Input: a list of protein sequences and one procedure that should be done with these sequences (str type, several values).

Valid operations:
Protein_tools include several operations:
- count_seq_length: returns length of protein (int);
- classify_aminoacids: returns collection of classified aminoacids, included in the protein (dict);
- check_unusual_aminoacids: informs about whether the unusual aminoacis include into the protein (str);
- count_charge: returns charge value of protein (int);
- count_protein_mass: calculates mass of all aminoacids of input peptide in g/mol scale (float);
- count_aliphatic_index: calculates relative proportion of aliphatic aminoacids in input peptide (float);
- count_trypsin_sites: counts number of valid trypsin cleavable sites.

Output: a list of outputs from the chosen procedure (list type).
'run_protein_tools' function take the protein sequences and the name of the procedure that the user gives and applies this procedure by one of the available functions
to all the given sequences. Also this function check the availabilaty of the procedure and raise the ValueError when the procedure is not in the list of available
functions (see 'OPERATIONS' global variable).
"""
operation = args[-1]
parsed_seq_list = []
for seq in args[0:-1]:
Comment on lines +190 to +192
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Лучше по другому сделайте аргументы. То как было в ДЗ 3 н самом деле не очень удобно. Пусть белки принимаются либо позиционно либо даже одним списком, а вот операция будет именованным. Чтобы это все было не в одной куче, а отдельно.

if not is_protein(seq):
raise ValueError("One of these sequences is not protein sequence or does not match the rools of input. Please select another sequence.")
Comment on lines +193 to +194
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Вот это красиво выглядит:)

else:
if operation in OPERATIONS:
parsed_seq_list.append(OPERATIONS[operation](seq))
else:
raise ValueError("This procedure is not available. Please choose another procedure.")
return parsed_seq_list
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Можно кстати тут возвращать не список, а словарь. Ключ - это белок из ввода, а значение - это результат работы программы. Но это чисто как идея, и так тоже хорошо

Loading