Python-BI-2023 · anshtompel · Sep 30, 2023 · Sep 30, 2023 · Sep 30, 2023 · Sep 30, 2023
diff --git a/HW4_Shtompel/README.md b/HW4_Shtompel/README.md
@@ -0,0 +1,50 @@
+# Protein_tools
+### Overview
+**Protein_tools** is a tool for basic analysis of protein and polypeptide sequenses. Using this tool you can estimate sequence length, charge, aminoacid compound and mass of the protein, find out the aliphatic index and see if the protein could be cleaved by trypsin.
+
+### Usage
+If you want to use the **Protein_tools**, use `git clone` to this repo. To run this tool, you can use this command:
+`run_protein_tools('<sequence>', '<procedure>')`, where `<sequence>` is the protein sequence (or several sequences) that should be analysed, and `<procedure>` is the name of option that you want to be done with the sequence(-s). Please write the name of option and sequences in quotes separated by commas, use only one option per time and make sure that your sequences contain the one-letter names of aminoacids (the case is not important).
+
+### Options
+1. `count_seq_length`: counts the length of protein sequence and output the number of aminoacids.
+2. `classify_aminoacids`: classify all aminoacids from the input sequence in accordance with the 'AA_ALPHABET' classification. If aminoacid is not included in this list, it should be classified as 'Unusual'.
+
+    AA_ALPHABET classification:
+    | Class | Aminoacids |
+    |----------|-----------|
+    | Nonpolar | G, A, V, I, L, P|
+    | Polar uncharged | S, T, C, M, N, Q |
+    | Aromatic | F, W, Y |
+    | Polar with negative charge | D, E |
+    | Polar with positive charge | K, R, H |
+
+3. `check_unusual_aminoacids`: checks the composition of aminoacids and return the list of unusual aminoacids if they present in the sequence. We call the aminoacid unusual when it does not belong to the list of proteinogenic aminoacids (see AA_ALPHABET classification).
+4. `count_charge`: counts the charge of the protein by the subtraction between the number of positively and negatively charged aminoacids.
+5. `count_protein_mass`: calculates mass of all aminoacids of input sequence in g/mol scale.
+6. `count_aliphatic_index`: calculates aliphatic index - relative proportion of aliphatic aminoacids in input peptide. The higher aliphatic index the higher thermostability of peptide.
+7. `count_trypsin_sites`: counts number of valid trypsin cleavable sites: Arginine/any aminoacid and Lysine/any aminoacid (except Proline). If peptide has not any trypsin cleavable sites, it will return zero.
+
+### Examples
+An illustration of the capabilities of **Protein_tools** using a random protein sequence is presented below:
+*sequence:* CVWGWAMGEACPNPIKINISAYAKTWYQNGPIGRCCCWVGYTAIRFPHQEMQQNTRFNKP
+
+| Option | Output |
+|--------|---------|
+| count_seq_length | 60 |
+| classify_aminoacids | 'Nonpolar': 22, 'Polar uncharged': 20, 'Aromatic': 9, 'Polar with negative charge': 2, 'Polar with positive charge': 7, 'Unusual': 0 |
+| check_unusual_aminoacids | This sequence contains only proteinogenic aminoacids. |
+| count_charge | 5 |
+| count_protein_mass | 6918.99 |
+| count_aliphatic_index | 0.5049999999999999 |
+| count_trypsin_sites | 5 |
+
+### Limitations and troubleshooting
+**Protein_tools** has several limitations that can raise the errors in the work of the program. Here are some of them:
+1. **Protein_Tools** works only with protein sequences that contains letters of Latin alphabet (the case is not important); also every aminoacid should be coded by one letter. If there are other symbols in the sequence, the tool raise `ValueError` *"One of these sequences is not protein sequence or does not match the rools of input. Please select another sequence."*. In this case you should check if there are punctuation marks, spaces or some other symbols in your sequence.
+2. Be careful to work only with the sequences that contain aminoacids that coded with one letter. If your sequense is "SerMetAlaGly", **Protein_tools** reads it as "SERMETALAGLY".
+3. The list of available functions is available in section "Options". If you see `ValueError` *"This procedure is not available. Please choose another procedure."*, probably your spelling of the name of function is incorrect. Please check the name of chosen prosedure and make sure that it is available in the **Protein_Tools**.
+
+### Contribution and contacts
+- Shtompel Anastasia (Telegram: @Aenye) — teamlead, developer (options 'count_protein_mass', 'count_aliphatic_index', 'count_trypsin_sites')
+- Chevokina Elizaveta (Telegram: @lzchv) — developer (options 'count_seq_length', 'classify_aminoacids', 'check_unusual_aminoacids', 'count_charge'), author of README file
diff --git a/HW4_Shtompel/pre-commit b/HW4_Shtompel/pre-commit
@@ -0,0 +1,36 @@
+#!/bin/bash
+
+echo "Hi! I'm your pre-commit code checker."
+
+FILE="dna_rna_tools.py"
+TESTS="${FILE%.py}_test.py"
+
+if [ -f $FILE ]; then
+
+	if [ ! -f hooks_env/bin/activate ]; then
+		echo "For the first time I need to prepare an environment, give me a minute..."
+		python3 -m venv hooks_env
+		source hooks_env/bin/activate
+		python3 -m pip install --upgrade pip --quiet
+		pip install pytest flake8 flake8-bugbear pep8-naming flake8-builtins flake8-functions-names flake8-variables-names pep8-naming pylint mypy --quiet
+		echo "hooks_env" >> .gitignore
+		echo ".gitignore" >> .gitignore
+	else
+		source hooks_env/bin/activate
+	fi
+
+	echo "$(tput setab 7 setaf 1)>>>>     Code quality checks     <<<<$(tput sgr 0)"
+	echo ">>>>  flake8 check"
+	flake8 $FILE
+	echo ">>>>  pylint check"
+	pylint $FILE
+	echo ">>>>  mypy check"
+	mypy $FILE
+
+	deactivate
+
+else
+
+	echo "Seems no python code to be checked. You can configure me in .git/hook/pre-commit"
+
+fi
diff --git a/HW4_Shtompel/protein_tools.py b/HW4_Shtompel/protein_tools.py
@@ -0,0 +1,200 @@
+"""
+Global variables:
+- AA_ALPHABET — a dictionary variable that contains a list of proteinogenic aminoacids classes.
+- ALL_AMINOACIDS — a set variable that contains a list of all proteinogenic aminoacids.
+- AMINO_ACIDS_MASSES — a dictionary variable that contains masses of all proteinogenic aminoacids.
+"""
+
+AA_ALPHABET = {'Nonpolar': ['G', 'A', 'V', 'I', 'L', 'P'],
-AA_ALPHABET = {'Nonpolar': ['G', 'A', 'V', 'I', 'L', 'P'],
+AA_GROUPS = {'Nonpolar': ['G', 'A', 'V', 'I', 'L', 'P'],
+
-AA_ALPHABET = {'Nonpolar': ['G', 'A', 'V', 'I', 'L', 'P'],
+AA_GROUPS = {'Nonpolar': ['G', 'A', 'V', 'I', 'L', 'P'],
+
+                'Polar uncharged': ['S', 'T', 'C', 'M', 'N', 'Q'],
+                'Aromatic': ['F', 'W', 'Y'],
+                'Polar with negative charge': ['D', 'E'],
+                'Polar with positive charge': ['K', 'R', 'H']
+                }
+
+ALL_AMINOACIDS = set(('G', 'A', 'V', 'I', 'L', 'P', 'S', 'T', 'C', 'M', 'N', 'Q', 'F', 'W', 'Y', 'D', 'E', 'K', 'R', 'H'))
-ALL_AMINOACIDS = set(('G', 'A', 'V', 'I', 'L', 'P', 'S', 'T', 'C', 'M', 'N', 'Q', 'F', 'W', 'Y', 'D', 'E', 'K', 'R', 'H'))
+AMINOACIDS = {'G', 'A', 'V', 'I', 'L', 'P',
+                     'S', 'T', 'C', 'M', 'N', 'Q', 
+                     'F', 'W', 'Y', 'D', 'E', 'K', 
+                     'R', 'H'
+                     }
+
-ALL_AMINOACIDS = set(('G', 'A', 'V', 'I', 'L', 'P', 'S', 'T', 'C', 'M', 'N', 'Q', 'F', 'W', 'Y', 'D', 'E', 'K', 'R', 'H'))
+AMINOACIDS = {'G', 'A', 'V', 'I', 'L', 'P',
+                     'S', 'T', 'C', 'M', 'N', 'Q', 
+                     'F', 'W', 'Y', 'D', 'E', 'K', 
+                     'R', 'H'
+                     }
+
+
+AMINO_ACIDS_MASSES = {
+    'G': 57.05, 'A': 71.08, 'S': 87.08, 'P': 97.12, 'V': 99.13,
+    'T': 101.1, 'C': 103.1, 'L': 113.2, 'I': 113.2, 'N': 114.1,
+    'D': 115.1, 'Q': 128.1, 'K': 128.2, 'E': 129.1, 'M': 131.2,
+    'H': 137.1, 'F': 147.2, 'R': 156.2, 'Y': 163.2, 'W': 186.2    
+}
+
+
+def is_protein(seq: str) -> bool:
+    """
+    Input: a protein sequence (a str type).
+    Output: boolean value.
+    'is_protein' function check if the sequence contains only letters in the upper case.
+    """
+    if seq.isalpha() and seq.isupper():
+        return True
-    if seq.isalpha() and seq.isupper():
-        return True
+        return  seq.isalpha() and seq.isupper()
+
-    if seq.isalpha() and seq.isupper():
-        return True
+        return  seq.isalpha() and seq.isupper()
+
+
+
+def count_seq_length(seq: str) -> int:
+    """
+    Input: a protein sequence (a str type).
+    Output: length of protein sequence (an int type).
+    'count_seq_length' function counts the length of protein sequence.
+    """
+    return len(seq)
+
+
+def classify_aminoacids(seq: str) -> dict:
+    """
+    Input: a protein sequence (a str type).
+    Output: a classification of all aminoacids from the sequence (a dict type — 'all_aminoacids_classes' variable).
+    'classify_aminoacids' function classify all aminoacids from the input sequence in accordance with the 'AA_ALPHABET' classification. If aminoacid is not included in this list,
+    it should be classified as 'Unusual'.
+    """
+    all_aminoacids_classes = dict.fromkeys(['Nonpolar', 'Polar uncharged', 'Aromatic', 'Polar with negative charge', 'Polar with positive charge', 'Unusual'], 0)
+    for aminoacid in seq:
+        aminoacid = aminoacid.upper()
+        if aminoacid not in ALL_AMINOACIDS:
+            all_aminoacids_classes['Unusual'] += 1
+        for aa_key, aa_value in AA_ALPHABET.items():
+            if aminoacid in aa_value:
+                all_aminoacids_classes[aa_key] += 1
+    return all_aminoacids_classes
+
+
+def check_unusual_aminoacids(seq: str) -> str:
+    """
+    Input: a protein sequence (a str type).
+    Output: an answer whether the sequense contains unusual aminoacids (a str type).
+    'check_unusual_aminoacids' function checks the composition of aminoacids and return the list of unusual aminoacids if they present in the sequence. We call the aminoacid
+    unusual when it does not belong to the list of proteinogenic aminoacids (see 'ALL_AMINOACIDS' global variable).
+    """
+    seq_aminoacids = set()
+    for aminoacid in seq:
+        aminoacid = aminoacid.upper()
+        seq_aminoacids.add(aminoacid)
+    if seq_aminoacids <= ALL_AMINOACIDS:
+        return 'This sequence contains only proteinogenic aminoacids.'
+    else:
+        unusual_aminoacids = seq_aminoacids - ALL_AMINOACIDS
+        unusual_aminoacids_str = ''
+        for elem in unusual_aminoacids:
+            unusual_aminoacids_str += elem
+            unusual_aminoacids_str += ', '
+        return f'This protein contains unusual aminoacids: {unusual_aminoacids_str[:-2]}.'
+
+
+def count_charge(seq: str) -> int:
+    """
+    Input: a protein sequence (a str type).
+    Output: a charge of the sequence (an int type).
+    'count_charge' function counts the charge of the protein by the subtraction between the number of positively and negatively charged aminoacids.
+    """
+    seq_classes = classify_aminoacids(seq)
+    positive_charge = seq_classes['Polar with positive charge']
+    negative_charge = seq_classes['Polar with negative charge']
+    sum_charge = positive_charge - negative_charge
+    return sum_charge
+
+
+def count_protein_mass(seq: str) -> float:
+    """
+    Calculates mass of all aminoacids of input peptide in g/mol scale.
+    Arguments:
+    - seq (str): one-letter code peptide sequence, case is not important;
+    Output:
+    Returns mass of peptide (float).
+    """
+    aa_mass = 0
+    for aminoacid in seq.upper():
+        if aminoacid in AMINO_ACIDS_MASSES:
+            aa_mass += AMINO_ACIDS_MASSES[aminoacid]
+    return aa_mass
+
+
+def count_aliphatic_index(seq: str) -> float:
+    """
+    Calculates aliphatic index - relative proportion of aliphatic aminoacids in input peptide.
+    The higher aliphatic index the higher thermostability of peptide.
+    Argument:
+    - seq (str): one-letter code peptide sequence, letter case is not important.
+    Output:
+    Returns alipatic index (float).
+    """
+    ala_count = seq.count('A') / len(seq)
+    val_count = seq.count('V') / len(seq)
+    lei_count = seq.count('L') / len(seq)
+    izlei_count = seq.count('I') / len(seq)
+    aliph_index = ala_count + 2.9 * val_count + 3.9 * lei_count + 3.9 * izlei_count
+    return aliph_index
+
+
+def not_trypsin_cleaved(seq: str) -> int:
+    """
+    Counts non-cleavable sites of trypsin: Arginine/Proline (RP) and Lysine/Proline (KP) pairs.
+    Argument:
+    - seq (str): one-letter code peptide sequence, case is not important.
+    Output:
+    Returns number of exception sites that cannot be cleaved by trypsin (int).
+    """
+    not_cleavage_count = 0
+    not_cleavage_count += seq.upper().count('RP')
+    not_cleavage_count += seq.upper().count('KP')
+    return not_cleavage_count
+
+
+def count_trypsin_sites(seq: str) -> int:
+    """
+    Counts number of valid trypsin cleavable sites:
+    Arginine/any aminoacid and Lysine/any aminoacid (except Proline).
+    Argument:
+    - seq (str): one-letter code peptide sequence, case is not important.
+    Output:
+    Returns number of valid trypsin cleavable sites (int).
+    If peptide has not any trypsin cleavable sites, it will return zero.
+    """
+    arginine_value = seq.upper().count('R')
+    lysine_value = seq.upper().count('K')
+    count_cleavage = arginine_value + lysine_value - not_trypsin_cleaved(seq)
-    count_cleavage = arginine_value + lysine_value - not_trypsin_cleaved(seq)
+    cleavage_sites_count = arginine_value + lysine_value - not_trypsin_cleaved(seq)
+
-    count_cleavage = arginine_value + lysine_value - not_trypsin_cleaved(seq)
+    cleavage_sites_count = arginine_value + lysine_value - not_trypsin_cleaved(seq)
+
+    return count_cleavage
+
+
+OPERATIONS = {'count_protein_mass':count_protein_mass,
+             'count_aliphatic_index': count_aliphatic_index,
+             'count_trypsin_sites': count_trypsin_sites,
+             'count_seq_length': count_seq_length,
+             'classify_aminoacids': classify_aminoacids,
+             'check_unusual_aminoacids': check_unusual_aminoacids,
+             'count_charge': count_charge}
+
+
+def protein_tools(*args: str) -> list:
-def protein_tools(*args: str) -> list:
+def run_protein_tools(*args: str) -> list:
+
-def protein_tools(*args: str) -> list:
+def run_protein_tools(*args: str) -> list:
+
+    """
+    Calculates protein phisical properties: mass, charge, length, aliphatic index;
+    as well as defines biological features: aminoacid composition, trypsin cleavable sites.
+
+    Input: a list of protein sequences and one procedure that should be done with these sequences (str type, several values).
+
+    Valid operations: 
+    Protein_tools include several operations:
+    - count_seq_length: returns length of protein (int);
+    - classify_aminoacids: returns collection of classified aminoacids, included in the protein (dict);
+    - check_unusual_aminoacids: informs about whether the unusual aminoacis include into the protein (str);
+    - count_charge: returns charge value of protein (int);
+    - count_protein_mass: calculates mass of all aminoacids of input peptide in g/mol scale (float);
+    - count_aliphatic_index: calculates relative proportion of aliphatic aminoacids in input peptide (float);
+    - count_trypsin_sites: counts number of valid trypsin cleavable sites.
+
+    Output: a list of outputs from the chosen procedure (list type).
+    'run_protein_tools' function take the protein sequences and the name of the procedure that the user gives and applies this procedure by one of the available functions
+    to all the given sequences. Also this function check the availabilaty of the procedure and raise the ValueError when the procedure is not in the list of available
+    functions (see 'OPERATIONS' global variable).
+    """
+    operation = args[-1]
+    parsed_seq_list = []
+    for seq in args[0:-1]:
+        if not is_protein(seq):
+            raise ValueError("One of these sequences is not protein sequence or does not match the rools of input. Please select another sequence.")
+        else:
+            if operation in OPERATIONS:
+                parsed_seq_list.append(OPERATIONS[operation](seq))
+            else:
+                raise ValueError("This procedure is not available. Please choose another procedure.")
+    return parsed_seq_list