In this study, we proposed a protein co-conservation weighted network (PCCN) model simply based on protein sequence to investigate the mutations effects on the spike protein. We then compared the network topological features of mutation and non-mutation sites at the residue level and variant level. Finally, we mined the correlation of topological features with the mutation effects on spike protein stability changes and binding free energy changes.
-
Data Preparation.
- S protein FASTA seq
data/YP_009724390.1.txt
. - Variant of concern
data/summary - 20211201.xlsx
.
- S protein FASTA seq
-
Find protein family .
- Fasta seqs:
data/protein-matching-IPR042578.fasta
. - Detail information
data/protein-matching-IPR042578.json
.
- Fasta seqs:
-
Seqs of same protein family are too much, so filter seqs with same name and save
filter_fanmily_protein.py
.- Fasta seqs:
data/protein-matching-IPR042578.filter.fasta
. - Detail information
data/protein-matching-IPR042578.filter.csv
.
- Fasta seqs:
-
clustalX2
MSA, and export asdata/protein-matching-IPR042578.filter.fasta.aligned
. -
cal_procon.py
calculates the conservation score, and result isdata/procon/type{}.txt
.- Due to the default output format is hard to analysis, it will be converted as csv
file
data/procon/type{}_parse.csv
.
- Due to the default output format is hard to analysis, it will be converted as csv
file
-
analysis_procon.py
finds the conservation of the variants, and save asdata/procon/analysis.json
. -
v2/output_procon_analysis
will construct the network and analysis the network.
2022.5.16
- Backup the history result in data/v1/
- Update the variants
summary - 20220516
- Run
analysis_procon.py
2022.6.8.
Via reading the result of network, stability and affinity, find the relationship within them.
Network medium files:
- group distribution statistic information.xlsx: variants result
- 1/4 1/2 3/4 quantile: quantile
- mean: average value
- score: variant score
- t p: mean value of T-test and p
- result
- name: variant name
- index
- aas_info.csv: network characteristics of postions
2022.7.9 ecdc doesn't provide complete mutations which need to be required
Mutation source:
- https://cov-lineages.org/ can query variant
- https://outbreak.info/situation-reports?pango=B.1.617.2 can query mutation
Flow
- Update variant: https://www.ecdc.europa.eu/en/covid-19/variants-concern
- Crawler mutation
crawler.py
- Filter data, remove:
- duplication
- without evidence
Network color
- Node
- mutation #bf643b size 20
- normal #008ea0 size 10
- Edge
- neighbour #f66b0e
- mutation #ffc300
- normal #b7e5dd
2022.8.4.
- Gephi
- Only 2 color for edges
- path and not path
- ♥ mutation and no-mutation
- position size: conservation score
- Only 2 color for edges
...
2022.9.28
- Update network in Gephi
- Calculate DeepDGG for 6VXX and 6VYB
- Try to submit part of mutation, but failed
- So, calculate all possible mutations
2022.12.13
- Improve project