-
Notifications
You must be signed in to change notification settings - Fork 3
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
nataliarosa9
committed
Jan 3, 2024
1 parent
d676f1e
commit 7b9b3fd
Showing
1 changed file
with
107 additions
and
55 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -15,6 +15,7 @@ | |
|
||
} | ||
|
||
|
||
</script> | ||
{% endblock%} | ||
|
||
|
@@ -27,76 +28,119 @@ | |
<div class="col-md-12 text-center text-md-start fs-1 mb-5"> | ||
<h1 class="fw-bold mb-4">About - Methods</h1> | ||
<p class="text-justify fs-0"> | ||
We developed PRECOGx, a machine learning predictor of GPCR interactions with G-protein and β-arrestin, by using the ESM1b protein embeddings as features and experimental binding datasets. | ||
We developed PRECOGx, a machine learning predictor of GPCR interactions with G-protein and | ||
β-arrestin, by using the ESM1b protein embeddings as features and experimental binding datasets. | ||
</p> | ||
<h2 class="fw-bold mb-4">Embeddings generation</h2> | ||
|
||
<p class="text-justify fs-0">Embeddings of the protein sequences were generated by using pre-trained protein language models that have | ||
been recently released. We computed embeddings from fasta sequence using the extract.py function of the <a href= "https://github.com/facebookresearch/esm">ESM library </a> | ||
and by specifying the ESM-1b model (esm1b_t33_650M_UR50S) with embedding for individual amino acids as well as averaged over the full | ||
<p class="text-justify fs-0">Embeddings of the protein sequences were generated by using pre-trained | ||
protein language models that have | ||
been recently released. We computed embeddings from fasta sequence using the extract.py function of | ||
the <a href="https://github.com/facebookresearch/esm">ESM library </a> | ||
and by specifying the ESM-1b model (esm1b_t33_650M_UR50S) with embedding for individual amino acids | ||
as well as averaged over the full | ||
sequence using the option <i>“--include mean per_tok”</i>. | ||
</p> | ||
<p class="text-justify fs-0">We generated embeddings for each individual layers separately, including the final one, by specifying their corresponding | ||
<p class="text-justify fs-0">We generated embeddings for each individual layers separately, including | ||
the final one, by specifying their corresponding | ||
number in the <i>“--repr-layers” option</i>. | ||
</p> | ||
<h2>Data sets</h2> | ||
<p class="text-justify fs-0"> | ||
We obtained experimental binding affinities from two distinct sources: TGF assay(12), which captures the binding | ||
affinities of 148 GPCRs with 11 chimeric G-proteins, and the ebBRET assay, which profiles the binding affinities of 97 | ||
GPCRs with 12 G-proteins and 3 β-arrestins/GRKs binders, available at <a href="https://gpcrdb.org/">gpcrdb</a>. We also used an integrated | ||
meta-coupling dataset derived from a meta-analysis of the aforementioned assays, entailing binding affinities of 164 GPCRs | ||
for 14 G-proteins. For the TGF assay, we considered a receptor coupled to a G-protein if the logarithm (base 10) | ||
of the relative intrinsic activity (logRAi) was greater than -1, and not-coupled otherwise. Similarly, for the GEMTA assay, | ||
we considered a receptor coupled to a G-protein (or β-arrestins/GRK) if the binding efficacy (dnorm Emax) was greater than 0, | ||
and not-coupled otherwise. For the integrated meta-coupling dataset, | ||
we considered a receptor coupled to a G-protein if the integrated binding affinity was greater than 0, and not-coupled otherwise. | ||
<p class="text-justify fs-0"> | ||
We obtained experimental binding affinities from two distinct sources: TGF assay(12), which captures | ||
the binding | ||
affinities of 148 GPCRs with 11 chimeric G-proteins, and the ebBRET assay, which profiles the | ||
binding affinities of 97 | ||
GPCRs with 12 G-proteins and 3 β-arrestins/GRKs binders, available at <a href="https://gpcrdb.org/">gpcrdb</a>. | ||
We also used an integrated | ||
meta-coupling dataset derived from a meta-analysis of the aforementioned assays, entailing binding | ||
affinities of 164 GPCRs | ||
for 14 G-proteins. For the TGF assay, we considered a receptor coupled to a G-protein if the | ||
logarithm (base 10) | ||
of the relative intrinsic activity (logRAi) was greater than -1, and not-coupled otherwise. | ||
Similarly, for the GEMTA assay, | ||
we considered a receptor coupled to a G-protein (or β-arrestins/GRK) if the binding efficacy (dnorm | ||
Emax) was greater than 0, | ||
and not-coupled otherwise. For the integrated meta-coupling dataset, | ||
we considered a receptor coupled to a G-protein if the integrated binding affinity was greater than | ||
0, and not-coupled otherwise. | ||
</p> | ||
|
||
<img class="pt-md-0 center" src="static/img/gallery/workflow.png" alt="Method workflow"/> | ||
<h2 class="fw-bold mb-4">Model training</h2> | ||
<p class="text-justify fs-0">We developed the new PRECOGx by training multiple models using the protein embeddings | ||
derived from the pre-trained ESM-1b model as features. For every pair of a coupling group (G-protein/β-arrestins) | ||
and assay dataset (TGF/GEMTA assays), we created a training matrix with vectors, each containing the decomposed | ||
PCA values of a receptor embedding along with the binary label (coupled/not-coupled) as the last element. We implemented | ||
the predictor using either a logistic regression or support vector classifier from the <a href="https://scikit-learn.org/">Scikit library</a> library. | ||
<p class="text-justify fs-0">We developed the new PRECOGx by training multiple models using the protein | ||
embeddings | ||
derived from the pre-trained ESM-1b model as features. For every pair of a coupling group | ||
(G-protein/β-arrestins) | ||
and assay dataset (TGF/GEMTA assays), we created a training matrix with vectors, each containing the | ||
decomposed | ||
PCA values of a receptor embedding along with the binary label (coupled/not-coupled) as the last | ||
element. We implemented | ||
the predictor using either a logistic regression or support vector classifier from the <a | ||
href="https://scikit-learn.org/">Scikit library</a> library. | ||
A grid search was performed using | ||
a stratified 5-fold cross validation (CV) to select the best hyperparameters of the classifier. | ||
We repeated the process 10 times to ensure a minimum variance. We generated a total of 34 models per G-protein (or β-arrestin) and assay. | ||
The best models were chosen based on the highest AUC (Area Under the Curve) score during the 5-fold cross-validation. | ||
We repeated the process 10 times to ensure a minimum variance. We generated a total of 34 models per | ||
G-protein (or β-arrestin) and assay. | ||
The best models were chosen based on the highest AUC (Area Under the Curve) score during the 5-fold | ||
cross-validation. | ||
</p> | ||
<h2 class="fw-bold mb-4">Model testing</h2> | ||
<p class="text-justify fs-0">We benchmarked our method against PRECOG, the web-server for GPCR/G-protein coupling predictions that we | ||
previously developed. We obtained an independent list of 117 (<strong>TGF assay</strong> data as the training set), | ||
and 160 receptors (<strong>GEMTA assay</strong> as the training set) from the GtoPdb that are absent in both the assay datasets. | ||
Since <a href="http://www.guidetopharmacology.org/">GtoPdb</a> lacks a proper true negative set, we used Recall (REC) as a measure | ||
<p class="text-justify fs-0">We benchmarked our method against PRECOG, the web-server for GPCR/G-protein | ||
coupling predictions that we | ||
previously developed. We obtained an independent list of 117 (<strong>TGF assay</strong> data as the | ||
training set), | ||
and 160 receptors (<strong>GEMTA assay</strong> as the training set) from the GtoPdb that are absent | ||
in both the assay datasets. | ||
Since <a href="http://www.guidetopharmacology.org/">GtoPdb</a> lacks a proper true negative set, we | ||
used Recall (REC) as a measure | ||
to compare the performance of PRECOGx with PRECOG. To assess over-fitting, we performed | ||
the <a href="https://link.springer.com/article/10.1023/A:1009752403260">randomization test</a> by randomly shuffling the original | ||
labels of the training matrix, while preserving the ratio of the number of coupled to not-coupled receptors. | ||
the <a href="https://link.springer.com/article/10.1023/A:1009752403260">randomization test</a> by | ||
randomly shuffling the original | ||
labels of the training matrix, while preserving the ratio of the number of coupled to not-coupled | ||
receptors. | ||
</p> | ||
<h2 class="fw-bold mb-4">PCA of the GPCRome embedded space</h2> | ||
<p class="text-justify fs-0">We generated embeddings for the human GPCRome, comprising a total of 377 | ||
receptors (279 Class A, 15 Class B1, 17 Class B2, 17 class C, 11 class F, 25 Taste receptors and 14 in other classes). | ||
We considered either the embeddings generated by considering all the layers. Embeddings were subjected to | ||
Principal Component Analysis (PCA), using the PCA function from the Scikit library. Each GPCR sequence within | ||
the embedding is annotated with functional information (i) coupling specificities (known from the TGF assay, GEMTA assay, | ||
the GtoPdb, and the STRING (for β-arrestins) databases; (ii) GPCR class membership (known from the GtoPdb). | ||
receptors (279 Class A, 15 Class B1, 17 Class B2, 17 class C, 11 class F, 25 Taste receptors and 14 | ||
in other classes). | ||
We considered either the embeddings generated by considering all the layers. Embeddings were | ||
subjected to | ||
Principal Component Analysis (PCA), using the PCA function from the Scikit library. Each GPCR | ||
sequence within | ||
the embedding is annotated with functional information (i) coupling specificities (known from the | ||
TGF assay, GEMTA assay, | ||
the GtoPdb, and the STRING (for β-arrestins) databases; (ii) GPCR class membership (known from the | ||
GtoPdb). | ||
</p> | ||
<h2 class="fw-bold mb-4">Contact analysis</h2> | ||
<p class="text-justify fs-0">To interpret the determinants of binding specificity, we first calculated predicted contacts | ||
for each sequence using a logistic regression over the model's attention maps, available in the ESM library through | ||
the predict_contacts function. Then, the predicted contact maps were grouped on the basis of G-protein binding specificity | ||
<p class="text-justify fs-0">To interpret the determinants of binding specificity, we first calculated | ||
predicted contacts | ||
for each sequence using a logistic regression over the model's attention maps, available in the ESM | ||
library through | ||
the predict_contacts function. Then, the predicted contact maps were grouped on the basis of | ||
G-protein binding specificity | ||
and contrasted to the contact maps of all the GPCRs which was used as a | ||
background. We computed a differential contact maps by calculating a log-odds ratio, employing the following formula: | ||
background. We computed a differential contact maps by calculating a log-odds ratio, employing the | ||
following formula: | ||
</p> | ||
<img class="pt-md-0 center" src="static/img/gallery/formula.png" alt="Math formula"/> | ||
<p class="text-justify fs-0">Where AA and BB terms represent a number of coupled GPCR to a specific G-protein depending on the assay , | ||
that has or does not have a specific contact pair respectively. CC and DD terms represent the number of not-uncoupled GPCR for a | ||
specific G-protein depending on the assay, that has or does not have a specific contact pair respectively. Contacts contributed | ||
from the loops, N-terminal, and C-terminal of the GPCR were aggregated. Contact pairs considered are those with a probability higher | ||
than 0.5 calculated based on predict_contacts function and those appearing in at least 15% of all GPCR in the assays. We computed log-odds | ||
ratio using the Table2x2 function from <a href="https://www.statsmodels.org/">StatsModels</a>. The resulting log-odds ratio was normalized using the MaxAbsScaler | ||
from Sscikit-learn.Contacts with a positive log-odds ratio (enriched) are seen more frequently in receptors coupled to a specific G-protein, | ||
while contacts with a negative log-odds ratio (depleted) are seen less frequently in receptors coupled to a specific G-protein. | ||
<p class="text-justify fs-0">Where AA and BB terms represent a number of coupled GPCR to a specific | ||
G-protein depending on the assay , | ||
that has or does not have a specific contact pair respectively. CC and DD terms represent the number | ||
of not-uncoupled GPCR for a | ||
specific G-protein depending on the assay, that has or does not have a specific contact pair | ||
respectively. Contacts contributed | ||
from the loops, N-terminal, and C-terminal of the GPCR were aggregated. Contact pairs considered are | ||
those with a probability higher | ||
than 0.5 calculated based on predict_contacts function and those appearing in at least 15% of all | ||
GPCR in the assays. We computed log-odds | ||
ratio using the Table2x2 function from <a href="https://www.statsmodels.org/">StatsModels</a>. The | ||
resulting log-odds ratio was normalized using the MaxAbsScaler | ||
from Sscikit-learn.Contacts with a positive log-odds ratio (enriched) are seen more frequently in | ||
receptors coupled to a specific G-protein, | ||
while contacts with a negative log-odds ratio (depleted) are seen less frequently in receptors | ||
coupled to a specific G-protein. | ||
</p> | ||
<img class="pt-md-0 center" src="static/img/gallery/contact.png" alt="contact analysis"/> | ||
|
||
|
@@ -110,17 +154,25 @@ <h2 class="fw-bold mb-4">Attention maps</h2> | |
<h2 class="fw-bold mb-4">Libraries used</h2> | ||
<p class="text-justify fs-0">Following libraries were used to build the webserver:</p> | ||
<ul> | ||
<li class="text-justify fs-0"> ESM </li> | ||
<li class="text-justify fs-0"> NGL viewer </li> | ||
<li class="text-justify fs-0"> jQuery </li> | ||
<li class="text-justify fs-0"> neXtProt </li> | ||
<li class="text-justify fs-0"> Bootstrap </li> | ||
<li class="text-justify fs-0"> Flask </li> | ||
<li class="text-justify fs-0"> Scikit-learn </li> | ||
<li class="text-justify fs-0"> DataTables </li> | ||
<li class="text-justify fs-0"> Plotly </li> | ||
<li class="text-justify fs-0"> ESM</li> | ||
<li class="text-justify fs-0"> NGL viewer</li> | ||
<li class="text-justify fs-0"> jQuery</li> | ||
<li class="text-justify fs-0"> neXtProt</li> | ||
<li class="text-justify fs-0"> Bootstrap</li> | ||
<li class="text-justify fs-0"> Flask</li> | ||
<li class="text-justify fs-0"> Scikit-learn</li> | ||
<li class="text-justify fs-0"> DataTables</li> | ||
<li class="text-justify fs-0"> Plotly</li> | ||
</ul> | ||
<h5 class="fw-bold mb-4">Contact:</h5> | ||
<h2 class="fw-bold">Cite</h2> | ||
<p class="text-justify fs-0"> | ||
Marin Matic, Gurdeep Singh, Francesco Carli, Natalia De Oliveira Rosa, Pasquale Miglionico, Lorenzo | ||
Magni, J Silvio Gutkind, Robert B Russell, Asuka Inoue, Francesco Raimondi, PRECOGx: exploring GPCR | ||
signaling mechanisms with deep protein representations, Nucleic Acids Research, Volume 50, Issue W1, | ||
5 July 2022, Pages W598–W610, | ||
<a href="https://doi.org/10.1093/nar/gkac426" | ||
target="_blank">https://doi.org/10.1093/nar/gkac426</a></p> | ||
<h5 class="fw-bold mb-3">Contact</h5> | ||
<p class="text-justify fs-0">Francesco Raimondi - [email protected]</p> | ||
<p class="text-justify fs-0">Marin Matic - [email protected]</p> | ||
<p class="text-justify fs-0">Gurdeep Singh - [email protected]</p> | ||
|