cite

raimondilab · Jan 3, 2024 · 7b9b3fd · 7b9b3fd
1 parent d676f1e
commit 7b9b3fd
Showing 1 changed file with 107 additions and 55 deletions.
diff --git a/templates/about.html b/templates/about.html
@@ -15,6 +15,7 @@
 
 }
 
+
 </script>
 {% endblock%}
 
@@ -27,76 +28,119 @@
             <div class="col-md-12 text-center text-md-start fs-1 mb-5">
                 <h1 class="fw-bold mb-4">About - Methods</h1>
                 <p class="text-justify fs-0">
-                    We developed PRECOGx, a machine learning predictor of GPCR interactions with G-protein and β-arrestin, by using the ESM1b protein embeddings as features and experimental binding datasets.
+                    We developed PRECOGx, a machine learning predictor of GPCR interactions with G-protein and
+                    β-arrestin, by using the ESM1b protein embeddings as features and experimental binding datasets.
                 </p>
                 <h2 class="fw-bold mb-4">Embeddings generation</h2>
 
-                <p class="text-justify fs-0">Embeddings of the protein sequences were generated by using pre-trained protein language models that have
-                    been recently released. We computed embeddings from fasta sequence using the extract.py function of the <a href= "https://github.com/facebookresearch/esm">ESM library </a>
-                    and by specifying the ESM-1b model (esm1b_t33_650M_UR50S) with embedding for individual amino acids as well as averaged over the full
+                <p class="text-justify fs-0">Embeddings of the protein sequences were generated by using pre-trained
+                    protein language models that have
+                    been recently released. We computed embeddings from fasta sequence using the extract.py function of
+                    the <a href="https://github.com/facebookresearch/esm">ESM library </a>
+                    and by specifying the ESM-1b model (esm1b_t33_650M_UR50S) with embedding for individual amino acids
+                    as well as averaged over the full
                     sequence using the option <i>“--include mean per_tok”</i>.
                 </p>
-                <p class="text-justify fs-0">We generated embeddings for each individual layers separately, including the final one, by specifying their corresponding
+                <p class="text-justify fs-0">We generated embeddings for each individual layers separately, including
+                    the final one, by specifying their corresponding
                     number in the <i>“--repr-layers” option</i>.
                 </p>
                 <h2>Data sets</h2>
-		    <p class="text-justify fs-0">
-                We obtained experimental binding affinities from two distinct sources:  TGF assay(12), which captures the binding
-                affinities of 148 GPCRs with 11 chimeric G-proteins, and the ebBRET assay, which profiles the binding affinities of 97
-                GPCRs with 12 G-proteins and 3 β-arrestins/GRKs binders, available at <a href="https://gpcrdb.org/">gpcrdb</a>. We also used an integrated
-                meta-coupling dataset derived from a meta-analysis of the aforementioned assays, entailing binding affinities of 164 GPCRs
-                for 14 G-proteins. For the TGF assay, we considered a receptor coupled to a G-protein if the logarithm (base 10)
-                of the relative intrinsic activity (logRAi) was greater than -1, and not-coupled otherwise. Similarly, for the GEMTA assay,
-                we considered a receptor coupled to a G-protein (or β-arrestins/GRK) if the binding efficacy (dnorm Emax) was greater than 0,
-                and not-coupled otherwise. For the integrated meta-coupling dataset,
-                we considered a receptor coupled to a G-protein if the integrated binding affinity was greater than 0, and not-coupled otherwise.
+                <p class="text-justify fs-0">
+                    We obtained experimental binding affinities from two distinct sources: TGF assay(12), which captures
+                    the binding
+                    affinities of 148 GPCRs with 11 chimeric G-proteins, and the ebBRET assay, which profiles the
+                    binding affinities of 97
+                    GPCRs with 12 G-proteins and 3 β-arrestins/GRKs binders, available at <a href="https://gpcrdb.org/">gpcrdb</a>.
+                    We also used an integrated
+                    meta-coupling dataset derived from a meta-analysis of the aforementioned assays, entailing binding
+                    affinities of 164 GPCRs
+                    for 14 G-proteins. For the TGF assay, we considered a receptor coupled to a G-protein if the
+                    logarithm (base 10)
+                    of the relative intrinsic activity (logRAi) was greater than -1, and not-coupled otherwise.
+                    Similarly, for the GEMTA assay,
+                    we considered a receptor coupled to a G-protein (or β-arrestins/GRK) if the binding efficacy (dnorm
+                    Emax) was greater than 0,
+                    and not-coupled otherwise. For the integrated meta-coupling dataset,
+                    we considered a receptor coupled to a G-protein if the integrated binding affinity was greater than
+                    0, and not-coupled otherwise.
                 </p>
 
                 <img class="pt-md-0 center" src="static/img/gallery/workflow.png" alt="Method workflow"/>
                 <h2 class="fw-bold mb-4">Model training</h2>
-                <p class="text-justify fs-0">We developed the new PRECOGx by training multiple models using the protein embeddings
-                    derived from the pre-trained ESM-1b model as features. For every pair of  a coupling group (G-protein/β-arrestins)
-                    and assay dataset (TGF/GEMTA assays), we created a training matrix with vectors, each containing the decomposed
-                    PCA values of a receptor embedding along with the binary label (coupled/not-coupled) as the last element. We implemented
-                    the predictor using either a logistic regression or support vector classifier from the <a href="https://scikit-learn.org/">Scikit library</a> library.
+                <p class="text-justify fs-0">We developed the new PRECOGx by training multiple models using the protein
+                    embeddings
+                    derived from the pre-trained ESM-1b model as features. For every pair of a coupling group
+                    (G-protein/β-arrestins)
+                    and assay dataset (TGF/GEMTA assays), we created a training matrix with vectors, each containing the
+                    decomposed
+                    PCA values of a receptor embedding along with the binary label (coupled/not-coupled) as the last
+                    element. We implemented
+                    the predictor using either a logistic regression or support vector classifier from the <a
+                            href="https://scikit-learn.org/">Scikit library</a> library.
                     A grid search was performed using
                     a stratified 5-fold cross validation (CV) to select the best hyperparameters of the classifier.
-                    We repeated the process 10 times to ensure a minimum variance. We generated a total of 34 models per G-protein (or β-arrestin) and assay.
-                    The best models were chosen based on the highest AUC (Area Under the Curve) score during the 5-fold cross-validation.
+                    We repeated the process 10 times to ensure a minimum variance. We generated a total of 34 models per
+                    G-protein (or β-arrestin) and assay.
+                    The best models were chosen based on the highest AUC (Area Under the Curve) score during the 5-fold
+                    cross-validation.
                 </p>
                 <h2 class="fw-bold mb-4">Model testing</h2>
-                <p class="text-justify fs-0">We benchmarked our method against PRECOG, the web-server for GPCR/G-protein coupling predictions that we
-                    previously developed. We obtained an independent list of 117 (<strong>TGF assay</strong> data as the training set),
-                    and 160 receptors (<strong>GEMTA assay</strong>  as the training set) from the GtoPdb that are absent in both the assay datasets.
-                    Since <a href="http://www.guidetopharmacology.org/">GtoPdb</a> lacks a proper true negative set, we used Recall (REC) as a measure
+                <p class="text-justify fs-0">We benchmarked our method against PRECOG, the web-server for GPCR/G-protein
+                    coupling predictions that we
+                    previously developed. We obtained an independent list of 117 (<strong>TGF assay</strong> data as the
+                    training set),
+                    and 160 receptors (<strong>GEMTA assay</strong> as the training set) from the GtoPdb that are absent
+                    in both the assay datasets.
+                    Since <a href="http://www.guidetopharmacology.org/">GtoPdb</a> lacks a proper true negative set, we
+                    used Recall (REC) as a measure
                     to compare the performance of PRECOGx with PRECOG. To assess over-fitting, we performed
-                    the <a href="https://link.springer.com/article/10.1023/A:1009752403260">randomization test</a> by randomly shuffling the original
-                    labels of the training matrix, while preserving the ratio of the number of coupled to not-coupled receptors.
+                    the <a href="https://link.springer.com/article/10.1023/A:1009752403260">randomization test</a> by
+                    randomly shuffling the original
+                    labels of the training matrix, while preserving the ratio of the number of coupled to not-coupled
+                    receptors.
                 </p>
                 <h2 class="fw-bold mb-4">PCA of the GPCRome embedded space</h2>
                 <p class="text-justify fs-0">We generated embeddings for the human GPCRome, comprising a total of 377
-                    receptors (279 Class A, 15 Class B1, 17 Class B2, 17 class C, 11 class F, 25 Taste receptors and 14 in other classes).
-                    We considered either the embeddings generated by considering all the layers. Embeddings were subjected to
-                    Principal Component Analysis (PCA), using the PCA function from the Scikit library. Each GPCR sequence within
-                    the embedding is annotated with functional information (i) coupling specificities (known from  the TGF assay,  GEMTA assay,
-                    the GtoPdb, and the STRING (for β-arrestins) databases; (ii) GPCR class membership (known from the GtoPdb).
+                    receptors (279 Class A, 15 Class B1, 17 Class B2, 17 class C, 11 class F, 25 Taste receptors and 14
+                    in other classes).
+                    We considered either the embeddings generated by considering all the layers. Embeddings were
+                    subjected to
+                    Principal Component Analysis (PCA), using the PCA function from the Scikit library. Each GPCR
+                    sequence within
+                    the embedding is annotated with functional information (i) coupling specificities (known from the
+                    TGF assay, GEMTA assay,
+                    the GtoPdb, and the STRING (for β-arrestins) databases; (ii) GPCR class membership (known from the
+                    GtoPdb).
                 </p>
                 <h2 class="fw-bold mb-4">Contact analysis</h2>
-                <p class="text-justify fs-0">To interpret the determinants of binding specificity, we first calculated predicted contacts
-                    for each sequence using a logistic regression over the model's attention maps, available in the ESM library through
-                    the predict_contacts function. Then, the predicted contact maps were grouped on the basis of G-protein binding specificity
+                <p class="text-justify fs-0">To interpret the determinants of binding specificity, we first calculated
+                    predicted contacts
+                    for each sequence using a logistic regression over the model's attention maps, available in the ESM
+                    library through
+                    the predict_contacts function. Then, the predicted contact maps were grouped on the basis of
+                    G-protein binding specificity
                     and contrasted to the contact maps of all the GPCRs which was used as a
-                    background. We computed a differential contact maps by calculating a log-odds ratio, employing the following formula:
+                    background. We computed a differential contact maps by calculating a log-odds ratio, employing the
+                    following formula:
                 </p>
                 <img class="pt-md-0 center" src="static/img/gallery/formula.png" alt="Math formula"/>
-                <p class="text-justify fs-0">Where AA and BB terms represent a number of coupled GPCR to a specific G-protein depending on the assay ,
-                    that has or does not have a specific contact pair respectively. CC and DD terms represent the number of not-uncoupled GPCR for a
-                    specific G-protein depending on the assay, that has or does not have a specific contact pair respectively. Contacts contributed
-                    from the loops, N-terminal, and C-terminal of the GPCR were aggregated. Contact pairs considered are those with a probability higher
-                    than 0.5 calculated based on predict_contacts function and those appearing in at least 15% of all GPCR in the assays.  We computed log-odds
-                    ratio using the Table2x2 function from <a href="https://www.statsmodels.org/">StatsModels</a>. The resulting log-odds ratio was normalized using the MaxAbsScaler
-                    from Sscikit-learn.Contacts with a positive log-odds ratio (enriched) are seen more frequently in receptors coupled to a specific G-protein,
-                    while contacts with a negative log-odds ratio (depleted) are seen less frequently in receptors coupled to a specific G-protein.
+                <p class="text-justify fs-0">Where AA and BB terms represent a number of coupled GPCR to a specific
+                    G-protein depending on the assay ,
+                    that has or does not have a specific contact pair respectively. CC and DD terms represent the number
+                    of not-uncoupled GPCR for a
+                    specific G-protein depending on the assay, that has or does not have a specific contact pair
+                    respectively. Contacts contributed
+                    from the loops, N-terminal, and C-terminal of the GPCR were aggregated. Contact pairs considered are
+                    those with a probability higher
+                    than 0.5 calculated based on predict_contacts function and those appearing in at least 15% of all
+                    GPCR in the assays. We computed log-odds
+                    ratio using the Table2x2 function from <a href="https://www.statsmodels.org/">StatsModels</a>. The
+                    resulting log-odds ratio was normalized using the MaxAbsScaler
+                    from Sscikit-learn.Contacts with a positive log-odds ratio (enriched) are seen more frequently in
+                    receptors coupled to a specific G-protein,
+                    while contacts with a negative log-odds ratio (depleted) are seen less frequently in receptors
+                    coupled to a specific G-protein.
                 </p>
                 <img class="pt-md-0 center" src="static/img/gallery/contact.png" alt="contact analysis"/>
 
@@ -110,17 +154,25 @@ <h2 class="fw-bold mb-4">Attention maps</h2>
                 <h2 class="fw-bold mb-4">Libraries used</h2>
                 <p class="text-justify fs-0">Following libraries were used to build the webserver:</p>
                 <ul>
-                <li class="text-justify fs-0"> ESM </li>
-                <li class="text-justify fs-0">  NGL viewer </li>
-                <li class="text-justify fs-0"> jQuery </li>
-                <li class="text-justify fs-0">  neXtProt </li>
-                <li class="text-justify fs-0">  Bootstrap </li>
-                <li class="text-justify fs-0">  Flask </li>
-                <li class="text-justify fs-0">  Scikit-learn </li>
-                    <li class="text-justify fs-0">  DataTables </li>
-                <li class="text-justify fs-0">  Plotly </li>
+                    <li class="text-justify fs-0"> ESM</li>
+                    <li class="text-justify fs-0"> NGL viewer</li>
+                    <li class="text-justify fs-0"> jQuery</li>
+                    <li class="text-justify fs-0"> neXtProt</li>
+                    <li class="text-justify fs-0"> Bootstrap</li>
+                    <li class="text-justify fs-0"> Flask</li>
+                    <li class="text-justify fs-0"> Scikit-learn</li>
+                    <li class="text-justify fs-0"> DataTables</li>
+                    <li class="text-justify fs-0"> Plotly</li>
                 </ul>
-                <h5 class="fw-bold mb-4">Contact:</h5>
+                <h2 class="fw-bold">Cite</h2>
+                <p class="text-justify fs-0">
+                    Marin Matic, Gurdeep Singh, Francesco Carli, Natalia De Oliveira Rosa, Pasquale Miglionico, Lorenzo
+                    Magni, J Silvio Gutkind, Robert B Russell, Asuka Inoue, Francesco Raimondi, PRECOGx: exploring GPCR
+                    signaling mechanisms with deep protein representations, Nucleic Acids Research, Volume 50, Issue W1,
+                    5 July 2022, Pages W598–W610,
+                    <a href="https://doi.org/10.1093/nar/gkac426"
+                       target="_blank">https://doi.org/10.1093/nar/gkac426</a></p>
+                <h5 class="fw-bold mb-3">Contact</h5>
                 <p class="text-justify fs-0">Francesco Raimondi - [email protected]</p>
                 <p class="text-justify fs-0">Marin Matic - [email protected]</p>
                 <p class="text-justify fs-0">Gurdeep Singh - [email protected]</p>