-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
Showing
19 changed files
with
4,862 additions
and
1 deletion.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -3,4 +3,3 @@ | |
*.csv | ||
|
||
/.quarto/ | ||
/_site/ |
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,44 @@ | ||
[ | ||
{ | ||
"objectID": "index.html", | ||
"href": "index.html", | ||
"title": "EnrichGT Document", | ||
"section": "", | ||
"text": "install.packages(\"pak\")\npak::pkg_install(\"ZhimingYe/EnrichGT\")\nor\nlibrary(devtools)\ninstall_github(\"ZhimingYe/EnrichGT\")\nThe AnnotationDbi, fgsea, reactome.db and GO.db were from BioConductor and might be slow to install. If you can’t install, please re-check your web connections or update your R and BioConductor, or use Posit Package Manager to install when using old R." | ||
}, | ||
{ | ||
"objectID": "index.html#core-function", | ||
"href": "index.html#core-function", | ||
"title": "EnrichGT Document", | ||
"section": "Core Function", | ||
"text": "Core Function\n\nEnrichment of genes\nThis is a C++ accelerated over representation analysis tool.\n\n\n\n\n\n\nThe difference from other tools\n\n\n\n\n\nCompared to the most popular clusterProfiler, the functions of EnrichGT differ slightly. This is mainly to accommodate wet lab researchers. First, most beginners are confused by the default input of clusterProfiler, which is “ENTREZ ID.” Most people familiar with biology are used to Gene Symbols, and even Ensembl IDs are not widely known, let alone a series of seemingly random numbers. Therefore, EnrichGT uses Gene Symbol as the default input, seamlessly integrating with most downstream results from various companies, making it more suitable for non-experts in the lab.\nSecond, clusterProfiler outputs an S4 object, which may be too complex for beginners (this is no joke); whereas EnrichGT outputs a simple table. The time of non-experts is precious, so I made these two adjustments. The only downside is that the GSEA peak plot is difficult to generate, but in reality, we focus more on NES and p-values, and in this case, bar plots are more convincing.\nFurthermore, The pre-processing step of the hypergeometric test in EnrichGT’s ORA function (which determines overlap) is accelerated using hash tables in C++, making it over five times faster than clusterProfiler::enricher(), which is a pure R implementation.\n\n\n\nres <- egt_enrichment_analysis(genes = DEGtable$Genes,\ndatabase = database_GO_BP())\n\nres <- egt_enrichment_analysis(genes = c(\"TP53\",\"CD169\",\"CD68\",\"CD163\",\n \"You can add more genes\"),\ndatabase = database_GO_ALL())\n\nres <- egt_enrichment_analysis(genes = c(\"TP53\",\"CD169\",\"CD68\",\"CD163\",\n \"You can add more genes\"),\ndatabase = database_from_gmt(\"MsigDB_Hallmark.gmt\"))\n\n\n\n\n\n\nExample of ORA\n\n\n\n\n\n\nlibrary(dplyr)\nlibrary(tibble)\nlibrary(org.Hs.eg.db)\nlibrary(gt)\nlibrary(testthat)\nlibrary(withr)\nlibrary(EnrichGT)\nlibrary(readr)\n\n\nDEGexample <- read_csv(\"./DEG.csv\")\n\nNew names:\nRows: 15903 Columns: 7\n── Column specification\n──────────────────────────────────────────────────────── Delimiter: \",\" chr\n(1): ...1 dbl (6): baseMean, log2FoldChange, lfcSE, stat, pvalue, padj\nℹ Use `spec()` to retrieve the full column specification for this data. ℹ\nSpecify the column types or set `show_col_types = FALSE` to quiet this message.\n• `` -> `...1`\n\nDEGexample_UpReg <- DEGexample |> dplyr::filter(pvalue<0.05,log2FoldChange>0.7)\nora_result <- egt_enrichment_analysis(genes = DEGexample_UpReg$...1,database = database_GO_BP(org.Hs.eg.db))\n\n\n✔ success loaded database, time used : 17.1571810245514\n\nhead(ora_result)\n\n ID Description GeneRatio\n1 GO:0035249 synaptic transmission, glutamatergic 19/457\n2 GO:0051966 regulation of synaptic transmission, glutamatergic 16/457\n3 GO:0050804 modulation of chemical synaptic transmission 39/457\n4 GO:0099177 regulation of trans-synaptic signaling 39/457\n5 GO:0050808 synapse organization 36/457\n6 GO:0048168 regulation of neuronal synaptic plasticity 12/457\n BgRatio pvalue p.adjust qvalue\n1 111/18870 2.055491e-11 8.053235e-08 8.409591e-05\n2 79/18870 5.742011e-11 8.053235e-08 8.409591e-05\n3 489/18870 7.257625e-11 8.053235e-08 8.409591e-05\n4 490/18870 7.711980e-11 8.053235e-08 8.409591e-05\n5 483/18870 2.512088e-09 2.098598e-06 1.109159e-03\n6 56/18870 7.497368e-09 4.183196e-06 1.109159e-03\n geneID\n1 ATP1A2/GRIA4/GRID2/GRIK2/GRIK3/GRIN1/GRIN2A/GRIN2B/GRIN2D/GRM1/GRM5/GRM8/DGKI/NRXN1/NLGN1/UNC13A/MAPK8IP2/CACNG5/UNC13C\n2 ATP1A2/GRIK2/GRIK3/GRIN1/GRIN2A/GRIN2B/GRIN2D/GRM1/GRM5/GRM8/DGKI/NRXN1/NLGN1/UNC13A/MAPK8IP2/CACNG5\n3 ACHE/APOE/ATP1A2/CA2/CAMK2B/CDC20/GFAP/GRIA4/GRID2/GRIK2/GRIK3/GRIN1/GRIN2A/GRIN2B/GRIN2D/GRM1/GRM5/GRM8/HRAS/MAP1B/NTRK2/SLC6A1/CNTN2/VGF/WNT5A/INA/DGKI/DLGAP1/NRXN1/RIMS3/NMU/NLGN1/UNC13A/MAPK8IP2/ERC2/CACNG5/LRFN2/UNC13C/SHISA9\n4 ACHE/APOE/ATP1A2/CA2/CAMK2B/CDC20/GFAP/GRIA4/GRID2/GRIK2/GRIK3/GRIN1/GRIN2A/GRIN2B/GRIN2D/GRM1/GRM5/GRM8/HRAS/MAP1B/NTRK2/SLC6A1/CNTN2/VGF/WNT5A/INA/DGKI/DLGAP1/NRXN1/RIMS3/NMU/NLGN1/UNC13A/MAPK8IP2/ERC2/CACNG5/LRFN2/UNC13C/SHISA9\n5 ACHE/APOE/KIF1A/CAMK2B/CDC20/CDH6/CTNNA2/DSCAM/GAP43/GRID2/GRIN2B/GRM5/MAP1B/NRCAM/NTRK2/RAC3/SIX1/SLC6A1/CNTN2/WNT5A/INA/NRXN1/NLGN1/UNC13A/ERC2/IL1RAPL2/SEZ6L2/TREM2/LRFN2/IGSF9/BCAN/SYNDIG1/DNER/ADGRF1/LHFPL4/UNC13C\n6 APOE/CAMK2B/GRIK2/GRIN1/GRIN2A/GRIN2B/GRIN2D/GRM5/HRAS/CNTN2/VGF/SHISA9\n Count\n1 19\n2 16\n3 39\n4 39\n5 36\n6 12\n\n\n\n\n\n\n\n\n\n\n\nHave many sources of genes?\n\n\n\nThis function also support many groups of genes, you can input a list.\n# For many groups of genes\nres <- egt_enrichment_analysis(list(Macrophages=c(\"CD169\",\"CD68\",\"CD163\"),\nFibroblast=c(\"COL1A2\",\"COL1A3\"),\"You can add more groups\"),\n database = database_from_gmt(\"panglaoDB.gmt\"))\n\n\n\n\nEnrichment of weighted genes (GSEA)\nGenes with specific weights (e.g. the log2FC) can use GSEA method. It should input a pre-ranked geneset. This use C++ accelerated fgsea::fgsea() as backend, so it is also very fast.\n\n\n\n\n\n\nHow to build pre-ranked gene set?\n\n\n\ngenes_with_weights(genes,weights) function is used to build the pre-ranked gene set for GSEA analysis.\n\n\n# From DEG analysis Results\nres <- egt_gsea_analysis(genes = \n genes_with_weights(genes = DEG$genes, \n weights = DEG$log2FoldChange),\n database = database_GO_BP()\n )\n\n# From PCA\nres <- egt_gsea_analysis(genes = genes_with_weights(genes = PCA_res$genes,\n weights =PCA_res$PC1_loading),\n database = database_from_gmt(\"MsigDB_Hallmark.gmt\")\n )" | ||
}, | ||
{ | ||
"objectID": "index.html#featured-function", | ||
"href": "index.html#featured-function", | ||
"title": "EnrichGT Document", | ||
"section": "Featured Function", | ||
"text": "Featured Function\n\nEnrichment of Enriched Results\nThe enriched result is too messy? Clean up it!\n\n\n\n\n\n\nForm clusterProfiler ?\n\n\n\nThis can also supports the results from clusterProfiler, so you can use any tool to do this.\n\n\n\n\n\n\n\n\nWhy the re-enrichment is necessary?\n\n\n\n\n\n\nChallenges in Biological Gene Enrichment Analysis\nGene enrichment analysis can often be misleading due to the redundancy within gene set databases and the limitations of most enrichment tools. Many tools, by default, only display a few top results and fail to filter out redundancy. This can result in both biological misinterpretation and valuable information being overlooked.\nFor instance, high expression of certain immune genes can cause many immune-related gene sets to appear overrepresented. However, a closer look often reveals that these gene sets are derived from the same group of genes, which might represent only a small fraction (less than 10%) of the differentially expressed genes (DEGs). What about the other 90%? Do they hold no biological significance?\n\n\nCurrent Solutions\nclusterProfiler is one of the most powerful tools in R for enrichment analysis. It’s designed with pathway redundancy in mind and includes the clusterProfiler::simplify function to address this issue. This method, based on GOSemSim for GO similarity evaluation, is scientifically robust and highly effective.\nHowever, there are some drawbacks:\n\nGOSemSim is not fast, particularly when dealing with large or complex gene sets.\nIt doesn’t support databases like KEGG or Reactome.\n\nUsing GOSemSim to measure the semantic similarity between pathways is, theoretically, the best way to tackle redundancy. However, in practical cases—especially in experimental bioinformatics validation—researchers are more focused on the genes behind these pathways rather than the pathways themselves.\n\n\nAlternative Approaches\nAlthough clustering pathways based on gene overlap has received some criticism, it remains a viable approach in many situations. For this reason, I developed BioThemeFinder a few years ago to solve this problem. However, the tool is so awful (I am poor in coding…)\nToday, two excellent alternatives exist:\n\nsimplifyEnrichment: This package is more scientifically rigorous (based on semantic similarity) and creates beautiful visualizations. It also doesn’t support databases like KEGG or Reactome.\naPEAR: A simpler and faster tool that better aligns with practical needs, making it my preferred choice.\n\nHowever, both of these tools have a common limitation: their visualizations are optimized for publication purposes rather than for exploratory research. I often find myself exporting CSV files or struggling with RStudio’s preview pane to fully explore enrichment tables. This inspired me to develop a more efficient solution. Also, they are slow.\n\n\nGoals of This Package\nThe main purpose of developing this package is to provide a lightweight and practical solution to the problems mentioned above. Specifically, this package aims to:\nCluster enrichment results based on hit genes or core enrichment from GSEA using term frequency analysis (from the output of the powerful clusterProfiler). This provides a clearer view of biological relevance by focusing on the genes that matter most.\n\n\n\n\n# From results generated before\nres <- egt_enrichment_analysis(genes = DEGtable$Genes,\ndatabase = database_GO_BP())\n\nre_enrichment_results <- egt_recluster_analysis(\n res,\n ClusterNum = 17,\n P.adj = 0.05,\n force = F,\n nTop = 10,\n method = \"ward.D2\"\n)\nYou can see the structure of egt_obj. The first slot is the result table, and the second slot contains gt table.\n\nstr(re_enrich,max.level = 2)\n\nFormal class 'EnrichGT_obj' [package \"EnrichGT\"] with 7 slots\n ..@ enriched_result : tibble [103 × 7] (S3: tbl_df/tbl/data.frame)\n ..@ gt_object :List of 17\n .. ..- attr(*, \"class\")= chr [1:2] \"gt_tbl\" \"list\"\n ..@ gene_modules :List of 16\n ..@ pathway_clusters :List of 16\n ..@ document_term_matrix:Formal class 'dgCMatrix' [package \"Matrix\"] with 6 slots\n ..@ clustering_tree :List of 7\n .. ..- attr(*, \"class\")= chr \"hclust\"\n ..@ raw_enriched_result :'data.frame': 175 obs. of 7 variables:" | ||
}, | ||
{ | ||
"objectID": "index.html#html-reports-gt-table", | ||
"href": "index.html#html-reports-gt-table", | ||
"title": "EnrichGT Document", | ||
"section": "HTML reports (gt table)", | ||
"text": "HTML reports (gt table)\nAlso, because of the messy result table is hardly to read, EnrichGT help you convert it into pretty gt HTML tables. This only supports the re-enriched results.\n\nThe gt_object is a pure object of gt package, you can use any function on it, like:\nre_enrichment_results@gt_object |> gt_save(\"test.html\") # Save it use basic gt functions. \nFor further usage of gt package, please refer to https://gt.rstudio.com/articles/gt.html.\nSee re-enrichment example for further demo." | ||
}, | ||
{ | ||
"objectID": "index.html#ploting-functions", | ||
"href": "index.html#ploting-functions", | ||
"title": "EnrichGT Document", | ||
"section": "Ploting functions", | ||
"text": "Ploting functions\n\n\n\n\n\n\nWarning\n\n\n\nThe Dot Plot supports simple enrichment result data.frame and re-enriched egt_object, but UMAP plot only supports re-enriched egt_object.\n\n\nHTML gt table satisfied most of things, but for others. Though we don’t want this package become complex (i.e., you can simple draw your figure using ggplot2 for enriched tables by yourself.) But we still provide limited figure ploting functions.\n\nDot Plot\n\negt_plot_results(re_enrich)\n\n\n\n\n\n\n\n\n\n\nUMAP Plot\n\negt_plot_umap(re_enrich)\n\nWarning: ggrepel: 8 unlabeled data points (too many overlaps). Consider\nincreasing max.overlaps" | ||
}, | ||
{ | ||
"objectID": "index.html#databases-helpers", | ||
"href": "index.html#databases-helpers", | ||
"title": "EnrichGT Document", | ||
"section": "DataBases Helpers", | ||
"text": "DataBases Helpers\n\n\n\n\n\n\nHow to specify species?\n\n\n\nEnrichGT use AnnotationDbi for this. you can use org.Hs.eg.db for human and org.Mm.eg.db for mouse. For others, please refer to BioConductor.\nBut for non-AnnotationDbi source database, you do not need to provide this, like database_CollecTRI_human() return database about human only.\n\n\n\nBuilt in database form AnnotationDbi\nYou should add argument OrgDB for fetching them.\nExample:\ndatabase_GO_BP(OrgDB = org.Hs.eg.db)\n\nGO Database\ndatabase_GO_BP(), database_GO_CC(), database_GO_MF(), database_GO_ALL()\n\n\nReactome Database\ndatabase_Reactome()\n\n\nProgeny Database\nFor pathway activity infer, database_progeny_human() and database_progeny_mouse()\n\n\nCollecTRI Database\nFor Transcript Factors infer, database_CollecTRI_human() and database_CollecTRI_mouse()\n\n\n\nRead Addition Gene Sets from local\nEnrichGT supports reading `GMT` files, You can obtain `GMT` files from MsigDB.\ndatabase_from_gmt(\"Path_to_your_Gmt_file.gmt\")\n\n\nWhere is KEGG?\nKEGG limited the commercial usage. And you should use the KEGG REST API to download it. I have no time to achieve it now. But you can use KEGG Database from MsigDB instead (KEGG_MED and KEGG_Classical).\n\n\nReading is slow?\nFrom 0.5.0, EnrichGT implemented a cache system. So when load a same database the second time, it will be much faster.\n\ntest <- database_GO_MF(org.Hs.eg.db)\n\n✔ success loaded database, time used : 5.44952607154846\n\ntest_reload <- database_GO_MF(org.Hs.eg.db)\n\n✔ Use cached database: GO_MF_org.Hs.eg.db" | ||
} | ||
] |
Oops, something went wrong.