Master Thesis

Exploiting network embedding in pharmacogenomics to study drug-disease associations

Workflow

Material

network

STRING PPI network version 11 https://string-db.org transer node name by mygenehttps://docs.mygene.info/en/latest/

pathway database

NCI & KEGG

https://www.kegg.jp
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2686461/
https://maayanlab.cloud/Enrichr/#libraries

REACTOME

https://reactome.org

MSigDB

http://www.gsea-msigdb.org/gsea/msigdb/index.jsp

disease gene set

from https://www.sciencedirect.com/science/article/pii/S2666389920301185

https://maayanlab.cloud/covid19/

CRISPR-ihttps://www.nature.com/articles/s41467-021-21213-4 CRISPR-ahttps://pubmed.ncbi.nlm.nih.gov/34431042/ interactomehttps://pubmed.ncbi.nlm.nih.gov/33357464/

drug gene set

from enrichr download

https://maayanlab.cloud/Enrichr/#libraries

Tools

Set2Gaussian

https://github.com/wangshenguiuc/set2gaussian

Enrichr

https://maayanlab.cloud/Enrichr/

machine learning

SVM
Random forest
XGBoost
Decision tree

all of the model selected use the default parameter in sci-kit learn package setting.

Result

network preprocess

using powerlaw(pathway package)

No filter
after filter(0.7 confidence score)

postive control and negative control

In order to realize the difference between genes whether in the gene set will make the dimension reduction output change dramatically.

pathway member identification(multi-class classification)

using node embedding output as feature To create target matrix need to transfer gene set list by one-hot encoding. Five fold cross validation, use accuracy to scoring. training:testing = 2:1

pathway prioritization

Using disease gene set to extract genes' vector in node embedding ouput with pathway location in the embedding space(mu output file) expect to analysis distance between disease and pathway. With the more shorter distance,the more closer relationship between disease and pathway.

算術平均數和中位數的版本都有寫 (公式在論文）
結果也同樣在論文

drug repurposing

同上面方法，疾病基因集的方式，選取對應基因在嵌入式空間的向量，然後取算數平均數，藥物基因集也是一樣的做法。這個部分還有計算z-score，透過固定疾病基因集，隨機選取100次藥物基因集，來計算疾病與整個藥物資料庫的距離，取得mean和standard error後，會有z-score，使用z-score來排名。透過相對距離(z-score)來說明藥物和疾病的關係，數值越小表示距離越近。

random selcet的大小，可能不夠大，導致結果不太穩定，但其他GSEA 選用100次，或許我們的方法與藥物資料庫的選擇，會需要重決定random select的次數
算術平均數和中位數的版本都有寫（公式在論文）
每個疾病基因集，各排出300個候選藥物。

drug combination

實作https://www.nature.com/articles/s41467-019-09186-x 操作的公式主要有兩個（都在文獻中，使用networkx，來實作）

shortest distance
z-score
separation value 後來有些間gene's的shortest distance計算出來，建成表。（加速）要達成drug combination theory，需要疾病藥物間的z-score小於零，且兩個藥物間的separation value大於零。所以我們將三百個候選藥物都去算出z-score。只有一個疾病基因集的候選藥物，z-score有小於零的存在。因此我們將符合條件的候選藥物，去計算倆倆之間的separation value。結果也發現都大於零。最後用兩個不同的方式來進行排序。

附錄

使用driverdbv3 http://driverdb.tms.cmu.edu.tw caner driver gene去分析，想應證兩件事。

不同工具在同一癌症基因集，雖然找得方式和假設不同，但致癌基因在網路的角色相似。
網路嵌入式演算法有找到網路角色和特性的能力。

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
Link_Prediction		Link_Prediction
Pathway_Identification_multilabel_classification		Pathway_Identification_multilabel_classification
Pathway_prioritization		Pathway_prioritization
STRING_network_preprocess		STRING_network_preprocess
different_dimension_to_evaluate_fitting_condition		different_dimension_to_evaluate_fitting_condition
drug_repurposing		drug_repurposing
network proximity and separation measure		network proximity and separation measure
prepare_set2gaussian_input		prepare_set2gaussian_input
MSigDB_REACTOME_KEGG_C6HallMark_gmt_file_address_and_GeneSet_output.ipynb		MSigDB_REACTOME_KEGG_C6HallMark_gmt_file_address_and_GeneSet_output.ipynb
README.md		README.md
string physical network link preprocess.ipynb		string physical network link preprocess.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Master Thesis

Workflow

Material

network

pathway database

NCI & KEGG

REACTOME

MSigDB

disease gene set

drug gene set

Tools

Set2Gaussian

Enrichr

machine learning

Result

network preprocess

postive control and negative control

pathway member identification(multi-class classification)

pathway prioritization

drug repurposing

drug combination

附錄

About

Releases

Packages

Contributors 2

Languages

steven556610/Exploiting-network-embedding-in-pharmacogenomics-data-to-study-drug-disease-association

Folders and files

Latest commit

History

Repository files navigation

Master Thesis

Workflow

Material

network

pathway database

NCI & KEGG

REACTOME

MSigDB

disease gene set

drug gene set

Tools

Set2Gaussian

Enrichr

machine learning

Result

network preprocess

postive control and negative control

pathway member identification(multi-class classification)

pathway prioritization

drug repurposing

drug combination

附錄

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages