Skip to content

National Yang-Ming Chiao-Tung university, Institute of biomedical informatics, Cho-Yi Chen.Ph.D lab, master thesis

Notifications You must be signed in to change notification settings

steven556610/Exploiting-network-embedding-in-pharmacogenomics-data-to-study-drug-disease-association

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Master Thesis

Exploiting network embedding in pharmacogenomics to study drug-disease associations

Workflow

Material

network

STRING PPI network version 11 https://string-db.org transer node name by mygenehttps://docs.mygene.info/en/latest/

pathway database

NCI & KEGG

https://www.kegg.jp
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2686461/
https://maayanlab.cloud/Enrichr/#libraries

REACTOME

https://reactome.org

MSigDB

http://www.gsea-msigdb.org/gsea/msigdb/index.jsp

disease gene set

from https://www.sciencedirect.com/science/article/pii/S2666389920301185

https://maayanlab.cloud/covid19/

CRISPR-ihttps://www.nature.com/articles/s41467-021-21213-4 CRISPR-ahttps://pubmed.ncbi.nlm.nih.gov/34431042/ interactomehttps://pubmed.ncbi.nlm.nih.gov/33357464/

drug gene set

from enrichr download

https://maayanlab.cloud/Enrichr/#libraries

Tools

Set2Gaussian

https://github.com/wangshenguiuc/set2gaussian

Enrichr

https://maayanlab.cloud/Enrichr/

machine learning

  • SVM
  • Random forest
  • XGBoost
  • Decision tree

all of the model selected use the default parameter in sci-kit learn package setting.

Result

network preprocess

using powerlaw(pathway package)

  • No filter
  • after filter(0.7 confidence score)

postive control and negative control

In order to realize the difference between genes whether in the gene set will make the dimension reduction output change dramatically.

pathway member identification(multi-class classification)

using node embedding output as feature To create target matrix need to transfer gene set list by one-hot encoding. Five fold cross validation, use accuracy to scoring. training:testing = 2:1

pathway prioritization

Using disease gene set to extract genes' vector in node embedding ouput with pathway location in the embedding space(mu output file) expect to analysis distance between disease and pathway. With the more shorter distance,the more closer relationship between disease and pathway.

  • 算術平均數和中位數的版本都有寫 (公式在論文)
  • 結果也同樣在論文

drug repurposing

同上面方法,疾病基因集的方式,選取對應基因在嵌入式空間的向量,然後取算數平均數,藥物基因集也是一樣的做法。這個部分還有計算z-score,透過固定疾病基因集,隨機選取100次藥物基因集,來計算疾病與整個藥物資料庫的距離,取得mean和standard error後,會有z-score,使用z-score來排名。 透過相對距離(z-score)來說明藥物和疾病的關係,數值越小表示距離越近。

  • random selcet的大小,可能不夠大,導致結果不太穩定,但其他GSEA 選用100次,或許我們的方法與藥物資料庫的選擇,會需要重決定random select的次數
  • 算術平均數和中位數的版本都有寫(公式在論文)
  • 每個疾病基因集,各排出300個候選藥物。

drug combination

實作https://www.nature.com/articles/s41467-019-09186-x 操作的公式主要有兩個(都在文獻中,使用networkx,來實作)

  • shortest distance
  • z-score
  • separation value 後來有些間gene's的shortest distance計算出來,建成表。(加速) 要達成drug combination theory,需要疾病藥物間的z-score小於零,且兩個藥物間的separation value大於零。 所以我們將三百個候選藥物都去算出z-score。 只有一個疾病基因集的候選藥物,z-score有小於零的存在。 因此我們將符合條件的候選藥物,去計算倆倆之間的separation value。 結果也發現都大於零。 最後用兩個不同的方式來進行排序。

附錄

使用driverdbv3 http://driverdb.tms.cmu.edu.tw caner driver gene去分析,想應證兩件事。

  1. 不同工具在同一癌症基因集,雖然找得方式和假設不同,但致癌基因在網路的角色相似。
  2. 網路嵌入式演算法有找到網路角色和特性的能力。

About

National Yang-Ming Chiao-Tung university, Institute of biomedical informatics, Cho-Yi Chen.Ph.D lab, master thesis

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published