This repo is the implementation of "".
- python == 3.6
- matplotlib==3.5.1
- numpy==1.23.2
- openpyxl==3.0.9
- pandas==1.4.3
- scikit_learn==1.1.2
- seaborn==0.11.2
- torch==1.10.2+cu102
- transformers==4.17.0
- xgboost==1.6.1
Kindly follow the sequence list below to run our code.
- Run word2vec.py: The program loads the methylation dataset from the folder and trains a Word2Vec network. The program will generate the file "net_word2vec.pth".
- Run extract word.py: This program will read the vocabulary used in the BERT pre-training and load the pre-trained BERT framework. The program filters English characters, edits the input format, and outputs a vector corresponding to English. This file will write the output to BERT_vec.xlsx.
- Run compare.py: The program will read the BERT_vec.xlsx, extracting a vector for each English word. Load the trained Word2Vec model parameters, compute the vector representation of each 5 mer. 'DNA_Eng.csv' is generated by calculating the cosine similarity between each 5 mer and each English word vector.
- Run fine-tuning.py: This program is a fine-tuning program that reads a DNA methylation dataset. To reproduce our experimental results, we provide the 'DNA_Eng.csv' we used in our experiments. The fine-tuned network parameters are saved in 'Bestmodel_.pth'.
- Tsne_show.py: Load the fine-tuned model and test data, and compute the new vector representation of the test set. The distribution of the test set is drawn by the T-SNE dimensionality reduction method.