Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Variables in the training data missing in newdata #1

Open
karamveerverma37 opened this issue May 31, 2024 · 7 comments
Open

Variables in the training data missing in newdata #1

karamveerverma37 opened this issue May 31, 2024 · 7 comments

Comments

@karamveerverma37
Copy link

Hi,
I am trying to run reptile on pre-trained model mm_model_coreMarks.reptile using methylation data. Is there any issue with bw generation, I have methylation base call bed files containing chr no, start, end, methylation rate. I convereted it into bw file using the following commands:
awk '{printf "%s\t%d\t%d\t%2.3f\n" , $1,$2,$3,$4}' myBed.bed > myFile.bedgraph
sort -k1,1 -k2,2n myFile.bedgraph > myFile_sorted.bedgraph
bedGraphToBigWig myFile_sorted.bedgraph myChrom.sizes myBigWig.bw

I tried alone Meth epimark as well as all four H3K4me1 etc given for mm_model_coreMarks.reptile model. The output of REPTILE_preprocess.py is preprocessed.region_with_epimark.tsv file and look like this:
chr start end id Meth_E4 H3K4me1_E4 H3K4me3_E4 H3K27ac_E4
chr1 0 2000 bin_0 0.0 0.0 0.0 0.0
chr1 100 2100 bin_1 0.0 0.0 0.0 0.0
chr1 200 2200 bin_2 0.0 0.0 0.0 0.0
chr1 300 2300 bin_3 0.0 0.0 0.0 0.0
chr1 400 2400 bin_4 0.0 0.0 0.0 0.0
chr1 500 2500 bin_5 0.0 0.0 0.0 0.0
chr1 600 2600 bin_6 0.0 0.0 0.0 0.0
chr1 700 2700 bin_7 0.0 0.0 0.0 0.0
chr1 800 2800 bin_8 0.0 0.0 0.0 0.0
chr1 900 2900 bin_9 0.0 0.0 0.0 0.0
chr1 1000 3000 bin_10 0.0 0.0 0.0 0.0
.
.
chr1 3211200 3213200 bin_32112 5.0 5.0 5.0 5.0
chr1 3211300 3213300 bin_32113 5.0 5.0 5.0 5.0
chr1 3211400 3213400 bin_32114 5.0 5.0 5.0 5.0
chr1 3211500 3213500 bin_32115 4.0 4.0 4.0 4.0
chr1 3211600 3213600 bin_32116 3.3 3.3 3.3 3.3
chr1 3211700 3213700 bin_32117 2.54545 2.54545 2.54545 2.54545
chr1 3211800 3213800 bin_32118 2.69231 2.69231 2.69231 2.69231
chr1 3211900 3213900 bin_32119 3.0 3.0 3.0 3.0
chr1 3212000 3214000 bin_32120 2.85714 2.85714 2.85714 2.85714

Now when I run the compute score command:
REPTILE_compute_score.R -i data_info_file2 -m mm_model_coreMarks.reptile -a tmp/mm39_w2kb_s100bp_preprocessed.region_with_epimark.tsv -s E4 -o tmp/E4__compute_pred

I get the following error:
Error in predict.randomForest(reptile_classifier, epimark, type = "prob") :
variables in the training data missing in newdata
Calls: reptile_predict_genome_wide ... reptile_predict_one_mode -> predict -> predict.randomForest
Execution halted
Are there any specific trained model available for only DNA methylation data to predict enhancers.
Note: I tried with both genome wide and region specific.

@yupenghe
Copy link
Owner

yupenghe commented Jun 3, 2024

Do you mind sharing some dummy input file for me to reproduce the error?

Are there any specific trained model available for only DNA methylation data to predict enhancers.

I found that methylation alone did not generate good enough prediction so I did not pursuit this further.

@karamveerverma37
Copy link
Author

Please find the attached files used as input.
I have generated bigwig file from bed file (Methylation_Calls.Pseudobulk.E4.5-5.5.bed) as described above and using MM39 genome for query.
Preprocessing was done using:
REPTILE_preprocess.py data_info_file mm39_w2kb_s200bp.bed mm39_w2kb_s200bp_preprocessed -g
input: data_info_file, mm39_w2kb_s200bp.bed (query region file)
output: mm39_w2kb_s200bp_preprocessed_regions_with_epimark.tsv

compute_score gives error:
REPTILE_compute_score.R -i data_info_file -m tmp/REPTILE_model.reptile -a mm39_w2kb_s200bp_preprocessed_regions_with_epimark -s E4 -o E4_pred

Error in predict.randomForest(reptile_classifier, epimark, type = "prob") :
variables in the training data missing in newdata
Calls: reptile_predict_genome_wide ... reptile_predict_one_mode -> predict -> predict.randomForest
Execution halted

Files:
issue.zip

@karamveerverma37
Copy link
Author

karamveerverma37 commented Jun 4, 2024 via email

@yupenghe
Copy link
Owner

yupenghe commented Jun 4, 2024

Thanks. I will take a look. Using mm39 will be an issue but it probably won't be the cause of the error you saw. Unfortunately all models were trained and tested on data processed based on mm10. I would suggest reprocessing your data on mm10 if I am able to fix the error and you still want to run REPTILE on your data.

@karamveerverma37
Copy link
Author

Thanks for the suggestion. I will try it with mm10 data as well but my data is for mm39, and I am doing some other complimentary analysis on mm39. So, cannot convert it to mm10.

@yupenghe
Copy link
Owner

yupenghe commented Jun 5, 2024

I see. REPTILE probably won't generate what you want. I would recommend you to use peak calls from H3K27ac or the overlapping peaks of H3K27ac and H3K4me1 as predicted enhancers.

@karamveerverma37
Copy link
Author

Hi, Can you share the full training dataset used for training. Since there is a subset of dataset (Chr19) only available in example data.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants